[GitHub] drill issue #921: DRILL-4286 Graceful shutdown of drillbit
Github user paul-rogers commented on the issue: https://github.com/apache/drill/pull/921 My suggestion is to go ahead and commit this PR with Jyothsna's change to disable the "graceful" part of the graceful shutdown. The feature will be in the product and people can turn it on to play with it. Then, @dvjyothsna should file a JIRA for the improvements we discussed (enabling by default for true servers, disabling for embedded.) ---
[GitHub] drill issue #921: DRILL-4286 Graceful shutdown of drillbit
Github user arina-ielchiieva commented on the issue: https://github.com/apache/drill/pull/921 Since we don't want to hold the release any more. @paul-rogers should we address this issue in the next release or move this PR to the next release? I am in the favor of the second. ---
[GitHub] drill issue #921: DRILL-4286 Graceful shutdown of drillbit
Github user paul-rogers commented on the issue: https://github.com/apache/drill/pull/921 Interesting issue! Let's think about this a bit. In a production system, we do want the grace period; it is an essential part of the graceful shutdown procedure. However, if we are doing a non-graceful shutdown, the grace is unneeded. Also, if the cluster contains only one node (as in most unit tests), there is nothing to wait for, so the grace period is not needed. The same is true in an embedded Drillbit for Sqlline. So, can we provide a solution that handles these cases rather than simply turning off the grace period always? If using the local cluster coordinator, say, then no grace is needed. If using ZK, but there is only one Drillbit, no grace is needed. (There is a race condition, but may be OK.) Or, if we detect we are embedded, no grace period. Then, also, if we are doing a graceful shutdown, we need the grace. But, if we are doing a "classic" shutdown, no grace is needed. The result should be that the grace period is used only in production servers, only when doing a graceful shutdown. There are probably some details to simplify, but I hope the above communicates the idea. Can we make this work? ---
[GitHub] drill issue #921: DRILL-4286 Graceful shutdown of drillbit
Github user dvjyothsna commented on the issue: https://github.com/apache/drill/pull/921 This was due to grace_period during shutdown of drillbits. Fixed it by changing grace_period = 0 in .conf ---
[GitHub] drill issue #921: DRILL-4286 Graceful shutdown of drillbit
Github user arina-ielchiieva commented on the issue: https://github.com/apache/drill/pull/921 @dvjyothsna all tests has passed. Though there is significant performance degradation. Before your changes unit test run takes ~ 33 minutes, after ~ 1 hour 18 minutes. For example, exec package: before `[INFO] exec/Java Execution Engine . SUCCESS [17:14 min]`, after `[INFO] exec/Java Execution Engine . SUCCESS [56:45 min]`. Since you did not add that many unit tests, could you please explain the reason of such performance degradation and ideally fix it. Attached two unit tests outputs in Jira. ---
[GitHub] drill issue #921: DRILL-4286 Graceful shutdown of drillbit
Github user arina-ielchiieva commented on the issue: https://github.com/apache/drill/pull/921 There are unit test failures with this fix. Full output is attached into Jira. Please check and fix. ``` TestMetadataProvider.columns:251 expected:<92> but was:<93> DrillSeparatePlanningTest.testMultiMinorFragmentComplexQuery:122->getResultsHelper:219 expected:<1> but was:<0> DrillSeparatePlanningTest.testSingleFragmentQuery:88->getResultsHelper:219 expected:<1> but was:<0> DrillSeparatePlanningTest.testMultiMinorFragmentSimpleQuery:105->getResultsHelper:219 expected:<1> but was:<0> ``` ---
[GitHub] drill issue #921: DRILL-4286 Graceful shutdown of drillbit
Github user arina-ielchiieva commented on the issue: https://github.com/apache/drill/pull/921 +1, LGTM. ---
[GitHub] drill issue #921: DRILL-4286 Graceful shutdown of drillbit
Github user arina-ielchiieva commented on the issue: https://github.com/apache/drill/pull/921 Is there any ETA for the fix? ---
[GitHub] drill issue #921: DRILL-4286 Graceful shutdown of drillbit
Github user arina-ielchiieva commented on the issue: https://github.com/apache/drill/pull/921 @dvjyothsna also please remove `e.printStackTrace();` from the code, I see 4 occurrences. ---
[GitHub] drill issue #921: DRILL-4286 Graceful shutdown of drillbit
Github user arina-ielchiieva commented on the issue: https://github.com/apache/drill/pull/921 @paul-rogers is right, there is something wrong with the protos @dvjyothsna has generated. I took master branch and generated protos. No changes so these extra changes we see are wrong. Also I have created branch where I have added changes in protos from this PR [1] and from the diff it can be seen that no extra changes as well [2]. @dvjyothsna please address this issue. If it helps: I have installed protoc: ``` protoc --version libprotoc 2.5.0 ``` and generated using the following: ``` cd ~/git_repo/drill/protocol mvn process-sources -P proto-compile ``` [1] https://github.com/arina-ielchiieva/drill/tree/DRILL-4386-proto [2] https://github.com/arina-ielchiieva/drill/commit/07af545ab27b77e5f86002d4ce0f54b3dd371343 ---
[GitHub] drill issue #921: DRILL-4286 Graceful shutdown of drillbit
Github user arina-ielchiieva commented on the issue: https://github.com/apache/drill/pull/921 @dvjyothsna is there any update on the status for this PR? Should it be targeted for the 1.12.0 release? ---
[GitHub] drill issue #921: DRILL-4286 Graceful shutdown of drillbit
Github user dvjyothsna commented on the issue: https://github.com/apache/drill/pull/921 @weijietong Please refer to comments under JIRA - https://issues.apache.org/jira/browse/DRILL-4286 ---
[GitHub] drill issue #921: DRILL-4286 Graceful shutdown of drillbit
Github user weijietong commented on the issue: https://github.com/apache/drill/pull/921 @dvjyothsna thanks for your contribution. Some objections please consider, the shutdown process should be quick and allow fault occurred. Google calls it lame duck model. If your cluster is large and has high concurrent queries,the grace model will let the shutdown process and the service unavailable time both too long. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---