[ https://issues.apache.org/jira/browse/IMPALA-7931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16736116#comment-16736116 ]
ASF subversion and git services commented on IMPALA-7931: --------------------------------------------------------- Commit a91b24cb7962200f330c4887f38f4704a52f7c7e in impala's branch refs/heads/master from Tim Armstrong [ https://git-wip-us.apache.org/repos/asf?p=impala.git;h=a91b24c ] IMPALA-7931: fix executor shutdown races There were two races: * queries were terminated because of an impalad being detected as failed by the statestore even if the query had finished executing on that impalad. * NUM_FRAGMENTS_IN_FLIGHT was used to detect the backend being idle, but it was decremented before the final status report was sent. The fixes are: * keep track of the backends that triggered the potential cancellation, and only proceed with the cancellation if the coordinator has fragments still executing on the backend. * add a new metric that keeps track of the number of executing queries, which isn't decremented until the final status report is sent. Also do some cleanup/improvements in this code: * use proper error codes for some errors * more overloads for Status::Expected() * also add a metric for the total number of queries executed on the backend Testing: Add a new version of test_shutdown_executor with delays that trigger both races. This test only runs in exhaustive to avoid adding ~20s to core build time. Ran exhaustive tests. Looped test_restart_services overnight. Change-Id: I7c1a80304cb6695d228aca8314e2231727ab1998 Reviewed-on: http://gerrit.cloudera.org:8080/12082 Reviewed-by: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenk...@cloudera.com> > test_shutdown_executor fails with timeout waiting for query target state > ------------------------------------------------------------------------ > > Key: IMPALA-7931 > URL: https://issues.apache.org/jira/browse/IMPALA-7931 > Project: IMPALA > Issue Type: Bug > Components: Infrastructure > Affects Versions: Impala 3.2.0 > Reporter: Lars Volker > Assignee: Tim Armstrong > Priority: Critical > Labels: broken-build > Attachments: impala-7931-impalad-logs.tar.gz > > > On a recent S3 test run test_shutdown_executor hit a timeout waiting for a > query to reach state FINISHED. Instead the query stays at state 5 (EXCEPTION). > {noformat} > 12:51:11 __________________ TestShutdownCommand.test_shutdown_executor > __________________ > 12:51:11 custom_cluster/test_restart_services.py:209: in > test_shutdown_executor > 12:51:11 assert self.__fetch_and_get_num_backends(QUERY, > before_shutdown_handle) == 3 > 12:51:11 custom_cluster/test_restart_services.py:356: in > __fetch_and_get_num_backends > 12:51:11 self.client.QUERY_STATES['FINISHED'], timeout=20) > 12:51:11 common/impala_service.py:267: in wait_for_query_state > 12:51:11 target_state, query_state) > 12:51:11 E AssertionError: Did not reach query state in time target=4 > actual=5 > {noformat} > From the logs I can see that the query fails because one of the executors > becomes unreachable: > {noformat} > I1204 12:31:39.954125 5609 impala-server.cc:1792] Query > a34c3a84775e5599:b2b25eb900000000: Failed due to unreachable impalad(s): > jenkins-worker:22001 > {noformat} > The query was {{select count\(*) from functional_parquet.alltypes where > sleep(1) = bool_col}}. > It seems that the query took longer than expected and was still running when > the executor shut down. > I can reproduce by adding a sleep to the test: > {noformat} > diff --git a/tests/custom_cluster/test_restart_services.py > b/tests/custom_cluster/test_restart_services.py > index e441cbc..32bc8a1 100644 > --- a/tests/custom_cluster/test_restart_services.py > +++ b/tests/custom_cluster/test_restart_services.py > @@ -206,7 +206,7 @@ class TestShutdownCommand(CustomClusterTestSuite, > HS2TestSuite): > after_shutdown_handle = self.__exec_and_wait_until_running(QUERY) > > # Finish executing the first query before the backend exits. > - assert self.__fetch_and_get_num_backends(QUERY, before_shutdown_handle) > == 3 > + assert self.__fetch_and_get_num_backends(QUERY, before_shutdown_handle, > delay=5) == 3 > > # Wait for the impalad to exit, then start it back up and run another > query, which > # should be scheduled on it again. > @@ -349,11 +349,14 @@ class TestShutdownCommand(CustomClusterTestSuite, > HS2TestSuite): > self.client.QUERY_STATES['RUNNING'], timeout=20) > return handle > > - def __fetch_and_get_num_backends(self, query, handle): > + def __fetch_and_get_num_backends(self, query, handle, delay=0): > """Fetch the results of 'query' from the beeswax handle 'handle', close > the > query and return the number of backends obtained from the profile.""" > self.impalad_test_service.wait_for_query_state(self.client, handle, > self.client.QUERY_STATES['FINISHED'], timeout=20) > + if delay > 0: > + LOG.info("sleeping for {0}".format(delay)) > + time.sleep(delay) > self.client.fetch(query, handle) > profile = self.client.get_runtime_profile(handle) > self.client.close_query(handle) > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org