[
https://issues.apache.org/jira/browse/IMPALA-7931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16716039#comment-16716039
]
Michael Ho commented on IMPALA-7931:
------------------------------------
[~tarmstrong], this sounds reasonable to me too.
i took a quick look at the existing shutdown code. It currently relies on
detecting the number of fragments on an executor to determine if it can be
shutdown after grace period. This may not be ideal as there may still be
ReportExecStatus() RPCs in flight around that time even after all fragment
instances of a query have completed. May help to use a new metrics tracked in
QueryExecMgr to count the actual number of outstanding queries. That said, this
most likely falls into the second category of race in which "a backend finishes
right after the check with coordinator".
> test_shutdown_executor fails with timeout waiting for query target state
> ------------------------------------------------------------------------
>
> Key: IMPALA-7931
> URL: https://issues.apache.org/jira/browse/IMPALA-7931
> Project: IMPALA
> Issue Type: Bug
> Components: Infrastructure
> Affects Versions: Impala 3.2.0
> Reporter: Lars Volker
> Assignee: Tim Armstrong
> Priority: Critical
> Labels: broken-build
> Attachments: impala-7931-impalad-logs.tar.gz
>
>
> On a recent S3 test run test_shutdown_executor hit a timeout waiting for a
> query to reach state FINISHED. Instead the query stays at state 5 (EXCEPTION).
> {noformat}
> 12:51:11 __________________ TestShutdownCommand.test_shutdown_executor
> __________________
> 12:51:11 custom_cluster/test_restart_services.py:209: in
> test_shutdown_executor
> 12:51:11 assert self.__fetch_and_get_num_backends(QUERY,
> before_shutdown_handle) == 3
> 12:51:11 custom_cluster/test_restart_services.py:356: in
> __fetch_and_get_num_backends
> 12:51:11 self.client.QUERY_STATES['FINISHED'], timeout=20)
> 12:51:11 common/impala_service.py:267: in wait_for_query_state
> 12:51:11 target_state, query_state)
> 12:51:11 E AssertionError: Did not reach query state in time target=4
> actual=5
> {noformat}
> From the logs I can see that the query fails because one of the executors
> becomes unreachable:
> {noformat}
> I1204 12:31:39.954125 5609 impala-server.cc:1792] Query
> a34c3a84775e5599:b2b25eb900000000: Failed due to unreachable impalad(s):
> jenkins-worker:22001
> {noformat}
> The query was {{select count\(*) from functional_parquet.alltypes where
> sleep(1) = bool_col}}.
> It seems that the query took longer than expected and was still running when
> the executor shut down.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]