[
https://issues.apache.org/jira/browse/IMPALA-7931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16712154#comment-16712154
]
Tim Armstrong commented on IMPALA-7931:
---------------------------------------
Here's an outline of a fix:
# Include the list of failed hosts that triggered the cancellation (represented
by TNetworkAddresses) in CancellationWork
# Plumb this through to ClientRequestState::Cancel()
# Only cancel the query if the query is still running on one of the hosts in
that list - we can check with the coordinator first about whether the
Note that we still might cancel queries that finished on the bad host *just* in
time, but this is sufficient to avoid the bad race where the host shuts down
and causes query cancellation before the query finishes successfully.
Maybe this could be cleaned up a bit - the current interfaces with the opaque
Status cause are a bit confusing - we could probably have a proper struct with
more explicit information about the semantics of the cancellation.
[~kwho] [~twmarshall] I know you've been looking at some of this cancellation
code - does something like the above sound like a reasonable approach to you?
> test_shutdown_executor fails with timeout waiting for query target state
> ------------------------------------------------------------------------
>
> Key: IMPALA-7931
> URL: https://issues.apache.org/jira/browse/IMPALA-7931
> Project: IMPALA
> Issue Type: Bug
> Components: Infrastructure
> Affects Versions: Impala 3.2.0
> Reporter: Lars Volker
> Assignee: Tim Armstrong
> Priority: Critical
> Labels: broken-build
> Attachments: impala-7931-impalad-logs.tar.gz
>
>
> On a recent S3 test run test_shutdown_executor hit a timeout waiting for a
> query to reach state FINISHED. Instead the query stays at state 5 (EXCEPTION).
> {noformat}
> 12:51:11 __________________ TestShutdownCommand.test_shutdown_executor
> __________________
> 12:51:11 custom_cluster/test_restart_services.py:209: in
> test_shutdown_executor
> 12:51:11 assert self.__fetch_and_get_num_backends(QUERY,
> before_shutdown_handle) == 3
> 12:51:11 custom_cluster/test_restart_services.py:356: in
> __fetch_and_get_num_backends
> 12:51:11 self.client.QUERY_STATES['FINISHED'], timeout=20)
> 12:51:11 common/impala_service.py:267: in wait_for_query_state
> 12:51:11 target_state, query_state)
> 12:51:11 E AssertionError: Did not reach query state in time target=4
> actual=5
> {noformat}
> From the logs I can see that the query fails because one of the executors
> becomes unreachable:
> {noformat}
> I1204 12:31:39.954125 5609 impala-server.cc:1792] Query
> a34c3a84775e5599:b2b25eb900000000: Failed due to unreachable impalad(s):
> jenkins-worker:22001
> {noformat}
> The query was {{select count\(*) from functional_parquet.alltypes where
> sleep(1) = bool_col}}.
> It seems that the query took longer than expected and was still running when
> the executor shut down.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]