[
https://issues.apache.org/jira/browse/IMPALA-7101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16504078#comment-16504078
]
ASF subversion and git services commented on IMPALA-7101:
---------------------------------------------------------
Commit 04a6e805fb517967f60565e683db39fc7a17aa2c in impala's branch
refs/heads/2.x from [~dhecht]
[ https://git-wip-us.apache.org/repos/asf?p=impala.git;h=04a6e80 ]
IMPALA-7101: Fix race between Fetch and Close RPCs that can lead to hang
If we hit EOS, we'll wait for all the backends to report status (to try
to get a complete profile). But if the query is closed after this point,
then we can get stuck waiting since once the query is closed,
ImpalaServer won't know about this coordinator and so it will stop
forwarding on the ReportStatus RPCs.
The real fix for this is IMPALA-6984, but in the mean time, add another
special case for this JIRA (see the other TODO IMPALA-6984 in
coordinator.cc).
The cancellation test only finds this race once in a while (several
hours) indirectly in a COMPUTE STATS query because the
ChildQueryExecutor will do a CloseOperation() while the execution thread
is inside Fetch(). To make this more reproducible, modify the
cancellation test to allow the close and fetch rpcs to execute
concurrently (don't join the test's fetch thread until after
close). This makes the race reproducible in a few iterations and a few
minutes.
Testing:
- Loop test_cancellation.py
Change-Id: I7c147550f86d81b818ecbdd34cf2919ced7ff8c5
Reviewed-on: http://gerrit.cloudera.org:8080/10601
Reviewed-by: Tim Armstrong <[email protected]>
Tested-by: Impala Public Jenkins <[email protected]>
Reviewed-on: http://gerrit.cloudera.org:8080/10619
Tested-by: Tim Armstrong <[email protected]>
> Builds are timing out/hanging
> -----------------------------
>
> Key: IMPALA-7101
> URL: https://issues.apache.org/jira/browse/IMPALA-7101
> Project: IMPALA
> Issue Type: Bug
> Reporter: Thomas Tauber-Marshall
> Assignee: Dan Hecht
> Priority: Blocker
> Labels: broken-build
>
> We've seen a large number of builds in the last week or two that appear to
> have hung and gotten killed after a 24-hour timeout.
> Exactly where the hang is occurring is different in each build, but II
> suspect it has something to do with cancellation no working correctly.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]