[ https://issues.apache.org/jira/browse/IMPALA-2990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16830404#comment-16830404 ]
ASF subversion and git services commented on IMPALA-2990: --------------------------------------------------------- Commit a73ef68745da6542657892d256e5b1bd29c68803 in impala's branch refs/heads/master from Thomas Tauber-Marshall [ https://gitbox.apache.org/repos/asf?p=impala.git;h=a73ef68 ] IMPALA-2990: timeout unresponsive queries in coordinator The coordinator currently waits indefinitely if it does not receive a status report from a backend. This could cause a query to hang indefinitely in certain situations, for example if the backend decides to cancel itself as a result of failed status report rpcs. This patch adds a thread to ImpalaServer which periodically iterates over all queries for which that server is the coordinator and cancels any that haven't had a report from a backend in a certain amount of time. This patch adds two flags: --status_report_max_retry_s: the maximum number of seconds a backend will attempt to send status reports before giving up. This is used in place of --status_report_max_retries which is now deprecated. --status_report_cancellation_padding: the coordinator will wait --status_report_max_retry_s * (1 + --status_report_cancellation_padding / 100) before concluding a backend is not responding and cancelling the query. Testing: - Added a functional test that runs a query that is cancelled through the new mechanism. - Passed a full set of exhaustive tests. Ran tests on a 10 node cluster loaded with tpch 500: - Ran the stress test for 1000 queries with the debug actions: 'REPORT_EXEC_STATUS_DELAY:JITTER@1000' Prior to this patch, this setup results in hanging queries. With this patch, no hangs were observed. - Ran perf tests with 4 concurrent streams, 3 iterations per query. Found no change in performance. Change-Id: I196c8c6a5633b1960e2c3a3884777be9b3824987 Reviewed-on: http://gerrit.cloudera.org:8080/12299 Reviewed-by: Thomas Marshall <tmarsh...@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenk...@cloudera.com> > Coordinator should timeout and cancel queries with unresponsive / stuck > executors > --------------------------------------------------------------------------------- > > Key: IMPALA-2990 > URL: https://issues.apache.org/jira/browse/IMPALA-2990 > Project: IMPALA > Issue Type: Bug > Components: Distributed Exec > Affects Versions: Impala 2.3.0 > Reporter: Sailesh Mukil > Assignee: Thomas Tauber-Marshall > Priority: Critical > Labels: hang, observability, supportability > > The coordinator currently waits indefinitely if it does not hear back from a > backend. This could cause a query to hang indefinitely in case of a network > error, etc. > We should add logic for determining when a backend is unresponsive and kill > the query. The logic should mostly revolve around Coordinator::Wait() and > Coordinator::UpdateFragmentExecStatus() based on whether it receives periodic > updates from a backed (via FragmentExecState::ReportStatusCb()). -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org