[jira] [Commented] (IMPALA-2990) Coordinator should timeout a connection for an unresponsive backend

ASF subversion and git services (JIRA) Mon, 03 Dec 2018 18:07:41 -0800


    [ 
https://issues.apache.org/jira/browse/IMPALA-2990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16708083#comment-16708083
 ]


ASF subversion and git services commented on IMPALA-2990:
---------------------------------------------------------

Commit df4ccf5ddf9340049d193d7e6244c8af88a2dd5c in impala's branch 
refs/heads/master from Michael Ho
[ https://git-wip-us.apache.org/repos/asf?p=impala.git;h=df4ccf5 ]

IMPALA-6741: Add timestamp of fragment instance's status updates

Currently, the profile of a running query doesn't contain any
timestamps for the last updates from the fragment instances.
This makes it hard to differentiate between when a fragment
instance failed to send status reports to the coordinator
for various reasons (e.g. IMPALA-2990) or a truly stuck
fragment instance.

This change adds a timestamp to a fragment instance's profile
to record the time when the coordinator last received a status
update from it. Note that it's possible that there is delay
between when the status was created on the executor and when
it arrived at the coordinator. Given that the clocks are not
necessarily synchronized across all executors, the receiving
time of the update at the coordinator seems easier to make sense of.

Sample output:

    Fragment F01:
      Instance 494d948d3235441a:23eae17900000001 (host=???):(Total: 15.099ms, 
non-child: 263.951us, % non-child: 1.75%)
        Last report received time: 2018-11-27 16:57:30.014
        Hdfs split stats (<volume id>:<# splits>/<split lengths>): 0:1/1.58 KB
        Fragment Instance Lifecycle Event Timeline: 15.622ms
           - Prepare Finished: 1.026ms (1.026ms)
           - Open Finished: 1.137ms (110.297us)
           - First Batch Produced: 15.010ms (13.873ms)
           - First Batch Sent: 15.080ms (70.715us)
           - ExecInternal Finished: 15.622ms (541.181us)

Change-Id: Iae3dcddc292d694d7003d10ed0caccfceed7d8fa
Reviewed-on: http://gerrit.cloudera.org:8080/12000
Reviewed-by: Impala Public Jenkins <[email protected]>
Tested-by: Impala Public Jenkins <[email protected]>


> Coordinator should timeout a connection for an unresponsive backend
> -------------------------------------------------------------------
>
>                 Key: IMPALA-2990
>                 URL: https://issues.apache.org/jira/browse/IMPALA-2990
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Distributed Exec
>    Affects Versions: Impala 2.3.0
>            Reporter: Sailesh Mukil
>            Assignee: Thomas Tauber-Marshall
>            Priority: Critical
>              Labels: hang, observability, supportability
>
> The coordinator currently waits indefinitely if it does not hear back from a 
> backend. This could cause a query to hang indefinitely in case of a network 
> error, etc.
> We should add logic for determining when a backend is unresponsive and kill 
> the query. The logic should mostly revolve around Coordinator::Wait() and 
> Coordinator::UpdateFragmentExecStatus() based on whether it receives periodic 
> updates from a backed (via FragmentExecState::ReportStatusCb()).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (IMPALA-2990) Coordinator should timeout a connection for an unresponsive backend

Reply via email to