[ 
https://issues.apache.org/jira/browse/IMPALA-6025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Armstrong updated IMPALA-6025:
----------------------------------
    Component/s:     (was: Backend)

> Improve hang diagnostics
> ------------------------
>
>                 Key: IMPALA-6025
>                 URL: https://issues.apache.org/jira/browse/IMPALA-6025
>             Project: IMPALA
>          Issue Type: Epic
>          Components: Distributed Exec
>    Affects Versions: Impala 2.9.0
>            Reporter: Michael Ho
>            Assignee: Lars Volker
>            Priority: Major
>              Labels: observability, supportability
>
> In the past, users of Impalad had a hard time getting diagnostics information 
> when a query is hung. Usually, that involves a rather manual process of 
> determining the fragment instances which aren't making progress and 
> generating stack trace or core from that Impalad and looking into it under a 
> debugger. Given the thousand of threads running when multiple queries are 
> active, it's quite time consuming for diagnostics.
> This JIRA aims to track the improvement ideas which we can implement to 
> alleviate the stress with debugging this kind of issue. Some ideas include:
> - implement a diagnostic button (analogous to the cancellation button in the 
> UI) to dump diagnostics information (e.g. threads' backtraces, executor 
> nodes' internals, states of data stream sender and receivers, lock 
> information (e.g. holder's pid) ) for fragment instances on some or all hosts 
> of a query.
> -  have a watch dog to dump backtraces on threads which aren't making 
> progress for a while. This probably doesn't apply to all threads (e.g. idle 
> threads shouldn't trigger any alert).
> - A fragment instance can appear to be not making progress because its parent 
> operator / fragment may be hung (e.g.the probe side of a join will not be 
> able to make much progress until the build side is done and the build side 
> itself could be another chain of joins). It'd be much easier to resolve this 
> dependency chain programmatically to find the root of the cascade of delay.
> Please feel free to add more ideas to this JIRA.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to