Michael Ho created IMPALA-6025:
----------------------------------
Summary: Improve hang diagnostics
Key: IMPALA-6025
URL: https://issues.apache.org/jira/browse/IMPALA-6025
Project: IMPALA
Issue Type: Improvement
Components: Backend, Distributed Exec
Affects Versions: Impala 2.9.0
Reporter: Michael Ho
In the past, users of Impalad had a hard time getting diagnostics information
when a query is hung. Usually, that involves a rather manual process of
determining the fragment instances which aren't making progress and generating
stack trace or core from that Impalad and looking into it under a debugger.
Given the thousand of threads running when multiple queries are active, it's
quite time consuming for diagnostics.
This JIRA aims to track the improvement ideas which we can implement to
alleviate the stress with debugging this kind of issue. Some ideas include:
- implement a diagnostic button (analogous to the cancellation button in the
UI) to dump diagnostics information (e.g. threads' backtraces, executor nodes'
internals, states of data stream sender and receivers, lock information (e.g.
holder's pid) ) for fragment instances on some or all hosts of a query.
- have a watch dog to dump backtraces on threads which aren't making progress
for a while. This probably doesn't apply to all threads (e.g. idle threads
shouldn't trigger any alert).
- A fragment instance can appear to be not making progress because its parent
operator / fragment may be hung (e.g.the probe side of a join will not be able
to make much progress until the build side is done and the build side itself
could be another chain of joins). It'd be much easier to resolve this
dependency chain programmatically to find the root of the cascade of delay.
Please feel free to add more ideas to this JIRA.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)