On Thu, Aug 30, 2018 at 12:25 PM Todd Lipcon <[email protected]> wrote:
> Hey folks,
>
> I ran into some issues with a local core test run last night. This is on my
> own branch with some uncommitted work, but I haven't touched the backend
> and these issues seem to be backend-focused.
>
> Initially, I noticed the problem because a test run that I started last
> night was still "stuck" this morning. I had three queries which had been
> running for 10 hours and not making progress. The 'fragments' page for the
> query showed that one of the backends had not reported to the coordinator
> for many hours. In attempting to debug this, I managed to get the
> StateStore to declare that node dead, and those queries eventually were
> cancelled due to that.
>
> I resumed the node that I had been debugging with gdb, and it was declared
> live again. I didn't restart the process, though, which might have led to
> further issues:
>
> Next, I also noticed that /queries on my coordinator node showed "Number of
> running queries with fragments on this host" at 100+ on all three nodes,
> even though no queries were running anymore. These numbers are stable. I
> can continue to issue new queries and they complete normally. While a
> queries are running, the count of fragments goes up appropriate, and when
> the query finishes, it goes back down. But, the "base" numbers are stuck at
> {108, 391, 402} with nothing running.
>
> I also found that the node that had had the original problems has three
> stuck fragments, all waiting on HdfsScanNode::GetNext:
> https://gist.github.com/57a99bf4c00f1575810af924013c259d
> Looking in the logs, I see continuous spew that it's trying to cancel the
> fragment instances for this query, but apparently the cancellation is not
> working. I talked to Michael Ho about this and it sounds like this is a
> known issue with cancellation.
>
> So, there are still two mysteries:
> - why did it get stuck in the first place?
> * - why are my "number of running queries" counters stuck at non-zero
> values?*
This is definitely a bug, I can reliably recreate it by running a query
that is rejected by the admission controller. I have filed a jira for this (
IMPALA-7516 <https://issues.apache.org/jira/browse/IMPALA-7516>) and will
look more into it
>
>
> Does anything above ring a bell for anyone?
>
> -Todd
> --
> Todd Lipcon
> Software Engineer, Cloudera
>