Hey folks,

I ran into some issues with a local core test run last night. This is on my
own branch with some uncommitted work, but I haven't touched the backend
and these issues seem to be backend-focused.

Initially, I noticed the problem because a test run that I started last
night was still "stuck" this morning. I had three queries which had been
running for 10 hours and not making progress. The 'fragments' page for the
query showed that one of the backends had not reported to the coordinator
for many hours. In attempting to debug this, I managed to get the
StateStore to declare that node dead, and those queries eventually were
cancelled due to that.

I resumed the node that I had been debugging with gdb, and it was declared
live again. I didn't restart the process, though, which might have led to
further issues:

Next, I also noticed that /queries on my coordinator node showed "Number of
running queries with fragments on this host" at 100+ on all three nodes,
even though no queries were running anymore. These numbers are stable. I
can continue to issue new queries and they complete normally. While a
queries are running, the count of fragments goes up appropriate, and when
the query finishes, it goes back down. But, the "base" numbers are stuck at
{108, 391, 402} with nothing running.

I also found that the node that had had the original problems has three
stuck fragments, all waiting on HdfsScanNode::GetNext:
https://gist.github.com/57a99bf4c00f1575810af924013c259d
Looking in the logs, I see continuous spew that it's trying to cancel the
fragment instances for this query, but apparently the cancellation is not
working. I talked to Michael Ho about this and it sounds like this is a
known issue with cancellation.

So, there are still two mysteries:
- why did it get stuck in the first place?
- why are my "number of running queries" counters stuck at non-zero values?

Does anything above ring a bell for anyone?

-Todd
-- 
Todd Lipcon
Software Engineer, Cloudera

Reply via email to