FWIW, please see the discussion at https://issues.apache.org/jira/browse/IMPALA-6194 about how "stuck IO" may render cancellation in HDFS scan node ineffective. We have seen users reported similar backtraces in the past due to hung RPCs to HDFS name node. Not saying this is the necessarily the case here.
A couple of things which may be interesting to look at: - is there any scan node in the profile which doesn't finish any assigned scan ranges ? - if you happen to have a core, it may help to inspect the stack traces of the scanner threads and the disk io mgr threads to understand their states. On Thu, Aug 30, 2018 at 12:25 PM Todd Lipcon <[email protected]> wrote: > Hey folks, > > I ran into some issues with a local core test run last night. This is on my > own branch with some uncommitted work, but I haven't touched the backend > and these issues seem to be backend-focused. > > Initially, I noticed the problem because a test run that I started last > night was still "stuck" this morning. I had three queries which had been > running for 10 hours and not making progress. The 'fragments' page for the > query showed that one of the backends had not reported to the coordinator > for many hours. In attempting to debug this, I managed to get the > StateStore to declare that node dead, and those queries eventually were > cancelled due to that. > > I resumed the node that I had been debugging with gdb, and it was declared > live again. I didn't restart the process, though, which might have led to > further issues: > > Next, I also noticed that /queries on my coordinator node showed "Number of > running queries with fragments on this host" at 100+ on all three nodes, > even though no queries were running anymore. These numbers are stable. I > can continue to issue new queries and they complete normally. While a > queries are running, the count of fragments goes up appropriate, and when > the query finishes, it goes back down. But, the "base" numbers are stuck at > {108, 391, 402} with nothing running. > > I also found that the node that had had the original problems has three > stuck fragments, all waiting on HdfsScanNode::GetNext: > https://gist.github.com/57a99bf4c00f1575810af924013c259d > Looking in the logs, I see continuous spew that it's trying to cancel the > fragment instances for this query, but apparently the cancellation is not > working. I talked to Michael Ho about this and it sounds like this is a > known issue with cancellation. > > So, there are still two mysteries: > - why did it get stuck in the first place? > - why are my "number of running queries" counters stuck at non-zero values? > > Does anything above ring a bell for anyone? > > -Todd > -- > Todd Lipcon > Software Engineer, Cloudera > -- Thanks, Michael
