On Thu, Aug 30, 2018 at 2:44 PM, Pooja Nilangekar <
pooja.nilange...@cloudera.com> wrote:

> Hi Todd,
>
> I believe you are right. There are a couple of other race conditions in the
> HdfsScanNode which have resulted in broken-builds in the last couple of
> weeks. (IMPALA-7418 and IMPALA-7335). While reviewing the change for
> IMPALA-7335, Tim and Dan suggested that we need to cleanup the state
> invariants of the HdfsScanNode to find and fix other such races. I am going
> to create a JIRA to track it, I could add this issue to the list of races
> to be fixed.  If this issue occurs more frequently, we could add a
> temporary fix for now. Let me know what you'd suggest.
>

Makes sense. I'm currently trying to put together a simple test and fix
that targets this race in particular, but definitely agree that the
interactions between the scanner threads and scan node are hard to reason
about and could be improved more generally.

For now I'm trying to use a DebugAction to inject probabilistic failures
into the soft memory limit checks to see if I can reproduce it more easily.

-Todd


> On Thu, Aug 30, 2018 at 1:27 PM Todd Lipcon <t...@cloudera.com> wrote:
>
> > I spent another 45 minutes looking at the core and think I figured this
> > out. I found that for the stuck scanner, both of its scanner threads and
> > run and exited, but for some reason it was never marked as "done".
> >
> > I think this might be a regression due to IMPALA-7096
> > (7ccf7369085aa49a8fc0daf6f91d97b8a3135682). The scanner thread has the
> > following code:
> >
> >     // Stop extra threads if we're over a soft limit in order to free up
> > memory.
> >     if (!first_thread && mem_tracker_->AnyLimitExceeded(MemLimit::SOFT))
> {
> >       break;
> >     }
> >
> >     // Done with range and it completed successfully
> >     if (progress_.done()) {
> >       // All ranges are finished.  Indicate we are done.
> >       SetDone();
> >       break;
> >     }
> >
> >     if (scan_range == nullptr && num_unqueued_files == 0) {
> >       unique_lock<mutex> l(lock_);
> >       // All ranges have been queued and DiskIoMgr has no more new ranges
> > for this scan
> >       // node to process. This means that every range is either done or
> > being processed by
> >       // another thread.
> >       all_ranges_started_ = true;
> >       break;
> >     }
> >   }
> >
> > What if we have the following scenario:
> >
> > T1) grab scan range 1 and start processing
> >
> > T2) grab scan range 2 and start processing
> >
> > T1) finish scan range 1 and see that 'progress_' is not done()
> > T1) loop around, get no scan range (there are no more), so set
> > all_ranges_satrted_ and break
> > T1) thread exits
> >
> > T2) finish scan range 2
> > T2) happen to hit a soft memory limit error due to pressure from other
> exec
> > nodes, etc. Since we aren't the first thread, we break. (even though the
> > first thread is no longer running)
> > T2) thread exits
> >
> > Note that no one got to the point of calling SetDone() because we break
> due
> > to the memory limit error _before_ checking progress_.Done().
> >
> > Thus, the query will hang forever.
> >
> > It seems to me that the order of these conditions needs to be rejiggered:
> > the breaking due to memory limit should be the last thing in the loop so
> > that, if we were the thread to finish the last scan range, we properly
> > SetDone before exiting.
> >
> > Does that makes sense? If so, I'll file a JIRA. Suggestions also welcome
> on
> > how to write a test for this (do we have an easy facility to inject
> > probabilistic mem-limit failures in that code path?)
> >
> > -Todd
> >
> >
> > On Thu, Aug 30, 2018 at 12:35 PM, Michael Ho <k...@cloudera.com> wrote:
> >
> > > FWIW, please see the discussion at
> > > https://issues.apache.org/jira/browse/IMPALA-6194 about how "stuck IO"
> > may
> > > render cancellation in HDFS scan node ineffective. We have seen users
> > > reported similar backtraces in the past due to hung RPCs to HDFS name
> > node.
> > > Not saying this is the necessarily the case here.
> > >
> > > A couple of things which may be interesting to look at:
> > > - is there any scan node in the profile which doesn't finish any
> assigned
> > > scan ranges ?
> > > - if you happen to have a core, it may help to inspect the stack traces
> > of
> > > the scanner threads and the disk io mgr threads to understand their
> > states.
> > >
> > > On Thu, Aug 30, 2018 at 12:25 PM Todd Lipcon <t...@cloudera.com>
> wrote:
> > >
> > > > Hey folks,
> > > >
> > > > I ran into some issues with a local core test run last night. This is
> > on
> > > my
> > > > own branch with some uncommitted work, but I haven't touched the
> > backend
> > > > and these issues seem to be backend-focused.
> > > >
> > > > Initially, I noticed the problem because a test run that I started
> last
> > > > night was still "stuck" this morning. I had three queries which had
> > been
> > > > running for 10 hours and not making progress. The 'fragments' page
> for
> > > the
> > > > query showed that one of the backends had not reported to the
> > coordinator
> > > > for many hours. In attempting to debug this, I managed to get the
> > > > StateStore to declare that node dead, and those queries eventually
> were
> > > > cancelled due to that.
> > > >
> > > > I resumed the node that I had been debugging with gdb, and it was
> > > declared
> > > > live again. I didn't restart the process, though, which might have
> led
> > to
> > > > further issues:
> > > >
> > > > Next, I also noticed that /queries on my coordinator node showed
> > "Number
> > > of
> > > > running queries with fragments on this host" at 100+ on all three
> > nodes,
> > > > even though no queries were running anymore. These numbers are
> stable.
> > I
> > > > can continue to issue new queries and they complete normally. While a
> > > > queries are running, the count of fragments goes up appropriate, and
> > when
> > > > the query finishes, it goes back down. But, the "base" numbers are
> > stuck
> > > at
> > > > {108, 391, 402} with nothing running.
> > > >
> > > > I also found that the node that had had the original problems has
> three
> > > > stuck fragments, all waiting on HdfsScanNode::GetNext:
> > > > https://gist.github.com/57a99bf4c00f1575810af924013c259d
> > > > Looking in the logs, I see continuous spew that it's trying to cancel
> > the
> > > > fragment instances for this query, but apparently the cancellation is
> > not
> > > > working. I talked to Michael Ho about this and it sounds like this
> is a
> > > > known issue with cancellation.
> > > >
> > > > So, there are still two mysteries:
> > > > - why did it get stuck in the first place?
> > > > - why are my "number of running queries" counters stuck at non-zero
> > > values?
> > > >
> > > > Does anything above ring a bell for anyone?
> > > >
> > > > -Todd
> > > > --
> > > > Todd Lipcon
> > > > Software Engineer, Cloudera
> > > >
> > >
> > >
> > > --
> > > Thanks,
> > > Michael
> > >
> >
> >
> >
> > --
> > Todd Lipcon
> > Software Engineer, Cloudera
> >
>



-- 
Todd Lipcon
Software Engineer, Cloudera

Reply via email to