Hakim,

It looks like if we hit an OOM in one fragment, we fail the queries related
to all running fragments that are using the same network channel. At a
higher concurrency, a single query could potentially cause hundreds of
other queries to fail.

Jinfeng,

You are right it does not make sense to accept new requests when the
drillbit hits an OOM.

- Rahul

On Fri, May 13, 2016 at 5:07 PM, Jinfeng Ni <[email protected]> wrote:

> When drillbit run into OOM, and has not recovered from OOM yet
> (release the allocated memory), I could not see the way that drillbit
> could continue to serve the other concurrent queries, unless those
> queries do not need use any direct memory at all. On the other hand,
> if the cause of query failure is not OOM, it makes sense to not let
> such failure to fail all the other running queries .
>
>
>
> On Fri, May 13, 2016 at 4:53 PM, rahul challapalli
> <[email protected]> wrote:
> > @jinfeng & @hakim
> >
> > For (1) I will raise a jira.
> >
> > For (2) I am arguing that we shouldn't fail the other concurrent queries
> > when one query hits an OOM, especially when the fragments related to
> other
> > queries themselves succeeded (need to check on this). We should handle
> the
> > OOM case in a better way where we do not end up closing the channel.
> > Thoughts?
> >
> > - Rahul
> >
> > On Fri, May 13, 2016 at 4:35 PM, Abdel Hakim Deneche <
> [email protected]>
> > wrote:
> >
> >> 1. you are right, the root allocator should prevent an allocation that
> >> exceeds the total available memory, but I'm not sure if all allocations
> in
> >> the rpc layer go through Drill's accountor. Also Netty internal
> >> fragmentation could cause this issue even though we are still below our
> >> memory limit.
> >> 2. unfortunately, when a channel is closed we don't get back most of the
> >> acknowledgements for the messages that were sent through that channel,
> and
> >> we are forced to fail any query that's still waiting for an ack from
> that
> >> channel. The more queries are running in parallel, the more chances a
> large
> >> number of them will be affected by this.
> >>
> >> On Fri, May 13, 2016 at 4:22 PM, rahul challapalli <
> >> [email protected]> wrote:
> >>
> >> > 1. This looks like a bug with the allocator unless there is a reason
> for
> >> > not enforcing a limit(total direct memory available) on the memory
> >> > allocated to all the fragments
> >> > 2. This looks like a bigger problem as we are unnecessarily failing
> all
> >> the
> >> > other queries as a result of one fragment causing OOM. It makes sense
> if
> >> > the drillbit was un-responsive after a fragment hit an OOM. But I was
> >> able
> >> > to connect to that specific drillbit after the failures and ran the
> same
> >> > failing queries successfully.
> >> >
> >> > - Rahul
> >> >
> >> > On Fri, May 13, 2016 at 4:06 PM, Abdel Hakim Deneche <
> >> > [email protected]>
> >> > wrote:
> >> >
> >> > > 1. you are getting this error because the Drillbit is running out of
> >> > direct
> >> > > memory. It's thrown by Netty when it couldn't allocate a new chunk
> of
> >> > > direct memory from the system. I know for each query, the allocator
> >> will
> >> > > enforce the query's limit. But I'm not sure we actually properly
> >> compute
> >> > > those limits to not exceed the total direct memory limit.
> >> > > 2. when we hit a channel closed exception, all fragments that were
> >> > > transmitting on that channel will most likely fail even though they
> >> > didn't
> >> > > run out of memory. It's hard to tell where the memory went without
> more
> >> > > information about the queries you were trying to run
> >> > >
> >> > > On Fri, May 13, 2016 at 3:45 PM, rahul challapalli <
> >> > > [email protected]> wrote:
> >> > >
> >> > > > Drillers,
> >> > > >
> >> > > > I was executing 20 queries using 10 concurrent clients on an 8
> node
> >> > > > cluster. First 10 queries succeed and the remaining 10 queries
> fail
> >> > with
> >> > > > "ChannelClosedException". The logs suggested that all the
> fragments
> >> > > running
> >> > > > on one node hit an "java.lang.OutOfMemoryError: Direct buffer
> >> memory".
> >> > 2
> >> > > > questions here.
> >> > > >    1. Can someone explain why we are even seeing this error.
> >> Shouldn't
> >> > > the
> >> > > > allocator detect this condition upfront?
> >> > > >    2. Why did all the fragments fail. Where did the memory go?
> >> > > >
> >> > > > - Rahul
> >> > > >
> >> > >
> >> > >
> >> > >
> >> > > --
> >> > >
> >> > > Abdelhakim Deneche
> >> > >
> >> > > Software Engineer
> >> > >
> >> > >   <http://www.mapr.com/>
> >> > >
> >> > >
> >> > > Now Available - Free Hadoop On-Demand Training
> >> > > <
> >> > >
> >> >
> >>
> http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available
> >> > > >
> >> > >
> >> >
> >>
> >>
> >>
> >> --
> >>
> >> Abdelhakim Deneche
> >>
> >> Software Engineer
> >>
> >>   <http://www.mapr.com/>
> >>
> >>
> >> Now Available - Free Hadoop On-Demand Training
> >> <
> >>
> http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available
> >> >
> >>
>

Reply via email to