@jinfeng & @hakim

For (1) I will raise a jira.

For (2) I am arguing that we shouldn't fail the other concurrent queries
when one query hits an OOM, especially when the fragments related to other
queries themselves succeeded (need to check on this). We should handle the
OOM case in a better way where we do not end up closing the channel.
Thoughts?

- Rahul

On Fri, May 13, 2016 at 4:35 PM, Abdel Hakim Deneche <[email protected]>
wrote:

> 1. you are right, the root allocator should prevent an allocation that
> exceeds the total available memory, but I'm not sure if all allocations in
> the rpc layer go through Drill's accountor. Also Netty internal
> fragmentation could cause this issue even though we are still below our
> memory limit.
> 2. unfortunately, when a channel is closed we don't get back most of the
> acknowledgements for the messages that were sent through that channel, and
> we are forced to fail any query that's still waiting for an ack from that
> channel. The more queries are running in parallel, the more chances a large
> number of them will be affected by this.
>
> On Fri, May 13, 2016 at 4:22 PM, rahul challapalli <
> [email protected]> wrote:
>
> > 1. This looks like a bug with the allocator unless there is a reason for
> > not enforcing a limit(total direct memory available) on the memory
> > allocated to all the fragments
> > 2. This looks like a bigger problem as we are unnecessarily failing all
> the
> > other queries as a result of one fragment causing OOM. It makes sense if
> > the drillbit was un-responsive after a fragment hit an OOM. But I was
> able
> > to connect to that specific drillbit after the failures and ran the same
> > failing queries successfully.
> >
> > - Rahul
> >
> > On Fri, May 13, 2016 at 4:06 PM, Abdel Hakim Deneche <
> > [email protected]>
> > wrote:
> >
> > > 1. you are getting this error because the Drillbit is running out of
> > direct
> > > memory. It's thrown by Netty when it couldn't allocate a new chunk of
> > > direct memory from the system. I know for each query, the allocator
> will
> > > enforce the query's limit. But I'm not sure we actually properly
> compute
> > > those limits to not exceed the total direct memory limit.
> > > 2. when we hit a channel closed exception, all fragments that were
> > > transmitting on that channel will most likely fail even though they
> > didn't
> > > run out of memory. It's hard to tell where the memory went without more
> > > information about the queries you were trying to run
> > >
> > > On Fri, May 13, 2016 at 3:45 PM, rahul challapalli <
> > > [email protected]> wrote:
> > >
> > > > Drillers,
> > > >
> > > > I was executing 20 queries using 10 concurrent clients on an 8 node
> > > > cluster. First 10 queries succeed and the remaining 10 queries fail
> > with
> > > > "ChannelClosedException". The logs suggested that all the fragments
> > > running
> > > > on one node hit an "java.lang.OutOfMemoryError: Direct buffer
> memory".
> > 2
> > > > questions here.
> > > >    1. Can someone explain why we are even seeing this error.
> Shouldn't
> > > the
> > > > allocator detect this condition upfront?
> > > >    2. Why did all the fragments fail. Where did the memory go?
> > > >
> > > > - Rahul
> > > >
> > >
> > >
> > >
> > > --
> > >
> > > Abdelhakim Deneche
> > >
> > > Software Engineer
> > >
> > >   <http://www.mapr.com/>
> > >
> > >
> > > Now Available - Free Hadoop On-Demand Training
> > > <
> > >
> >
> http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available
> > > >
> > >
> >
>
>
>
> --
>
> Abdelhakim Deneche
>
> Software Engineer
>
>   <http://www.mapr.com/>
>
>
> Now Available - Free Hadoop On-Demand Training
> <
> http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available
> >
>

Reply via email to