For (2) my understanding is that we fail fragments that are still running.
If you can confirm we are indeed failing finished fragments then please
fill a Jira.

On Fri, May 13, 2016 at 4:53 PM, rahul challapalli <
[email protected]> wrote:

> @jinfeng & @hakim
>
> For (1) I will raise a jira.
>
> For (2) I am arguing that we shouldn't fail the other concurrent queries
> when one query hits an OOM, especially when the fragments related to other
> queries themselves succeeded (need to check on this). We should handle the
> OOM case in a better way where we do not end up closing the channel.
> Thoughts?
>
> - Rahul
>
> On Fri, May 13, 2016 at 4:35 PM, Abdel Hakim Deneche <
> [email protected]>
> wrote:
>
> > 1. you are right, the root allocator should prevent an allocation that
> > exceeds the total available memory, but I'm not sure if all allocations
> in
> > the rpc layer go through Drill's accountor. Also Netty internal
> > fragmentation could cause this issue even though we are still below our
> > memory limit.
> > 2. unfortunately, when a channel is closed we don't get back most of the
> > acknowledgements for the messages that were sent through that channel,
> and
> > we are forced to fail any query that's still waiting for an ack from that
> > channel. The more queries are running in parallel, the more chances a
> large
> > number of them will be affected by this.
> >
> > On Fri, May 13, 2016 at 4:22 PM, rahul challapalli <
> > [email protected]> wrote:
> >
> > > 1. This looks like a bug with the allocator unless there is a reason
> for
> > > not enforcing a limit(total direct memory available) on the memory
> > > allocated to all the fragments
> > > 2. This looks like a bigger problem as we are unnecessarily failing all
> > the
> > > other queries as a result of one fragment causing OOM. It makes sense
> if
> > > the drillbit was un-responsive after a fragment hit an OOM. But I was
> > able
> > > to connect to that specific drillbit after the failures and ran the
> same
> > > failing queries successfully.
> > >
> > > - Rahul
> > >
> > > On Fri, May 13, 2016 at 4:06 PM, Abdel Hakim Deneche <
> > > [email protected]>
> > > wrote:
> > >
> > > > 1. you are getting this error because the Drillbit is running out of
> > > direct
> > > > memory. It's thrown by Netty when it couldn't allocate a new chunk of
> > > > direct memory from the system. I know for each query, the allocator
> > will
> > > > enforce the query's limit. But I'm not sure we actually properly
> > compute
> > > > those limits to not exceed the total direct memory limit.
> > > > 2. when we hit a channel closed exception, all fragments that were
> > > > transmitting on that channel will most likely fail even though they
> > > didn't
> > > > run out of memory. It's hard to tell where the memory went without
> more
> > > > information about the queries you were trying to run
> > > >
> > > > On Fri, May 13, 2016 at 3:45 PM, rahul challapalli <
> > > > [email protected]> wrote:
> > > >
> > > > > Drillers,
> > > > >
> > > > > I was executing 20 queries using 10 concurrent clients on an 8 node
> > > > > cluster. First 10 queries succeed and the remaining 10 queries fail
> > > with
> > > > > "ChannelClosedException". The logs suggested that all the fragments
> > > > running
> > > > > on one node hit an "java.lang.OutOfMemoryError: Direct buffer
> > memory".
> > > 2
> > > > > questions here.
> > > > >    1. Can someone explain why we are even seeing this error.
> > Shouldn't
> > > > the
> > > > > allocator detect this condition upfront?
> > > > >    2. Why did all the fragments fail. Where did the memory go?
> > > > >
> > > > > - Rahul
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > >
> > > > Abdelhakim Deneche
> > > >
> > > > Software Engineer
> > > >
> > > >   <http://www.mapr.com/>
> > > >
> > > >
> > > > Now Available - Free Hadoop On-Demand Training
> > > > <
> > > >
> > >
> >
> http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available
> > > > >
> > > >
> > >
> >
> >
> >
> > --
> >
> > Abdelhakim Deneche
> >
> > Software Engineer
> >
> >   <http://www.mapr.com/>
> >
> >
> > Now Available - Free Hadoop On-Demand Training
> > <
> >
> http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available
> > >
> >
>



-- 

Abdelhakim Deneche

Software Engineer

  <http://www.mapr.com/>


Now Available - Free Hadoop On-Demand Training
<http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available>

Reply via email to