For (2) my understanding is that we fail fragments that are still running. If you can confirm we are indeed failing finished fragments then please fill a Jira.
On Fri, May 13, 2016 at 4:53 PM, rahul challapalli < [email protected]> wrote: > @jinfeng & @hakim > > For (1) I will raise a jira. > > For (2) I am arguing that we shouldn't fail the other concurrent queries > when one query hits an OOM, especially when the fragments related to other > queries themselves succeeded (need to check on this). We should handle the > OOM case in a better way where we do not end up closing the channel. > Thoughts? > > - Rahul > > On Fri, May 13, 2016 at 4:35 PM, Abdel Hakim Deneche < > [email protected]> > wrote: > > > 1. you are right, the root allocator should prevent an allocation that > > exceeds the total available memory, but I'm not sure if all allocations > in > > the rpc layer go through Drill's accountor. Also Netty internal > > fragmentation could cause this issue even though we are still below our > > memory limit. > > 2. unfortunately, when a channel is closed we don't get back most of the > > acknowledgements for the messages that were sent through that channel, > and > > we are forced to fail any query that's still waiting for an ack from that > > channel. The more queries are running in parallel, the more chances a > large > > number of them will be affected by this. > > > > On Fri, May 13, 2016 at 4:22 PM, rahul challapalli < > > [email protected]> wrote: > > > > > 1. This looks like a bug with the allocator unless there is a reason > for > > > not enforcing a limit(total direct memory available) on the memory > > > allocated to all the fragments > > > 2. This looks like a bigger problem as we are unnecessarily failing all > > the > > > other queries as a result of one fragment causing OOM. It makes sense > if > > > the drillbit was un-responsive after a fragment hit an OOM. But I was > > able > > > to connect to that specific drillbit after the failures and ran the > same > > > failing queries successfully. > > > > > > - Rahul > > > > > > On Fri, May 13, 2016 at 4:06 PM, Abdel Hakim Deneche < > > > [email protected]> > > > wrote: > > > > > > > 1. you are getting this error because the Drillbit is running out of > > > direct > > > > memory. It's thrown by Netty when it couldn't allocate a new chunk of > > > > direct memory from the system. I know for each query, the allocator > > will > > > > enforce the query's limit. But I'm not sure we actually properly > > compute > > > > those limits to not exceed the total direct memory limit. > > > > 2. when we hit a channel closed exception, all fragments that were > > > > transmitting on that channel will most likely fail even though they > > > didn't > > > > run out of memory. It's hard to tell where the memory went without > more > > > > information about the queries you were trying to run > > > > > > > > On Fri, May 13, 2016 at 3:45 PM, rahul challapalli < > > > > [email protected]> wrote: > > > > > > > > > Drillers, > > > > > > > > > > I was executing 20 queries using 10 concurrent clients on an 8 node > > > > > cluster. First 10 queries succeed and the remaining 10 queries fail > > > with > > > > > "ChannelClosedException". The logs suggested that all the fragments > > > > running > > > > > on one node hit an "java.lang.OutOfMemoryError: Direct buffer > > memory". > > > 2 > > > > > questions here. > > > > > 1. Can someone explain why we are even seeing this error. > > Shouldn't > > > > the > > > > > allocator detect this condition upfront? > > > > > 2. Why did all the fragments fail. Where did the memory go? > > > > > > > > > > - Rahul > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > Abdelhakim Deneche > > > > > > > > Software Engineer > > > > > > > > <http://www.mapr.com/> > > > > > > > > > > > > Now Available - Free Hadoop On-Demand Training > > > > < > > > > > > > > > > http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available > > > > > > > > > > > > > > > > > > > > -- > > > > Abdelhakim Deneche > > > > Software Engineer > > > > <http://www.mapr.com/> > > > > > > Now Available - Free Hadoop On-Demand Training > > < > > > http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available > > > > > > -- Abdelhakim Deneche Software Engineer <http://www.mapr.com/> Now Available - Free Hadoop On-Demand Training <http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available>
