Hakim, It looks like if we hit an OOM in one fragment, we fail the queries related to all running fragments that are using the same network channel. At a higher concurrency, a single query could potentially cause hundreds of other queries to fail.
Jinfeng, You are right it does not make sense to accept new requests when the drillbit hits an OOM. - Rahul On Fri, May 13, 2016 at 5:07 PM, Jinfeng Ni <[email protected]> wrote: > When drillbit run into OOM, and has not recovered from OOM yet > (release the allocated memory), I could not see the way that drillbit > could continue to serve the other concurrent queries, unless those > queries do not need use any direct memory at all. On the other hand, > if the cause of query failure is not OOM, it makes sense to not let > such failure to fail all the other running queries . > > > > On Fri, May 13, 2016 at 4:53 PM, rahul challapalli > <[email protected]> wrote: > > @jinfeng & @hakim > > > > For (1) I will raise a jira. > > > > For (2) I am arguing that we shouldn't fail the other concurrent queries > > when one query hits an OOM, especially when the fragments related to > other > > queries themselves succeeded (need to check on this). We should handle > the > > OOM case in a better way where we do not end up closing the channel. > > Thoughts? > > > > - Rahul > > > > On Fri, May 13, 2016 at 4:35 PM, Abdel Hakim Deneche < > [email protected]> > > wrote: > > > >> 1. you are right, the root allocator should prevent an allocation that > >> exceeds the total available memory, but I'm not sure if all allocations > in > >> the rpc layer go through Drill's accountor. Also Netty internal > >> fragmentation could cause this issue even though we are still below our > >> memory limit. > >> 2. unfortunately, when a channel is closed we don't get back most of the > >> acknowledgements for the messages that were sent through that channel, > and > >> we are forced to fail any query that's still waiting for an ack from > that > >> channel. The more queries are running in parallel, the more chances a > large > >> number of them will be affected by this. > >> > >> On Fri, May 13, 2016 at 4:22 PM, rahul challapalli < > >> [email protected]> wrote: > >> > >> > 1. This looks like a bug with the allocator unless there is a reason > for > >> > not enforcing a limit(total direct memory available) on the memory > >> > allocated to all the fragments > >> > 2. This looks like a bigger problem as we are unnecessarily failing > all > >> the > >> > other queries as a result of one fragment causing OOM. It makes sense > if > >> > the drillbit was un-responsive after a fragment hit an OOM. But I was > >> able > >> > to connect to that specific drillbit after the failures and ran the > same > >> > failing queries successfully. > >> > > >> > - Rahul > >> > > >> > On Fri, May 13, 2016 at 4:06 PM, Abdel Hakim Deneche < > >> > [email protected]> > >> > wrote: > >> > > >> > > 1. you are getting this error because the Drillbit is running out of > >> > direct > >> > > memory. It's thrown by Netty when it couldn't allocate a new chunk > of > >> > > direct memory from the system. I know for each query, the allocator > >> will > >> > > enforce the query's limit. But I'm not sure we actually properly > >> compute > >> > > those limits to not exceed the total direct memory limit. > >> > > 2. when we hit a channel closed exception, all fragments that were > >> > > transmitting on that channel will most likely fail even though they > >> > didn't > >> > > run out of memory. It's hard to tell where the memory went without > more > >> > > information about the queries you were trying to run > >> > > > >> > > On Fri, May 13, 2016 at 3:45 PM, rahul challapalli < > >> > > [email protected]> wrote: > >> > > > >> > > > Drillers, > >> > > > > >> > > > I was executing 20 queries using 10 concurrent clients on an 8 > node > >> > > > cluster. First 10 queries succeed and the remaining 10 queries > fail > >> > with > >> > > > "ChannelClosedException". The logs suggested that all the > fragments > >> > > running > >> > > > on one node hit an "java.lang.OutOfMemoryError: Direct buffer > >> memory". > >> > 2 > >> > > > questions here. > >> > > > 1. Can someone explain why we are even seeing this error. > >> Shouldn't > >> > > the > >> > > > allocator detect this condition upfront? > >> > > > 2. Why did all the fragments fail. Where did the memory go? > >> > > > > >> > > > - Rahul > >> > > > > >> > > > >> > > > >> > > > >> > > -- > >> > > > >> > > Abdelhakim Deneche > >> > > > >> > > Software Engineer > >> > > > >> > > <http://www.mapr.com/> > >> > > > >> > > > >> > > Now Available - Free Hadoop On-Demand Training > >> > > < > >> > > > >> > > >> > http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available > >> > > > > >> > > > >> > > >> > >> > >> > >> -- > >> > >> Abdelhakim Deneche > >> > >> Software Engineer > >> > >> <http://www.mapr.com/> > >> > >> > >> Now Available - Free Hadoop On-Demand Training > >> < > >> > http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available > >> > > >> >
