Hi, Sorry it took so long, lost the origin picture, had to go to a ms paint training here we go:
https://github.com/fmethot/imagine/blob/master/affinity_factor.png On Thu, Oct 20, 2016 at 12:57 PM, Sudheesh Katkam <skat...@maprtech.com> wrote: > The mailing list does not seem to allow for images. Can you put the image > elsewhere (Github or Dropbox), and reply with a link to it? > > - Sudheesh > > > On Oct 19, 2016, at 5:37 PM, François Méthot <fmetho...@gmail.com> > wrote: > > > > We had problem on the 220 nodes cluster. No problem on the 12 nodes > cluster. > > > > I agree that the data may not be distributed evenly. It would be a long > and tedious process for me to produce a report. > > > > Here is a drawing of the fragments overview before and after the > changes of the affinity factory on a sample query ran on the 220 nodes > cluster. max_width_per_node=8 on both, but it turned out to be irrelevant > to the issue. > > > > > > > > > > > > > > Before: SYSTEM ERROR: ForemanException: One more more nodes lost > connectivity during query. Identified nodes were [server121:31010]. > > > > After: error is gone > > > > Before: low disk io, high network io on the bottom part of the graph > > after : high disk io, low network io on the bottom part of the graph > > > > > > > > > > > > > > > > On Tue, Oct 18, 2016 at 12:58 AM, Padma Penumarthy < > ppenumar...@maprtech.com <mailto:ppenumar...@maprtech.com>> wrote: > > Hi Francois, > > > > It would be good to understand how increasing affinity_factor helped in > your case > > so we can better document and also use that knowledge to improve things > in future release. > > > > If you have two clusters, it is not clear whether you had the problem > on 12 node cluster > > or 220 node cluster or both. Is the dataset same on both ? Is > max_width_per_node=8 in both clusters ? > > > > Increasing affinity factor will lower remote reads by scheduling more > fragments/doing more work > > on nodes which have data available locally. So, there seem to be some > kind of non uniform > > data distribution for sure. It would be good if you can provide more > details i.e. how the data is > > distributed in the cluster and how the load on the nodes changed when > affinity factor was increased. > > > > Thanks, > > Padma > > > > > > > On Oct 14, 2016, at 6:45 PM, François Méthot <fmetho...@gmail.com > <mailto:fmetho...@gmail.com>> wrote: > > > > > > We have a 12 nodes cluster and a 220 nodes cluster, but they do not > talk > > > to each other. So Padma's analysis do not apply but thanks for your > > > comments. Our goal had been to run Drill on the 220 nodes cluster > after it > > > proved worthy of it on the small cluster. > > > > > > planner.width.max_per_node was eventually reduced to 2 when we were > trying > > > to figure this out, it would still fail. After we figured out the > > > affinity_factor, we put it back to its original value and it would work > > > fine. > > > > > > > > > > > > Sudheesh: Indeed, The Zk/drill services use the same network on our > bigger > > > cluster. > > > > > > potential improvements: > > > - planner.affinity_factor should be better documented. > > > - When ZK disconnected, the running queries systematically failed. > When we > > > disabled the ForemanException thrown in the QueryManager. > > > drillbitUnregistered method, most of our query started to run > successfully, > > > we would sometime get Drillbit Disconnected error within the rpc work > bus. > > > It did confirm that we still had something on our network going on, > but it > > > also showed that the RPC bus between drillbits was more resilient to > > > network hiccup. I could not prove it, but I think under certain > condition, > > > the ZK session gets recreated, which cause a Query Manager unregistered > > > (query fail) and register call right after, but the RPC > > > bus would remains connected. > > > > > > > > > We really appreciate your feedback and we hope to contribute to this > great > > > project in the future. > > > Thanks > > > Francois > > > > > > > > > > > > > > > > > > > > > On Fri, Oct 14, 2016 at 3:00 PM, Padma Penumarthy < > ppenumar...@maprtech.com <mailto:ppenumar...@maprtech.com>> > > > wrote: > > > > > >> > > >> Seems like you have 215 nodes, but the data for your query is there on > > >> only 12 nodes. > > >> Drill tries to distribute the scan fragments across the cluster more > > >> uniformly (trying to utilize all CPU resources). > > >> That is why you have lot of remote reads going on and increasing > affinity > > >> factor eliminates running scan > > >> fragments on the other (215-12) nodes. > > >> > > >> you also mentioned planner.width.max_per_node is set to 8. > > >> So, with increased affinity factor, you have 8 scan fragments doing > a lot > > >> more work on these 12 nodes. > > >> Still, you got 10X improvement. Seems like your network is the obvious > > >> bottleneck. Is it a 10G or 1G ? > > >> > > >> Also, increasing affinity factor helped in your case because there is > no > > >> data on other nodes. > > >> But, if you have data non uniformly distributed across more nodes, you > > >> might still have the problem. > > >> > > >> Thanks, > > >> Padma > > >> > > >>> On Oct 14, 2016, at 11:18 AM, Sudheesh Katkam <skat...@maprtech.com > <mailto:skat...@maprtech.com>> > > >> wrote: > > >>> > > >>> Hi Francois, > > >>> > > >>> Thank you for posting your findings! Glad to see a 10X improvement. > > >>> > > >>> By increasing affinity factor, looks like Drill’s parallelizer is > forced > > >> to assign fragments on nodes with data i.e. with high favorability > for data > > >> locality. > > >>> > > >>> Regarding the random disconnection, I agree with your guess that the > > >> network bandwidth is being used up by remote reads which causes lags > in > > >> drillbit to ZooKeeper heartbeats (since these services use the same > > >> network)? Maybe others can comment here. > > >>> > > >>> Thank you, > > >>> Sudheesh > > >>> > > >>>> On Oct 12, 2016, at 6:06 PM, François Méthot <fmetho...@gmail.com > <mailto:fmetho...@gmail.com>> > > >> wrote: > > >>>> > > >>>> Hi, > > >>>> > > >>>> We finally got rid of this error. We have tried many, many things > (like > > >>>> modifying drill to ignore the error!), it ultimately came down to > this > > >>>> change: > > >>>> > > >>>> from default > > >>>> planner.affinity_factor=1.2 > > >>>> to > > >>>> planner.affinity_factor=100 > > >>>> > > >>>> Basically this encourages fragment to only care about locally stored > > >> files. > > >>>> We looked at the code that used that property and figured that 100 > would > > >>>> have strong impact. > > >>>> > > >>>> What led us to this property is the fact that 1/4 of our fragments > would > > >>>> take a lot more time to complete their scan, up to 10x the time of > the > > >>>> fastest nodes. On the slower nodes, Cloudera Manager would show > very > > >> low > > >>>> Disk IOPS with high Network IO compare to our faster nodes. We had > > >> noticed > > >>>> that before but figured it would be some optimization to be done > later > > >> when > > >>>> more pressing issue would be fixed, like Zk disconnection and OOM. > We > > >> were > > >>>> desperate and decided to fix anything that would look unusual. > > >>>> > > >>>> After this change, query ran up to 10x faster. > > >>>> > > >>>> We no longer get random disconnection between node and Zookeeper. > > >>>> > > >>>> We are still wondering why exactly. Network should not be a > bottleneck. > > >>>> Could high network traffic between a Drillbit and HDFS causes the > > >> Drillbit > > >>>> to timeout with zookeeper? > > >>>> > > >>>> > > >>>> On Fri, Sep 30, 2016 at 4:21 PM, François Méthot < > fmetho...@gmail.com <mailto:fmetho...@gmail.com>> > > >>>> wrote: > > >>>> > > >>>>> After the 30 seconds gap, all the Drill nodes receives the > following: > > >>>>> > > >>>>> 2016-09-26 20:07:38,629 [Curator-ServiceCache-0] Debug Active > drillbit > > >> set > > >>>>> changed. Now includes 220 total bits. New Active drill bits.... > > >>>>> ...faulty node is not on the list... > > >>>>> 2016-09-26 20:07:38,897 [Curator-ServiceCache-0] Debug Active > drillbit > > >> set > > >>>>> changed. Now includes 221 total bits. New Active drill bits.... > > >>>>> ...faulty node is back on the list... > > >>>>> > > >>>>> > > >>>>> So the faulty Drill node get unregistered and registered right > after. > > >>>>> > > >>>>> Drill is using the low level API for registering and > unregistering, and > > >>>>> the only place with unregistering occurs is when the drillbit is > > >> closed at > > >>>>> shutdown. > > >>>>> > > >>>>> That particular drillbit is still up and running after those log, > it > > >> could > > >>>>> not have trigger the unregistering process through a shutdown. > > >>>>> > > >>>>> > > >>>>> > > >>>>> > > >>>>> Would you have an idea what else could cause a Drillbit to be > > >> unregistered > > >>>>> from the DiscoveryService and registered again right after? > > >>>>> > > >>>>> > > >>>>> > > >>>>> We are using Zookeeper 3.4.5 > > >>>>> > > >>>>> > > >>>>> > > >>>>> > > >>>>> > > >>>>> > > >>>>> > > >>>>> > > >>>>> > > >>>>> > > >>>>> On Wed, Sep 28, 2016 at 10:36 AM, François Méthot < > fmetho...@gmail.com <mailto:fmetho...@gmail.com> > > >>> > > >>>>> wrote: > > >>>>> > > >>>>>> Hi, > > >>>>>> > > >>>>>> We have switched to 1.8 and we are still getting node > disconnection. > > >>>>>> > > >>>>>> We did many tests, we thought initially our stand alone parquet > > >> converter > > >>>>>> was generating parquet files with problematic data (like 10K > > >> characters > > >>>>>> string), but we were able to reproduce it with employee data from > the > > >>>>>> tutorial. > > >>>>>> > > >>>>>> For example, we duplicated the Drill Tutorial "Employee" data to > > >> reach > > >>>>>> 500 M records spread over 130 parquet files. > > >>>>>> Each files is ~60 MB. > > >>>>>> > > >>>>>> > > >>>>>> We ran this query over and over on 5 different sessions using a > > >> script: > > >>>>>> select * from hdfs.tmp.`PARQUET_EMPLOYEE` where full_name like > '%does > > >>>>>> not exist%'; > > >>>>>> > > >>>>>> Query return no rows and would take ~35 to 45 seconds to return. > > >>>>>> > > >>>>>> Leaving the script running on each node, we eventually hit the > "nodes > > >>>>>> lost connectivity during query" error. > > >>>>>> > > >>>>>> One the done that failed, > > >>>>>> > > >>>>>> We see those log: > > >>>>>> 2016-09-26 20:07:09,029 [...uuid...frag:1:10] INFO > > >>>>>> o.a.d.e.w.f.FragmentStatusReporter - ...uuid...:1:10: State to > > >> report: > > >>>>>> RUNNING > > >>>>>> 2016-09-26 20:07:09,029 [...uuid...frag:1:10] DEBUG > > >>>>>> o.a.d.e.w.FragmentExecutor - Starting fragment 1:10 on > server064:31010 > > >>>>>> > > >>>>>> <--- 30 seconds gap for that fragment ---> > > >>>>>> > > >>>>>> 2016-09-26 20:37:09,976 [BitServer-2] WARN o.a.d.exec.rpc.control. > > >> ControlServer > > >>>>>> - Message of mode REQUEST of rpc type 2 took longer then 500 ms. > > >> Actual > > >>>>>> duration was 23617ms. > > >>>>>> > > >>>>>> 2016-09-26 20:07:38,211 [...uuid...frag:1:10] DEBUG > > >>>>>> o.a.d.e.p.i.s.RemovingRecordBatch - doWork(): 0 records copied > out > > >> of 0, > > >>>>>> remaining: 0 incoming schema BatchSchema [...., > > >> selectionVector=TWO_BYTE] > > >>>>>> 2016-09-26 20:07:38,211 [...uuid...frag:1:10] DEBUG > > >>>>>> o.a.d.exec.rpc.control.WorkEventBus - Cancelling and removing > > >> fragment > > >>>>>> manager : ...uuid... > > >>>>>> > > >>>>>> > > >>>>>> > > >>>>>> For the same query on a working node: > > >>>>>> 2016-09-26 20:07:09,056 [...uuid...frag:1:2] INFO > > >>>>>> o.a.d.e.w.f.FragmentStatusReporter - ...uuid...:1:2: State to > report: > > >>>>>> RUNNING > > >>>>>> 2016-09-26 20:07:09,056 [...uuid...frag:1:2] DEBUG > > >>>>>> o.a.d.e.w.FragmentExecutor - Starting fragment 1:2 on > server125:31010 > > >>>>>> 2016-09-26 20:07:09,749 [...uuid...frag:1:2] DEBUG > > >>>>>> o.a.d.e.p.i.s.RemovingRecordBatch - doWork(): 0 records copied > out > > >> of 0, > > >>>>>> remaining: 0 incoming schema BatchSchema [...., > > >> selectionVector=TWO_BYTE] > > >>>>>> 2016-09-26 20:07:09,749 [...uuid...frag:1:2] DEBUG > > >>>>>> o.a.d.e.p.i.s.RemovingRecordBatch - doWork(): 0 records copied > out > > >> of 0, > > >>>>>> remaining: 0 incoming schema BatchSchema [...., > > >> selectionVector=TWO_BYTE] > > >>>>>> 2016-09-26 20:07:11,005 [...uuid...frag:1:2] DEBUG > > >>>>>> o.a.d.e.s.p.c.ParquetRecordReader - Read 87573 records out of row > > >>>>>> groups(0) in file `/data/drill/tmp/PARQUET_ > EMPLOYEE/0_0_14.parquet > > >>>>>> > > >>>>>> > > >>>>>> > > >>>>>> > > >>>>>> We are investigating what could get cause that 30 seconds gap for > that > > >>>>>> fragment. > > >>>>>> > > >>>>>> Any idea let us know > > >>>>>> > > >>>>>> Thanks > > >>>>>> Francois > > >>>>>> > > >>>>>> > > >>>>>> > > >>>>>> > > >>>>>> > > >>>>>> > > >>>>>> > > >>>>>> > > >>>>>> > > >>>>>> > > >>>>>> > > >>>>>> > > >>>>>> > > >>>>>> > > >>>>>> > > >>>>>> > > >>>>>> > > >>>>>> > > >>>>>> > > >>>>>> > > >>>>>> > > >>>>>> On Mon, Sep 19, 2016 at 2:59 PM, François Méthot < > fmetho...@gmail.com <mailto:fmetho...@gmail.com> > > >>> > > >>>>>> wrote: > > >>>>>> > > >>>>>>> Hi Sudheesh, > > >>>>>>> > > >>>>>>> If I add selection filter so that no row are returned, the same > > >>>>>>> problem occur. I also simplified the query to include only few > > >> integer > > >>>>>>> columns. > > >>>>>>> > > >>>>>>> That particular data repo is ~200+ Billions records spread over > ~50 > > >> 000 > > >>>>>>> parquet files. > > >>>>>>> > > >>>>>>> We have other CSV data repo that are 100x smaller that does not > > >> trigger > > >>>>>>> this issue. > > >>>>>>> > > >>>>>>> > > >>>>>>> + Is atsqa4-133.qa.lab [1] the Foreman node for the query in this > > >> case? > > >>>>>>> There is also a bizarre case where the node that is reported as > lost > > >> is the > > >>>>>>> node itself. > > >>>>>>> Yes, the stack trace is from the ticket, It did occurred once or > > >> twice > > >>>>>>> (in the many many attempts) that it was the node itself. > > >>>>>>> > > >>>>>>> + Is there a spike in memory usage of the Drillbit this is the > > >> Foreman > > >>>>>>> for the query (process memory, not just heap)? > > >>>>>>> We don't notice any unusual spike, each nodes gets busy in the > same > > >>>>>>> range when query is running. > > >>>>>>> > > >>>>>>> I tried running with 8GB/20GB and 4GB/24GB heap/off-heap, did > not see > > >>>>>>> any improvement. > > >>>>>>> > > >>>>>>> > > >>>>>>> We will update from 1.7 to 1.8 before going ahead with more > > >>>>>>> investigation. > > >>>>>>> > > >>>>>>> Thanks a lot. > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> On Mon, Sep 19, 2016 at 1:19 PM, Sudheesh Katkam < > > >> skat...@maprtech.com <mailto:skat...@maprtech.com>> > > >>>>>>> wrote: > > >>>>>>> > > >>>>>>>> Hi Francois, > > >>>>>>>> > > >>>>>>>> A simple query with only projections is not an “ideal” use case, > > >> since > > >>>>>>>> Drill is bound by how fast the client can consume records. There > > >> are 1000 > > >>>>>>>> scanners sending data to 1 client (vs far fewer scanners sending > > >> data in > > >>>>>>>> the 12 node case). > > >>>>>>>> > > >>>>>>>> This might increase the load on the Drillbit that is the > Foreman for > > >>>>>>>> the query. In the query profile, the scanners should be > spending a > > >> lot more > > >>>>>>>> time “waiting” to send records to the client (via root > fragment). > > >>>>>>>> + Is atsqa4-133.qa.lab [1] the Foreman node for the query in > this > > >> case? > > >>>>>>>> There is also a bizarre case where the node that is reported as > > >> lost is the > > >>>>>>>> node itself. > > >>>>>>>> + Is there a spike in memory usage of the Drillbit this is the > > >> Foreman > > >>>>>>>> for the query (process memory, not just heap)? > > >>>>>>>> > > >>>>>>>> Regarding the warnings ... > > >>>>>>>> > > >>>>>>>>> 2016-09-19 13:31:56,866 [BitServer-7] WARN > > >>>>>>>>> o.a.d.exec.rpc.control.ControlServer - Message of mode > REQUEST of > > >>>>>>>> rpc type > > >>>>>>>>> 6 took longer than 500 ms. Actual Duration was 16053ms. > > >>>>>>>> > > >>>>>>>> > > >>>>>>>> RPC type 6 is a cancellation request; DRILL-4766 [2] should > help in > > >>>>>>>> this case, which is resolved in the latest version of Drill. So > as > > >> Chun > > >>>>>>>> suggested, upgrade the cluster to the latest version of Drill. > > >>>>>>>> > > >>>>>>>>> 2016-09-19 14:15:33,357 [BitServer-4] WARN > > >>>>>>>>> o.a.d.exec.rpc.control.ControlClient - Message of mode > RESPONSE of > > >>>>>>>> rpc type > > >>>>>>>>> 1 took longer than 500 ms. Actual Duration was 981ms. > > >>>>>>>> > > >>>>>>>> I am surprised that responses are taking that long to handle. > > >>>>>>>> + Are both messages on the same Drillbit? > > >>>>>>>> > > >>>>>>>> The other warnings can be ignored. > > >>>>>>>> > > >>>>>>>> Thank you, > > >>>>>>>> Sudheesh > > >>>>>>>> > > >>>>>>>> [1] I just realized that atsqa4-133.qa.lab is in one of our test > > >>>>>>>> environments :) > > >>>>>>>> [2] https://issues.apache.org/jira/browse/DRILL-4766 < > https://issues.apache.org/jira/browse/DRILL-4766> < > > >>>>>>>> https://issues.apache.org/jira/browse/DRILL-4766 < > https://issues.apache.org/jira/browse/DRILL-4766>> > > >>>>>>>> > > >>>>>>>>> On Sep 19, 2016, at 9:15 AM, François Méthot < > fmetho...@gmail.com <mailto:fmetho...@gmail.com>> > > >>>>>>>> wrote: > > >>>>>>>>> > > >>>>>>>>> Hi Sudheesh, > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> + Does the query involve any aggregations or filters? Or is > this a > > >>>>>>>> select > > >>>>>>>>> query with only projections? > > >>>>>>>>> Simple query with only projections > > >>>>>>>>> > > >>>>>>>>> + Any suspicious timings in the query profile? > > >>>>>>>>> Nothing specially different than our working query on our small > > >>>>>>>> cluster. > > >>>>>>>>> > > >>>>>>>>> + Any suspicious warning messages in the logs around the time > of > > >>>>>>>> failure on > > >>>>>>>>> any of the drillbits? Specially on atsqa4-133.qa.lab? Specially > > >> this > > >>>>>>>> one > > >>>>>>>>> (“..” are place holders): > > >>>>>>>>> Message of mode .. of rpc type .. took longer than ..ms. > Actual > > >>>>>>>> duration > > >>>>>>>>> was ..ms. > > >>>>>>>>> > > >>>>>>>>> Well we do see this warning on the failing node (on my last > test), > > >> I > > >>>>>>>> found > > >>>>>>>>> this WARNING in our log for the past month for pretty much > every > > >> node > > >>>>>>>> I > > >>>>>>>>> checked. > > >>>>>>>>> 2016-09-19 13:31:56,866 [BitServer-7] WARN > > >>>>>>>>> o.a.d.exec.rpc.control.ControlServer - Message of mode > REQUEST of > > >>>>>>>> rpc type > > >>>>>>>>> 6 took longer than 500 ms. Actual Duration was 16053ms. > > >>>>>>>>> 2016-09-19 14:15:33,357 [BitServer-4] WARN > > >>>>>>>>> o.a.d.exec.rpc.control.ControlClient - Message of mode > RESPONSE of > > >>>>>>>> rpc type > > >>>>>>>>> 1 took longer than 500 ms. Actual Duration was 981ms. > > >>>>>>>>> > > >>>>>>>>> We really appreciate your help. I will dig in the source code > for > > >>>>>>>> when and > > >>>>>>>>> why this error happen. > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> Francois > > >>>>>>>>> > > >>>>>>>>> P.S.: > > >>>>>>>>> We do see this also: > > >>>>>>>>> 2016-09-19 14:48:23,444 [drill-executor-9] WARN > > >>>>>>>>> o.a.d.exec.rpc.control.WorkEventBus - Fragment ......:1:2 not > > >> found > > >>>>>>>> in the > > >>>>>>>>> work bus. > > >>>>>>>>> 2016-09-19 14:48:23,444 [drill-executor-11] WARN > > >>>>>>>>> o.a.d.exec.rpc.control.WorkEventBus - Fragment ....:1:222 not > > >> found > > >>>>>>>> in the > > >>>>>>>>> work bus. > > >>>>>>>>> 2016-09-19 14:48:23,444 [drill-executor-12] WARN > > >>>>>>>>> o.a.d.exec.rpc.control.WorkEventBus - Fragment ....:1:442 not > > >> found > > >>>>>>>> in the > > >>>>>>>>> work bus. > > >>>>>>>>> 2016-09-19 14:48:23,444 [drill-executor-10] WARN > > >>>>>>>>> o.a.d.exec.rpc.control.WorkEventBus - Fragment ....:1:662 not > > >> found > > >>>>>>>> in the > > >>>>>>>>> work bus. > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> On Sun, Sep 18, 2016 at 2:57 PM, Sudheesh Katkam < > > >>>>>>>> skat...@maprtech.com <mailto:skat...@maprtech.com>> > > >>>>>>>>> wrote: > > >>>>>>>>> > > >>>>>>>>>> Hi Francois, > > >>>>>>>>>> > > >>>>>>>>>> More questions.. > > >>>>>>>>>> > > >>>>>>>>>>> + Can you share the query profile? > > >>>>>>>>>>> I will sum it up: > > >>>>>>>>>>> It is a select on 18 columns: 9 string, 9 integers. > > >>>>>>>>>>> Scan is done on 13862 parquet files spread on 1000 > fragments. > > >>>>>>>>>>> Fragments are spread accross 215 nodes. > > >>>>>>>>>> > > >>>>>>>>>> So ~5 leaf fragments (or scanners) per Drillbit seems fine. > > >>>>>>>>>> > > >>>>>>>>>> + Does the query involve any aggregations or filters? Or is > this a > > >>>>>>>> select > > >>>>>>>>>> query with only projections? > > >>>>>>>>>> + Any suspicious timings in the query profile? > > >>>>>>>>>> + Any suspicious warning messages in the logs around the time > of > > >>>>>>>> failure > > >>>>>>>>>> on any of the drillbits? Specially on atsqa4-133.qa.lab? > Specially > > >>>>>>>> this one > > >>>>>>>>>> (“..” are place holders): > > >>>>>>>>>> Message of mode .. of rpc type .. took longer than ..ms. > Actual > > >>>>>>>>>> duration was ..ms. > > >>>>>>>>>> > > >>>>>>>>>> Thank you, > > >>>>>>>>>> Sudheesh > > >>>>>>>>>> > > >>>>>>>>>>> On Sep 15, 2016, at 11:27 AM, François Méthot < > > >> fmetho...@gmail.com <mailto:fmetho...@gmail.com>> > > >>>>>>>>>> wrote: > > >>>>>>>>>>> > > >>>>>>>>>>> Hi Sudheesh, > > >>>>>>>>>>> > > >>>>>>>>>>> + How many zookeeper servers in the quorum? > > >>>>>>>>>>> The quorum has 3 servers, everything looks healthy > > >>>>>>>>>>> > > >>>>>>>>>>> + What is the load on atsqa4-133.qa.lab when this happens? > Any > > >> other > > >>>>>>>>>>> applications running on that node? How many threads is the > Drill > > >>>>>>>> process > > >>>>>>>>>>> using? > > >>>>>>>>>>> The load on the failing node(8 cores) is 14, when Drill is > > >> running. > > >>>>>>>> Which > > >>>>>>>>>>> is nothing out of the ordinary according to our admin. > > >>>>>>>>>>> HBase is also running. > > >>>>>>>>>>> planner.width.max_per_node is set to 8 > > >>>>>>>>>>> > > >>>>>>>>>>> + When running the same query on 12 nodes, is the data size > same? > > >>>>>>>>>>> Yes > > >>>>>>>>>>> > > >>>>>>>>>>> + Can you share the query profile? > > >>>>>>>>>>> I will sum it up: > > >>>>>>>>>>> It is a select on 18 columns: 9 string, 9 integers. > > >>>>>>>>>>> Scan is done on 13862 parquet files spread on 1000 > fragments. > > >>>>>>>>>>> Fragments are spread accross 215 nodes. > > >>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>>>> We are in process of increasing our Zookeeper session timeout > > >>>>>>>> config to > > >>>>>>>>>> see > > >>>>>>>>>>> if it helps. > > >>>>>>>>>>> > > >>>>>>>>>>> thanks > > >>>>>>>>>>> > > >>>>>>>>>>> Francois > > >>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>>>> On Wed, Sep 14, 2016 at 4:40 PM, Sudheesh Katkam < > > >>>>>>>> skat...@maprtech.com <mailto:skat...@maprtech.com>> > > >>>>>>>>>>> wrote: > > >>>>>>>>>>> > > >>>>>>>>>>>> Hi Francois, > > >>>>>>>>>>>> > > >>>>>>>>>>>> Few questions: > > >>>>>>>>>>>> + How many zookeeper servers in the quorum? > > >>>>>>>>>>>> + What is the load on atsqa4-133.qa.lab when this happens? > Any > > >>>>>>>> other > > >>>>>>>>>>>> applications running on that node? How many threads is the > Drill > > >>>>>>>> process > > >>>>>>>>>>>> using? > > >>>>>>>>>>>> + When running the same query on 12 nodes, is the data size > > >> same? > > >>>>>>>>>>>> + Can you share the query profile? > > >>>>>>>>>>>> > > >>>>>>>>>>>> This may not be the right thing to do, but for now, If the > > >> cluster > > >>>>>>>> is > > >>>>>>>>>>>> heavily loaded, increase the zk timeout. > > >>>>>>>>>>>> > > >>>>>>>>>>>> Thank you, > > >>>>>>>>>>>> Sudheesh > > >>>>>>>>>>>> > > >>>>>>>>>>>>> On Sep 14, 2016, at 11:53 AM, François Méthot < > > >>>>>>>> fmetho...@gmail.com <mailto:fmetho...@gmail.com>> > > >>>>>>>>>>>> wrote: > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> We are running 1.7. > > >>>>>>>>>>>>> The log were taken from the jira tickets. > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> We will try out 1.8 soon. > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> On Wed, Sep 14, 2016 at 2:52 PM, Chun Chang < > > >> cch...@maprtech.com <mailto:cch...@maprtech.com>> > > >>>>>>>>>> wrote: > > >>>>>>>>>>>>> > > >>>>>>>>>>>>>> Looks like you are running 1.5. I believe there are some > work > > >>>>>>>> done in > > >>>>>>>>>>>> that > > >>>>>>>>>>>>>> area and the newer release should behave better. > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> On Wed, Sep 14, 2016 at 11:43 AM, François Méthot < > > >>>>>>>>>> fmetho...@gmail.com <mailto:fmetho...@gmail.com>> > > >>>>>>>>>>>>>> wrote: > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> Hi, > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> We are trying to find a solution/workaround to issue: > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> 2016-01-28 16:36:14,367 [Curator-ServiceCache-0] ERROR > > >>>>>>>>>>>>>>> o.a.drill.exec.work.foreman.Fo < > http://o.a.drill.exec.work.foreman.fo/>reman - SYSTEM ERROR: > > >>>>>>>>>> ForemanException: > > >>>>>>>>>>>>>>> One more more nodes lost connectivity during query. > > >> Identified > > >>>>>>>> nodes > > >>>>>>>>>>>>>>> were [atsqa4-133.qa.lab:31010]. > > >>>>>>>>>>>>>>> org.apache.drill.common.exceptions.UserException: SYSTEM > > >> ERROR: > > >>>>>>>>>>>>>>> ForemanException: One more more nodes lost connectivity > > >> during > > >>>>>>>> query. > > >>>>>>>>>>>>>>> Identified nodes were [atsqa4-133.qa.lab:31010]. > > >>>>>>>>>>>>>>> at org.apache.drill.exec.work.for > > >>>>>>>> eman.Foreman$ForemanResult. > > >>>>>>>>>>>>>>> close(Foreman.java:746) > > >>>>>>>>>>>>>>> [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT] > > >>>>>>>>>>>>>>> at org.apache.drill.exec.work. > > >> foreman.Foreman$StateSwitch. > > >>>>>>>>>>>>>>> processEvent(Foreman.java:858) > > >>>>>>>>>>>>>>> [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT] > > >>>>>>>>>>>>>>> at org.apache.drill.exec.work. > > >> foreman.Foreman$StateSwitch. > > >>>>>>>>>>>>>>> processEvent(Foreman.java:790) > > >>>>>>>>>>>>>>> [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT] > > >>>>>>>>>>>>>>> at org.apache.drill.exec.work. > > >> foreman.Foreman$StateSwitch. > > >>>>>>>>>>>>>>> moveToState(Foreman.java:792) > > >>>>>>>>>>>>>>> [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT] > > >>>>>>>>>>>>>>> at org.apache.drill.exec.work. > > >> foreman.Foreman.moveToState( > > >>>>>>>>>>>>>>> Foreman.java:909) > > >>>>>>>>>>>>>>> [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT] > > >>>>>>>>>>>>>>> at org.apache.drill.exec.work. > > >> foreman.Foreman.access$2700( > > >>>>>>>>>>>>>>> Foreman.java:110) > > >>>>>>>>>>>>>>> [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT] > > >>>>>>>>>>>>>>> at org.apache.drill.exec.work.for > > >>>>>>>> eman.Foreman$StateListener. > > >>>>>>>>>>>>>>> moveToState(Foreman.java:1183) > > >>>>>>>>>>>>>>> [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT] > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> DRILL-4325 <https://issues.apache.org/ < > https://issues.apache.org/> > > >> jira/browse/DRILL-4325> > > >>>>>>>>>>>>>>> ForemanException: > > >>>>>>>>>>>>>>> One or more nodes lost connectivity during query > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> Any one experienced this issue ? > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> It happens when running query involving many parquet > files > > >> on a > > >>>>>>>>>> cluster > > >>>>>>>>>>>>>> of > > >>>>>>>>>>>>>>> 200 nodes. Same query on a smaller cluster of 12 nodes > runs > > >>>>>>>> fine. > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> It is not caused by garbage collection, (checked on both > ZK > > >>>>>>>> node and > > >>>>>>>>>>>> the > > >>>>>>>>>>>>>>> involved drill bit). > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> Negotiated max session timeout is 40 seconds. > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> The sequence seems: > > >>>>>>>>>>>>>>> - Drill Query begins, using an existing ZK session. > > >>>>>>>>>>>>>>> - Drill Zk session timeouts > > >>>>>>>>>>>>>>> - perhaps it was writing something that took too long > > >>>>>>>>>>>>>>> - Drill attempts to renew session > > >>>>>>>>>>>>>>> - drill believes that the write operation failed, so it > > >>>>>>>> attempts > > >>>>>>>>>>>>>> to > > >>>>>>>>>>>>>>> re-create the zk node, which trigger another exception. > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> We are open to any suggestion. We will report any > finding. > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> Thanks > > >>>>>>>>>>>>>>> Francois > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>> > > >>>>>>>>>>>> > > >>>>>>>>>>>> > > >>>>>>>>>> > > >>>>>>>>>> > > >>>>>>>> > > >>>>>>>> > > >>>>>>> > > >>>>>> > > >>>>> > > >>> > > >> > > >> > > > > > >