We had problem on the 220 nodes cluster. No problem on the 12 nodes cluster.
I agree that the data may not be distributed evenly. It would be a long and tedious process for me to produce a report. Here is a drawing of the fragments overview before and after the changes of the affinity factory on a sample query ran on the 220 nodes cluster. max_width_per_node=8 on both, but it turned out to be irrelevant to the issue. [image: Inline image 1] Before: SYSTEM ERROR: ForemanException: One more more nodes lost connectivity during query. Identified nodes were [server121:31010]. After: error is gone Before: low disk io, high network io on the bottom part of the graph after : high disk io, low network io on the bottom part of the graph On Tue, Oct 18, 2016 at 12:58 AM, Padma Penumarthy <ppenumar...@maprtech.com > wrote: > Hi Francois, > > It would be good to understand how increasing affinity_factor helped in > your case > so we can better document and also use that knowledge to improve things in > future release. > > If you have two clusters, it is not clear whether you had the problem on > 12 node cluster > or 220 node cluster or both. Is the dataset same on both ? Is > max_width_per_node=8 in both clusters ? > > Increasing affinity factor will lower remote reads by scheduling more > fragments/doing more work > on nodes which have data available locally. So, there seem to be some > kind of non uniform > data distribution for sure. It would be good if you can provide more > details i.e. how the data is > distributed in the cluster and how the load on the nodes changed when > affinity factor was increased. > > Thanks, > Padma > > > > On Oct 14, 2016, at 6:45 PM, François Méthot <fmetho...@gmail.com> > wrote: > > > > We have a 12 nodes cluster and a 220 nodes cluster, but they do not talk > > to each other. So Padma's analysis do not apply but thanks for your > > comments. Our goal had been to run Drill on the 220 nodes cluster after > it > > proved worthy of it on the small cluster. > > > > planner.width.max_per_node was eventually reduced to 2 when we were > trying > > to figure this out, it would still fail. After we figured out the > > affinity_factor, we put it back to its original value and it would work > > fine. > > > > > > > > Sudheesh: Indeed, The Zk/drill services use the same network on our > bigger > > cluster. > > > > potential improvements: > > - planner.affinity_factor should be better documented. > > - When ZK disconnected, the running queries systematically failed. When > we > > disabled the ForemanException thrown in the QueryManager. > > drillbitUnregistered method, most of our query started to run > successfully, > > we would sometime get Drillbit Disconnected error within the rpc work > bus. > > It did confirm that we still had something on our network going on, but > it > > also showed that the RPC bus between drillbits was more resilient to > > network hiccup. I could not prove it, but I think under certain > condition, > > the ZK session gets recreated, which cause a Query Manager unregistered > > (query fail) and register call right after, but the RPC > > bus would remains connected. > > > > > > We really appreciate your feedback and we hope to contribute to this > great > > project in the future. > > Thanks > > Francois > > > > > > > > > > > > > > On Fri, Oct 14, 2016 at 3:00 PM, Padma Penumarthy < > ppenumar...@maprtech.com> > > wrote: > > > >> > >> Seems like you have 215 nodes, but the data for your query is there on > >> only 12 nodes. > >> Drill tries to distribute the scan fragments across the cluster more > >> uniformly (trying to utilize all CPU resources). > >> That is why you have lot of remote reads going on and increasing > affinity > >> factor eliminates running scan > >> fragments on the other (215-12) nodes. > >> > >> you also mentioned planner.width.max_per_node is set to 8. > >> So, with increased affinity factor, you have 8 scan fragments doing a > lot > >> more work on these 12 nodes. > >> Still, you got 10X improvement. Seems like your network is the obvious > >> bottleneck. Is it a 10G or 1G ? > >> > >> Also, increasing affinity factor helped in your case because there is no > >> data on other nodes. > >> But, if you have data non uniformly distributed across more nodes, you > >> might still have the problem. > >> > >> Thanks, > >> Padma > >> > >>> On Oct 14, 2016, at 11:18 AM, Sudheesh Katkam <skat...@maprtech.com> > >> wrote: > >>> > >>> Hi Francois, > >>> > >>> Thank you for posting your findings! Glad to see a 10X improvement. > >>> > >>> By increasing affinity factor, looks like Drill’s parallelizer is > forced > >> to assign fragments on nodes with data i.e. with high favorability for > data > >> locality. > >>> > >>> Regarding the random disconnection, I agree with your guess that the > >> network bandwidth is being used up by remote reads which causes lags in > >> drillbit to ZooKeeper heartbeats (since these services use the same > >> network)? Maybe others can comment here. > >>> > >>> Thank you, > >>> Sudheesh > >>> > >>>> On Oct 12, 2016, at 6:06 PM, François Méthot <fmetho...@gmail.com> > >> wrote: > >>>> > >>>> Hi, > >>>> > >>>> We finally got rid of this error. We have tried many, many things > (like > >>>> modifying drill to ignore the error!), it ultimately came down to this > >>>> change: > >>>> > >>>> from default > >>>> planner.affinity_factor=1.2 > >>>> to > >>>> planner.affinity_factor=100 > >>>> > >>>> Basically this encourages fragment to only care about locally stored > >> files. > >>>> We looked at the code that used that property and figured that 100 > would > >>>> have strong impact. > >>>> > >>>> What led us to this property is the fact that 1/4 of our fragments > would > >>>> take a lot more time to complete their scan, up to 10x the time of > the > >>>> fastest nodes. On the slower nodes, Cloudera Manager would show very > >> low > >>>> Disk IOPS with high Network IO compare to our faster nodes. We had > >> noticed > >>>> that before but figured it would be some optimization to be done later > >> when > >>>> more pressing issue would be fixed, like Zk disconnection and OOM. We > >> were > >>>> desperate and decided to fix anything that would look unusual. > >>>> > >>>> After this change, query ran up to 10x faster. > >>>> > >>>> We no longer get random disconnection between node and Zookeeper. > >>>> > >>>> We are still wondering why exactly. Network should not be a > bottleneck. > >>>> Could high network traffic between a Drillbit and HDFS causes the > >> Drillbit > >>>> to timeout with zookeeper? > >>>> > >>>> > >>>> On Fri, Sep 30, 2016 at 4:21 PM, François Méthot <fmetho...@gmail.com > > > >>>> wrote: > >>>> > >>>>> After the 30 seconds gap, all the Drill nodes receives the following: > >>>>> > >>>>> 2016-09-26 20:07:38,629 [Curator-ServiceCache-0] Debug Active > drillbit > >> set > >>>>> changed. Now includes 220 total bits. New Active drill bits.... > >>>>> ...faulty node is not on the list... > >>>>> 2016-09-26 20:07:38,897 [Curator-ServiceCache-0] Debug Active > drillbit > >> set > >>>>> changed. Now includes 221 total bits. New Active drill bits.... > >>>>> ...faulty node is back on the list... > >>>>> > >>>>> > >>>>> So the faulty Drill node get unregistered and registered right after. > >>>>> > >>>>> Drill is using the low level API for registering and unregistering, > and > >>>>> the only place with unregistering occurs is when the drillbit is > >> closed at > >>>>> shutdown. > >>>>> > >>>>> That particular drillbit is still up and running after those log, it > >> could > >>>>> not have trigger the unregistering process through a shutdown. > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> Would you have an idea what else could cause a Drillbit to be > >> unregistered > >>>>> from the DiscoveryService and registered again right after? > >>>>> > >>>>> > >>>>> > >>>>> We are using Zookeeper 3.4.5 > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> On Wed, Sep 28, 2016 at 10:36 AM, François Méthot < > fmetho...@gmail.com > >>> > >>>>> wrote: > >>>>> > >>>>>> Hi, > >>>>>> > >>>>>> We have switched to 1.8 and we are still getting node disconnection. > >>>>>> > >>>>>> We did many tests, we thought initially our stand alone parquet > >> converter > >>>>>> was generating parquet files with problematic data (like 10K > >> characters > >>>>>> string), but we were able to reproduce it with employee data from > the > >>>>>> tutorial. > >>>>>> > >>>>>> For example, we duplicated the Drill Tutorial "Employee" data to > >> reach > >>>>>> 500 M records spread over 130 parquet files. > >>>>>> Each files is ~60 MB. > >>>>>> > >>>>>> > >>>>>> We ran this query over and over on 5 different sessions using a > >> script: > >>>>>> select * from hdfs.tmp.`PARQUET_EMPLOYEE` where full_name like > '%does > >>>>>> not exist%'; > >>>>>> > >>>>>> Query return no rows and would take ~35 to 45 seconds to return. > >>>>>> > >>>>>> Leaving the script running on each node, we eventually hit the > "nodes > >>>>>> lost connectivity during query" error. > >>>>>> > >>>>>> One the done that failed, > >>>>>> > >>>>>> We see those log: > >>>>>> 2016-09-26 20:07:09,029 [...uuid...frag:1:10] INFO > >>>>>> o.a.d.e.w.f.FragmentStatusReporter - ...uuid...:1:10: State to > >> report: > >>>>>> RUNNING > >>>>>> 2016-09-26 20:07:09,029 [...uuid...frag:1:10] DEBUG > >>>>>> o.a.d.e.w.FragmentExecutor - Starting fragment 1:10 on > server064:31010 > >>>>>> > >>>>>> <--- 30 seconds gap for that fragment ---> > >>>>>> > >>>>>> 2016-09-26 20:37:09,976 [BitServer-2] WARN o.a.d.exec.rpc.control. > >> ControlServer > >>>>>> - Message of mode REQUEST of rpc type 2 took longer then 500 ms. > >> Actual > >>>>>> duration was 23617ms. > >>>>>> > >>>>>> 2016-09-26 20:07:38,211 [...uuid...frag:1:10] DEBUG > >>>>>> o.a.d.e.p.i.s.RemovingRecordBatch - doWork(): 0 records copied out > >> of 0, > >>>>>> remaining: 0 incoming schema BatchSchema [...., > >> selectionVector=TWO_BYTE] > >>>>>> 2016-09-26 20:07:38,211 [...uuid...frag:1:10] DEBUG > >>>>>> o.a.d.exec.rpc.control.WorkEventBus - Cancelling and removing > >> fragment > >>>>>> manager : ...uuid... > >>>>>> > >>>>>> > >>>>>> > >>>>>> For the same query on a working node: > >>>>>> 2016-09-26 20:07:09,056 [...uuid...frag:1:2] INFO > >>>>>> o.a.d.e.w.f.FragmentStatusReporter - ...uuid...:1:2: State to > report: > >>>>>> RUNNING > >>>>>> 2016-09-26 20:07:09,056 [...uuid...frag:1:2] DEBUG > >>>>>> o.a.d.e.w.FragmentExecutor - Starting fragment 1:2 on > server125:31010 > >>>>>> 2016-09-26 20:07:09,749 [...uuid...frag:1:2] DEBUG > >>>>>> o.a.d.e.p.i.s.RemovingRecordBatch - doWork(): 0 records copied out > >> of 0, > >>>>>> remaining: 0 incoming schema BatchSchema [...., > >> selectionVector=TWO_BYTE] > >>>>>> 2016-09-26 20:07:09,749 [...uuid...frag:1:2] DEBUG > >>>>>> o.a.d.e.p.i.s.RemovingRecordBatch - doWork(): 0 records copied out > >> of 0, > >>>>>> remaining: 0 incoming schema BatchSchema [...., > >> selectionVector=TWO_BYTE] > >>>>>> 2016-09-26 20:07:11,005 [...uuid...frag:1:2] DEBUG > >>>>>> o.a.d.e.s.p.c.ParquetRecordReader - Read 87573 records out of row > >>>>>> groups(0) in file `/data/drill/tmp/PARQUET_EMPLOYEE/0_0_14.parquet > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> We are investigating what could get cause that 30 seconds gap for > that > >>>>>> fragment. > >>>>>> > >>>>>> Any idea let us know > >>>>>> > >>>>>> Thanks > >>>>>> Francois > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> On Mon, Sep 19, 2016 at 2:59 PM, François Méthot < > fmetho...@gmail.com > >>> > >>>>>> wrote: > >>>>>> > >>>>>>> Hi Sudheesh, > >>>>>>> > >>>>>>> If I add selection filter so that no row are returned, the same > >>>>>>> problem occur. I also simplified the query to include only few > >> integer > >>>>>>> columns. > >>>>>>> > >>>>>>> That particular data repo is ~200+ Billions records spread over ~50 > >> 000 > >>>>>>> parquet files. > >>>>>>> > >>>>>>> We have other CSV data repo that are 100x smaller that does not > >> trigger > >>>>>>> this issue. > >>>>>>> > >>>>>>> > >>>>>>> + Is atsqa4-133.qa.lab [1] the Foreman node for the query in this > >> case? > >>>>>>> There is also a bizarre case where the node that is reported as > lost > >> is the > >>>>>>> node itself. > >>>>>>> Yes, the stack trace is from the ticket, It did occurred once or > >> twice > >>>>>>> (in the many many attempts) that it was the node itself. > >>>>>>> > >>>>>>> + Is there a spike in memory usage of the Drillbit this is the > >> Foreman > >>>>>>> for the query (process memory, not just heap)? > >>>>>>> We don't notice any unusual spike, each nodes gets busy in the same > >>>>>>> range when query is running. > >>>>>>> > >>>>>>> I tried running with 8GB/20GB and 4GB/24GB heap/off-heap, did not > see > >>>>>>> any improvement. > >>>>>>> > >>>>>>> > >>>>>>> We will update from 1.7 to 1.8 before going ahead with more > >>>>>>> investigation. > >>>>>>> > >>>>>>> Thanks a lot. > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> On Mon, Sep 19, 2016 at 1:19 PM, Sudheesh Katkam < > >> skat...@maprtech.com> > >>>>>>> wrote: > >>>>>>> > >>>>>>>> Hi Francois, > >>>>>>>> > >>>>>>>> A simple query with only projections is not an “ideal” use case, > >> since > >>>>>>>> Drill is bound by how fast the client can consume records. There > >> are 1000 > >>>>>>>> scanners sending data to 1 client (vs far fewer scanners sending > >> data in > >>>>>>>> the 12 node case). > >>>>>>>> > >>>>>>>> This might increase the load on the Drillbit that is the Foreman > for > >>>>>>>> the query. In the query profile, the scanners should be spending a > >> lot more > >>>>>>>> time “waiting” to send records to the client (via root fragment). > >>>>>>>> + Is atsqa4-133.qa.lab [1] the Foreman node for the query in this > >> case? > >>>>>>>> There is also a bizarre case where the node that is reported as > >> lost is the > >>>>>>>> node itself. > >>>>>>>> + Is there a spike in memory usage of the Drillbit this is the > >> Foreman > >>>>>>>> for the query (process memory, not just heap)? > >>>>>>>> > >>>>>>>> Regarding the warnings ... > >>>>>>>> > >>>>>>>>> 2016-09-19 13:31:56,866 [BitServer-7] WARN > >>>>>>>>> o.a.d.exec.rpc.control.ControlServer - Message of mode REQUEST > of > >>>>>>>> rpc type > >>>>>>>>> 6 took longer than 500 ms. Actual Duration was 16053ms. > >>>>>>>> > >>>>>>>> > >>>>>>>> RPC type 6 is a cancellation request; DRILL-4766 [2] should help > in > >>>>>>>> this case, which is resolved in the latest version of Drill. So as > >> Chun > >>>>>>>> suggested, upgrade the cluster to the latest version of Drill. > >>>>>>>> > >>>>>>>>> 2016-09-19 14:15:33,357 [BitServer-4] WARN > >>>>>>>>> o.a.d.exec.rpc.control.ControlClient - Message of mode RESPONSE > of > >>>>>>>> rpc type > >>>>>>>>> 1 took longer than 500 ms. Actual Duration was 981ms. > >>>>>>>> > >>>>>>>> I am surprised that responses are taking that long to handle. > >>>>>>>> + Are both messages on the same Drillbit? > >>>>>>>> > >>>>>>>> The other warnings can be ignored. > >>>>>>>> > >>>>>>>> Thank you, > >>>>>>>> Sudheesh > >>>>>>>> > >>>>>>>> [1] I just realized that atsqa4-133.qa.lab is in one of our test > >>>>>>>> environments :) > >>>>>>>> [2] https://issues.apache.org/jira/browse/DRILL-4766 < > >>>>>>>> https://issues.apache.org/jira/browse/DRILL-4766> > >>>>>>>> > >>>>>>>>> On Sep 19, 2016, at 9:15 AM, François Méthot < > fmetho...@gmail.com> > >>>>>>>> wrote: > >>>>>>>>> > >>>>>>>>> Hi Sudheesh, > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> + Does the query involve any aggregations or filters? Or is this > a > >>>>>>>> select > >>>>>>>>> query with only projections? > >>>>>>>>> Simple query with only projections > >>>>>>>>> > >>>>>>>>> + Any suspicious timings in the query profile? > >>>>>>>>> Nothing specially different than our working query on our small > >>>>>>>> cluster. > >>>>>>>>> > >>>>>>>>> + Any suspicious warning messages in the logs around the time of > >>>>>>>> failure on > >>>>>>>>> any of the drillbits? Specially on atsqa4-133.qa.lab? Specially > >> this > >>>>>>>> one > >>>>>>>>> (“..” are place holders): > >>>>>>>>> Message of mode .. of rpc type .. took longer than ..ms. Actual > >>>>>>>> duration > >>>>>>>>> was ..ms. > >>>>>>>>> > >>>>>>>>> Well we do see this warning on the failing node (on my last > test), > >> I > >>>>>>>> found > >>>>>>>>> this WARNING in our log for the past month for pretty much every > >> node > >>>>>>>> I > >>>>>>>>> checked. > >>>>>>>>> 2016-09-19 13:31:56,866 [BitServer-7] WARN > >>>>>>>>> o.a.d.exec.rpc.control.ControlServer - Message of mode REQUEST > of > >>>>>>>> rpc type > >>>>>>>>> 6 took longer than 500 ms. Actual Duration was 16053ms. > >>>>>>>>> 2016-09-19 14:15:33,357 [BitServer-4] WARN > >>>>>>>>> o.a.d.exec.rpc.control.ControlClient - Message of mode RESPONSE > of > >>>>>>>> rpc type > >>>>>>>>> 1 took longer than 500 ms. Actual Duration was 981ms. > >>>>>>>>> > >>>>>>>>> We really appreciate your help. I will dig in the source code for > >>>>>>>> when and > >>>>>>>>> why this error happen. > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> Francois > >>>>>>>>> > >>>>>>>>> P.S.: > >>>>>>>>> We do see this also: > >>>>>>>>> 2016-09-19 14:48:23,444 [drill-executor-9] WARN > >>>>>>>>> o.a.d.exec.rpc.control.WorkEventBus - Fragment ......:1:2 not > >> found > >>>>>>>> in the > >>>>>>>>> work bus. > >>>>>>>>> 2016-09-19 14:48:23,444 [drill-executor-11] WARN > >>>>>>>>> o.a.d.exec.rpc.control.WorkEventBus - Fragment ....:1:222 not > >> found > >>>>>>>> in the > >>>>>>>>> work bus. > >>>>>>>>> 2016-09-19 14:48:23,444 [drill-executor-12] WARN > >>>>>>>>> o.a.d.exec.rpc.control.WorkEventBus - Fragment ....:1:442 not > >> found > >>>>>>>> in the > >>>>>>>>> work bus. > >>>>>>>>> 2016-09-19 14:48:23,444 [drill-executor-10] WARN > >>>>>>>>> o.a.d.exec.rpc.control.WorkEventBus - Fragment ....:1:662 not > >> found > >>>>>>>> in the > >>>>>>>>> work bus. > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> On Sun, Sep 18, 2016 at 2:57 PM, Sudheesh Katkam < > >>>>>>>> skat...@maprtech.com> > >>>>>>>>> wrote: > >>>>>>>>> > >>>>>>>>>> Hi Francois, > >>>>>>>>>> > >>>>>>>>>> More questions.. > >>>>>>>>>> > >>>>>>>>>>> + Can you share the query profile? > >>>>>>>>>>> I will sum it up: > >>>>>>>>>>> It is a select on 18 columns: 9 string, 9 integers. > >>>>>>>>>>> Scan is done on 13862 parquet files spread on 1000 fragments. > >>>>>>>>>>> Fragments are spread accross 215 nodes. > >>>>>>>>>> > >>>>>>>>>> So ~5 leaf fragments (or scanners) per Drillbit seems fine. > >>>>>>>>>> > >>>>>>>>>> + Does the query involve any aggregations or filters? Or is > this a > >>>>>>>> select > >>>>>>>>>> query with only projections? > >>>>>>>>>> + Any suspicious timings in the query profile? > >>>>>>>>>> + Any suspicious warning messages in the logs around the time of > >>>>>>>> failure > >>>>>>>>>> on any of the drillbits? Specially on atsqa4-133.qa.lab? > Specially > >>>>>>>> this one > >>>>>>>>>> (“..” are place holders): > >>>>>>>>>> Message of mode .. of rpc type .. took longer than ..ms. Actual > >>>>>>>>>> duration was ..ms. > >>>>>>>>>> > >>>>>>>>>> Thank you, > >>>>>>>>>> Sudheesh > >>>>>>>>>> > >>>>>>>>>>> On Sep 15, 2016, at 11:27 AM, François Méthot < > >> fmetho...@gmail.com> > >>>>>>>>>> wrote: > >>>>>>>>>>> > >>>>>>>>>>> Hi Sudheesh, > >>>>>>>>>>> > >>>>>>>>>>> + How many zookeeper servers in the quorum? > >>>>>>>>>>> The quorum has 3 servers, everything looks healthy > >>>>>>>>>>> > >>>>>>>>>>> + What is the load on atsqa4-133.qa.lab when this happens? Any > >> other > >>>>>>>>>>> applications running on that node? How many threads is the > Drill > >>>>>>>> process > >>>>>>>>>>> using? > >>>>>>>>>>> The load on the failing node(8 cores) is 14, when Drill is > >> running. > >>>>>>>> Which > >>>>>>>>>>> is nothing out of the ordinary according to our admin. > >>>>>>>>>>> HBase is also running. > >>>>>>>>>>> planner.width.max_per_node is set to 8 > >>>>>>>>>>> > >>>>>>>>>>> + When running the same query on 12 nodes, is the data size > same? > >>>>>>>>>>> Yes > >>>>>>>>>>> > >>>>>>>>>>> + Can you share the query profile? > >>>>>>>>>>> I will sum it up: > >>>>>>>>>>> It is a select on 18 columns: 9 string, 9 integers. > >>>>>>>>>>> Scan is done on 13862 parquet files spread on 1000 fragments. > >>>>>>>>>>> Fragments are spread accross 215 nodes. > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> We are in process of increasing our Zookeeper session timeout > >>>>>>>> config to > >>>>>>>>>> see > >>>>>>>>>>> if it helps. > >>>>>>>>>>> > >>>>>>>>>>> thanks > >>>>>>>>>>> > >>>>>>>>>>> Francois > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> On Wed, Sep 14, 2016 at 4:40 PM, Sudheesh Katkam < > >>>>>>>> skat...@maprtech.com> > >>>>>>>>>>> wrote: > >>>>>>>>>>> > >>>>>>>>>>>> Hi Francois, > >>>>>>>>>>>> > >>>>>>>>>>>> Few questions: > >>>>>>>>>>>> + How many zookeeper servers in the quorum? > >>>>>>>>>>>> + What is the load on atsqa4-133.qa.lab when this happens? Any > >>>>>>>> other > >>>>>>>>>>>> applications running on that node? How many threads is the > Drill > >>>>>>>> process > >>>>>>>>>>>> using? > >>>>>>>>>>>> + When running the same query on 12 nodes, is the data size > >> same? > >>>>>>>>>>>> + Can you share the query profile? > >>>>>>>>>>>> > >>>>>>>>>>>> This may not be the right thing to do, but for now, If the > >> cluster > >>>>>>>> is > >>>>>>>>>>>> heavily loaded, increase the zk timeout. > >>>>>>>>>>>> > >>>>>>>>>>>> Thank you, > >>>>>>>>>>>> Sudheesh > >>>>>>>>>>>> > >>>>>>>>>>>>> On Sep 14, 2016, at 11:53 AM, François Méthot < > >>>>>>>> fmetho...@gmail.com> > >>>>>>>>>>>> wrote: > >>>>>>>>>>>>> > >>>>>>>>>>>>> We are running 1.7. > >>>>>>>>>>>>> The log were taken from the jira tickets. > >>>>>>>>>>>>> > >>>>>>>>>>>>> We will try out 1.8 soon. > >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>> On Wed, Sep 14, 2016 at 2:52 PM, Chun Chang < > >> cch...@maprtech.com> > >>>>>>>>>> wrote: > >>>>>>>>>>>>> > >>>>>>>>>>>>>> Looks like you are running 1.5. I believe there are some > work > >>>>>>>> done in > >>>>>>>>>>>> that > >>>>>>>>>>>>>> area and the newer release should behave better. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> On Wed, Sep 14, 2016 at 11:43 AM, François Méthot < > >>>>>>>>>> fmetho...@gmail.com> > >>>>>>>>>>>>>> wrote: > >>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Hi, > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> We are trying to find a solution/workaround to issue: > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> 2016-01-28 16:36:14,367 [Curator-ServiceCache-0] ERROR > >>>>>>>>>>>>>>> o.a.drill.exec.work.foreman.Foreman - SYSTEM ERROR: > >>>>>>>>>> ForemanException: > >>>>>>>>>>>>>>> One more more nodes lost connectivity during query. > >> Identified > >>>>>>>> nodes > >>>>>>>>>>>>>>> were [atsqa4-133.qa.lab:31010]. > >>>>>>>>>>>>>>> org.apache.drill.common.exceptions.UserException: SYSTEM > >> ERROR: > >>>>>>>>>>>>>>> ForemanException: One more more nodes lost connectivity > >> during > >>>>>>>> query. > >>>>>>>>>>>>>>> Identified nodes were [atsqa4-133.qa.lab:31010]. > >>>>>>>>>>>>>>> at org.apache.drill.exec.work.for > >>>>>>>> eman.Foreman$ForemanResult. > >>>>>>>>>>>>>>> close(Foreman.java:746) > >>>>>>>>>>>>>>> [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT] > >>>>>>>>>>>>>>> at org.apache.drill.exec.work. > >> foreman.Foreman$StateSwitch. > >>>>>>>>>>>>>>> processEvent(Foreman.java:858) > >>>>>>>>>>>>>>> [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT] > >>>>>>>>>>>>>>> at org.apache.drill.exec.work. > >> foreman.Foreman$StateSwitch. > >>>>>>>>>>>>>>> processEvent(Foreman.java:790) > >>>>>>>>>>>>>>> [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT] > >>>>>>>>>>>>>>> at org.apache.drill.exec.work. > >> foreman.Foreman$StateSwitch. > >>>>>>>>>>>>>>> moveToState(Foreman.java:792) > >>>>>>>>>>>>>>> [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT] > >>>>>>>>>>>>>>> at org.apache.drill.exec.work. > >> foreman.Foreman.moveToState( > >>>>>>>>>>>>>>> Foreman.java:909) > >>>>>>>>>>>>>>> [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT] > >>>>>>>>>>>>>>> at org.apache.drill.exec.work. > >> foreman.Foreman.access$2700( > >>>>>>>>>>>>>>> Foreman.java:110) > >>>>>>>>>>>>>>> [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT] > >>>>>>>>>>>>>>> at org.apache.drill.exec.work.for > >>>>>>>> eman.Foreman$StateListener. > >>>>>>>>>>>>>>> moveToState(Foreman.java:1183) > >>>>>>>>>>>>>>> [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT] > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> DRILL-4325 <https://issues.apache.org/ > >> jira/browse/DRILL-4325> > >>>>>>>>>>>>>>> ForemanException: > >>>>>>>>>>>>>>> One or more nodes lost connectivity during query > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Any one experienced this issue ? > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> It happens when running query involving many parquet files > >> on a > >>>>>>>>>> cluster > >>>>>>>>>>>>>> of > >>>>>>>>>>>>>>> 200 nodes. Same query on a smaller cluster of 12 nodes runs > >>>>>>>> fine. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> It is not caused by garbage collection, (checked on both ZK > >>>>>>>> node and > >>>>>>>>>>>> the > >>>>>>>>>>>>>>> involved drill bit). > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Negotiated max session timeout is 40 seconds. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> The sequence seems: > >>>>>>>>>>>>>>> - Drill Query begins, using an existing ZK session. > >>>>>>>>>>>>>>> - Drill Zk session timeouts > >>>>>>>>>>>>>>> - perhaps it was writing something that took too long > >>>>>>>>>>>>>>> - Drill attempts to renew session > >>>>>>>>>>>>>>> - drill believes that the write operation failed, so it > >>>>>>>> attempts > >>>>>>>>>>>>>> to > >>>>>>>>>>>>>>> re-create the zk node, which trigger another exception. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> We are open to any suggestion. We will report any finding. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Thanks > >>>>>>>>>>>>>>> Francois > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>> > >>>>>> > >>>>> > >>> > >> > >> > >