Hi Francois, Thank you for the picture and the info you provided. We will keep you updated and let you know when we make changes in future release.
Thanks, Padma > On Oct 26, 2016, at 6:06 PM, François Méthot <fmetho...@gmail.com> wrote: > > Hi, > > Sorry it took so long, lost the origin picture, had to go to a ms paint > training > here we go: > > https://github.com/fmethot/imagine/blob/master/affinity_factor.png > > > > On Thu, Oct 20, 2016 at 12:57 PM, Sudheesh Katkam <skat...@maprtech.com> > wrote: > >> The mailing list does not seem to allow for images. Can you put the image >> elsewhere (Github or Dropbox), and reply with a link to it? >> >> - Sudheesh >> >>> On Oct 19, 2016, at 5:37 PM, François Méthot <fmetho...@gmail.com> >> wrote: >>> >>> We had problem on the 220 nodes cluster. No problem on the 12 nodes >> cluster. >>> >>> I agree that the data may not be distributed evenly. It would be a long >> and tedious process for me to produce a report. >>> >>> Here is a drawing of the fragments overview before and after the >> changes of the affinity factory on a sample query ran on the 220 nodes >> cluster. max_width_per_node=8 on both, but it turned out to be irrelevant >> to the issue. >>> >>> >>> >>> >>> >>> >>> Before: SYSTEM ERROR: ForemanException: One more more nodes lost >> connectivity during query. Identified nodes were [server121:31010]. >>> >>> After: error is gone >>> >>> Before: low disk io, high network io on the bottom part of the graph >>> after : high disk io, low network io on the bottom part of the graph >>> >>> >>> >>> >>> >>> >>> >>> On Tue, Oct 18, 2016 at 12:58 AM, Padma Penumarthy < >> ppenumar...@maprtech.com <mailto:ppenumar...@maprtech.com>> wrote: >>> Hi Francois, >>> >>> It would be good to understand how increasing affinity_factor helped in >> your case >>> so we can better document and also use that knowledge to improve things >> in future release. >>> >>> If you have two clusters, it is not clear whether you had the problem >> on 12 node cluster >>> or 220 node cluster or both. Is the dataset same on both ? Is >> max_width_per_node=8 in both clusters ? >>> >>> Increasing affinity factor will lower remote reads by scheduling more >> fragments/doing more work >>> on nodes which have data available locally. So, there seem to be some >> kind of non uniform >>> data distribution for sure. It would be good if you can provide more >> details i.e. how the data is >>> distributed in the cluster and how the load on the nodes changed when >> affinity factor was increased. >>> >>> Thanks, >>> Padma >>> >>> >>>> On Oct 14, 2016, at 6:45 PM, François Méthot <fmetho...@gmail.com >> <mailto:fmetho...@gmail.com>> wrote: >>>> >>>> We have a 12 nodes cluster and a 220 nodes cluster, but they do not >> talk >>>> to each other. So Padma's analysis do not apply but thanks for your >>>> comments. Our goal had been to run Drill on the 220 nodes cluster >> after it >>>> proved worthy of it on the small cluster. >>>> >>>> planner.width.max_per_node was eventually reduced to 2 when we were >> trying >>>> to figure this out, it would still fail. After we figured out the >>>> affinity_factor, we put it back to its original value and it would work >>>> fine. >>>> >>>> >>>> >>>> Sudheesh: Indeed, The Zk/drill services use the same network on our >> bigger >>>> cluster. >>>> >>>> potential improvements: >>>> - planner.affinity_factor should be better documented. >>>> - When ZK disconnected, the running queries systematically failed. >> When we >>>> disabled the ForemanException thrown in the QueryManager. >>>> drillbitUnregistered method, most of our query started to run >> successfully, >>>> we would sometime get Drillbit Disconnected error within the rpc work >> bus. >>>> It did confirm that we still had something on our network going on, >> but it >>>> also showed that the RPC bus between drillbits was more resilient to >>>> network hiccup. I could not prove it, but I think under certain >> condition, >>>> the ZK session gets recreated, which cause a Query Manager unregistered >>>> (query fail) and register call right after, but the RPC >>>> bus would remains connected. >>>> >>>> >>>> We really appreciate your feedback and we hope to contribute to this >> great >>>> project in the future. >>>> Thanks >>>> Francois >>>> >>>> >>>> >>>> >>>> >>>> >>>> On Fri, Oct 14, 2016 at 3:00 PM, Padma Penumarthy < >> ppenumar...@maprtech.com <mailto:ppenumar...@maprtech.com>> >>>> wrote: >>>> >>>>> >>>>> Seems like you have 215 nodes, but the data for your query is there on >>>>> only 12 nodes. >>>>> Drill tries to distribute the scan fragments across the cluster more >>>>> uniformly (trying to utilize all CPU resources). >>>>> That is why you have lot of remote reads going on and increasing >> affinity >>>>> factor eliminates running scan >>>>> fragments on the other (215-12) nodes. >>>>> >>>>> you also mentioned planner.width.max_per_node is set to 8. >>>>> So, with increased affinity factor, you have 8 scan fragments doing >> a lot >>>>> more work on these 12 nodes. >>>>> Still, you got 10X improvement. Seems like your network is the obvious >>>>> bottleneck. Is it a 10G or 1G ? >>>>> >>>>> Also, increasing affinity factor helped in your case because there is >> no >>>>> data on other nodes. >>>>> But, if you have data non uniformly distributed across more nodes, you >>>>> might still have the problem. >>>>> >>>>> Thanks, >>>>> Padma >>>>> >>>>>> On Oct 14, 2016, at 11:18 AM, Sudheesh Katkam <skat...@maprtech.com >> <mailto:skat...@maprtech.com>> >>>>> wrote: >>>>>> >>>>>> Hi Francois, >>>>>> >>>>>> Thank you for posting your findings! Glad to see a 10X improvement. >>>>>> >>>>>> By increasing affinity factor, looks like Drill’s parallelizer is >> forced >>>>> to assign fragments on nodes with data i.e. with high favorability >> for data >>>>> locality. >>>>>> >>>>>> Regarding the random disconnection, I agree with your guess that the >>>>> network bandwidth is being used up by remote reads which causes lags >> in >>>>> drillbit to ZooKeeper heartbeats (since these services use the same >>>>> network)? Maybe others can comment here. >>>>>> >>>>>> Thank you, >>>>>> Sudheesh >>>>>> >>>>>>> On Oct 12, 2016, at 6:06 PM, François Méthot <fmetho...@gmail.com >> <mailto:fmetho...@gmail.com>> >>>>> wrote: >>>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> We finally got rid of this error. We have tried many, many things >> (like >>>>>>> modifying drill to ignore the error!), it ultimately came down to >> this >>>>>>> change: >>>>>>> >>>>>>> from default >>>>>>> planner.affinity_factor=1.2 >>>>>>> to >>>>>>> planner.affinity_factor=100 >>>>>>> >>>>>>> Basically this encourages fragment to only care about locally stored >>>>> files. >>>>>>> We looked at the code that used that property and figured that 100 >> would >>>>>>> have strong impact. >>>>>>> >>>>>>> What led us to this property is the fact that 1/4 of our fragments >> would >>>>>>> take a lot more time to complete their scan, up to 10x the time of >> the >>>>>>> fastest nodes. On the slower nodes, Cloudera Manager would show >> very >>>>> low >>>>>>> Disk IOPS with high Network IO compare to our faster nodes. We had >>>>> noticed >>>>>>> that before but figured it would be some optimization to be done >> later >>>>> when >>>>>>> more pressing issue would be fixed, like Zk disconnection and OOM. >> We >>>>> were >>>>>>> desperate and decided to fix anything that would look unusual. >>>>>>> >>>>>>> After this change, query ran up to 10x faster. >>>>>>> >>>>>>> We no longer get random disconnection between node and Zookeeper. >>>>>>> >>>>>>> We are still wondering why exactly. Network should not be a >> bottleneck. >>>>>>> Could high network traffic between a Drillbit and HDFS causes the >>>>> Drillbit >>>>>>> to timeout with zookeeper? >>>>>>> >>>>>>> >>>>>>> On Fri, Sep 30, 2016 at 4:21 PM, François Méthot < >> fmetho...@gmail.com <mailto:fmetho...@gmail.com>> >>>>>>> wrote: >>>>>>> >>>>>>>> After the 30 seconds gap, all the Drill nodes receives the >> following: >>>>>>>> >>>>>>>> 2016-09-26 20:07:38,629 [Curator-ServiceCache-0] Debug Active >> drillbit >>>>> set >>>>>>>> changed. Now includes 220 total bits. New Active drill bits.... >>>>>>>> ...faulty node is not on the list... >>>>>>>> 2016-09-26 20:07:38,897 [Curator-ServiceCache-0] Debug Active >> drillbit >>>>> set >>>>>>>> changed. Now includes 221 total bits. New Active drill bits.... >>>>>>>> ...faulty node is back on the list... >>>>>>>> >>>>>>>> >>>>>>>> So the faulty Drill node get unregistered and registered right >> after. >>>>>>>> >>>>>>>> Drill is using the low level API for registering and >> unregistering, and >>>>>>>> the only place with unregistering occurs is when the drillbit is >>>>> closed at >>>>>>>> shutdown. >>>>>>>> >>>>>>>> That particular drillbit is still up and running after those log, >> it >>>>> could >>>>>>>> not have trigger the unregistering process through a shutdown. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Would you have an idea what else could cause a Drillbit to be >>>>> unregistered >>>>>>>> from the DiscoveryService and registered again right after? >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> We are using Zookeeper 3.4.5 >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Wed, Sep 28, 2016 at 10:36 AM, François Méthot < >> fmetho...@gmail.com <mailto:fmetho...@gmail.com> >>>>>> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Hi, >>>>>>>>> >>>>>>>>> We have switched to 1.8 and we are still getting node >> disconnection. >>>>>>>>> >>>>>>>>> We did many tests, we thought initially our stand alone parquet >>>>> converter >>>>>>>>> was generating parquet files with problematic data (like 10K >>>>> characters >>>>>>>>> string), but we were able to reproduce it with employee data from >> the >>>>>>>>> tutorial. >>>>>>>>> >>>>>>>>> For example, we duplicated the Drill Tutorial "Employee" data to >>>>> reach >>>>>>>>> 500 M records spread over 130 parquet files. >>>>>>>>> Each files is ~60 MB. >>>>>>>>> >>>>>>>>> >>>>>>>>> We ran this query over and over on 5 different sessions using a >>>>> script: >>>>>>>>> select * from hdfs.tmp.`PARQUET_EMPLOYEE` where full_name like >> '%does >>>>>>>>> not exist%'; >>>>>>>>> >>>>>>>>> Query return no rows and would take ~35 to 45 seconds to return. >>>>>>>>> >>>>>>>>> Leaving the script running on each node, we eventually hit the >> "nodes >>>>>>>>> lost connectivity during query" error. >>>>>>>>> >>>>>>>>> One the done that failed, >>>>>>>>> >>>>>>>>> We see those log: >>>>>>>>> 2016-09-26 20:07:09,029 [...uuid...frag:1:10] INFO >>>>>>>>> o.a.d.e.w.f.FragmentStatusReporter - ...uuid...:1:10: State to >>>>> report: >>>>>>>>> RUNNING >>>>>>>>> 2016-09-26 20:07:09,029 [...uuid...frag:1:10] DEBUG >>>>>>>>> o.a.d.e.w.FragmentExecutor - Starting fragment 1:10 on >> server064:31010 >>>>>>>>> >>>>>>>>> <--- 30 seconds gap for that fragment ---> >>>>>>>>> >>>>>>>>> 2016-09-26 20:37:09,976 [BitServer-2] WARN o.a.d.exec.rpc.control. >>>>> ControlServer >>>>>>>>> - Message of mode REQUEST of rpc type 2 took longer then 500 ms. >>>>> Actual >>>>>>>>> duration was 23617ms. >>>>>>>>> >>>>>>>>> 2016-09-26 20:07:38,211 [...uuid...frag:1:10] DEBUG >>>>>>>>> o.a.d.e.p.i.s.RemovingRecordBatch - doWork(): 0 records copied >> out >>>>> of 0, >>>>>>>>> remaining: 0 incoming schema BatchSchema [...., >>>>> selectionVector=TWO_BYTE] >>>>>>>>> 2016-09-26 20:07:38,211 [...uuid...frag:1:10] DEBUG >>>>>>>>> o.a.d.exec.rpc.control.WorkEventBus - Cancelling and removing >>>>> fragment >>>>>>>>> manager : ...uuid... >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> For the same query on a working node: >>>>>>>>> 2016-09-26 20:07:09,056 [...uuid...frag:1:2] INFO >>>>>>>>> o.a.d.e.w.f.FragmentStatusReporter - ...uuid...:1:2: State to >> report: >>>>>>>>> RUNNING >>>>>>>>> 2016-09-26 20:07:09,056 [...uuid...frag:1:2] DEBUG >>>>>>>>> o.a.d.e.w.FragmentExecutor - Starting fragment 1:2 on >> server125:31010 >>>>>>>>> 2016-09-26 20:07:09,749 [...uuid...frag:1:2] DEBUG >>>>>>>>> o.a.d.e.p.i.s.RemovingRecordBatch - doWork(): 0 records copied >> out >>>>> of 0, >>>>>>>>> remaining: 0 incoming schema BatchSchema [...., >>>>> selectionVector=TWO_BYTE] >>>>>>>>> 2016-09-26 20:07:09,749 [...uuid...frag:1:2] DEBUG >>>>>>>>> o.a.d.e.p.i.s.RemovingRecordBatch - doWork(): 0 records copied >> out >>>>> of 0, >>>>>>>>> remaining: 0 incoming schema BatchSchema [...., >>>>> selectionVector=TWO_BYTE] >>>>>>>>> 2016-09-26 20:07:11,005 [...uuid...frag:1:2] DEBUG >>>>>>>>> o.a.d.e.s.p.c.ParquetRecordReader - Read 87573 records out of row >>>>>>>>> groups(0) in file `/data/drill/tmp/PARQUET_ >> EMPLOYEE/0_0_14.parquet >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> We are investigating what could get cause that 30 seconds gap for >> that >>>>>>>>> fragment. >>>>>>>>> >>>>>>>>> Any idea let us know >>>>>>>>> >>>>>>>>> Thanks >>>>>>>>> Francois >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Mon, Sep 19, 2016 at 2:59 PM, François Méthot < >> fmetho...@gmail.com <mailto:fmetho...@gmail.com> >>>>>> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Hi Sudheesh, >>>>>>>>>> >>>>>>>>>> If I add selection filter so that no row are returned, the same >>>>>>>>>> problem occur. I also simplified the query to include only few >>>>> integer >>>>>>>>>> columns. >>>>>>>>>> >>>>>>>>>> That particular data repo is ~200+ Billions records spread over >> ~50 >>>>> 000 >>>>>>>>>> parquet files. >>>>>>>>>> >>>>>>>>>> We have other CSV data repo that are 100x smaller that does not >>>>> trigger >>>>>>>>>> this issue. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> + Is atsqa4-133.qa.lab [1] the Foreman node for the query in this >>>>> case? >>>>>>>>>> There is also a bizarre case where the node that is reported as >> lost >>>>> is the >>>>>>>>>> node itself. >>>>>>>>>> Yes, the stack trace is from the ticket, It did occurred once or >>>>> twice >>>>>>>>>> (in the many many attempts) that it was the node itself. >>>>>>>>>> >>>>>>>>>> + Is there a spike in memory usage of the Drillbit this is the >>>>> Foreman >>>>>>>>>> for the query (process memory, not just heap)? >>>>>>>>>> We don't notice any unusual spike, each nodes gets busy in the >> same >>>>>>>>>> range when query is running. >>>>>>>>>> >>>>>>>>>> I tried running with 8GB/20GB and 4GB/24GB heap/off-heap, did >> not see >>>>>>>>>> any improvement. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> We will update from 1.7 to 1.8 before going ahead with more >>>>>>>>>> investigation. >>>>>>>>>> >>>>>>>>>> Thanks a lot. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Mon, Sep 19, 2016 at 1:19 PM, Sudheesh Katkam < >>>>> skat...@maprtech.com <mailto:skat...@maprtech.com>> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> Hi Francois, >>>>>>>>>>> >>>>>>>>>>> A simple query with only projections is not an “ideal” use case, >>>>> since >>>>>>>>>>> Drill is bound by how fast the client can consume records. There >>>>> are 1000 >>>>>>>>>>> scanners sending data to 1 client (vs far fewer scanners sending >>>>> data in >>>>>>>>>>> the 12 node case). >>>>>>>>>>> >>>>>>>>>>> This might increase the load on the Drillbit that is the >> Foreman for >>>>>>>>>>> the query. In the query profile, the scanners should be >> spending a >>>>> lot more >>>>>>>>>>> time “waiting” to send records to the client (via root >> fragment). >>>>>>>>>>> + Is atsqa4-133.qa.lab [1] the Foreman node for the query in >> this >>>>> case? >>>>>>>>>>> There is also a bizarre case where the node that is reported as >>>>> lost is the >>>>>>>>>>> node itself. >>>>>>>>>>> + Is there a spike in memory usage of the Drillbit this is the >>>>> Foreman >>>>>>>>>>> for the query (process memory, not just heap)? >>>>>>>>>>> >>>>>>>>>>> Regarding the warnings ... >>>>>>>>>>> >>>>>>>>>>>> 2016-09-19 13:31:56,866 [BitServer-7] WARN >>>>>>>>>>>> o.a.d.exec.rpc.control.ControlServer - Message of mode >> REQUEST of >>>>>>>>>>> rpc type >>>>>>>>>>>> 6 took longer than 500 ms. Actual Duration was 16053ms. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> RPC type 6 is a cancellation request; DRILL-4766 [2] should >> help in >>>>>>>>>>> this case, which is resolved in the latest version of Drill. So >> as >>>>> Chun >>>>>>>>>>> suggested, upgrade the cluster to the latest version of Drill. >>>>>>>>>>> >>>>>>>>>>>> 2016-09-19 14:15:33,357 [BitServer-4] WARN >>>>>>>>>>>> o.a.d.exec.rpc.control.ControlClient - Message of mode >> RESPONSE of >>>>>>>>>>> rpc type >>>>>>>>>>>> 1 took longer than 500 ms. Actual Duration was 981ms. >>>>>>>>>>> >>>>>>>>>>> I am surprised that responses are taking that long to handle. >>>>>>>>>>> + Are both messages on the same Drillbit? >>>>>>>>>>> >>>>>>>>>>> The other warnings can be ignored. >>>>>>>>>>> >>>>>>>>>>> Thank you, >>>>>>>>>>> Sudheesh >>>>>>>>>>> >>>>>>>>>>> [1] I just realized that atsqa4-133.qa.lab is in one of our test >>>>>>>>>>> environments :) >>>>>>>>>>> [2] https://issues.apache.org/jira/browse/DRILL-4766 < >> https://issues.apache.org/jira/browse/DRILL-4766> < >>>>>>>>>>> https://issues.apache.org/jira/browse/DRILL-4766 < >> https://issues.apache.org/jira/browse/DRILL-4766>> >>>>>>>>>>> >>>>>>>>>>>> On Sep 19, 2016, at 9:15 AM, François Méthot < >> fmetho...@gmail.com <mailto:fmetho...@gmail.com>> >>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>> Hi Sudheesh, >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> + Does the query involve any aggregations or filters? Or is >> this a >>>>>>>>>>> select >>>>>>>>>>>> query with only projections? >>>>>>>>>>>> Simple query with only projections >>>>>>>>>>>> >>>>>>>>>>>> + Any suspicious timings in the query profile? >>>>>>>>>>>> Nothing specially different than our working query on our small >>>>>>>>>>> cluster. >>>>>>>>>>>> >>>>>>>>>>>> + Any suspicious warning messages in the logs around the time >> of >>>>>>>>>>> failure on >>>>>>>>>>>> any of the drillbits? Specially on atsqa4-133.qa.lab? Specially >>>>> this >>>>>>>>>>> one >>>>>>>>>>>> (“..” are place holders): >>>>>>>>>>>> Message of mode .. of rpc type .. took longer than ..ms. >> Actual >>>>>>>>>>> duration >>>>>>>>>>>> was ..ms. >>>>>>>>>>>> >>>>>>>>>>>> Well we do see this warning on the failing node (on my last >> test), >>>>> I >>>>>>>>>>> found >>>>>>>>>>>> this WARNING in our log for the past month for pretty much >> every >>>>> node >>>>>>>>>>> I >>>>>>>>>>>> checked. >>>>>>>>>>>> 2016-09-19 13:31:56,866 [BitServer-7] WARN >>>>>>>>>>>> o.a.d.exec.rpc.control.ControlServer - Message of mode >> REQUEST of >>>>>>>>>>> rpc type >>>>>>>>>>>> 6 took longer than 500 ms. Actual Duration was 16053ms. >>>>>>>>>>>> 2016-09-19 14:15:33,357 [BitServer-4] WARN >>>>>>>>>>>> o.a.d.exec.rpc.control.ControlClient - Message of mode >> RESPONSE of >>>>>>>>>>> rpc type >>>>>>>>>>>> 1 took longer than 500 ms. Actual Duration was 981ms. >>>>>>>>>>>> >>>>>>>>>>>> We really appreciate your help. I will dig in the source code >> for >>>>>>>>>>> when and >>>>>>>>>>>> why this error happen. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Francois >>>>>>>>>>>> >>>>>>>>>>>> P.S.: >>>>>>>>>>>> We do see this also: >>>>>>>>>>>> 2016-09-19 14:48:23,444 [drill-executor-9] WARN >>>>>>>>>>>> o.a.d.exec.rpc.control.WorkEventBus - Fragment ......:1:2 not >>>>> found >>>>>>>>>>> in the >>>>>>>>>>>> work bus. >>>>>>>>>>>> 2016-09-19 14:48:23,444 [drill-executor-11] WARN >>>>>>>>>>>> o.a.d.exec.rpc.control.WorkEventBus - Fragment ....:1:222 not >>>>> found >>>>>>>>>>> in the >>>>>>>>>>>> work bus. >>>>>>>>>>>> 2016-09-19 14:48:23,444 [drill-executor-12] WARN >>>>>>>>>>>> o.a.d.exec.rpc.control.WorkEventBus - Fragment ....:1:442 not >>>>> found >>>>>>>>>>> in the >>>>>>>>>>>> work bus. >>>>>>>>>>>> 2016-09-19 14:48:23,444 [drill-executor-10] WARN >>>>>>>>>>>> o.a.d.exec.rpc.control.WorkEventBus - Fragment ....:1:662 not >>>>> found >>>>>>>>>>> in the >>>>>>>>>>>> work bus. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Sun, Sep 18, 2016 at 2:57 PM, Sudheesh Katkam < >>>>>>>>>>> skat...@maprtech.com <mailto:skat...@maprtech.com>> >>>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Hi Francois, >>>>>>>>>>>>> >>>>>>>>>>>>> More questions.. >>>>>>>>>>>>> >>>>>>>>>>>>>> + Can you share the query profile? >>>>>>>>>>>>>> I will sum it up: >>>>>>>>>>>>>> It is a select on 18 columns: 9 string, 9 integers. >>>>>>>>>>>>>> Scan is done on 13862 parquet files spread on 1000 >> fragments. >>>>>>>>>>>>>> Fragments are spread accross 215 nodes. >>>>>>>>>>>>> >>>>>>>>>>>>> So ~5 leaf fragments (or scanners) per Drillbit seems fine. >>>>>>>>>>>>> >>>>>>>>>>>>> + Does the query involve any aggregations or filters? Or is >> this a >>>>>>>>>>> select >>>>>>>>>>>>> query with only projections? >>>>>>>>>>>>> + Any suspicious timings in the query profile? >>>>>>>>>>>>> + Any suspicious warning messages in the logs around the time >> of >>>>>>>>>>> failure >>>>>>>>>>>>> on any of the drillbits? Specially on atsqa4-133.qa.lab? >> Specially >>>>>>>>>>> this one >>>>>>>>>>>>> (“..” are place holders): >>>>>>>>>>>>> Message of mode .. of rpc type .. took longer than ..ms. >> Actual >>>>>>>>>>>>> duration was ..ms. >>>>>>>>>>>>> >>>>>>>>>>>>> Thank you, >>>>>>>>>>>>> Sudheesh >>>>>>>>>>>>> >>>>>>>>>>>>>> On Sep 15, 2016, at 11:27 AM, François Méthot < >>>>> fmetho...@gmail.com <mailto:fmetho...@gmail.com>> >>>>>>>>>>>>> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>> Hi Sudheesh, >>>>>>>>>>>>>> >>>>>>>>>>>>>> + How many zookeeper servers in the quorum? >>>>>>>>>>>>>> The quorum has 3 servers, everything looks healthy >>>>>>>>>>>>>> >>>>>>>>>>>>>> + What is the load on atsqa4-133.qa.lab when this happens? >> Any >>>>> other >>>>>>>>>>>>>> applications running on that node? How many threads is the >> Drill >>>>>>>>>>> process >>>>>>>>>>>>>> using? >>>>>>>>>>>>>> The load on the failing node(8 cores) is 14, when Drill is >>>>> running. >>>>>>>>>>> Which >>>>>>>>>>>>>> is nothing out of the ordinary according to our admin. >>>>>>>>>>>>>> HBase is also running. >>>>>>>>>>>>>> planner.width.max_per_node is set to 8 >>>>>>>>>>>>>> >>>>>>>>>>>>>> + When running the same query on 12 nodes, is the data size >> same? >>>>>>>>>>>>>> Yes >>>>>>>>>>>>>> >>>>>>>>>>>>>> + Can you share the query profile? >>>>>>>>>>>>>> I will sum it up: >>>>>>>>>>>>>> It is a select on 18 columns: 9 string, 9 integers. >>>>>>>>>>>>>> Scan is done on 13862 parquet files spread on 1000 >> fragments. >>>>>>>>>>>>>> Fragments are spread accross 215 nodes. >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> We are in process of increasing our Zookeeper session timeout >>>>>>>>>>> config to >>>>>>>>>>>>> see >>>>>>>>>>>>>> if it helps. >>>>>>>>>>>>>> >>>>>>>>>>>>>> thanks >>>>>>>>>>>>>> >>>>>>>>>>>>>> Francois >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Wed, Sep 14, 2016 at 4:40 PM, Sudheesh Katkam < >>>>>>>>>>> skat...@maprtech.com <mailto:skat...@maprtech.com>> >>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Hi Francois, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Few questions: >>>>>>>>>>>>>>> + How many zookeeper servers in the quorum? >>>>>>>>>>>>>>> + What is the load on atsqa4-133.qa.lab when this happens? >> Any >>>>>>>>>>> other >>>>>>>>>>>>>>> applications running on that node? How many threads is the >> Drill >>>>>>>>>>> process >>>>>>>>>>>>>>> using? >>>>>>>>>>>>>>> + When running the same query on 12 nodes, is the data size >>>>> same? >>>>>>>>>>>>>>> + Can you share the query profile? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> This may not be the right thing to do, but for now, If the >>>>> cluster >>>>>>>>>>> is >>>>>>>>>>>>>>> heavily loaded, increase the zk timeout. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Thank you, >>>>>>>>>>>>>>> Sudheesh >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Sep 14, 2016, at 11:53 AM, François Méthot < >>>>>>>>>>> fmetho...@gmail.com <mailto:fmetho...@gmail.com>> >>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> We are running 1.7. >>>>>>>>>>>>>>>> The log were taken from the jira tickets. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> We will try out 1.8 soon. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Wed, Sep 14, 2016 at 2:52 PM, Chun Chang < >>>>> cch...@maprtech.com <mailto:cch...@maprtech.com>> >>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Looks like you are running 1.5. I believe there are some >> work >>>>>>>>>>> done in >>>>>>>>>>>>>>> that >>>>>>>>>>>>>>>>> area and the newer release should behave better. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On Wed, Sep 14, 2016 at 11:43 AM, François Méthot < >>>>>>>>>>>>> fmetho...@gmail.com <mailto:fmetho...@gmail.com>> >>>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Hi, >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> We are trying to find a solution/workaround to issue: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> 2016-01-28 16:36:14,367 [Curator-ServiceCache-0] ERROR >>>>>>>>>>>>>>>>>> o.a.drill.exec.work.foreman.Fo < >> http://o.a.drill.exec.work.foreman.fo/>reman - SYSTEM ERROR: >>>>>>>>>>>>> ForemanException: >>>>>>>>>>>>>>>>>> One more more nodes lost connectivity during query. >>>>> Identified >>>>>>>>>>> nodes >>>>>>>>>>>>>>>>>> were [atsqa4-133.qa.lab:31010]. >>>>>>>>>>>>>>>>>> org.apache.drill.common.exceptions.UserException: SYSTEM >>>>> ERROR: >>>>>>>>>>>>>>>>>> ForemanException: One more more nodes lost connectivity >>>>> during >>>>>>>>>>> query. >>>>>>>>>>>>>>>>>> Identified nodes were [atsqa4-133.qa.lab:31010]. >>>>>>>>>>>>>>>>>> at org.apache.drill.exec.work.for >>>>>>>>>>> eman.Foreman$ForemanResult. >>>>>>>>>>>>>>>>>> close(Foreman.java:746) >>>>>>>>>>>>>>>>>> [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT] >>>>>>>>>>>>>>>>>> at org.apache.drill.exec.work. >>>>> foreman.Foreman$StateSwitch. >>>>>>>>>>>>>>>>>> processEvent(Foreman.java:858) >>>>>>>>>>>>>>>>>> [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT] >>>>>>>>>>>>>>>>>> at org.apache.drill.exec.work. >>>>> foreman.Foreman$StateSwitch. >>>>>>>>>>>>>>>>>> processEvent(Foreman.java:790) >>>>>>>>>>>>>>>>>> [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT] >>>>>>>>>>>>>>>>>> at org.apache.drill.exec.work. >>>>> foreman.Foreman$StateSwitch. >>>>>>>>>>>>>>>>>> moveToState(Foreman.java:792) >>>>>>>>>>>>>>>>>> [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT] >>>>>>>>>>>>>>>>>> at org.apache.drill.exec.work. >>>>> foreman.Foreman.moveToState( >>>>>>>>>>>>>>>>>> Foreman.java:909) >>>>>>>>>>>>>>>>>> [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT] >>>>>>>>>>>>>>>>>> at org.apache.drill.exec.work. >>>>> foreman.Foreman.access$2700( >>>>>>>>>>>>>>>>>> Foreman.java:110) >>>>>>>>>>>>>>>>>> [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT] >>>>>>>>>>>>>>>>>> at org.apache.drill.exec.work.for >>>>>>>>>>> eman.Foreman$StateListener. >>>>>>>>>>>>>>>>>> moveToState(Foreman.java:1183) >>>>>>>>>>>>>>>>>> [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT] >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> DRILL-4325 <https://issues.apache.org/ < >> https://issues.apache.org/> >>>>> jira/browse/DRILL-4325> >>>>>>>>>>>>>>>>>> ForemanException: >>>>>>>>>>>>>>>>>> One or more nodes lost connectivity during query >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Any one experienced this issue ? >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> It happens when running query involving many parquet >> files >>>>> on a >>>>>>>>>>>>> cluster >>>>>>>>>>>>>>>>> of >>>>>>>>>>>>>>>>>> 200 nodes. Same query on a smaller cluster of 12 nodes >> runs >>>>>>>>>>> fine. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> It is not caused by garbage collection, (checked on both >> ZK >>>>>>>>>>> node and >>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>>> involved drill bit). >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Negotiated max session timeout is 40 seconds. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> The sequence seems: >>>>>>>>>>>>>>>>>> - Drill Query begins, using an existing ZK session. >>>>>>>>>>>>>>>>>> - Drill Zk session timeouts >>>>>>>>>>>>>>>>>> - perhaps it was writing something that took too long >>>>>>>>>>>>>>>>>> - Drill attempts to renew session >>>>>>>>>>>>>>>>>> - drill believes that the write operation failed, so it >>>>>>>>>>> attempts >>>>>>>>>>>>>>>>> to >>>>>>>>>>>>>>>>>> re-create the zk node, which trigger another exception. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> We are open to any suggestion. We will report any >> finding. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Thanks >>>>>>>>>>>>>>>>>> Francois >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>> >>>>> >>>>> >>> >>> >> >>