Re: ZK lost connectivity issue on large cluster

Sudheesh Katkam Thu, 20 Oct 2016 10:03:48 -0700

The mailing list does not seem to allow for images. Can you put the image 
elsewhere (Github or Dropbox), and reply with a link to it?


- Sudheesh

> On Oct 19, 2016, at 5:37 PM, François Méthot <[email protected]> wrote:
> 
> We had problem on the 220 nodes cluster. No problem on the 12 nodes cluster.
> 
> I agree that the data may not be distributed evenly. It would be a long and 
> tedious process for me to produce a report. 
> 
> Here is a drawing  of the fragments overview before and after the changes of 
> the affinity factory on a sample query ran on the 220 nodes cluster.  
> max_width_per_node=8 on both, but it turned out to be irrelevant to the issue.
> 
> 
> 
> 
> 
> 
> Before: SYSTEM ERROR: ForemanException: One more more nodes lost connectivity 
> during query.  Identified nodes were [server121:31010].
> 
> After: error is gone
> 
> Before: low disk io, high network io on the bottom part of the graph
> after : high disk io, low network io on the bottom part of the graph
> 
> 
> 
> 
> 
> 
> 
> On Tue, Oct 18, 2016 at 12:58 AM, Padma Penumarthy <[email protected] 
> <mailto:[email protected]>> wrote:
> Hi Francois,
> 
> It would be good to understand how increasing affinity_factor helped in your 
> case
> so we can better document and also use that knowledge to improve things in 
> future release.
> 
> If you have two clusters,  it is not clear whether you had the problem on 12 
> node cluster
> or 220 node cluster or both. Is the dataset same on both ? Is 
> max_width_per_node=8 in both clusters ?
> 
> Increasing affinity factor will lower remote reads  by scheduling more 
> fragments/doing more work
> on nodes which have data available locally.  So, there seem to be some kind 
> of non uniform
> data distribution for sure. It would be good if you can provide more details 
> i.e. how the data is
> distributed in the cluster and how the load on the nodes changed when 
> affinity factor was increased.
> 
> Thanks,
> Padma
> 
> 
> > On Oct 14, 2016, at 6:45 PM, François Méthot <[email protected] 
> > <mailto:[email protected]>> wrote:
> >
> > We have  a 12 nodes cluster and a 220 nodes cluster, but they do not talk
> > to each other. So Padma's analysis do not apply but thanks for your
> > comments. Our goal had been to run Drill on the 220 nodes cluster after it
> > proved worthy of it on the small cluster.
> >
> > planner.width.max_per_node was eventually reduced to 2 when we were trying
> > to figure this out, it would still fail. After we figured out the
> > affinity_factor, we put it back to its original value and it would work
> > fine.
> >
> >
> >
> > Sudheesh: Indeed, The Zk/drill services use the same network on our bigger
> > cluster.
> >
> > potential improvements:
> > - planner.affinity_factor should be better documented.
> > - When ZK disconnected, the running queries systematically failed. When we
> > disabled the ForemanException thrown in the QueryManager.
> > drillbitUnregistered method, most of our query started to run successfully,
> > we would sometime get Drillbit Disconnected error within the rpc work bus.
> > It did confirm that we still had something on our network going on, but it
> > also showed that the RPC bus between drillbits was more resilient to
> > network hiccup. I could not prove it, but I think under certain condition,
> > the ZK session gets recreated, which cause a Query Manager unregistered
> > (query fail) and register call right after, but the RPC
> > bus  would remains connected.
> >
> >
> > We really appreciate your feedback and we hope to contribute to this great
> > project in the future.
> > Thanks
> > Francois
> >
> >
> >
> >
> >
> >
> > On Fri, Oct 14, 2016 at 3:00 PM, Padma Penumarthy <[email protected] 
> > <mailto:[email protected]>>
> > wrote:
> >
> >>
> >> Seems like you have 215 nodes, but the data for your query is there on
> >> only 12 nodes.
> >> Drill tries to distribute the scan fragments across the cluster more
> >> uniformly (trying to utilize all CPU resources).
> >> That is why you have lot of remote reads going on and increasing affinity
> >> factor eliminates running scan
> >> fragments on the other (215-12) nodes.
> >>
> >> you also mentioned planner.width.max_per_node is set to 8.
> >> So, with increased affinity factor,  you have 8 scan fragments doing a lot
> >> more work on these 12 nodes.
> >> Still, you got 10X improvement. Seems like your network is the obvious
> >> bottleneck. Is it a 10G or 1G ?
> >>
> >> Also, increasing affinity factor helped in your case because there is no
> >> data on other nodes.
> >> But, if you have data non uniformly distributed across more nodes, you
> >> might still have the problem.
> >>
> >> Thanks,
> >> Padma
> >>
> >>> On Oct 14, 2016, at 11:18 AM, Sudheesh Katkam <[email protected] 
> >>> <mailto:[email protected]>>
> >> wrote:
> >>>
> >>> Hi Francois,
> >>>
> >>> Thank you for posting your findings! Glad to see a 10X improvement.
> >>>
> >>> By increasing affinity factor, looks like Drill’s parallelizer is forced
> >> to assign fragments on nodes with data i.e. with high favorability for data
> >> locality.
> >>>
> >>> Regarding the random disconnection, I agree with your guess that the
> >> network bandwidth is being used up by remote reads which causes lags in
> >> drillbit to ZooKeeper heartbeats (since these services use the same
> >> network)? Maybe others can comment here.
> >>>
> >>> Thank you,
> >>> Sudheesh
> >>>
> >>>> On Oct 12, 2016, at 6:06 PM, François Méthot <[email protected] 
> >>>> <mailto:[email protected]>>
> >> wrote:
> >>>>
> >>>> Hi,
> >>>>
> >>>> We finally got rid of this error. We have tried many, many things  (like
> >>>> modifying drill to ignore the error!), it ultimately came down to this
> >>>> change:
> >>>>
> >>>> from default
> >>>> planner.affinity_factor=1.2
> >>>> to
> >>>> planner.affinity_factor=100
> >>>>
> >>>> Basically this encourages fragment to only care about locally stored
> >> files.
> >>>> We looked at the code that used that property and figured that 100 would
> >>>> have strong impact.
> >>>>
> >>>> What led us to this property is the fact that 1/4 of our fragments would
> >>>> take a lot more time to complete their scan, up to  10x the time of the
> >>>> fastest nodes.  On the slower nodes, Cloudera Manager would show very
> >> low
> >>>> Disk IOPS with high Network IO compare to our faster nodes. We had
> >> noticed
> >>>> that before but figured it would be some optimization to be done later
> >> when
> >>>> more pressing issue would be fixed, like Zk disconnection and OOM. We
> >> were
> >>>> desperate and decided to fix anything that  would look unusual.
> >>>>
> >>>> After this change, query ran up to 10x faster.
> >>>>
> >>>> We no longer get random disconnection between node and Zookeeper.
> >>>>
> >>>> We are still wondering why exactly. Network should not be a bottleneck.
> >>>> Could high network traffic between a Drillbit and HDFS causes the
> >> Drillbit
> >>>> to timeout with zookeeper?
> >>>>
> >>>>
> >>>> On Fri, Sep 30, 2016 at 4:21 PM, François Méthot <[email protected] 
> >>>> <mailto:[email protected]>>
> >>>> wrote:
> >>>>
> >>>>> After the 30 seconds gap, all the Drill nodes receives the following:
> >>>>>
> >>>>> 2016-09-26 20:07:38,629 [Curator-ServiceCache-0] Debug Active drillbit
> >> set
> >>>>> changed. Now includes 220 total bits. New Active drill bits....
> >>>>> ...faulty node is not on the list...
> >>>>> 2016-09-26 20:07:38,897 [Curator-ServiceCache-0] Debug Active drillbit
> >> set
> >>>>> changed. Now includes 221 total bits. New Active drill bits....
> >>>>> ...faulty node is back on the list...
> >>>>>
> >>>>>
> >>>>> So the faulty Drill node get unregistered and registered right after.
> >>>>>
> >>>>> Drill is using the low level API for registering and unregistering, and
> >>>>> the only place with unregistering occurs is when the drillbit is
> >> closed at
> >>>>> shutdown.
> >>>>>
> >>>>> That particular drillbit is still up and running after those log, it
> >> could
> >>>>> not have trigger the unregistering process through a shutdown.
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> Would you have an idea what else could cause a Drillbit to be
> >> unregistered
> >>>>> from the DiscoveryService and registered again right after?
> >>>>>
> >>>>>
> >>>>>
> >>>>> We are using Zookeeper 3.4.5
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> On Wed, Sep 28, 2016 at 10:36 AM, François Méthot <[email protected] 
> >>>>> <mailto:[email protected]>
> >>>
> >>>>> wrote:
> >>>>>
> >>>>>> Hi,
> >>>>>>
> >>>>>> We have switched to 1.8 and we are still getting node disconnection.
> >>>>>>
> >>>>>> We did many tests, we thought initially our stand alone parquet
> >> converter
> >>>>>> was generating parquet files with problematic data (like 10K
> >> characters
> >>>>>> string), but we were able to reproduce it with employee data from the
> >>>>>> tutorial.
> >>>>>>
> >>>>>> For example,  we duplicated the Drill Tutorial "Employee" data to
> >> reach
> >>>>>> 500 M records spread over 130 parquet files.
> >>>>>> Each files is ~60 MB.
> >>>>>>
> >>>>>>
> >>>>>> We ran this query over and over on 5 different sessions using a
> >> script:
> >>>>>> select * from hdfs.tmp.`PARQUET_EMPLOYEE` where full_name like '%does
> >>>>>> not exist%';
> >>>>>>
> >>>>>> Query return no rows and would take ~35 to 45 seconds to return.
> >>>>>>
> >>>>>> Leaving the script running on each node, we eventually hit the "nodes
> >>>>>> lost connectivity during query" error.
> >>>>>>
> >>>>>> One the done that failed,
> >>>>>>
> >>>>>> We see those log:
> >>>>>> 2016-09-26 20:07:09,029 [...uuid...frag:1:10] INFO
> >>>>>> o.a.d.e.w.f.FragmentStatusReporter - ...uuid...:1:10: State to
> >> report:
> >>>>>> RUNNING
> >>>>>> 2016-09-26 20:07:09,029 [...uuid...frag:1:10] DEBUG
> >>>>>> o.a.d.e.w.FragmentExecutor - Starting fragment 1:10 on server064:31010
> >>>>>>
> >>>>>> <--- 30 seconds gap for that fragment --->
> >>>>>>
> >>>>>> 2016-09-26 20:37:09,976 [BitServer-2] WARN o.a.d.exec.rpc.control.
> >> ControlServer
> >>>>>> - Message of mode REQUEST of rpc type 2 took longer then 500 ms.
> >> Actual
> >>>>>> duration was 23617ms.
> >>>>>>
> >>>>>> 2016-09-26 20:07:38,211 [...uuid...frag:1:10] DEBUG
> >>>>>> o.a.d.e.p.i.s.RemovingRecordBatch - doWork(): 0 records copied out
> >> of 0,
> >>>>>> remaining: 0 incoming schema BatchSchema [....,
> >> selectionVector=TWO_BYTE]
> >>>>>> 2016-09-26 20:07:38,211 [...uuid...frag:1:10] DEBUG
> >>>>>> o.a.d.exec.rpc.control.WorkEventBus - Cancelling and removing
> >> fragment
> >>>>>> manager : ...uuid...
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> For the same query on a working node:
> >>>>>> 2016-09-26 20:07:09,056 [...uuid...frag:1:2] INFO
> >>>>>> o.a.d.e.w.f.FragmentStatusReporter - ...uuid...:1:2: State to report:
> >>>>>> RUNNING
> >>>>>> 2016-09-26 20:07:09,056 [...uuid...frag:1:2] DEBUG
> >>>>>> o.a.d.e.w.FragmentExecutor - Starting fragment 1:2 on server125:31010
> >>>>>> 2016-09-26 20:07:09,749 [...uuid...frag:1:2] DEBUG
> >>>>>> o.a.d.e.p.i.s.RemovingRecordBatch - doWork(): 0 records copied out
> >> of 0,
> >>>>>> remaining: 0 incoming schema BatchSchema [....,
> >> selectionVector=TWO_BYTE]
> >>>>>> 2016-09-26 20:07:09,749 [...uuid...frag:1:2] DEBUG
> >>>>>> o.a.d.e.p.i.s.RemovingRecordBatch - doWork(): 0 records copied out
> >> of 0,
> >>>>>> remaining: 0 incoming schema BatchSchema [....,
> >> selectionVector=TWO_BYTE]
> >>>>>> 2016-09-26 20:07:11,005 [...uuid...frag:1:2] DEBUG
> >>>>>> o.a.d.e.s.p.c.ParquetRecordReader - Read 87573 records out of row
> >>>>>> groups(0) in file `/data/drill/tmp/PARQUET_EMPLOYEE/0_0_14.parquet
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> We are investigating what could get cause that 30 seconds gap for that
> >>>>>> fragment.
> >>>>>>
> >>>>>> Any idea let us know
> >>>>>>
> >>>>>> Thanks
> >>>>>> Francois
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On Mon, Sep 19, 2016 at 2:59 PM, François Méthot <[email protected] 
> >>>>>> <mailto:[email protected]>
> >>>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> Hi Sudheesh,
> >>>>>>>
> >>>>>>> If I add selection filter so that no row are returned, the same
> >>>>>>> problem occur. I also simplified the query to include only few
> >> integer
> >>>>>>> columns.
> >>>>>>>
> >>>>>>> That particular data repo is ~200+ Billions records spread over ~50
> >> 000
> >>>>>>> parquet files.
> >>>>>>>
> >>>>>>> We have other CSV data repo that are 100x smaller that does not
> >> trigger
> >>>>>>> this issue.
> >>>>>>>
> >>>>>>>
> >>>>>>> + Is atsqa4-133.qa.lab [1] the Foreman node for the query in this
> >> case?
> >>>>>>> There is also a bizarre case where the node that is reported as lost
> >> is the
> >>>>>>> node itself.
> >>>>>>> Yes, the stack trace is from the ticket, It did occurred once or
> >> twice
> >>>>>>> (in the many many attempts) that it was the node itself.
> >>>>>>>
> >>>>>>> + Is there a spike in memory usage of the Drillbit this is the
> >> Foreman
> >>>>>>> for the query (process memory, not just heap)?
> >>>>>>> We don't notice any unusual spike, each nodes gets busy in the same
> >>>>>>> range when query is running.
> >>>>>>>
> >>>>>>> I tried running with 8GB/20GB and 4GB/24GB heap/off-heap, did not see
> >>>>>>> any improvement.
> >>>>>>>
> >>>>>>>
> >>>>>>> We will update from 1.7 to 1.8 before going ahead with more
> >>>>>>> investigation.
> >>>>>>>
> >>>>>>> Thanks a lot.
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> On Mon, Sep 19, 2016 at 1:19 PM, Sudheesh Katkam <
> >> [email protected] <mailto:[email protected]>>
> >>>>>>> wrote:
> >>>>>>>
> >>>>>>>> Hi Francois,
> >>>>>>>>
> >>>>>>>> A simple query with only projections is not an “ideal” use case,
> >> since
> >>>>>>>> Drill is bound by how fast the client can consume records. There
> >> are 1000
> >>>>>>>> scanners sending data to 1 client (vs far fewer scanners sending
> >> data in
> >>>>>>>> the 12 node case).
> >>>>>>>>
> >>>>>>>> This might increase the load on the Drillbit that is the Foreman for
> >>>>>>>> the query. In the query profile, the scanners should be spending a
> >> lot more
> >>>>>>>> time “waiting” to send records to the client (via root fragment).
> >>>>>>>> + Is atsqa4-133.qa.lab [1] the Foreman node for the query in this
> >> case?
> >>>>>>>> There is also a bizarre case where the node that is reported as
> >> lost is the
> >>>>>>>> node itself.
> >>>>>>>> + Is there a spike in memory usage of the Drillbit this is the
> >> Foreman
> >>>>>>>> for the query (process memory, not just heap)?
> >>>>>>>>
> >>>>>>>> Regarding the warnings ...
> >>>>>>>>
> >>>>>>>>> 2016-09-19 13:31:56,866 [BitServer-7] WARN
> >>>>>>>>> o.a.d.exec.rpc.control.ControlServer - Message of mode REQUEST of
> >>>>>>>> rpc type
> >>>>>>>>> 6 took longer than 500 ms. Actual Duration was 16053ms.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> RPC type 6 is a cancellation request; DRILL-4766 [2] should help in
> >>>>>>>> this case, which is resolved in the latest version of Drill. So as
> >> Chun
> >>>>>>>> suggested, upgrade the cluster to the latest version of Drill.
> >>>>>>>>
> >>>>>>>>> 2016-09-19 14:15:33,357 [BitServer-4] WARN
> >>>>>>>>> o.a.d.exec.rpc.control.ControlClient - Message of mode RESPONSE of
> >>>>>>>> rpc type
> >>>>>>>>> 1 took longer than 500 ms. Actual Duration was 981ms.
> >>>>>>>>
> >>>>>>>> I am surprised that responses are taking that long to handle.
> >>>>>>>> + Are both messages on the same Drillbit?
> >>>>>>>>
> >>>>>>>> The other warnings can be ignored.
> >>>>>>>>
> >>>>>>>> Thank you,
> >>>>>>>> Sudheesh
> >>>>>>>>
> >>>>>>>> [1] I just realized that atsqa4-133.qa.lab is in one of our test
> >>>>>>>> environments :)
> >>>>>>>> [2] https://issues.apache.org/jira/browse/DRILL-4766 
> >>>>>>>> <https://issues.apache.org/jira/browse/DRILL-4766> <
> >>>>>>>> https://issues.apache.org/jira/browse/DRILL-4766 
> >>>>>>>> <https://issues.apache.org/jira/browse/DRILL-4766>>
> >>>>>>>>
> >>>>>>>>> On Sep 19, 2016, at 9:15 AM, François Méthot <[email protected] 
> >>>>>>>>> <mailto:[email protected]>>
> >>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>> Hi Sudheesh,
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> + Does the query involve any aggregations or filters? Or is this a
> >>>>>>>> select
> >>>>>>>>> query with only projections?
> >>>>>>>>> Simple query with only projections
> >>>>>>>>>
> >>>>>>>>> + Any suspicious timings in the query profile?
> >>>>>>>>> Nothing specially different than our working query on our small
> >>>>>>>> cluster.
> >>>>>>>>>
> >>>>>>>>> + Any suspicious warning messages in the logs around the time of
> >>>>>>>> failure on
> >>>>>>>>> any of the drillbits? Specially on atsqa4-133.qa.lab? Specially
> >> this
> >>>>>>>> one
> >>>>>>>>> (“..” are place holders):
> >>>>>>>>> Message of mode .. of rpc type .. took longer than ..ms.  Actual
> >>>>>>>> duration
> >>>>>>>>> was ..ms.
> >>>>>>>>>
> >>>>>>>>> Well we do see this warning on the failing node (on my last test),
> >> I
> >>>>>>>> found
> >>>>>>>>> this WARNING in our log for the past month for pretty much every
> >> node
> >>>>>>>> I
> >>>>>>>>> checked.
> >>>>>>>>> 2016-09-19 13:31:56,866 [BitServer-7] WARN
> >>>>>>>>> o.a.d.exec.rpc.control.ControlServer - Message of mode REQUEST of
> >>>>>>>> rpc type
> >>>>>>>>> 6 took longer than 500 ms. Actual Duration was 16053ms.
> >>>>>>>>> 2016-09-19 14:15:33,357 [BitServer-4] WARN
> >>>>>>>>> o.a.d.exec.rpc.control.ControlClient - Message of mode RESPONSE of
> >>>>>>>> rpc type
> >>>>>>>>> 1 took longer than 500 ms. Actual Duration was 981ms.
> >>>>>>>>>
> >>>>>>>>> We really appreciate your help. I will dig in the source code for
> >>>>>>>> when and
> >>>>>>>>> why this error happen.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Francois
> >>>>>>>>>
> >>>>>>>>> P.S.:
> >>>>>>>>> We do see this also:
> >>>>>>>>> 2016-09-19 14:48:23,444 [drill-executor-9] WARN
> >>>>>>>>> o.a.d.exec.rpc.control.WorkEventBus - Fragment ......:1:2 not
> >> found
> >>>>>>>> in the
> >>>>>>>>> work bus.
> >>>>>>>>> 2016-09-19 14:48:23,444 [drill-executor-11] WARN
> >>>>>>>>> o.a.d.exec.rpc.control.WorkEventBus - Fragment ....:1:222 not
> >> found
> >>>>>>>> in the
> >>>>>>>>> work bus.
> >>>>>>>>> 2016-09-19 14:48:23,444 [drill-executor-12] WARN
> >>>>>>>>> o.a.d.exec.rpc.control.WorkEventBus - Fragment ....:1:442 not
> >> found
> >>>>>>>> in the
> >>>>>>>>> work bus.
> >>>>>>>>> 2016-09-19 14:48:23,444 [drill-executor-10] WARN
> >>>>>>>>> o.a.d.exec.rpc.control.WorkEventBus - Fragment ....:1:662 not
> >> found
> >>>>>>>> in the
> >>>>>>>>> work bus.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On Sun, Sep 18, 2016 at 2:57 PM, Sudheesh Katkam <
> >>>>>>>> [email protected] <mailto:[email protected]>>
> >>>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> Hi Francois,
> >>>>>>>>>>
> >>>>>>>>>> More questions..
> >>>>>>>>>>
> >>>>>>>>>>> + Can you share the query profile?
> >>>>>>>>>>> I will sum it up:
> >>>>>>>>>>> It is a select on 18 columns: 9 string, 9 integers.
> >>>>>>>>>>> Scan is done on 13862 parquet files spread  on 1000 fragments.
> >>>>>>>>>>> Fragments are spread accross 215 nodes.
> >>>>>>>>>>
> >>>>>>>>>> So ~5 leaf fragments (or scanners) per Drillbit seems fine.
> >>>>>>>>>>
> >>>>>>>>>> + Does the query involve any aggregations or filters? Or is this a
> >>>>>>>> select
> >>>>>>>>>> query with only projections?
> >>>>>>>>>> + Any suspicious timings in the query profile?
> >>>>>>>>>> + Any suspicious warning messages in the logs around the time of
> >>>>>>>> failure
> >>>>>>>>>> on any of the drillbits? Specially on atsqa4-133.qa.lab? Specially
> >>>>>>>> this one
> >>>>>>>>>> (“..” are place holders):
> >>>>>>>>>> Message of mode .. of rpc type .. took longer than ..ms.  Actual
> >>>>>>>>>> duration was ..ms.
> >>>>>>>>>>
> >>>>>>>>>> Thank you,
> >>>>>>>>>> Sudheesh
> >>>>>>>>>>
> >>>>>>>>>>> On Sep 15, 2016, at 11:27 AM, François Méthot <
> >> [email protected] <mailto:[email protected]>>
> >>>>>>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>> Hi Sudheesh,
> >>>>>>>>>>>
> >>>>>>>>>>> + How many zookeeper servers in the quorum?
> >>>>>>>>>>> The quorum has 3 servers, everything looks healthy
> >>>>>>>>>>>
> >>>>>>>>>>> + What is the load on atsqa4-133.qa.lab when this happens? Any
> >> other
> >>>>>>>>>>> applications running on that node? How many threads is the Drill
> >>>>>>>> process
> >>>>>>>>>>> using?
> >>>>>>>>>>> The load on the failing node(8 cores) is 14, when Drill is
> >> running.
> >>>>>>>> Which
> >>>>>>>>>>> is nothing out of the ordinary according to our admin.
> >>>>>>>>>>> HBase is also running.
> >>>>>>>>>>> planner.width.max_per_node is set to 8
> >>>>>>>>>>>
> >>>>>>>>>>> + When running the same query on 12 nodes, is the data size same?
> >>>>>>>>>>> Yes
> >>>>>>>>>>>
> >>>>>>>>>>> + Can you share the query profile?
> >>>>>>>>>>> I will sum it up:
> >>>>>>>>>>> It is a select on 18 columns: 9 string, 9 integers.
> >>>>>>>>>>> Scan is done on 13862 parquet files spread  on 1000 fragments.
> >>>>>>>>>>> Fragments are spread accross 215 nodes.
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> We are in process of increasing our Zookeeper session timeout
> >>>>>>>> config to
> >>>>>>>>>> see
> >>>>>>>>>>> if it helps.
> >>>>>>>>>>>
> >>>>>>>>>>> thanks
> >>>>>>>>>>>
> >>>>>>>>>>> Francois
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> On Wed, Sep 14, 2016 at 4:40 PM, Sudheesh Katkam <
> >>>>>>>> [email protected] <mailto:[email protected]>>
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> Hi Francois,
> >>>>>>>>>>>>
> >>>>>>>>>>>> Few questions:
> >>>>>>>>>>>> + How many zookeeper servers in the quorum?
> >>>>>>>>>>>> + What is the load on atsqa4-133.qa.lab when this happens? Any
> >>>>>>>> other
> >>>>>>>>>>>> applications running on that node? How many threads is the Drill
> >>>>>>>> process
> >>>>>>>>>>>> using?
> >>>>>>>>>>>> + When running the same query on 12 nodes, is the data size
> >> same?
> >>>>>>>>>>>> + Can you share the query profile?
> >>>>>>>>>>>>
> >>>>>>>>>>>> This may not be the right thing to do, but for now, If the
> >> cluster
> >>>>>>>> is
> >>>>>>>>>>>> heavily loaded, increase the zk timeout.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Thank you,
> >>>>>>>>>>>> Sudheesh
> >>>>>>>>>>>>
> >>>>>>>>>>>>> On Sep 14, 2016, at 11:53 AM, François Méthot <
> >>>>>>>> [email protected] <mailto:[email protected]>>
> >>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> We are running 1.7.
> >>>>>>>>>>>>> The log were taken from the jira tickets.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> We will try out 1.8 soon.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On Wed, Sep 14, 2016 at 2:52 PM, Chun Chang <
> >> [email protected] <mailto:[email protected]>>
> >>>>>>>>>> wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> Looks like you are running 1.5. I believe there are some work
> >>>>>>>> done in
> >>>>>>>>>>>> that
> >>>>>>>>>>>>>> area and the newer release should behave better.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> On Wed, Sep 14, 2016 at 11:43 AM, François Méthot <
> >>>>>>>>>> [email protected] <mailto:[email protected]>>
> >>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Hi,
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> We are trying to find a solution/workaround to issue:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> 2016-01-28 16:36:14,367 [Curator-ServiceCache-0] ERROR
> >>>>>>>>>>>>>>> o.a.drill.exec.work.foreman.Fo 
> >>>>>>>>>>>>>>> <http://o.a.drill.exec.work.foreman.fo/>reman - SYSTEM ERROR:
> >>>>>>>>>> ForemanException:
> >>>>>>>>>>>>>>> One more more nodes lost connectivity during query.
> >> Identified
> >>>>>>>> nodes
> >>>>>>>>>>>>>>> were [atsqa4-133.qa.lab:31010].
> >>>>>>>>>>>>>>> org.apache.drill.common.exceptions.UserException: SYSTEM
> >> ERROR:
> >>>>>>>>>>>>>>> ForemanException: One more more nodes lost connectivity
> >> during
> >>>>>>>> query.
> >>>>>>>>>>>>>>> Identified nodes were [atsqa4-133.qa.lab:31010].
> >>>>>>>>>>>>>>>   at org.apache.drill.exec.work.for
> >>>>>>>> eman.Foreman$ForemanResult.
> >>>>>>>>>>>>>>> close(Foreman.java:746)
> >>>>>>>>>>>>>>> [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
> >>>>>>>>>>>>>>>   at org.apache.drill.exec.work.
> >> foreman.Foreman$StateSwitch.
> >>>>>>>>>>>>>>> processEvent(Foreman.java:858)
> >>>>>>>>>>>>>>> [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
> >>>>>>>>>>>>>>>   at org.apache.drill.exec.work.
> >> foreman.Foreman$StateSwitch.
> >>>>>>>>>>>>>>> processEvent(Foreman.java:790)
> >>>>>>>>>>>>>>> [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
> >>>>>>>>>>>>>>>   at org.apache.drill.exec.work.
> >> foreman.Foreman$StateSwitch.
> >>>>>>>>>>>>>>> moveToState(Foreman.java:792)
> >>>>>>>>>>>>>>> [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
> >>>>>>>>>>>>>>>   at org.apache.drill.exec.work.
> >> foreman.Foreman.moveToState(
> >>>>>>>>>>>>>>> Foreman.java:909)
> >>>>>>>>>>>>>>> [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
> >>>>>>>>>>>>>>>   at org.apache.drill.exec.work.
> >> foreman.Foreman.access$2700(
> >>>>>>>>>>>>>>> Foreman.java:110)
> >>>>>>>>>>>>>>> [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
> >>>>>>>>>>>>>>>   at org.apache.drill.exec.work.for
> >>>>>>>> eman.Foreman$StateListener.
> >>>>>>>>>>>>>>> moveToState(Foreman.java:1183)
> >>>>>>>>>>>>>>> [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> DRILL-4325  <https://issues.apache.org/ 
> >>>>>>>>>>>>>>> <https://issues.apache.org/>
> >> jira/browse/DRILL-4325>
> >>>>>>>>>>>>>>> ForemanException:
> >>>>>>>>>>>>>>> One or more nodes lost connectivity during query
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Any one experienced this issue ?
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> It happens when running query involving many parquet files
> >> on a
> >>>>>>>>>> cluster
> >>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>> 200 nodes. Same query on a smaller cluster of 12 nodes runs
> >>>>>>>> fine.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> It is not caused by garbage collection, (checked on both ZK
> >>>>>>>> node and
> >>>>>>>>>>>> the
> >>>>>>>>>>>>>>> involved drill bit).
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Negotiated max session timeout is 40 seconds.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> The sequence seems:
> >>>>>>>>>>>>>>> - Drill Query begins, using an existing ZK session.
> >>>>>>>>>>>>>>> - Drill Zk session timeouts
> >>>>>>>>>>>>>>> - perhaps it was writing something that took too long
> >>>>>>>>>>>>>>> - Drill attempts to renew session
> >>>>>>>>>>>>>>>  - drill believes that the write operation failed, so it
> >>>>>>>> attempts
> >>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>> re-create the zk node, which trigger another exception.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> We are open to any suggestion. We will report any finding.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Thanks
> >>>>>>>>>>>>>>> Francois
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>
> >>
> >>
> 
>

Re: ZK lost connectivity issue on large cluster

Reply via email to