Re: ZK lost connectivity issue on large cluster

François Méthot Wed, 26 Oct 2016 18:07:54 -0700

Hi,

Sorry it took so long, lost the origin picture, had to go to a ms paint
training
here we go:


https://github.com/fmethot/imagine/blob/master/affinity_factor.png



On Thu, Oct 20, 2016 at 12:57 PM, Sudheesh Katkam <[email protected]>
wrote:

> The mailing list does not seem to allow for images. Can you put the image
> elsewhere (Github or Dropbox), and reply with a link to it?
>
> - Sudheesh
>
> > On Oct 19, 2016, at 5:37 PM, François Méthot <[email protected]>
> wrote:
> >
> > We had problem on the 220 nodes cluster. No problem on the 12 nodes
> cluster.
> >
> > I agree that the data may not be distributed evenly. It would be a long
> and tedious process for me to produce a report.
> >
> > Here is a drawing  of the fragments overview before and after the
> changes of the affinity factory on a sample query ran on the 220 nodes
> cluster.  max_width_per_node=8 on both, but it turned out to be irrelevant
> to the issue.
> >
> >
> >
> >
> >
> >
> > Before: SYSTEM ERROR: ForemanException: One more more nodes lost
> connectivity during query.  Identified nodes were [server121:31010].
> >
> > After: error is gone
> >
> > Before: low disk io, high network io on the bottom part of the graph
> > after : high disk io, low network io on the bottom part of the graph
> >
> >
> >
> >
> >
> >
> >
> > On Tue, Oct 18, 2016 at 12:58 AM, Padma Penumarthy <
> [email protected] <mailto:[email protected]>> wrote:
> > Hi Francois,
> >
> > It would be good to understand how increasing affinity_factor helped in
> your case
> > so we can better document and also use that knowledge to improve things
> in future release.
> >
> > If you have two clusters,  it is not clear whether you had the problem
> on 12 node cluster
> > or 220 node cluster or both. Is the dataset same on both ? Is
> max_width_per_node=8 in both clusters ?
> >
> > Increasing affinity factor will lower remote reads  by scheduling more
> fragments/doing more work
> > on nodes which have data available locally.  So, there seem to be some
> kind of non uniform
> > data distribution for sure. It would be good if you can provide more
> details i.e. how the data is
> > distributed in the cluster and how the load on the nodes changed when
> affinity factor was increased.
> >
> > Thanks,
> > Padma
> >
> >
> > > On Oct 14, 2016, at 6:45 PM, François Méthot <[email protected]
> <mailto:[email protected]>> wrote:
> > >
> > > We have  a 12 nodes cluster and a 220 nodes cluster, but they do not
> talk
> > > to each other. So Padma's analysis do not apply but thanks for your
> > > comments. Our goal had been to run Drill on the 220 nodes cluster
> after it
> > > proved worthy of it on the small cluster.
> > >
> > > planner.width.max_per_node was eventually reduced to 2 when we were
> trying
> > > to figure this out, it would still fail. After we figured out the
> > > affinity_factor, we put it back to its original value and it would work
> > > fine.
> > >
> > >
> > >
> > > Sudheesh: Indeed, The Zk/drill services use the same network on our
> bigger
> > > cluster.
> > >
> > > potential improvements:
> > > - planner.affinity_factor should be better documented.
> > > - When ZK disconnected, the running queries systematically failed.
> When we
> > > disabled the ForemanException thrown in the QueryManager.
> > > drillbitUnregistered method, most of our query started to run
> successfully,
> > > we would sometime get Drillbit Disconnected error within the rpc work
> bus.
> > > It did confirm that we still had something on our network going on,
> but it
> > > also showed that the RPC bus between drillbits was more resilient to
> > > network hiccup. I could not prove it, but I think under certain
> condition,
> > > the ZK session gets recreated, which cause a Query Manager unregistered
> > > (query fail) and register call right after, but the RPC
> > > bus  would remains connected.
> > >
> > >
> > > We really appreciate your feedback and we hope to contribute to this
> great
> > > project in the future.
> > > Thanks
> > > Francois
> > >
> > >
> > >
> > >
> > >
> > >
> > > On Fri, Oct 14, 2016 at 3:00 PM, Padma Penumarthy <
> [email protected] <mailto:[email protected]>>
> > > wrote:
> > >
> > >>
> > >> Seems like you have 215 nodes, but the data for your query is there on
> > >> only 12 nodes.
> > >> Drill tries to distribute the scan fragments across the cluster more
> > >> uniformly (trying to utilize all CPU resources).
> > >> That is why you have lot of remote reads going on and increasing
> affinity
> > >> factor eliminates running scan
> > >> fragments on the other (215-12) nodes.
> > >>
> > >> you also mentioned planner.width.max_per_node is set to 8.
> > >> So, with increased affinity factor,  you have 8 scan fragments doing
> a lot
> > >> more work on these 12 nodes.
> > >> Still, you got 10X improvement. Seems like your network is the obvious
> > >> bottleneck. Is it a 10G or 1G ?
> > >>
> > >> Also, increasing affinity factor helped in your case because there is
> no
> > >> data on other nodes.
> > >> But, if you have data non uniformly distributed across more nodes, you
> > >> might still have the problem.
> > >>
> > >> Thanks,
> > >> Padma
> > >>
> > >>> On Oct 14, 2016, at 11:18 AM, Sudheesh Katkam <[email protected]
> <mailto:[email protected]>>
> > >> wrote:
> > >>>
> > >>> Hi Francois,
> > >>>
> > >>> Thank you for posting your findings! Glad to see a 10X improvement.
> > >>>
> > >>> By increasing affinity factor, looks like Drill’s parallelizer is
> forced
> > >> to assign fragments on nodes with data i.e. with high favorability
> for data
> > >> locality.
> > >>>
> > >>> Regarding the random disconnection, I agree with your guess that the
> > >> network bandwidth is being used up by remote reads which causes lags
> in
> > >> drillbit to ZooKeeper heartbeats (since these services use the same
> > >> network)? Maybe others can comment here.
> > >>>
> > >>> Thank you,
> > >>> Sudheesh
> > >>>
> > >>>> On Oct 12, 2016, at 6:06 PM, François Méthot <[email protected]
> <mailto:[email protected]>>
> > >> wrote:
> > >>>>
> > >>>> Hi,
> > >>>>
> > >>>> We finally got rid of this error. We have tried many, many things
> (like
> > >>>> modifying drill to ignore the error!), it ultimately came down to
> this
> > >>>> change:
> > >>>>
> > >>>> from default
> > >>>> planner.affinity_factor=1.2
> > >>>> to
> > >>>> planner.affinity_factor=100
> > >>>>
> > >>>> Basically this encourages fragment to only care about locally stored
> > >> files.
> > >>>> We looked at the code that used that property and figured that 100
> would
> > >>>> have strong impact.
> > >>>>
> > >>>> What led us to this property is the fact that 1/4 of our fragments
> would
> > >>>> take a lot more time to complete their scan, up to  10x the time of
> the
> > >>>> fastest nodes.  On the slower nodes, Cloudera Manager would show
> very
> > >> low
> > >>>> Disk IOPS with high Network IO compare to our faster nodes. We had
> > >> noticed
> > >>>> that before but figured it would be some optimization to be done
> later
> > >> when
> > >>>> more pressing issue would be fixed, like Zk disconnection and OOM.
> We
> > >> were
> > >>>> desperate and decided to fix anything that  would look unusual.
> > >>>>
> > >>>> After this change, query ran up to 10x faster.
> > >>>>
> > >>>> We no longer get random disconnection between node and Zookeeper.
> > >>>>
> > >>>> We are still wondering why exactly. Network should not be a
> bottleneck.
> > >>>> Could high network traffic between a Drillbit and HDFS causes the
> > >> Drillbit
> > >>>> to timeout with zookeeper?
> > >>>>
> > >>>>
> > >>>> On Fri, Sep 30, 2016 at 4:21 PM, François Méthot <
> [email protected] <mailto:[email protected]>>
> > >>>> wrote:
> > >>>>
> > >>>>> After the 30 seconds gap, all the Drill nodes receives the
> following:
> > >>>>>
> > >>>>> 2016-09-26 20:07:38,629 [Curator-ServiceCache-0] Debug Active
> drillbit
> > >> set
> > >>>>> changed. Now includes 220 total bits. New Active drill bits....
> > >>>>> ...faulty node is not on the list...
> > >>>>> 2016-09-26 20:07:38,897 [Curator-ServiceCache-0] Debug Active
> drillbit
> > >> set
> > >>>>> changed. Now includes 221 total bits. New Active drill bits....
> > >>>>> ...faulty node is back on the list...
> > >>>>>
> > >>>>>
> > >>>>> So the faulty Drill node get unregistered and registered right
> after.
> > >>>>>
> > >>>>> Drill is using the low level API for registering and
> unregistering, and
> > >>>>> the only place with unregistering occurs is when the drillbit is
> > >> closed at
> > >>>>> shutdown.
> > >>>>>
> > >>>>> That particular drillbit is still up and running after those log,
> it
> > >> could
> > >>>>> not have trigger the unregistering process through a shutdown.
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>> Would you have an idea what else could cause a Drillbit to be
> > >> unregistered
> > >>>>> from the DiscoveryService and registered again right after?
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>> We are using Zookeeper 3.4.5
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>> On Wed, Sep 28, 2016 at 10:36 AM, François Méthot <
> [email protected] <mailto:[email protected]>
> > >>>
> > >>>>> wrote:
> > >>>>>
> > >>>>>> Hi,
> > >>>>>>
> > >>>>>> We have switched to 1.8 and we are still getting node
> disconnection.
> > >>>>>>
> > >>>>>> We did many tests, we thought initially our stand alone parquet
> > >> converter
> > >>>>>> was generating parquet files with problematic data (like 10K
> > >> characters
> > >>>>>> string), but we were able to reproduce it with employee data from
> the
> > >>>>>> tutorial.
> > >>>>>>
> > >>>>>> For example,  we duplicated the Drill Tutorial "Employee" data to
> > >> reach
> > >>>>>> 500 M records spread over 130 parquet files.
> > >>>>>> Each files is ~60 MB.
> > >>>>>>
> > >>>>>>
> > >>>>>> We ran this query over and over on 5 different sessions using a
> > >> script:
> > >>>>>> select * from hdfs.tmp.`PARQUET_EMPLOYEE` where full_name like
> '%does
> > >>>>>> not exist%';
> > >>>>>>
> > >>>>>> Query return no rows and would take ~35 to 45 seconds to return.
> > >>>>>>
> > >>>>>> Leaving the script running on each node, we eventually hit the
> "nodes
> > >>>>>> lost connectivity during query" error.
> > >>>>>>
> > >>>>>> One the done that failed,
> > >>>>>>
> > >>>>>> We see those log:
> > >>>>>> 2016-09-26 20:07:09,029 [...uuid...frag:1:10] INFO
> > >>>>>> o.a.d.e.w.f.FragmentStatusReporter - ...uuid...:1:10: State to
> > >> report:
> > >>>>>> RUNNING
> > >>>>>> 2016-09-26 20:07:09,029 [...uuid...frag:1:10] DEBUG
> > >>>>>> o.a.d.e.w.FragmentExecutor - Starting fragment 1:10 on
> server064:31010
> > >>>>>>
> > >>>>>> <--- 30 seconds gap for that fragment --->
> > >>>>>>
> > >>>>>> 2016-09-26 20:37:09,976 [BitServer-2] WARN o.a.d.exec.rpc.control.
> > >> ControlServer
> > >>>>>> - Message of mode REQUEST of rpc type 2 took longer then 500 ms.
> > >> Actual
> > >>>>>> duration was 23617ms.
> > >>>>>>
> > >>>>>> 2016-09-26 20:07:38,211 [...uuid...frag:1:10] DEBUG
> > >>>>>> o.a.d.e.p.i.s.RemovingRecordBatch - doWork(): 0 records copied
> out
> > >> of 0,
> > >>>>>> remaining: 0 incoming schema BatchSchema [....,
> > >> selectionVector=TWO_BYTE]
> > >>>>>> 2016-09-26 20:07:38,211 [...uuid...frag:1:10] DEBUG
> > >>>>>> o.a.d.exec.rpc.control.WorkEventBus - Cancelling and removing
> > >> fragment
> > >>>>>> manager : ...uuid...
> > >>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>> For the same query on a working node:
> > >>>>>> 2016-09-26 20:07:09,056 [...uuid...frag:1:2] INFO
> > >>>>>> o.a.d.e.w.f.FragmentStatusReporter - ...uuid...:1:2: State to
> report:
> > >>>>>> RUNNING
> > >>>>>> 2016-09-26 20:07:09,056 [...uuid...frag:1:2] DEBUG
> > >>>>>> o.a.d.e.w.FragmentExecutor - Starting fragment 1:2 on
> server125:31010
> > >>>>>> 2016-09-26 20:07:09,749 [...uuid...frag:1:2] DEBUG
> > >>>>>> o.a.d.e.p.i.s.RemovingRecordBatch - doWork(): 0 records copied
> out
> > >> of 0,
> > >>>>>> remaining: 0 incoming schema BatchSchema [....,
> > >> selectionVector=TWO_BYTE]
> > >>>>>> 2016-09-26 20:07:09,749 [...uuid...frag:1:2] DEBUG
> > >>>>>> o.a.d.e.p.i.s.RemovingRecordBatch - doWork(): 0 records copied
> out
> > >> of 0,
> > >>>>>> remaining: 0 incoming schema BatchSchema [....,
> > >> selectionVector=TWO_BYTE]
> > >>>>>> 2016-09-26 20:07:11,005 [...uuid...frag:1:2] DEBUG
> > >>>>>> o.a.d.e.s.p.c.ParquetRecordReader - Read 87573 records out of row
> > >>>>>> groups(0) in file `/data/drill/tmp/PARQUET_
> EMPLOYEE/0_0_14.parquet
> > >>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>> We are investigating what could get cause that 30 seconds gap for
> that
> > >>>>>> fragment.
> > >>>>>>
> > >>>>>> Any idea let us know
> > >>>>>>
> > >>>>>> Thanks
> > >>>>>> Francois
> > >>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>> On Mon, Sep 19, 2016 at 2:59 PM, François Méthot <
> [email protected] <mailto:[email protected]>
> > >>>
> > >>>>>> wrote:
> > >>>>>>
> > >>>>>>> Hi Sudheesh,
> > >>>>>>>
> > >>>>>>> If I add selection filter so that no row are returned, the same
> > >>>>>>> problem occur. I also simplified the query to include only few
> > >> integer
> > >>>>>>> columns.
> > >>>>>>>
> > >>>>>>> That particular data repo is ~200+ Billions records spread over
> ~50
> > >> 000
> > >>>>>>> parquet files.
> > >>>>>>>
> > >>>>>>> We have other CSV data repo that are 100x smaller that does not
> > >> trigger
> > >>>>>>> this issue.
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> + Is atsqa4-133.qa.lab [1] the Foreman node for the query in this
> > >> case?
> > >>>>>>> There is also a bizarre case where the node that is reported as
> lost
> > >> is the
> > >>>>>>> node itself.
> > >>>>>>> Yes, the stack trace is from the ticket, It did occurred once or
> > >> twice
> > >>>>>>> (in the many many attempts) that it was the node itself.
> > >>>>>>>
> > >>>>>>> + Is there a spike in memory usage of the Drillbit this is the
> > >> Foreman
> > >>>>>>> for the query (process memory, not just heap)?
> > >>>>>>> We don't notice any unusual spike, each nodes gets busy in the
> same
> > >>>>>>> range when query is running.
> > >>>>>>>
> > >>>>>>> I tried running with 8GB/20GB and 4GB/24GB heap/off-heap, did
> not see
> > >>>>>>> any improvement.
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> We will update from 1.7 to 1.8 before going ahead with more
> > >>>>>>> investigation.
> > >>>>>>>
> > >>>>>>> Thanks a lot.
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> On Mon, Sep 19, 2016 at 1:19 PM, Sudheesh Katkam <
> > >> [email protected] <mailto:[email protected]>>
> > >>>>>>> wrote:
> > >>>>>>>
> > >>>>>>>> Hi Francois,
> > >>>>>>>>
> > >>>>>>>> A simple query with only projections is not an “ideal” use case,
> > >> since
> > >>>>>>>> Drill is bound by how fast the client can consume records. There
> > >> are 1000
> > >>>>>>>> scanners sending data to 1 client (vs far fewer scanners sending
> > >> data in
> > >>>>>>>> the 12 node case).
> > >>>>>>>>
> > >>>>>>>> This might increase the load on the Drillbit that is the
> Foreman for
> > >>>>>>>> the query. In the query profile, the scanners should be
> spending a
> > >> lot more
> > >>>>>>>> time “waiting” to send records to the client (via root
> fragment).
> > >>>>>>>> + Is atsqa4-133.qa.lab [1] the Foreman node for the query in
> this
> > >> case?
> > >>>>>>>> There is also a bizarre case where the node that is reported as
> > >> lost is the
> > >>>>>>>> node itself.
> > >>>>>>>> + Is there a spike in memory usage of the Drillbit this is the
> > >> Foreman
> > >>>>>>>> for the query (process memory, not just heap)?
> > >>>>>>>>
> > >>>>>>>> Regarding the warnings ...
> > >>>>>>>>
> > >>>>>>>>> 2016-09-19 13:31:56,866 [BitServer-7] WARN
> > >>>>>>>>> o.a.d.exec.rpc.control.ControlServer - Message of mode
> REQUEST of
> > >>>>>>>> rpc type
> > >>>>>>>>> 6 took longer than 500 ms. Actual Duration was 16053ms.
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> RPC type 6 is a cancellation request; DRILL-4766 [2] should
> help in
> > >>>>>>>> this case, which is resolved in the latest version of Drill. So
> as
> > >> Chun
> > >>>>>>>> suggested, upgrade the cluster to the latest version of Drill.
> > >>>>>>>>
> > >>>>>>>>> 2016-09-19 14:15:33,357 [BitServer-4] WARN
> > >>>>>>>>> o.a.d.exec.rpc.control.ControlClient - Message of mode
> RESPONSE of
> > >>>>>>>> rpc type
> > >>>>>>>>> 1 took longer than 500 ms. Actual Duration was 981ms.
> > >>>>>>>>
> > >>>>>>>> I am surprised that responses are taking that long to handle.
> > >>>>>>>> + Are both messages on the same Drillbit?
> > >>>>>>>>
> > >>>>>>>> The other warnings can be ignored.
> > >>>>>>>>
> > >>>>>>>> Thank you,
> > >>>>>>>> Sudheesh
> > >>>>>>>>
> > >>>>>>>> [1] I just realized that atsqa4-133.qa.lab is in one of our test
> > >>>>>>>> environments :)
> > >>>>>>>> [2] https://issues.apache.org/jira/browse/DRILL-4766 <
> https://issues.apache.org/jira/browse/DRILL-4766> <
> > >>>>>>>> https://issues.apache.org/jira/browse/DRILL-4766 <
> https://issues.apache.org/jira/browse/DRILL-4766>>
> > >>>>>>>>
> > >>>>>>>>> On Sep 19, 2016, at 9:15 AM, François Méthot <
> [email protected] <mailto:[email protected]>>
> > >>>>>>>> wrote:
> > >>>>>>>>>
> > >>>>>>>>> Hi Sudheesh,
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> + Does the query involve any aggregations or filters? Or is
> this a
> > >>>>>>>> select
> > >>>>>>>>> query with only projections?
> > >>>>>>>>> Simple query with only projections
> > >>>>>>>>>
> > >>>>>>>>> + Any suspicious timings in the query profile?
> > >>>>>>>>> Nothing specially different than our working query on our small
> > >>>>>>>> cluster.
> > >>>>>>>>>
> > >>>>>>>>> + Any suspicious warning messages in the logs around the time
> of
> > >>>>>>>> failure on
> > >>>>>>>>> any of the drillbits? Specially on atsqa4-133.qa.lab? Specially
> > >> this
> > >>>>>>>> one
> > >>>>>>>>> (“..” are place holders):
> > >>>>>>>>> Message of mode .. of rpc type .. took longer than ..ms.
> Actual
> > >>>>>>>> duration
> > >>>>>>>>> was ..ms.
> > >>>>>>>>>
> > >>>>>>>>> Well we do see this warning on the failing node (on my last
> test),
> > >> I
> > >>>>>>>> found
> > >>>>>>>>> this WARNING in our log for the past month for pretty much
> every
> > >> node
> > >>>>>>>> I
> > >>>>>>>>> checked.
> > >>>>>>>>> 2016-09-19 13:31:56,866 [BitServer-7] WARN
> > >>>>>>>>> o.a.d.exec.rpc.control.ControlServer - Message of mode
> REQUEST of
> > >>>>>>>> rpc type
> > >>>>>>>>> 6 took longer than 500 ms. Actual Duration was 16053ms.
> > >>>>>>>>> 2016-09-19 14:15:33,357 [BitServer-4] WARN
> > >>>>>>>>> o.a.d.exec.rpc.control.ControlClient - Message of mode
> RESPONSE of
> > >>>>>>>> rpc type
> > >>>>>>>>> 1 took longer than 500 ms. Actual Duration was 981ms.
> > >>>>>>>>>
> > >>>>>>>>> We really appreciate your help. I will dig in the source code
> for
> > >>>>>>>> when and
> > >>>>>>>>> why this error happen.
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> Francois
> > >>>>>>>>>
> > >>>>>>>>> P.S.:
> > >>>>>>>>> We do see this also:
> > >>>>>>>>> 2016-09-19 14:48:23,444 [drill-executor-9] WARN
> > >>>>>>>>> o.a.d.exec.rpc.control.WorkEventBus - Fragment ......:1:2 not
> > >> found
> > >>>>>>>> in the
> > >>>>>>>>> work bus.
> > >>>>>>>>> 2016-09-19 14:48:23,444 [drill-executor-11] WARN
> > >>>>>>>>> o.a.d.exec.rpc.control.WorkEventBus - Fragment ....:1:222 not
> > >> found
> > >>>>>>>> in the
> > >>>>>>>>> work bus.
> > >>>>>>>>> 2016-09-19 14:48:23,444 [drill-executor-12] WARN
> > >>>>>>>>> o.a.d.exec.rpc.control.WorkEventBus - Fragment ....:1:442 not
> > >> found
> > >>>>>>>> in the
> > >>>>>>>>> work bus.
> > >>>>>>>>> 2016-09-19 14:48:23,444 [drill-executor-10] WARN
> > >>>>>>>>> o.a.d.exec.rpc.control.WorkEventBus - Fragment ....:1:662 not
> > >> found
> > >>>>>>>> in the
> > >>>>>>>>> work bus.
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> On Sun, Sep 18, 2016 at 2:57 PM, Sudheesh Katkam <
> > >>>>>>>> [email protected] <mailto:[email protected]>>
> > >>>>>>>>> wrote:
> > >>>>>>>>>
> > >>>>>>>>>> Hi Francois,
> > >>>>>>>>>>
> > >>>>>>>>>> More questions..
> > >>>>>>>>>>
> > >>>>>>>>>>> + Can you share the query profile?
> > >>>>>>>>>>> I will sum it up:
> > >>>>>>>>>>> It is a select on 18 columns: 9 string, 9 integers.
> > >>>>>>>>>>> Scan is done on 13862 parquet files spread  on 1000
> fragments.
> > >>>>>>>>>>> Fragments are spread accross 215 nodes.
> > >>>>>>>>>>
> > >>>>>>>>>> So ~5 leaf fragments (or scanners) per Drillbit seems fine.
> > >>>>>>>>>>
> > >>>>>>>>>> + Does the query involve any aggregations or filters? Or is
> this a
> > >>>>>>>> select
> > >>>>>>>>>> query with only projections?
> > >>>>>>>>>> + Any suspicious timings in the query profile?
> > >>>>>>>>>> + Any suspicious warning messages in the logs around the time
> of
> > >>>>>>>> failure
> > >>>>>>>>>> on any of the drillbits? Specially on atsqa4-133.qa.lab?
> Specially
> > >>>>>>>> this one
> > >>>>>>>>>> (“..” are place holders):
> > >>>>>>>>>> Message of mode .. of rpc type .. took longer than ..ms.
> Actual
> > >>>>>>>>>> duration was ..ms.
> > >>>>>>>>>>
> > >>>>>>>>>> Thank you,
> > >>>>>>>>>> Sudheesh
> > >>>>>>>>>>
> > >>>>>>>>>>> On Sep 15, 2016, at 11:27 AM, François Méthot <
> > >> [email protected] <mailto:[email protected]>>
> > >>>>>>>>>> wrote:
> > >>>>>>>>>>>
> > >>>>>>>>>>> Hi Sudheesh,
> > >>>>>>>>>>>
> > >>>>>>>>>>> + How many zookeeper servers in the quorum?
> > >>>>>>>>>>> The quorum has 3 servers, everything looks healthy
> > >>>>>>>>>>>
> > >>>>>>>>>>> + What is the load on atsqa4-133.qa.lab when this happens?
> Any
> > >> other
> > >>>>>>>>>>> applications running on that node? How many threads is the
> Drill
> > >>>>>>>> process
> > >>>>>>>>>>> using?
> > >>>>>>>>>>> The load on the failing node(8 cores) is 14, when Drill is
> > >> running.
> > >>>>>>>> Which
> > >>>>>>>>>>> is nothing out of the ordinary according to our admin.
> > >>>>>>>>>>> HBase is also running.
> > >>>>>>>>>>> planner.width.max_per_node is set to 8
> > >>>>>>>>>>>
> > >>>>>>>>>>> + When running the same query on 12 nodes, is the data size
> same?
> > >>>>>>>>>>> Yes
> > >>>>>>>>>>>
> > >>>>>>>>>>> + Can you share the query profile?
> > >>>>>>>>>>> I will sum it up:
> > >>>>>>>>>>> It is a select on 18 columns: 9 string, 9 integers.
> > >>>>>>>>>>> Scan is done on 13862 parquet files spread  on 1000
> fragments.
> > >>>>>>>>>>> Fragments are spread accross 215 nodes.
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>> We are in process of increasing our Zookeeper session timeout
> > >>>>>>>> config to
> > >>>>>>>>>> see
> > >>>>>>>>>>> if it helps.
> > >>>>>>>>>>>
> > >>>>>>>>>>> thanks
> > >>>>>>>>>>>
> > >>>>>>>>>>> Francois
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>> On Wed, Sep 14, 2016 at 4:40 PM, Sudheesh Katkam <
> > >>>>>>>> [email protected] <mailto:[email protected]>>
> > >>>>>>>>>>> wrote:
> > >>>>>>>>>>>
> > >>>>>>>>>>>> Hi Francois,
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Few questions:
> > >>>>>>>>>>>> + How many zookeeper servers in the quorum?
> > >>>>>>>>>>>> + What is the load on atsqa4-133.qa.lab when this happens?
> Any
> > >>>>>>>> other
> > >>>>>>>>>>>> applications running on that node? How many threads is the
> Drill
> > >>>>>>>> process
> > >>>>>>>>>>>> using?
> > >>>>>>>>>>>> + When running the same query on 12 nodes, is the data size
> > >> same?
> > >>>>>>>>>>>> + Can you share the query profile?
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> This may not be the right thing to do, but for now, If the
> > >> cluster
> > >>>>>>>> is
> > >>>>>>>>>>>> heavily loaded, increase the zk timeout.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Thank you,
> > >>>>>>>>>>>> Sudheesh
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>> On Sep 14, 2016, at 11:53 AM, François Méthot <
> > >>>>>>>> [email protected] <mailto:[email protected]>>
> > >>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> We are running 1.7.
> > >>>>>>>>>>>>> The log were taken from the jira tickets.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> We will try out 1.8 soon.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> On Wed, Sep 14, 2016 at 2:52 PM, Chun Chang <
> > >> [email protected] <mailto:[email protected]>>
> > >>>>>>>>>> wrote:
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Looks like you are running 1.5. I believe there are some
> work
> > >>>>>>>> done in
> > >>>>>>>>>>>> that
> > >>>>>>>>>>>>>> area and the newer release should behave better.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> On Wed, Sep 14, 2016 at 11:43 AM, François Méthot <
> > >>>>>>>>>> [email protected] <mailto:[email protected]>>
> > >>>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> Hi,
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> We are trying to find a solution/workaround to issue:
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> 2016-01-28 16:36:14,367 [Curator-ServiceCache-0] ERROR
> > >>>>>>>>>>>>>>> o.a.drill.exec.work.foreman.Fo <
> http://o.a.drill.exec.work.foreman.fo/>reman - SYSTEM ERROR:
> > >>>>>>>>>> ForemanException:
> > >>>>>>>>>>>>>>> One more more nodes lost connectivity during query.
> > >> Identified
> > >>>>>>>> nodes
> > >>>>>>>>>>>>>>> were [atsqa4-133.qa.lab:31010].
> > >>>>>>>>>>>>>>> org.apache.drill.common.exceptions.UserException: SYSTEM
> > >> ERROR:
> > >>>>>>>>>>>>>>> ForemanException: One more more nodes lost connectivity
> > >> during
> > >>>>>>>> query.
> > >>>>>>>>>>>>>>> Identified nodes were [atsqa4-133.qa.lab:31010].
> > >>>>>>>>>>>>>>>   at org.apache.drill.exec.work.for
> > >>>>>>>> eman.Foreman$ForemanResult.
> > >>>>>>>>>>>>>>> close(Foreman.java:746)
> > >>>>>>>>>>>>>>> [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
> > >>>>>>>>>>>>>>>   at org.apache.drill.exec.work.
> > >> foreman.Foreman$StateSwitch.
> > >>>>>>>>>>>>>>> processEvent(Foreman.java:858)
> > >>>>>>>>>>>>>>> [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
> > >>>>>>>>>>>>>>>   at org.apache.drill.exec.work.
> > >> foreman.Foreman$StateSwitch.
> > >>>>>>>>>>>>>>> processEvent(Foreman.java:790)
> > >>>>>>>>>>>>>>> [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
> > >>>>>>>>>>>>>>>   at org.apache.drill.exec.work.
> > >> foreman.Foreman$StateSwitch.
> > >>>>>>>>>>>>>>> moveToState(Foreman.java:792)
> > >>>>>>>>>>>>>>> [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
> > >>>>>>>>>>>>>>>   at org.apache.drill.exec.work.
> > >> foreman.Foreman.moveToState(
> > >>>>>>>>>>>>>>> Foreman.java:909)
> > >>>>>>>>>>>>>>> [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
> > >>>>>>>>>>>>>>>   at org.apache.drill.exec.work.
> > >> foreman.Foreman.access$2700(
> > >>>>>>>>>>>>>>> Foreman.java:110)
> > >>>>>>>>>>>>>>> [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
> > >>>>>>>>>>>>>>>   at org.apache.drill.exec.work.for
> > >>>>>>>> eman.Foreman$StateListener.
> > >>>>>>>>>>>>>>> moveToState(Foreman.java:1183)
> > >>>>>>>>>>>>>>> [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> DRILL-4325  <https://issues.apache.org/ <
> https://issues.apache.org/>
> > >> jira/browse/DRILL-4325>
> > >>>>>>>>>>>>>>> ForemanException:
> > >>>>>>>>>>>>>>> One or more nodes lost connectivity during query
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> Any one experienced this issue ?
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> It happens when running query involving many parquet
> files
> > >> on a
> > >>>>>>>>>> cluster
> > >>>>>>>>>>>>>> of
> > >>>>>>>>>>>>>>> 200 nodes. Same query on a smaller cluster of 12 nodes
> runs
> > >>>>>>>> fine.
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> It is not caused by garbage collection, (checked on both
> ZK
> > >>>>>>>> node and
> > >>>>>>>>>>>> the
> > >>>>>>>>>>>>>>> involved drill bit).
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> Negotiated max session timeout is 40 seconds.
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> The sequence seems:
> > >>>>>>>>>>>>>>> - Drill Query begins, using an existing ZK session.
> > >>>>>>>>>>>>>>> - Drill Zk session timeouts
> > >>>>>>>>>>>>>>> - perhaps it was writing something that took too long
> > >>>>>>>>>>>>>>> - Drill attempts to renew session
> > >>>>>>>>>>>>>>>  - drill believes that the write operation failed, so it
> > >>>>>>>> attempts
> > >>>>>>>>>>>>>> to
> > >>>>>>>>>>>>>>> re-create the zk node, which trigger another exception.
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> We are open to any suggestion. We will report any
> finding.
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> Thanks
> > >>>>>>>>>>>>>>> Francois
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>
> > >>>>>>
> > >>>>>
> > >>>
> > >>
> > >>
> >
> >
>
>

Re: ZK lost connectivity issue on large cluster

Reply via email to