Re: ZK lost connectivity issue on large cluster

Sudheesh Katkam Fri, 14 Oct 2016 11:18:29 -0700

Hi Francois,

Thank you for posting your findings! Glad to see a 10X improvement.


By increasing affinity factor, looks like Drill’s parallelizer is forced to 
assign fragments on nodes with data i.e. with high favorability for data 
locality.

Regarding the random disconnection, I agree with your guess that the network 
bandwidth is being used up by remote reads which causes lags in drillbit to 
ZooKeeper heartbeats (since these services use the same network)? Maybe others 
can comment here.

Thank you,
Sudheesh

> On Oct 12, 2016, at 6:06 PM, François Méthot <[email protected]> wrote:
> 
> Hi,
> 
> We finally got rid of this error. We have tried many, many things  (like
> modifying drill to ignore the error!), it ultimately came down to this
> change:
> 
> from default
> planner.affinity_factor=1.2
>   to
> planner.affinity_factor=100
> 
> Basically this encourages fragment to only care about locally stored files.
> We looked at the code that used that property and figured that 100 would
> have strong impact.
> 
> What led us to this property is the fact that 1/4 of our fragments would
> take a lot more time to complete their scan, up to  10x the time of the
> fastest nodes.  On the slower nodes, Cloudera Manager would show very low
> Disk IOPS with high Network IO compare to our faster nodes. We had noticed
> that before but figured it would be some optimization to be done later when
> more pressing issue would be fixed, like Zk disconnection and OOM. We were
> desperate and decided to fix anything that  would look unusual.
> 
> After this change, query ran up to 10x faster.
> 
> We no longer get random disconnection between node and Zookeeper.
> 
> We are still wondering why exactly. Network should not be a bottleneck.
> Could high network traffic between a Drillbit and HDFS causes the Drillbit
> to timeout with zookeeper?
> 
> 
> On Fri, Sep 30, 2016 at 4:21 PM, François Méthot <[email protected]>
> wrote:
> 
>> After the 30 seconds gap, all the Drill nodes receives the following:
>> 
>> 2016-09-26 20:07:38,629 [Curator-ServiceCache-0] Debug Active drillbit set
>> changed. Now includes 220 total bits. New Active drill bits....
>> ...faulty node is not on the list...
>> 2016-09-26 20:07:38,897 [Curator-ServiceCache-0] Debug Active drillbit set
>> changed. Now includes 221 total bits. New Active drill bits....
>> ...faulty node is back on the list...
>> 
>> 
>> So the faulty Drill node get unregistered and registered right after.
>> 
>> Drill is using the low level API for registering and unregistering, and
>> the only place with unregistering occurs is when the drillbit is closed at
>> shutdown.
>> 
>> That particular drillbit is still up and running after those log, it could
>> not have trigger the unregistering process through a shutdown.
>> 
>> 
>> 
>> 
>> Would you have an idea what else could cause a Drillbit to be unregistered
>> from the DiscoveryService and registered again right after?
>> 
>> 
>> 
>> We are using Zookeeper 3.4.5
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> On Wed, Sep 28, 2016 at 10:36 AM, François Méthot <[email protected]>
>> wrote:
>> 
>>> Hi,
>>> 
>>> We have switched to 1.8 and we are still getting node disconnection.
>>> 
>>> We did many tests, we thought initially our stand alone parquet converter
>>> was generating parquet files with problematic data (like 10K characters
>>> string), but we were able to reproduce it with employee data from the
>>> tutorial.
>>> 
>>> For example,  we duplicated the Drill Tutorial "Employee" data to reach
>>> 500 M records spread over 130 parquet files.
>>> Each files is ~60 MB.
>>> 
>>> 
>>> We ran this query over and over on 5 different sessions using a script:
>>>   select * from hdfs.tmp.`PARQUET_EMPLOYEE` where full_name like '%does
>>> not exist%';
>>> 
>>>   Query return no rows and would take ~35 to 45 seconds to return.
>>> 
>>> Leaving the script running on each node, we eventually hit the "nodes
>>> lost connectivity during query" error.
>>> 
>>> One the done that failed,
>>> 
>>>   We see those log:
>>> 2016-09-26 20:07:09,029 [...uuid...frag:1:10] INFO
>>> o.a.d.e.w.f.FragmentStatusReporter - ...uuid...:1:10: State to report:
>>> RUNNING
>>> 2016-09-26 20:07:09,029 [...uuid...frag:1:10] DEBUG
>>> o.a.d.e.w.FragmentExecutor - Starting fragment 1:10 on server064:31010
>>> 
>>> <--- 30 seconds gap for that fragment --->
>>> 
>>> 2016-09-26 20:37:09,976 [BitServer-2] WARN 
>>> o.a.d.exec.rpc.control.ControlServer
>>> - Message of mode REQUEST of rpc type 2 took longer then 500 ms. Actual
>>> duration was 23617ms.
>>> 
>>> 2016-09-26 20:07:38,211 [...uuid...frag:1:10] DEBUG
>>> o.a.d.e.p.i.s.RemovingRecordBatch - doWork(): 0 records copied out of 0,
>>> remaining: 0 incoming schema BatchSchema [...., selectionVector=TWO_BYTE]
>>> 2016-09-26 20:07:38,211 [...uuid...frag:1:10] DEBUG
>>> o.a.d.exec.rpc.control.WorkEventBus - Cancelling and removing fragment
>>> manager : ...uuid...
>>> 
>>> 
>>> 
>>> For the same query on a working node:
>>> 2016-09-26 20:07:09,056 [...uuid...frag:1:2] INFO
>>> o.a.d.e.w.f.FragmentStatusReporter - ...uuid...:1:2: State to report:
>>> RUNNING
>>> 2016-09-26 20:07:09,056 [...uuid...frag:1:2] DEBUG
>>> o.a.d.e.w.FragmentExecutor - Starting fragment 1:2 on server125:31010
>>> 2016-09-26 20:07:09,749 [...uuid...frag:1:2] DEBUG
>>> o.a.d.e.p.i.s.RemovingRecordBatch - doWork(): 0 records copied out of 0,
>>> remaining: 0 incoming schema BatchSchema [...., selectionVector=TWO_BYTE]
>>> 2016-09-26 20:07:09,749 [...uuid...frag:1:2] DEBUG
>>> o.a.d.e.p.i.s.RemovingRecordBatch - doWork(): 0 records copied out of 0,
>>> remaining: 0 incoming schema BatchSchema [...., selectionVector=TWO_BYTE]
>>> 2016-09-26 20:07:11,005 [...uuid...frag:1:2] DEBUG
>>> o.a.d.e.s.p.c.ParquetRecordReader - Read 87573 records out of row
>>> groups(0) in file `/data/drill/tmp/PARQUET_EMPLOYEE/0_0_14.parquet
>>> 
>>> 
>>> 
>>> 
>>> We are investigating what could get cause that 30 seconds gap for that
>>> fragment.
>>> 
>>> Any idea let us know
>>> 
>>> Thanks
>>> Francois
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> On Mon, Sep 19, 2016 at 2:59 PM, François Méthot <[email protected]>
>>> wrote:
>>> 
>>>> Hi Sudheesh,
>>>> 
>>>>  If I add selection filter so that no row are returned, the same
>>>> problem occur. I also simplified the query to include only few integer
>>>> columns.
>>>> 
>>>> That particular data repo is ~200+ Billions records spread over ~50 000
>>>> parquet files.
>>>> 
>>>> We have other CSV data repo that are 100x smaller that does not trigger
>>>> this issue.
>>>> 
>>>> 
>>>> + Is atsqa4-133.qa.lab [1] the Foreman node for the query in this case?
>>>> There is also a bizarre case where the node that is reported as lost is the
>>>> node itself.
>>>> Yes, the stack trace is from the ticket, It did occurred once or twice
>>>> (in the many many attempts) that it was the node itself.
>>>> 
>>>> + Is there a spike in memory usage of the Drillbit this is the Foreman
>>>> for the query (process memory, not just heap)?
>>>> We don't notice any unusual spike, each nodes gets busy in the same
>>>> range when query is running.
>>>> 
>>>> I tried running with 8GB/20GB and 4GB/24GB heap/off-heap, did not see
>>>> any improvement.
>>>> 
>>>> 
>>>> We will update from 1.7 to 1.8 before going ahead with more
>>>> investigation.
>>>> 
>>>> Thanks a lot.
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> On Mon, Sep 19, 2016 at 1:19 PM, Sudheesh Katkam <[email protected]>
>>>> wrote:
>>>> 
>>>>> Hi Francois,
>>>>> 
>>>>> A simple query with only projections is not an “ideal” use case, since
>>>>> Drill is bound by how fast the client can consume records. There are 1000
>>>>> scanners sending data to 1 client (vs far fewer scanners sending data in
>>>>> the 12 node case).
>>>>> 
>>>>> This might increase the load on the Drillbit that is the Foreman for
>>>>> the query. In the query profile, the scanners should be spending a lot 
>>>>> more
>>>>> time “waiting” to send records to the client (via root fragment).
>>>>> + Is atsqa4-133.qa.lab [1] the Foreman node for the query in this case?
>>>>> There is also a bizarre case where the node that is reported as lost is 
>>>>> the
>>>>> node itself.
>>>>> + Is there a spike in memory usage of the Drillbit this is the Foreman
>>>>> for the query (process memory, not just heap)?
>>>>> 
>>>>> Regarding the warnings ...
>>>>> 
>>>>>> 2016-09-19 13:31:56,866 [BitServer-7] WARN
>>>>>> o.a.d.exec.rpc.control.ControlServer - Message of mode REQUEST of
>>>>> rpc type
>>>>>> 6 took longer than 500 ms. Actual Duration was 16053ms.
>>>>> 
>>>>> 
>>>>> RPC type 6 is a cancellation request; DRILL-4766 [2] should help in
>>>>> this case, which is resolved in the latest version of Drill. So as Chun
>>>>> suggested, upgrade the cluster to the latest version of Drill.
>>>>> 
>>>>>> 2016-09-19 14:15:33,357 [BitServer-4] WARN
>>>>>> o.a.d.exec.rpc.control.ControlClient - Message of mode RESPONSE of
>>>>> rpc type
>>>>>> 1 took longer than 500 ms. Actual Duration was 981ms.
>>>>> 
>>>>> I am surprised that responses are taking that long to handle.
>>>>> + Are both messages on the same Drillbit?
>>>>> 
>>>>> The other warnings can be ignored.
>>>>> 
>>>>> Thank you,
>>>>> Sudheesh
>>>>> 
>>>>> [1] I just realized that atsqa4-133.qa.lab is in one of our test
>>>>> environments :)
>>>>> [2] https://issues.apache.org/jira/browse/DRILL-4766 <
>>>>> https://issues.apache.org/jira/browse/DRILL-4766>
>>>>> 
>>>>>> On Sep 19, 2016, at 9:15 AM, François Méthot <[email protected]>
>>>>> wrote:
>>>>>> 
>>>>>> Hi Sudheesh,
>>>>>> 
>>>>>> 
>>>>>> + Does the query involve any aggregations or filters? Or is this a
>>>>> select
>>>>>> query with only projections?
>>>>>> Simple query with only projections
>>>>>> 
>>>>>> + Any suspicious timings in the query profile?
>>>>>> Nothing specially different than our working query on our small
>>>>> cluster.
>>>>>> 
>>>>>> + Any suspicious warning messages in the logs around the time of
>>>>> failure on
>>>>>> any of the drillbits? Specially on atsqa4-133.qa.lab? Specially this
>>>>> one
>>>>>> (“..” are place holders):
>>>>>> Message of mode .. of rpc type .. took longer than ..ms.  Actual
>>>>> duration
>>>>>> was ..ms.
>>>>>> 
>>>>>> Well we do see this warning on the failing node (on my last test), I
>>>>> found
>>>>>> this WARNING in our log for the past month for pretty much every node
>>>>> I
>>>>>> checked.
>>>>>> 2016-09-19 13:31:56,866 [BitServer-7] WARN
>>>>>> o.a.d.exec.rpc.control.ControlServer - Message of mode REQUEST of
>>>>> rpc type
>>>>>> 6 took longer than 500 ms. Actual Duration was 16053ms.
>>>>>> 2016-09-19 14:15:33,357 [BitServer-4] WARN
>>>>>> o.a.d.exec.rpc.control.ControlClient - Message of mode RESPONSE of
>>>>> rpc type
>>>>>> 1 took longer than 500 ms. Actual Duration was 981ms.
>>>>>> 
>>>>>> We really appreciate your help. I will dig in the source code for
>>>>> when and
>>>>>> why this error happen.
>>>>>> 
>>>>>> 
>>>>>> Francois
>>>>>> 
>>>>>> P.S.:
>>>>>> We do see this also:
>>>>>> 2016-09-19 14:48:23,444 [drill-executor-9] WARN
>>>>>> o.a.d.exec.rpc.control.WorkEventBus - Fragment ......:1:2 not found
>>>>> in the
>>>>>> work bus.
>>>>>> 2016-09-19 14:48:23,444 [drill-executor-11] WARN
>>>>>> o.a.d.exec.rpc.control.WorkEventBus - Fragment ....:1:222 not found
>>>>> in the
>>>>>> work bus.
>>>>>> 2016-09-19 14:48:23,444 [drill-executor-12] WARN
>>>>>> o.a.d.exec.rpc.control.WorkEventBus - Fragment ....:1:442 not found
>>>>> in the
>>>>>> work bus.
>>>>>> 2016-09-19 14:48:23,444 [drill-executor-10] WARN
>>>>>> o.a.d.exec.rpc.control.WorkEventBus - Fragment ....:1:662 not found
>>>>> in the
>>>>>> work bus.
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Sun, Sep 18, 2016 at 2:57 PM, Sudheesh Katkam <
>>>>> [email protected]>
>>>>>> wrote:
>>>>>> 
>>>>>>> Hi Francois,
>>>>>>> 
>>>>>>> More questions..
>>>>>>> 
>>>>>>>> + Can you share the query profile?
>>>>>>>> I will sum it up:
>>>>>>>> It is a select on 18 columns: 9 string, 9 integers.
>>>>>>>> Scan is done on 13862 parquet files spread  on 1000 fragments.
>>>>>>>> Fragments are spread accross 215 nodes.
>>>>>>> 
>>>>>>> So ~5 leaf fragments (or scanners) per Drillbit seems fine.
>>>>>>> 
>>>>>>> + Does the query involve any aggregations or filters? Or is this a
>>>>> select
>>>>>>> query with only projections?
>>>>>>> + Any suspicious timings in the query profile?
>>>>>>> + Any suspicious warning messages in the logs around the time of
>>>>> failure
>>>>>>> on any of the drillbits? Specially on atsqa4-133.qa.lab? Specially
>>>>> this one
>>>>>>> (“..” are place holders):
>>>>>>> Message of mode .. of rpc type .. took longer than ..ms.  Actual
>>>>>>> duration was ..ms.
>>>>>>> 
>>>>>>> Thank you,
>>>>>>> Sudheesh
>>>>>>> 
>>>>>>>> On Sep 15, 2016, at 11:27 AM, François Méthot <[email protected]>
>>>>>>> wrote:
>>>>>>>> 
>>>>>>>> Hi Sudheesh,
>>>>>>>> 
>>>>>>>> + How many zookeeper servers in the quorum?
>>>>>>>> The quorum has 3 servers, everything looks healthy
>>>>>>>> 
>>>>>>>> + What is the load on atsqa4-133.qa.lab when this happens? Any other
>>>>>>>> applications running on that node? How many threads is the Drill
>>>>> process
>>>>>>>> using?
>>>>>>>> The load on the failing node(8 cores) is 14, when Drill is running.
>>>>> Which
>>>>>>>> is nothing out of the ordinary according to our admin.
>>>>>>>> HBase is also running.
>>>>>>>> planner.width.max_per_node is set to 8
>>>>>>>> 
>>>>>>>> + When running the same query on 12 nodes, is the data size same?
>>>>>>>> Yes
>>>>>>>> 
>>>>>>>> + Can you share the query profile?
>>>>>>>> I will sum it up:
>>>>>>>> It is a select on 18 columns: 9 string, 9 integers.
>>>>>>>> Scan is done on 13862 parquet files spread  on 1000 fragments.
>>>>>>>> Fragments are spread accross 215 nodes.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> We are in process of increasing our Zookeeper session timeout
>>>>> config to
>>>>>>> see
>>>>>>>> if it helps.
>>>>>>>> 
>>>>>>>> thanks
>>>>>>>> 
>>>>>>>> Francois
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Wed, Sep 14, 2016 at 4:40 PM, Sudheesh Katkam <
>>>>> [email protected]>
>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> Hi Francois,
>>>>>>>>> 
>>>>>>>>> Few questions:
>>>>>>>>> + How many zookeeper servers in the quorum?
>>>>>>>>> + What is the load on atsqa4-133.qa.lab when this happens? Any
>>>>> other
>>>>>>>>> applications running on that node? How many threads is the Drill
>>>>> process
>>>>>>>>> using?
>>>>>>>>> + When running the same query on 12 nodes, is the data size same?
>>>>>>>>> + Can you share the query profile?
>>>>>>>>> 
>>>>>>>>> This may not be the right thing to do, but for now, If the cluster
>>>>> is
>>>>>>>>> heavily loaded, increase the zk timeout.
>>>>>>>>> 
>>>>>>>>> Thank you,
>>>>>>>>> Sudheesh
>>>>>>>>> 
>>>>>>>>>> On Sep 14, 2016, at 11:53 AM, François Méthot <
>>>>> [email protected]>
>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>> We are running 1.7.
>>>>>>>>>> The log were taken from the jira tickets.
>>>>>>>>>> 
>>>>>>>>>> We will try out 1.8 soon.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On Wed, Sep 14, 2016 at 2:52 PM, Chun Chang <[email protected]>
>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> Looks like you are running 1.5. I believe there are some work
>>>>> done in
>>>>>>>>> that
>>>>>>>>>>> area and the newer release should behave better.
>>>>>>>>>>> 
>>>>>>>>>>> On Wed, Sep 14, 2016 at 11:43 AM, François Méthot <
>>>>>>> [email protected]>
>>>>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> Hi,
>>>>>>>>>>>> 
>>>>>>>>>>>> We are trying to find a solution/workaround to issue:
>>>>>>>>>>>> 
>>>>>>>>>>>> 2016-01-28 16:36:14,367 [Curator-ServiceCache-0] ERROR
>>>>>>>>>>>> o.a.drill.exec.work.foreman.Foreman - SYSTEM ERROR:
>>>>>>> ForemanException:
>>>>>>>>>>>> One more more nodes lost connectivity during query.  Identified
>>>>> nodes
>>>>>>>>>>>> were [atsqa4-133.qa.lab:31010].
>>>>>>>>>>>> org.apache.drill.common.exceptions.UserException: SYSTEM ERROR:
>>>>>>>>>>>> ForemanException: One more more nodes lost connectivity during
>>>>> query.
>>>>>>>>>>>> Identified nodes were [atsqa4-133.qa.lab:31010].
>>>>>>>>>>>>     at org.apache.drill.exec.work.for
>>>>> eman.Foreman$ForemanResult.
>>>>>>>>>>>> close(Foreman.java:746)
>>>>>>>>>>>> [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
>>>>>>>>>>>>     at org.apache.drill.exec.work.foreman.Foreman$StateSwitch.
>>>>>>>>>>>> processEvent(Foreman.java:858)
>>>>>>>>>>>> [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
>>>>>>>>>>>>     at org.apache.drill.exec.work.foreman.Foreman$StateSwitch.
>>>>>>>>>>>> processEvent(Foreman.java:790)
>>>>>>>>>>>> [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
>>>>>>>>>>>>     at org.apache.drill.exec.work.foreman.Foreman$StateSwitch.
>>>>>>>>>>>> moveToState(Foreman.java:792)
>>>>>>>>>>>> [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
>>>>>>>>>>>>     at org.apache.drill.exec.work.foreman.Foreman.moveToState(
>>>>>>>>>>>> Foreman.java:909)
>>>>>>>>>>>> [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
>>>>>>>>>>>>     at org.apache.drill.exec.work.foreman.Foreman.access$2700(
>>>>>>>>>>>> Foreman.java:110)
>>>>>>>>>>>> [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
>>>>>>>>>>>>     at org.apache.drill.exec.work.for
>>>>> eman.Foreman$StateListener.
>>>>>>>>>>>> moveToState(Foreman.java:1183)
>>>>>>>>>>>> [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> DRILL-4325  <https://issues.apache.org/jira/browse/DRILL-4325>
>>>>>>>>>>>> ForemanException:
>>>>>>>>>>>> One or more nodes lost connectivity during query
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> Any one experienced this issue ?
>>>>>>>>>>>> 
>>>>>>>>>>>> It happens when running query involving many parquet files on a
>>>>>>> cluster
>>>>>>>>>>> of
>>>>>>>>>>>> 200 nodes. Same query on a smaller cluster of 12 nodes runs
>>>>> fine.
>>>>>>>>>>>> 
>>>>>>>>>>>> It is not caused by garbage collection, (checked on both ZK
>>>>> node and
>>>>>>>>> the
>>>>>>>>>>>> involved drill bit).
>>>>>>>>>>>> 
>>>>>>>>>>>> Negotiated max session timeout is 40 seconds.
>>>>>>>>>>>> 
>>>>>>>>>>>> The sequence seems:
>>>>>>>>>>>> - Drill Query begins, using an existing ZK session.
>>>>>>>>>>>> - Drill Zk session timeouts
>>>>>>>>>>>>   - perhaps it was writing something that took too long
>>>>>>>>>>>> - Drill attempts to renew session
>>>>>>>>>>>>    - drill believes that the write operation failed, so it
>>>>> attempts
>>>>>>>>>>> to
>>>>>>>>>>>> re-create the zk node, which trigger another exception.
>>>>>>>>>>>> 
>>>>>>>>>>>> We are open to any suggestion. We will report any finding.
>>>>>>>>>>>> 
>>>>>>>>>>>> Thanks
>>>>>>>>>>>> Francois
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>> 
>>

Re: ZK lost connectivity issue on large cluster

Reply via email to