Re: ZK lost connectivity issue on large cluster

Sudheesh Katkam Mon, 19 Sep 2016 10:40:23 -0700

One more interesting thing and another guess to resolve the problem,

> P.S.:
> We do see this also:
> 2016-09-19 14:48:23,444 [drill-executor-9] WARN
> o.a.d.exec.rpc.control.WorkEventBus - Fragment ......:1:2 not found in the
> work bus.
> 2016-09-19 14:48:23,444 [drill-executor-11] WARN
> o.a.d.exec.rpc.control.WorkEventBus - Fragment ....:1:222 not found in the
> work bus.
> 2016-09-19 14:48:23,444 [drill-executor-12] WARN
> o.a.d.exec.rpc.control.WorkEventBus - Fragment ....:1:442 not found in the
> work bus.
> 2016-09-19 14:48:23,444 [drill-executor-10] WARN
> o.a.d.exec.rpc.control.WorkEventBus - Fragment ....:1:662 not found in the
> work bus.


+ Are there 220 nodes in your cluster?

If that is the case and these fragments are scanners, they seem be be assigned 
in a round robin fashion (and not based on data locality?). If so, please 
upgrade to the latest version where DRILL-4446 [1] is resolved.

Thank you,
Sudheesh

[1] https://issues.apache.org/jira/browse/DRILL-4446 
<https://issues.apache.org/jira/browse/DRILL-4446>

> On Sep 19, 2016, at 10:19 AM, Sudheesh Katkam <skat...@maprtech.com> wrote:
> 
> Hi Francois,
> 
> A simple query with only projections is not an “ideal” use case, since Drill 
> is bound by how fast the client can consume records. There are 1000 scanners 
> sending data to 1 client (vs far fewer scanners sending data in the 12 node 
> case).
> 
> This might increase the load on the Drillbit that is the Foreman for the 
> query. In the query profile, the scanners should be spending a lot more time 
> “waiting” to send records to the client (via root fragment).
> + Is atsqa4-133.qa.lab [1] the Foreman node for the query in this case? There 
> is also a bizarre case where the node that is reported as lost is the node 
> itself.
> + Is there a spike in memory usage of the Drillbit this is the Foreman for 
> the query (process memory, not just heap)?
> 
> Regarding the warnings ...
> 
>> 2016-09-19 13:31:56,866 [BitServer-7] WARN
>> o.a.d.exec.rpc.control.ControlServer - Message of mode REQUEST of rpc type
>> 6 took longer than 500 ms. Actual Duration was 16053ms.
> 
> 
> RPC type 6 is a cancellation request; DRILL-4766 [2] should help in this 
> case, which is resolved in the latest version of Drill. So as Chun suggested, 
> upgrade the cluster to the latest version of Drill.
> 
>> 2016-09-19 14:15:33,357 [BitServer-4] WARN
>> o.a.d.exec.rpc.control.ControlClient - Message of mode RESPONSE of rpc type
>> 1 took longer than 500 ms. Actual Duration was 981ms.
> 
> I am surprised that responses are taking that long to handle.
> + Are both messages on the same Drillbit?
> 
> The other warnings can be ignored.
> 
> Thank you,
> Sudheesh
> 
> [1] I just realized that atsqa4-133.qa.lab is in one of our test environments 
> :)
> [2] https://issues.apache.org/jira/browse/DRILL-4766 
> <https://issues.apache.org/jira/browse/DRILL-4766>
> 
>> On Sep 19, 2016, at 9:15 AM, François Méthot <fmetho...@gmail.com 
>> <mailto:fmetho...@gmail.com>> wrote:
>> 
>> Hi Sudheesh,
>> 
>> 
>> + Does the query involve any aggregations or filters? Or is this a select
>> query with only projections?
>> Simple query with only projections
>> 
>> + Any suspicious timings in the query profile?
>> Nothing specially different than our working query on our small cluster.
>> 
>> + Any suspicious warning messages in the logs around the time of failure on
>> any of the drillbits? Specially on atsqa4-133.qa.lab? Specially this one
>> (“..” are place holders):
>>  Message of mode .. of rpc type .. took longer than ..ms.  Actual duration
>> was ..ms.
>> 
>> Well we do see this warning on the failing node (on my last test), I found
>> this WARNING in our log for the past month for pretty much every node I
>> checked.
>> 2016-09-19 13:31:56,866 [BitServer-7] WARN
>> o.a.d.exec.rpc.control.ControlServer - Message of mode REQUEST of rpc type
>> 6 took longer than 500 ms. Actual Duration was 16053ms.
>> 2016-09-19 14:15:33,357 [BitServer-4] WARN
>> o.a.d.exec.rpc.control.ControlClient - Message of mode RESPONSE of rpc type
>> 1 took longer than 500 ms. Actual Duration was 981ms.
>> 
>> We really appreciate your help. I will dig in the source code for when and
>> why this error happen.
>> 
>> 
>> Francois
>> 
>> P.S.:
>> We do see this also:
>> 2016-09-19 14:48:23,444 [drill-executor-9] WARN
>> o.a.d.exec.rpc.control.WorkEventBus - Fragment ......:1:2 not found in the
>> work bus.
>> 2016-09-19 14:48:23,444 [drill-executor-11] WARN
>> o.a.d.exec.rpc.control.WorkEventBus - Fragment ....:1:222 not found in the
>> work bus.
>> 2016-09-19 14:48:23,444 [drill-executor-12] WARN
>> o.a.d.exec.rpc.control.WorkEventBus - Fragment ....:1:442 not found in the
>> work bus.
>> 2016-09-19 14:48:23,444 [drill-executor-10] WARN
>> o.a.d.exec.rpc.control.WorkEventBus - Fragment ....:1:662 not found in the
>> work bus.
>> 
>> 
>> 
>> 
>> On Sun, Sep 18, 2016 at 2:57 PM, Sudheesh Katkam <skat...@maprtech.com 
>> <mailto:skat...@maprtech.com>>
>> wrote:
>> 
>>> Hi Francois,
>>> 
>>> More questions..
>>> 
>>>> + Can you share the query profile?
>>>>  I will sum it up:
>>>> It is a select on 18 columns: 9 string, 9 integers.
>>>> Scan is done on 13862 parquet files spread  on 1000 fragments.
>>>> Fragments are spread accross 215 nodes.
>>> 
>>> So ~5 leaf fragments (or scanners) per Drillbit seems fine.
>>> 
>>> + Does the query involve any aggregations or filters? Or is this a select
>>> query with only projections?
>>> + Any suspicious timings in the query profile?
>>> + Any suspicious warning messages in the logs around the time of failure
>>> on any of the drillbits? Specially on atsqa4-133.qa.lab? Specially this one
>>> (“..” are place holders):
>>>  Message of mode .. of rpc type .. took longer than ..ms.  Actual
>>> duration was ..ms.
>>> 
>>> Thank you,
>>> Sudheesh
>>> 
>>>> On Sep 15, 2016, at 11:27 AM, François Méthot <fmetho...@gmail.com 
>>>> <mailto:fmetho...@gmail.com>>
>>> wrote:
>>>> 
>>>> Hi Sudheesh,
>>>> 
>>>> + How many zookeeper servers in the quorum?
>>>> The quorum has 3 servers, everything looks healthy
>>>> 
>>>> + What is the load on atsqa4-133.qa.lab when this happens? Any other
>>>> applications running on that node? How many threads is the Drill process
>>>> using?
>>>> The load on the failing node(8 cores) is 14, when Drill is running. Which
>>>> is nothing out of the ordinary according to our admin.
>>>> HBase is also running.
>>>> planner.width.max_per_node is set to 8
>>>> 
>>>> + When running the same query on 12 nodes, is the data size same?
>>>> Yes
>>>> 
>>>> + Can you share the query profile?
>>>>  I will sum it up:
>>>> It is a select on 18 columns: 9 string, 9 integers.
>>>> Scan is done on 13862 parquet files spread  on 1000 fragments.
>>>> Fragments are spread accross 215 nodes.
>>>> 
>>>> 
>>>> We are in process of increasing our Zookeeper session timeout config to
>>> see
>>>> if it helps.
>>>> 
>>>> thanks
>>>> 
>>>> Francois
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> On Wed, Sep 14, 2016 at 4:40 PM, Sudheesh Katkam <skat...@maprtech.com 
>>>> <mailto:skat...@maprtech.com>>
>>>> wrote:
>>>> 
>>>>> Hi Francois,
>>>>> 
>>>>> Few questions:
>>>>> + How many zookeeper servers in the quorum?
>>>>> + What is the load on atsqa4-133.qa.lab when this happens? Any other
>>>>> applications running on that node? How many threads is the Drill process
>>>>> using?
>>>>> + When running the same query on 12 nodes, is the data size same?
>>>>> + Can you share the query profile?
>>>>> 
>>>>> This may not be the right thing to do, but for now, If the cluster is
>>>>> heavily loaded, increase the zk timeout.
>>>>> 
>>>>> Thank you,
>>>>> Sudheesh
>>>>> 
>>>>>> On Sep 14, 2016, at 11:53 AM, François Méthot <fmetho...@gmail.com 
>>>>>> <mailto:fmetho...@gmail.com>>
>>>>> wrote:
>>>>>> 
>>>>>> We are running 1.7.
>>>>>> The log were taken from the jira tickets.
>>>>>> 
>>>>>> We will try out 1.8 soon.
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Wed, Sep 14, 2016 at 2:52 PM, Chun Chang <cch...@maprtech.com 
>>>>>> <mailto:cch...@maprtech.com>>
>>> wrote:
>>>>>> 
>>>>>>> Looks like you are running 1.5. I believe there are some work done in
>>>>> that
>>>>>>> area and the newer release should behave better.
>>>>>>> 
>>>>>>> On Wed, Sep 14, 2016 at 11:43 AM, François Méthot <
>>> fmetho...@gmail.com <mailto:fmetho...@gmail.com>>
>>>>>>> wrote:
>>>>>>> 
>>>>>>>> Hi,
>>>>>>>> 
>>>>>>>> We are trying to find a solution/workaround to issue:
>>>>>>>> 
>>>>>>>> 2016-01-28 16:36:14,367 [Curator-ServiceCache-0] ERROR
>>>>>>>> o.a.drill.exec.work.foreman.Foreman - SYSTEM ERROR:
>>> ForemanException:
>>>>>>>> One more more nodes lost connectivity during query.  Identified nodes
>>>>>>>> were [atsqa4-133.qa.lab:31010].
>>>>>>>> org.apache.drill.common.exceptions.UserException: SYSTEM ERROR:
>>>>>>>> ForemanException: One more more nodes lost connectivity during query.
>>>>>>>> Identified nodes were [atsqa4-133.qa.lab:31010].
>>>>>>>>      at org.apache.drill.exec.work.foreman.Foreman$ForemanResult.
>>>>>>>> close(Foreman.java:746)
>>>>>>>> [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
>>>>>>>>      at org.apache.drill.exec.work.foreman.Foreman$StateSwitch.
>>>>>>>> processEvent(Foreman.java:858)
>>>>>>>> [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
>>>>>>>>      at org.apache.drill.exec.work.foreman.Foreman$StateSwitch.
>>>>>>>> processEvent(Foreman.java:790)
>>>>>>>> [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
>>>>>>>>      at org.apache.drill.exec.work.foreman.Foreman$StateSwitch.
>>>>>>>> moveToState(Foreman.java:792)
>>>>>>>> [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
>>>>>>>>      at org.apache.drill.exec.work.foreman.Foreman.moveToState(
>>>>>>>> Foreman.java:909)
>>>>>>>> [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
>>>>>>>>      at org.apache.drill.exec.work.foreman.Foreman.access$2700(
>>>>>>>> Foreman.java:110)
>>>>>>>> [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
>>>>>>>>      at org.apache.drill.exec.work.foreman.Foreman$StateListener.
>>>>>>>> moveToState(Foreman.java:1183)
>>>>>>>> [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
>>>>>>>> 
>>>>>>>> 
>>>>>>>> DRILL-4325  <https://issues.apache.org/jira/browse/DRILL-4325 
>>>>>>>> <https://issues.apache.org/jira/browse/DRILL-4325>>
>>>>>>>> ForemanException:
>>>>>>>> One or more nodes lost connectivity during query
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Any one experienced this issue ?
>>>>>>>> 
>>>>>>>> It happens when running query involving many parquet files on a
>>> cluster
>>>>>>> of
>>>>>>>> 200 nodes. Same query on a smaller cluster of 12 nodes runs fine.
>>>>>>>> 
>>>>>>>> It is not caused by garbage collection, (checked on both ZK node and
>>>>> the
>>>>>>>> involved drill bit).
>>>>>>>> 
>>>>>>>> Negotiated max session timeout is 40 seconds.
>>>>>>>> 
>>>>>>>> The sequence seems:
>>>>>>>> - Drill Query begins, using an existing ZK session.
>>>>>>>> - Drill Zk session timeouts
>>>>>>>>    - perhaps it was writing something that took too long
>>>>>>>> - Drill attempts to renew session
>>>>>>>>     - drill believes that the write operation failed, so it attempts
>>>>>>> to
>>>>>>>> re-create the zk node, which trigger another exception.
>>>>>>>> 
>>>>>>>> We are open to any suggestion. We will report any finding.
>>>>>>>> 
>>>>>>>> Thanks
>>>>>>>> Francois
>>>>>>>> 
>>>>>>> 
>>>>> 
>>>>> 
>>> 
>>> 
>

Re: ZK lost connectivity issue on large cluster

Reply via email to