Re: ZK lost connectivity issue on large cluster

Sudheesh Katkam Sun, 18 Sep 2016 11:57:53 -0700

Hi Francois,

More questions..


> + Can you share the query profile?
>   I will sum it up:
>  It is a select on 18 columns: 9 string, 9 integers.
>  Scan is done on 13862 parquet files spread  on 1000 fragments.
>  Fragments are spread accross 215 nodes.

So ~5 leaf fragments (or scanners) per Drillbit seems fine.

+ Does the query involve any aggregations or filters? Or is this a select query 
with only projections?
+ Any suspicious timings in the query profile?
+ Any suspicious warning messages in the logs around the time of failure on any 
of the drillbits? Specially on atsqa4-133.qa.lab? Specially this one (“..” are 
place holders):
  Message of mode .. of rpc type .. took longer than ..ms.  Actual duration was 
..ms.

Thank you,
Sudheesh

> On Sep 15, 2016, at 11:27 AM, François Méthot <fmetho...@gmail.com> wrote:
> 
> Hi Sudheesh,
> 
> + How many zookeeper servers in the quorum?
> The quorum has 3 servers, everything looks healthy
> 
> + What is the load on atsqa4-133.qa.lab when this happens? Any other
> applications running on that node? How many threads is the Drill process
> using?
> The load on the failing node(8 cores) is 14, when Drill is running. Which
> is nothing out of the ordinary according to our admin.
> HBase is also running.
> planner.width.max_per_node is set to 8
> 
> + When running the same query on 12 nodes, is the data size same?
> Yes
> 
> + Can you share the query profile?
>   I will sum it up:
>  It is a select on 18 columns: 9 string, 9 integers.
>  Scan is done on 13862 parquet files spread  on 1000 fragments.
>  Fragments are spread accross 215 nodes.
> 
> 
> We are in process of increasing our Zookeeper session timeout config to see
> if it helps.
> 
> thanks
> 
> Francois
> 
> 
> 
> 
> 
> 
> 
> On Wed, Sep 14, 2016 at 4:40 PM, Sudheesh Katkam <skat...@maprtech.com>
> wrote:
> 
>> Hi Francois,
>> 
>> Few questions:
>> + How many zookeeper servers in the quorum?
>> + What is the load on atsqa4-133.qa.lab when this happens? Any other
>> applications running on that node? How many threads is the Drill process
>> using?
>> + When running the same query on 12 nodes, is the data size same?
>> + Can you share the query profile?
>> 
>> This may not be the right thing to do, but for now, If the cluster is
>> heavily loaded, increase the zk timeout.
>> 
>> Thank you,
>> Sudheesh
>> 
>>> On Sep 14, 2016, at 11:53 AM, François Méthot <fmetho...@gmail.com>
>> wrote:
>>> 
>>> We are running 1.7.
>>> The log were taken from the jira tickets.
>>> 
>>> We will try out 1.8 soon.
>>> 
>>> 
>>> 
>>> 
>>> On Wed, Sep 14, 2016 at 2:52 PM, Chun Chang <cch...@maprtech.com> wrote:
>>> 
>>>> Looks like you are running 1.5. I believe there are some work done in
>> that
>>>> area and the newer release should behave better.
>>>> 
>>>> On Wed, Sep 14, 2016 at 11:43 AM, François Méthot <fmetho...@gmail.com>
>>>> wrote:
>>>> 
>>>>> Hi,
>>>>> 
>>>>> We are trying to find a solution/workaround to issue:
>>>>> 
>>>>> 2016-01-28 16:36:14,367 [Curator-ServiceCache-0] ERROR
>>>>> o.a.drill.exec.work.foreman.Foreman - SYSTEM ERROR: ForemanException:
>>>>> One more more nodes lost connectivity during query.  Identified nodes
>>>>> were [atsqa4-133.qa.lab:31010].
>>>>> org.apache.drill.common.exceptions.UserException: SYSTEM ERROR:
>>>>> ForemanException: One more more nodes lost connectivity during query.
>>>>> Identified nodes were [atsqa4-133.qa.lab:31010].
>>>>>       at org.apache.drill.exec.work.foreman.Foreman$ForemanResult.
>>>>> close(Foreman.java:746)
>>>>> [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
>>>>>       at org.apache.drill.exec.work.foreman.Foreman$StateSwitch.
>>>>> processEvent(Foreman.java:858)
>>>>> [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
>>>>>       at org.apache.drill.exec.work.foreman.Foreman$StateSwitch.
>>>>> processEvent(Foreman.java:790)
>>>>> [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
>>>>>       at org.apache.drill.exec.work.foreman.Foreman$StateSwitch.
>>>>> moveToState(Foreman.java:792)
>>>>> [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
>>>>>       at org.apache.drill.exec.work.foreman.Foreman.moveToState(
>>>>> Foreman.java:909)
>>>>> [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
>>>>>       at org.apache.drill.exec.work.foreman.Foreman.access$2700(
>>>>> Foreman.java:110)
>>>>> [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
>>>>>       at org.apache.drill.exec.work.foreman.Foreman$StateListener.
>>>>> moveToState(Foreman.java:1183)
>>>>> [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
>>>>> 
>>>>> 
>>>>> DRILL-4325  <https://issues.apache.org/jira/browse/DRILL-4325>
>>>>> ForemanException:
>>>>> One or more nodes lost connectivity during query
>>>>> 
>>>>> 
>>>>> 
>>>>> Any one experienced this issue ?
>>>>> 
>>>>> It happens when running query involving many parquet files on a cluster
>>>> of
>>>>> 200 nodes. Same query on a smaller cluster of 12 nodes runs fine.
>>>>> 
>>>>> It is not caused by garbage collection, (checked on both ZK node and
>> the
>>>>> involved drill bit).
>>>>> 
>>>>> Negotiated max session timeout is 40 seconds.
>>>>> 
>>>>> The sequence seems:
>>>>> - Drill Query begins, using an existing ZK session.
>>>>> - Drill Zk session timeouts
>>>>>     - perhaps it was writing something that took too long
>>>>> - Drill attempts to renew session
>>>>>      - drill believes that the write operation failed, so it attempts
>>>> to
>>>>> re-create the zk node, which trigger another exception.
>>>>> 
>>>>> We are open to any suggestion. We will report any finding.
>>>>> 
>>>>> Thanks
>>>>> Francois
>>>>> 
>>>> 
>> 
>>

Re: ZK lost connectivity issue on large cluster

Reply via email to