Re: ZK lost connectivity issue on large cluster

Chun Chang Wed, 14 Sep 2016 11:52:29 -0700

Looks like you are running 1.5. I believe there are some work done in that
area and the newer release should behave better.


On Wed, Sep 14, 2016 at 11:43 AM, François Méthot <[email protected]>
wrote:

> Hi,
>
>   We are trying to find a solution/workaround to issue:
>
> 2016-01-28 16:36:14,367 [Curator-ServiceCache-0] ERROR
> o.a.drill.exec.work.foreman.Foreman - SYSTEM ERROR: ForemanException:
> One more more nodes lost connectivity during query.  Identified nodes
> were [atsqa4-133.qa.lab:31010].
> org.apache.drill.common.exceptions.UserException: SYSTEM ERROR:
> ForemanException: One more more nodes lost connectivity during query.
> Identified nodes were [atsqa4-133.qa.lab:31010].
>         at org.apache.drill.exec.work.foreman.Foreman$ForemanResult.
> close(Foreman.java:746)
> [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
>         at org.apache.drill.exec.work.foreman.Foreman$StateSwitch.
> processEvent(Foreman.java:858)
> [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
>         at org.apache.drill.exec.work.foreman.Foreman$StateSwitch.
> processEvent(Foreman.java:790)
> [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
>         at org.apache.drill.exec.work.foreman.Foreman$StateSwitch.
> moveToState(Foreman.java:792)
> [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
>         at org.apache.drill.exec.work.foreman.Foreman.moveToState(
> Foreman.java:909)
> [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
>         at org.apache.drill.exec.work.foreman.Foreman.access$2700(
> Foreman.java:110)
> [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
>         at org.apache.drill.exec.work.foreman.Foreman$StateListener.
> moveToState(Foreman.java:1183)
> [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
>
>
> DRILL-4325  <https://issues.apache.org/jira/browse/DRILL-4325>
> ForemanException:
> One or more nodes lost connectivity during query
>
>
>
> Any one experienced this issue ?
>
> It happens when running query involving many parquet files on a cluster of
> 200 nodes. Same query on a smaller cluster of 12 nodes runs fine.
>
> It is not caused by garbage collection, (checked on both ZK node and the
> involved drill bit).
>
> Negotiated max session timeout is 40 seconds.
>
> The sequence seems:
> - Drill Query begins, using an existing ZK session.
> - Drill Zk session timeouts
>       - perhaps it was writing something that took too long
> - Drill attempts to renew session
>        - drill believes that the write operation failed, so it attempts to
> re-create the zk node, which trigger another exception.
>
>  We are open to any suggestion. We will report any finding.
>
> Thanks
> Francois
>

Re: ZK lost connectivity issue on large cluster

Reply via email to