Re: ZK lost connectivity issue on large cluster

François Méthot Wed, 14 Sep 2016 11:54:15 -0700

We are running 1.7.
The log were taken from the jira tickets.

We will try out 1.8 soon.





On Wed, Sep 14, 2016 at 2:52 PM, Chun Chang <[email protected]> wrote:

> Looks like you are running 1.5. I believe there are some work done in that
> area and the newer release should behave better.
>
> On Wed, Sep 14, 2016 at 11:43 AM, François Méthot <[email protected]>
> wrote:
>
> > Hi,
> >
> >   We are trying to find a solution/workaround to issue:
> >
> > 2016-01-28 16:36:14,367 [Curator-ServiceCache-0] ERROR
> > o.a.drill.exec.work.foreman.Foreman - SYSTEM ERROR: ForemanException:
> > One more more nodes lost connectivity during query.  Identified nodes
> > were [atsqa4-133.qa.lab:31010].
> > org.apache.drill.common.exceptions.UserException: SYSTEM ERROR:
> > ForemanException: One more more nodes lost connectivity during query.
> > Identified nodes were [atsqa4-133.qa.lab:31010].
> >         at org.apache.drill.exec.work.foreman.Foreman$ForemanResult.
> > close(Foreman.java:746)
> > [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
> >         at org.apache.drill.exec.work.foreman.Foreman$StateSwitch.
> > processEvent(Foreman.java:858)
> > [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
> >         at org.apache.drill.exec.work.foreman.Foreman$StateSwitch.
> > processEvent(Foreman.java:790)
> > [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
> >         at org.apache.drill.exec.work.foreman.Foreman$StateSwitch.
> > moveToState(Foreman.java:792)
> > [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
> >         at org.apache.drill.exec.work.foreman.Foreman.moveToState(
> > Foreman.java:909)
> > [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
> >         at org.apache.drill.exec.work.foreman.Foreman.access$2700(
> > Foreman.java:110)
> > [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
> >         at org.apache.drill.exec.work.foreman.Foreman$StateListener.
> > moveToState(Foreman.java:1183)
> > [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
> >
> >
> > DRILL-4325  <https://issues.apache.org/jira/browse/DRILL-4325>
> > ForemanException:
> > One or more nodes lost connectivity during query
> >
> >
> >
> > Any one experienced this issue ?
> >
> > It happens when running query involving many parquet files on a cluster
> of
> > 200 nodes. Same query on a smaller cluster of 12 nodes runs fine.
> >
> > It is not caused by garbage collection, (checked on both ZK node and the
> > involved drill bit).
> >
> > Negotiated max session timeout is 40 seconds.
> >
> > The sequence seems:
> > - Drill Query begins, using an existing ZK session.
> > - Drill Zk session timeouts
> >       - perhaps it was writing something that took too long
> > - Drill attempts to renew session
> >        - drill believes that the write operation failed, so it attempts
> to
> > re-create the zk node, which trigger another exception.
> >
> >  We are open to any suggestion. We will report any finding.
> >
> > Thanks
> > Francois
> >
>

Re: ZK lost connectivity issue on large cluster

Reply via email to