Looks like you are running 1.5. I believe there are some work done in that area and the newer release should behave better.
On Wed, Sep 14, 2016 at 11:43 AM, François Méthot <fmetho...@gmail.com> wrote: > Hi, > > We are trying to find a solution/workaround to issue: > > 2016-01-28 16:36:14,367 [Curator-ServiceCache-0] ERROR > o.a.drill.exec.work.foreman.Foreman - SYSTEM ERROR: ForemanException: > One more more nodes lost connectivity during query. Identified nodes > were [atsqa4-133.qa.lab:31010]. > org.apache.drill.common.exceptions.UserException: SYSTEM ERROR: > ForemanException: One more more nodes lost connectivity during query. > Identified nodes were [atsqa4-133.qa.lab:31010]. > at org.apache.drill.exec.work.foreman.Foreman$ForemanResult. > close(Foreman.java:746) > [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT] > at org.apache.drill.exec.work.foreman.Foreman$StateSwitch. > processEvent(Foreman.java:858) > [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT] > at org.apache.drill.exec.work.foreman.Foreman$StateSwitch. > processEvent(Foreman.java:790) > [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT] > at org.apache.drill.exec.work.foreman.Foreman$StateSwitch. > moveToState(Foreman.java:792) > [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT] > at org.apache.drill.exec.work.foreman.Foreman.moveToState( > Foreman.java:909) > [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT] > at org.apache.drill.exec.work.foreman.Foreman.access$2700( > Foreman.java:110) > [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT] > at org.apache.drill.exec.work.foreman.Foreman$StateListener. > moveToState(Foreman.java:1183) > [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT] > > > DRILL-4325 <https://issues.apache.org/jira/browse/DRILL-4325> > ForemanException: > One or more nodes lost connectivity during query > > > > Any one experienced this issue ? > > It happens when running query involving many parquet files on a cluster of > 200 nodes. Same query on a smaller cluster of 12 nodes runs fine. > > It is not caused by garbage collection, (checked on both ZK node and the > involved drill bit). > > Negotiated max session timeout is 40 seconds. > > The sequence seems: > - Drill Query begins, using an existing ZK session. > - Drill Zk session timeouts > - perhaps it was writing something that took too long > - Drill attempts to renew session > - drill believes that the write operation failed, so it attempts to > re-create the zk node, which trigger another exception. > > We are open to any suggestion. We will report any finding. > > Thanks > Francois >