We are running 1.7. The log were taken from the jira tickets. We will try out 1.8 soon.
On Wed, Sep 14, 2016 at 2:52 PM, Chun Chang <cch...@maprtech.com> wrote: > Looks like you are running 1.5. I believe there are some work done in that > area and the newer release should behave better. > > On Wed, Sep 14, 2016 at 11:43 AM, François Méthot <fmetho...@gmail.com> > wrote: > > > Hi, > > > > We are trying to find a solution/workaround to issue: > > > > 2016-01-28 16:36:14,367 [Curator-ServiceCache-0] ERROR > > o.a.drill.exec.work.foreman.Foreman - SYSTEM ERROR: ForemanException: > > One more more nodes lost connectivity during query. Identified nodes > > were [atsqa4-133.qa.lab:31010]. > > org.apache.drill.common.exceptions.UserException: SYSTEM ERROR: > > ForemanException: One more more nodes lost connectivity during query. > > Identified nodes were [atsqa4-133.qa.lab:31010]. > > at org.apache.drill.exec.work.foreman.Foreman$ForemanResult. > > close(Foreman.java:746) > > [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT] > > at org.apache.drill.exec.work.foreman.Foreman$StateSwitch. > > processEvent(Foreman.java:858) > > [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT] > > at org.apache.drill.exec.work.foreman.Foreman$StateSwitch. > > processEvent(Foreman.java:790) > > [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT] > > at org.apache.drill.exec.work.foreman.Foreman$StateSwitch. > > moveToState(Foreman.java:792) > > [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT] > > at org.apache.drill.exec.work.foreman.Foreman.moveToState( > > Foreman.java:909) > > [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT] > > at org.apache.drill.exec.work.foreman.Foreman.access$2700( > > Foreman.java:110) > > [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT] > > at org.apache.drill.exec.work.foreman.Foreman$StateListener. > > moveToState(Foreman.java:1183) > > [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT] > > > > > > DRILL-4325 <https://issues.apache.org/jira/browse/DRILL-4325> > > ForemanException: > > One or more nodes lost connectivity during query > > > > > > > > Any one experienced this issue ? > > > > It happens when running query involving many parquet files on a cluster > of > > 200 nodes. Same query on a smaller cluster of 12 nodes runs fine. > > > > It is not caused by garbage collection, (checked on both ZK node and the > > involved drill bit). > > > > Negotiated max session timeout is 40 seconds. > > > > The sequence seems: > > - Drill Query begins, using an existing ZK session. > > - Drill Zk session timeouts > > - perhaps it was writing something that took too long > > - Drill attempts to renew session > > - drill believes that the write operation failed, so it attempts > to > > re-create the zk node, which trigger another exception. > > > > We are open to any suggestion. We will report any finding. > > > > Thanks > > Francois > > >