Hi Francois, More questions..
> + Can you share the query profile? > I will sum it up: > It is a select on 18 columns: 9 string, 9 integers. > Scan is done on 13862 parquet files spread on 1000 fragments. > Fragments are spread accross 215 nodes. So ~5 leaf fragments (or scanners) per Drillbit seems fine. + Does the query involve any aggregations or filters? Or is this a select query with only projections? + Any suspicious timings in the query profile? + Any suspicious warning messages in the logs around the time of failure on any of the drillbits? Specially on atsqa4-133.qa.lab? Specially this one (“..” are place holders): Message of mode .. of rpc type .. took longer than ..ms. Actual duration was ..ms. Thank you, Sudheesh > On Sep 15, 2016, at 11:27 AM, François Méthot <fmetho...@gmail.com> wrote: > > Hi Sudheesh, > > + How many zookeeper servers in the quorum? > The quorum has 3 servers, everything looks healthy > > + What is the load on atsqa4-133.qa.lab when this happens? Any other > applications running on that node? How many threads is the Drill process > using? > The load on the failing node(8 cores) is 14, when Drill is running. Which > is nothing out of the ordinary according to our admin. > HBase is also running. > planner.width.max_per_node is set to 8 > > + When running the same query on 12 nodes, is the data size same? > Yes > > + Can you share the query profile? > I will sum it up: > It is a select on 18 columns: 9 string, 9 integers. > Scan is done on 13862 parquet files spread on 1000 fragments. > Fragments are spread accross 215 nodes. > > > We are in process of increasing our Zookeeper session timeout config to see > if it helps. > > thanks > > Francois > > > > > > > > On Wed, Sep 14, 2016 at 4:40 PM, Sudheesh Katkam <skat...@maprtech.com> > wrote: > >> Hi Francois, >> >> Few questions: >> + How many zookeeper servers in the quorum? >> + What is the load on atsqa4-133.qa.lab when this happens? Any other >> applications running on that node? How many threads is the Drill process >> using? >> + When running the same query on 12 nodes, is the data size same? >> + Can you share the query profile? >> >> This may not be the right thing to do, but for now, If the cluster is >> heavily loaded, increase the zk timeout. >> >> Thank you, >> Sudheesh >> >>> On Sep 14, 2016, at 11:53 AM, François Méthot <fmetho...@gmail.com> >> wrote: >>> >>> We are running 1.7. >>> The log were taken from the jira tickets. >>> >>> We will try out 1.8 soon. >>> >>> >>> >>> >>> On Wed, Sep 14, 2016 at 2:52 PM, Chun Chang <cch...@maprtech.com> wrote: >>> >>>> Looks like you are running 1.5. I believe there are some work done in >> that >>>> area and the newer release should behave better. >>>> >>>> On Wed, Sep 14, 2016 at 11:43 AM, François Méthot <fmetho...@gmail.com> >>>> wrote: >>>> >>>>> Hi, >>>>> >>>>> We are trying to find a solution/workaround to issue: >>>>> >>>>> 2016-01-28 16:36:14,367 [Curator-ServiceCache-0] ERROR >>>>> o.a.drill.exec.work.foreman.Foreman - SYSTEM ERROR: ForemanException: >>>>> One more more nodes lost connectivity during query. Identified nodes >>>>> were [atsqa4-133.qa.lab:31010]. >>>>> org.apache.drill.common.exceptions.UserException: SYSTEM ERROR: >>>>> ForemanException: One more more nodes lost connectivity during query. >>>>> Identified nodes were [atsqa4-133.qa.lab:31010]. >>>>> at org.apache.drill.exec.work.foreman.Foreman$ForemanResult. >>>>> close(Foreman.java:746) >>>>> [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT] >>>>> at org.apache.drill.exec.work.foreman.Foreman$StateSwitch. >>>>> processEvent(Foreman.java:858) >>>>> [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT] >>>>> at org.apache.drill.exec.work.foreman.Foreman$StateSwitch. >>>>> processEvent(Foreman.java:790) >>>>> [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT] >>>>> at org.apache.drill.exec.work.foreman.Foreman$StateSwitch. >>>>> moveToState(Foreman.java:792) >>>>> [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT] >>>>> at org.apache.drill.exec.work.foreman.Foreman.moveToState( >>>>> Foreman.java:909) >>>>> [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT] >>>>> at org.apache.drill.exec.work.foreman.Foreman.access$2700( >>>>> Foreman.java:110) >>>>> [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT] >>>>> at org.apache.drill.exec.work.foreman.Foreman$StateListener. >>>>> moveToState(Foreman.java:1183) >>>>> [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT] >>>>> >>>>> >>>>> DRILL-4325 <https://issues.apache.org/jira/browse/DRILL-4325> >>>>> ForemanException: >>>>> One or more nodes lost connectivity during query >>>>> >>>>> >>>>> >>>>> Any one experienced this issue ? >>>>> >>>>> It happens when running query involving many parquet files on a cluster >>>> of >>>>> 200 nodes. Same query on a smaller cluster of 12 nodes runs fine. >>>>> >>>>> It is not caused by garbage collection, (checked on both ZK node and >> the >>>>> involved drill bit). >>>>> >>>>> Negotiated max session timeout is 40 seconds. >>>>> >>>>> The sequence seems: >>>>> - Drill Query begins, using an existing ZK session. >>>>> - Drill Zk session timeouts >>>>> - perhaps it was writing something that took too long >>>>> - Drill attempts to renew session >>>>> - drill believes that the write operation failed, so it attempts >>>> to >>>>> re-create the zk node, which trigger another exception. >>>>> >>>>> We are open to any suggestion. We will report any finding. >>>>> >>>>> Thanks >>>>> Francois >>>>> >>>> >> >>