One more interesting thing and another guess to resolve the problem, > P.S.: > We do see this also: > 2016-09-19 14:48:23,444 [drill-executor-9] WARN > o.a.d.exec.rpc.control.WorkEventBus - Fragment ......:1:2 not found in the > work bus. > 2016-09-19 14:48:23,444 [drill-executor-11] WARN > o.a.d.exec.rpc.control.WorkEventBus - Fragment ....:1:222 not found in the > work bus. > 2016-09-19 14:48:23,444 [drill-executor-12] WARN > o.a.d.exec.rpc.control.WorkEventBus - Fragment ....:1:442 not found in the > work bus. > 2016-09-19 14:48:23,444 [drill-executor-10] WARN > o.a.d.exec.rpc.control.WorkEventBus - Fragment ....:1:662 not found in the > work bus.
+ Are there 220 nodes in your cluster? If that is the case and these fragments are scanners, they seem be be assigned in a round robin fashion (and not based on data locality?). If so, please upgrade to the latest version where DRILL-4446 [1] is resolved. Thank you, Sudheesh [1] https://issues.apache.org/jira/browse/DRILL-4446 <https://issues.apache.org/jira/browse/DRILL-4446> > On Sep 19, 2016, at 10:19 AM, Sudheesh Katkam <skat...@maprtech.com> wrote: > > Hi Francois, > > A simple query with only projections is not an “ideal” use case, since Drill > is bound by how fast the client can consume records. There are 1000 scanners > sending data to 1 client (vs far fewer scanners sending data in the 12 node > case). > > This might increase the load on the Drillbit that is the Foreman for the > query. In the query profile, the scanners should be spending a lot more time > “waiting” to send records to the client (via root fragment). > + Is atsqa4-133.qa.lab [1] the Foreman node for the query in this case? There > is also a bizarre case where the node that is reported as lost is the node > itself. > + Is there a spike in memory usage of the Drillbit this is the Foreman for > the query (process memory, not just heap)? > > Regarding the warnings ... > >> 2016-09-19 13:31:56,866 [BitServer-7] WARN >> o.a.d.exec.rpc.control.ControlServer - Message of mode REQUEST of rpc type >> 6 took longer than 500 ms. Actual Duration was 16053ms. > > > RPC type 6 is a cancellation request; DRILL-4766 [2] should help in this > case, which is resolved in the latest version of Drill. So as Chun suggested, > upgrade the cluster to the latest version of Drill. > >> 2016-09-19 14:15:33,357 [BitServer-4] WARN >> o.a.d.exec.rpc.control.ControlClient - Message of mode RESPONSE of rpc type >> 1 took longer than 500 ms. Actual Duration was 981ms. > > I am surprised that responses are taking that long to handle. > + Are both messages on the same Drillbit? > > The other warnings can be ignored. > > Thank you, > Sudheesh > > [1] I just realized that atsqa4-133.qa.lab is in one of our test environments > :) > [2] https://issues.apache.org/jira/browse/DRILL-4766 > <https://issues.apache.org/jira/browse/DRILL-4766> > >> On Sep 19, 2016, at 9:15 AM, François Méthot <fmetho...@gmail.com >> <mailto:fmetho...@gmail.com>> wrote: >> >> Hi Sudheesh, >> >> >> + Does the query involve any aggregations or filters? Or is this a select >> query with only projections? >> Simple query with only projections >> >> + Any suspicious timings in the query profile? >> Nothing specially different than our working query on our small cluster. >> >> + Any suspicious warning messages in the logs around the time of failure on >> any of the drillbits? Specially on atsqa4-133.qa.lab? Specially this one >> (“..” are place holders): >> Message of mode .. of rpc type .. took longer than ..ms. Actual duration >> was ..ms. >> >> Well we do see this warning on the failing node (on my last test), I found >> this WARNING in our log for the past month for pretty much every node I >> checked. >> 2016-09-19 13:31:56,866 [BitServer-7] WARN >> o.a.d.exec.rpc.control.ControlServer - Message of mode REQUEST of rpc type >> 6 took longer than 500 ms. Actual Duration was 16053ms. >> 2016-09-19 14:15:33,357 [BitServer-4] WARN >> o.a.d.exec.rpc.control.ControlClient - Message of mode RESPONSE of rpc type >> 1 took longer than 500 ms. Actual Duration was 981ms. >> >> We really appreciate your help. I will dig in the source code for when and >> why this error happen. >> >> >> Francois >> >> P.S.: >> We do see this also: >> 2016-09-19 14:48:23,444 [drill-executor-9] WARN >> o.a.d.exec.rpc.control.WorkEventBus - Fragment ......:1:2 not found in the >> work bus. >> 2016-09-19 14:48:23,444 [drill-executor-11] WARN >> o.a.d.exec.rpc.control.WorkEventBus - Fragment ....:1:222 not found in the >> work bus. >> 2016-09-19 14:48:23,444 [drill-executor-12] WARN >> o.a.d.exec.rpc.control.WorkEventBus - Fragment ....:1:442 not found in the >> work bus. >> 2016-09-19 14:48:23,444 [drill-executor-10] WARN >> o.a.d.exec.rpc.control.WorkEventBus - Fragment ....:1:662 not found in the >> work bus. >> >> >> >> >> On Sun, Sep 18, 2016 at 2:57 PM, Sudheesh Katkam <skat...@maprtech.com >> <mailto:skat...@maprtech.com>> >> wrote: >> >>> Hi Francois, >>> >>> More questions.. >>> >>>> + Can you share the query profile? >>>> I will sum it up: >>>> It is a select on 18 columns: 9 string, 9 integers. >>>> Scan is done on 13862 parquet files spread on 1000 fragments. >>>> Fragments are spread accross 215 nodes. >>> >>> So ~5 leaf fragments (or scanners) per Drillbit seems fine. >>> >>> + Does the query involve any aggregations or filters? Or is this a select >>> query with only projections? >>> + Any suspicious timings in the query profile? >>> + Any suspicious warning messages in the logs around the time of failure >>> on any of the drillbits? Specially on atsqa4-133.qa.lab? Specially this one >>> (“..” are place holders): >>> Message of mode .. of rpc type .. took longer than ..ms. Actual >>> duration was ..ms. >>> >>> Thank you, >>> Sudheesh >>> >>>> On Sep 15, 2016, at 11:27 AM, François Méthot <fmetho...@gmail.com >>>> <mailto:fmetho...@gmail.com>> >>> wrote: >>>> >>>> Hi Sudheesh, >>>> >>>> + How many zookeeper servers in the quorum? >>>> The quorum has 3 servers, everything looks healthy >>>> >>>> + What is the load on atsqa4-133.qa.lab when this happens? Any other >>>> applications running on that node? How many threads is the Drill process >>>> using? >>>> The load on the failing node(8 cores) is 14, when Drill is running. Which >>>> is nothing out of the ordinary according to our admin. >>>> HBase is also running. >>>> planner.width.max_per_node is set to 8 >>>> >>>> + When running the same query on 12 nodes, is the data size same? >>>> Yes >>>> >>>> + Can you share the query profile? >>>> I will sum it up: >>>> It is a select on 18 columns: 9 string, 9 integers. >>>> Scan is done on 13862 parquet files spread on 1000 fragments. >>>> Fragments are spread accross 215 nodes. >>>> >>>> >>>> We are in process of increasing our Zookeeper session timeout config to >>> see >>>> if it helps. >>>> >>>> thanks >>>> >>>> Francois >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> On Wed, Sep 14, 2016 at 4:40 PM, Sudheesh Katkam <skat...@maprtech.com >>>> <mailto:skat...@maprtech.com>> >>>> wrote: >>>> >>>>> Hi Francois, >>>>> >>>>> Few questions: >>>>> + How many zookeeper servers in the quorum? >>>>> + What is the load on atsqa4-133.qa.lab when this happens? Any other >>>>> applications running on that node? How many threads is the Drill process >>>>> using? >>>>> + When running the same query on 12 nodes, is the data size same? >>>>> + Can you share the query profile? >>>>> >>>>> This may not be the right thing to do, but for now, If the cluster is >>>>> heavily loaded, increase the zk timeout. >>>>> >>>>> Thank you, >>>>> Sudheesh >>>>> >>>>>> On Sep 14, 2016, at 11:53 AM, François Méthot <fmetho...@gmail.com >>>>>> <mailto:fmetho...@gmail.com>> >>>>> wrote: >>>>>> >>>>>> We are running 1.7. >>>>>> The log were taken from the jira tickets. >>>>>> >>>>>> We will try out 1.8 soon. >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Wed, Sep 14, 2016 at 2:52 PM, Chun Chang <cch...@maprtech.com >>>>>> <mailto:cch...@maprtech.com>> >>> wrote: >>>>>> >>>>>>> Looks like you are running 1.5. I believe there are some work done in >>>>> that >>>>>>> area and the newer release should behave better. >>>>>>> >>>>>>> On Wed, Sep 14, 2016 at 11:43 AM, François Méthot < >>> fmetho...@gmail.com <mailto:fmetho...@gmail.com>> >>>>>>> wrote: >>>>>>> >>>>>>>> Hi, >>>>>>>> >>>>>>>> We are trying to find a solution/workaround to issue: >>>>>>>> >>>>>>>> 2016-01-28 16:36:14,367 [Curator-ServiceCache-0] ERROR >>>>>>>> o.a.drill.exec.work.foreman.Foreman - SYSTEM ERROR: >>> ForemanException: >>>>>>>> One more more nodes lost connectivity during query. Identified nodes >>>>>>>> were [atsqa4-133.qa.lab:31010]. >>>>>>>> org.apache.drill.common.exceptions.UserException: SYSTEM ERROR: >>>>>>>> ForemanException: One more more nodes lost connectivity during query. >>>>>>>> Identified nodes were [atsqa4-133.qa.lab:31010]. >>>>>>>> at org.apache.drill.exec.work.foreman.Foreman$ForemanResult. >>>>>>>> close(Foreman.java:746) >>>>>>>> [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT] >>>>>>>> at org.apache.drill.exec.work.foreman.Foreman$StateSwitch. >>>>>>>> processEvent(Foreman.java:858) >>>>>>>> [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT] >>>>>>>> at org.apache.drill.exec.work.foreman.Foreman$StateSwitch. >>>>>>>> processEvent(Foreman.java:790) >>>>>>>> [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT] >>>>>>>> at org.apache.drill.exec.work.foreman.Foreman$StateSwitch. >>>>>>>> moveToState(Foreman.java:792) >>>>>>>> [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT] >>>>>>>> at org.apache.drill.exec.work.foreman.Foreman.moveToState( >>>>>>>> Foreman.java:909) >>>>>>>> [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT] >>>>>>>> at org.apache.drill.exec.work.foreman.Foreman.access$2700( >>>>>>>> Foreman.java:110) >>>>>>>> [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT] >>>>>>>> at org.apache.drill.exec.work.foreman.Foreman$StateListener. >>>>>>>> moveToState(Foreman.java:1183) >>>>>>>> [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT] >>>>>>>> >>>>>>>> >>>>>>>> DRILL-4325 <https://issues.apache.org/jira/browse/DRILL-4325 >>>>>>>> <https://issues.apache.org/jira/browse/DRILL-4325>> >>>>>>>> ForemanException: >>>>>>>> One or more nodes lost connectivity during query >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Any one experienced this issue ? >>>>>>>> >>>>>>>> It happens when running query involving many parquet files on a >>> cluster >>>>>>> of >>>>>>>> 200 nodes. Same query on a smaller cluster of 12 nodes runs fine. >>>>>>>> >>>>>>>> It is not caused by garbage collection, (checked on both ZK node and >>>>> the >>>>>>>> involved drill bit). >>>>>>>> >>>>>>>> Negotiated max session timeout is 40 seconds. >>>>>>>> >>>>>>>> The sequence seems: >>>>>>>> - Drill Query begins, using an existing ZK session. >>>>>>>> - Drill Zk session timeouts >>>>>>>> - perhaps it was writing something that took too long >>>>>>>> - Drill attempts to renew session >>>>>>>> - drill believes that the write operation failed, so it attempts >>>>>>> to >>>>>>>> re-create the zk node, which trigger another exception. >>>>>>>> >>>>>>>> We are open to any suggestion. We will report any finding. >>>>>>>> >>>>>>>> Thanks >>>>>>>> Francois >>>>>>>> >>>>>>> >>>>> >>>>> >>> >>> >