Hi Francois,

Thank you for the picture and the info you provided.
We will keep you updated and let you know when we make changes in future 
release.

Thanks,
Padma


> On Oct 26, 2016, at 6:06 PM, François Méthot <fmetho...@gmail.com> wrote:
> 
> Hi,
> 
> Sorry it took so long, lost the origin picture, had to go to a ms paint
> training
> here we go:
> 
> https://github.com/fmethot/imagine/blob/master/affinity_factor.png
> 
> 
> 
> On Thu, Oct 20, 2016 at 12:57 PM, Sudheesh Katkam <skat...@maprtech.com>
> wrote:
> 
>> The mailing list does not seem to allow for images. Can you put the image
>> elsewhere (Github or Dropbox), and reply with a link to it?
>> 
>> - Sudheesh
>> 
>>> On Oct 19, 2016, at 5:37 PM, François Méthot <fmetho...@gmail.com>
>> wrote:
>>> 
>>> We had problem on the 220 nodes cluster. No problem on the 12 nodes
>> cluster.
>>> 
>>> I agree that the data may not be distributed evenly. It would be a long
>> and tedious process for me to produce a report.
>>> 
>>> Here is a drawing  of the fragments overview before and after the
>> changes of the affinity factory on a sample query ran on the 220 nodes
>> cluster.  max_width_per_node=8 on both, but it turned out to be irrelevant
>> to the issue.
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> Before: SYSTEM ERROR: ForemanException: One more more nodes lost
>> connectivity during query.  Identified nodes were [server121:31010].
>>> 
>>> After: error is gone
>>> 
>>> Before: low disk io, high network io on the bottom part of the graph
>>> after : high disk io, low network io on the bottom part of the graph
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> On Tue, Oct 18, 2016 at 12:58 AM, Padma Penumarthy <
>> ppenumar...@maprtech.com <mailto:ppenumar...@maprtech.com>> wrote:
>>> Hi Francois,
>>> 
>>> It would be good to understand how increasing affinity_factor helped in
>> your case
>>> so we can better document and also use that knowledge to improve things
>> in future release.
>>> 
>>> If you have two clusters,  it is not clear whether you had the problem
>> on 12 node cluster
>>> or 220 node cluster or both. Is the dataset same on both ? Is
>> max_width_per_node=8 in both clusters ?
>>> 
>>> Increasing affinity factor will lower remote reads  by scheduling more
>> fragments/doing more work
>>> on nodes which have data available locally.  So, there seem to be some
>> kind of non uniform
>>> data distribution for sure. It would be good if you can provide more
>> details i.e. how the data is
>>> distributed in the cluster and how the load on the nodes changed when
>> affinity factor was increased.
>>> 
>>> Thanks,
>>> Padma
>>> 
>>> 
>>>> On Oct 14, 2016, at 6:45 PM, François Méthot <fmetho...@gmail.com
>> <mailto:fmetho...@gmail.com>> wrote:
>>>> 
>>>> We have  a 12 nodes cluster and a 220 nodes cluster, but they do not
>> talk
>>>> to each other. So Padma's analysis do not apply but thanks for your
>>>> comments. Our goal had been to run Drill on the 220 nodes cluster
>> after it
>>>> proved worthy of it on the small cluster.
>>>> 
>>>> planner.width.max_per_node was eventually reduced to 2 when we were
>> trying
>>>> to figure this out, it would still fail. After we figured out the
>>>> affinity_factor, we put it back to its original value and it would work
>>>> fine.
>>>> 
>>>> 
>>>> 
>>>> Sudheesh: Indeed, The Zk/drill services use the same network on our
>> bigger
>>>> cluster.
>>>> 
>>>> potential improvements:
>>>> - planner.affinity_factor should be better documented.
>>>> - When ZK disconnected, the running queries systematically failed.
>> When we
>>>> disabled the ForemanException thrown in the QueryManager.
>>>> drillbitUnregistered method, most of our query started to run
>> successfully,
>>>> we would sometime get Drillbit Disconnected error within the rpc work
>> bus.
>>>> It did confirm that we still had something on our network going on,
>> but it
>>>> also showed that the RPC bus between drillbits was more resilient to
>>>> network hiccup. I could not prove it, but I think under certain
>> condition,
>>>> the ZK session gets recreated, which cause a Query Manager unregistered
>>>> (query fail) and register call right after, but the RPC
>>>> bus  would remains connected.
>>>> 
>>>> 
>>>> We really appreciate your feedback and we hope to contribute to this
>> great
>>>> project in the future.
>>>> Thanks
>>>> Francois
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> On Fri, Oct 14, 2016 at 3:00 PM, Padma Penumarthy <
>> ppenumar...@maprtech.com <mailto:ppenumar...@maprtech.com>>
>>>> wrote:
>>>> 
>>>>> 
>>>>> Seems like you have 215 nodes, but the data for your query is there on
>>>>> only 12 nodes.
>>>>> Drill tries to distribute the scan fragments across the cluster more
>>>>> uniformly (trying to utilize all CPU resources).
>>>>> That is why you have lot of remote reads going on and increasing
>> affinity
>>>>> factor eliminates running scan
>>>>> fragments on the other (215-12) nodes.
>>>>> 
>>>>> you also mentioned planner.width.max_per_node is set to 8.
>>>>> So, with increased affinity factor,  you have 8 scan fragments doing
>> a lot
>>>>> more work on these 12 nodes.
>>>>> Still, you got 10X improvement. Seems like your network is the obvious
>>>>> bottleneck. Is it a 10G or 1G ?
>>>>> 
>>>>> Also, increasing affinity factor helped in your case because there is
>> no
>>>>> data on other nodes.
>>>>> But, if you have data non uniformly distributed across more nodes, you
>>>>> might still have the problem.
>>>>> 
>>>>> Thanks,
>>>>> Padma
>>>>> 
>>>>>> On Oct 14, 2016, at 11:18 AM, Sudheesh Katkam <skat...@maprtech.com
>> <mailto:skat...@maprtech.com>>
>>>>> wrote:
>>>>>> 
>>>>>> Hi Francois,
>>>>>> 
>>>>>> Thank you for posting your findings! Glad to see a 10X improvement.
>>>>>> 
>>>>>> By increasing affinity factor, looks like Drill’s parallelizer is
>> forced
>>>>> to assign fragments on nodes with data i.e. with high favorability
>> for data
>>>>> locality.
>>>>>> 
>>>>>> Regarding the random disconnection, I agree with your guess that the
>>>>> network bandwidth is being used up by remote reads which causes lags
>> in
>>>>> drillbit to ZooKeeper heartbeats (since these services use the same
>>>>> network)? Maybe others can comment here.
>>>>>> 
>>>>>> Thank you,
>>>>>> Sudheesh
>>>>>> 
>>>>>>> On Oct 12, 2016, at 6:06 PM, François Méthot <fmetho...@gmail.com
>> <mailto:fmetho...@gmail.com>>
>>>>> wrote:
>>>>>>> 
>>>>>>> Hi,
>>>>>>> 
>>>>>>> We finally got rid of this error. We have tried many, many things
>> (like
>>>>>>> modifying drill to ignore the error!), it ultimately came down to
>> this
>>>>>>> change:
>>>>>>> 
>>>>>>> from default
>>>>>>> planner.affinity_factor=1.2
>>>>>>> to
>>>>>>> planner.affinity_factor=100
>>>>>>> 
>>>>>>> Basically this encourages fragment to only care about locally stored
>>>>> files.
>>>>>>> We looked at the code that used that property and figured that 100
>> would
>>>>>>> have strong impact.
>>>>>>> 
>>>>>>> What led us to this property is the fact that 1/4 of our fragments
>> would
>>>>>>> take a lot more time to complete their scan, up to  10x the time of
>> the
>>>>>>> fastest nodes.  On the slower nodes, Cloudera Manager would show
>> very
>>>>> low
>>>>>>> Disk IOPS with high Network IO compare to our faster nodes. We had
>>>>> noticed
>>>>>>> that before but figured it would be some optimization to be done
>> later
>>>>> when
>>>>>>> more pressing issue would be fixed, like Zk disconnection and OOM.
>> We
>>>>> were
>>>>>>> desperate and decided to fix anything that  would look unusual.
>>>>>>> 
>>>>>>> After this change, query ran up to 10x faster.
>>>>>>> 
>>>>>>> We no longer get random disconnection between node and Zookeeper.
>>>>>>> 
>>>>>>> We are still wondering why exactly. Network should not be a
>> bottleneck.
>>>>>>> Could high network traffic between a Drillbit and HDFS causes the
>>>>> Drillbit
>>>>>>> to timeout with zookeeper?
>>>>>>> 
>>>>>>> 
>>>>>>> On Fri, Sep 30, 2016 at 4:21 PM, François Méthot <
>> fmetho...@gmail.com <mailto:fmetho...@gmail.com>>
>>>>>>> wrote:
>>>>>>> 
>>>>>>>> After the 30 seconds gap, all the Drill nodes receives the
>> following:
>>>>>>>> 
>>>>>>>> 2016-09-26 20:07:38,629 [Curator-ServiceCache-0] Debug Active
>> drillbit
>>>>> set
>>>>>>>> changed. Now includes 220 total bits. New Active drill bits....
>>>>>>>> ...faulty node is not on the list...
>>>>>>>> 2016-09-26 20:07:38,897 [Curator-ServiceCache-0] Debug Active
>> drillbit
>>>>> set
>>>>>>>> changed. Now includes 221 total bits. New Active drill bits....
>>>>>>>> ...faulty node is back on the list...
>>>>>>>> 
>>>>>>>> 
>>>>>>>> So the faulty Drill node get unregistered and registered right
>> after.
>>>>>>>> 
>>>>>>>> Drill is using the low level API for registering and
>> unregistering, and
>>>>>>>> the only place with unregistering occurs is when the drillbit is
>>>>> closed at
>>>>>>>> shutdown.
>>>>>>>> 
>>>>>>>> That particular drillbit is still up and running after those log,
>> it
>>>>> could
>>>>>>>> not have trigger the unregistering process through a shutdown.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Would you have an idea what else could cause a Drillbit to be
>>>>> unregistered
>>>>>>>> from the DiscoveryService and registered again right after?
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> We are using Zookeeper 3.4.5
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Wed, Sep 28, 2016 at 10:36 AM, François Méthot <
>> fmetho...@gmail.com <mailto:fmetho...@gmail.com>
>>>>>> 
>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> Hi,
>>>>>>>>> 
>>>>>>>>> We have switched to 1.8 and we are still getting node
>> disconnection.
>>>>>>>>> 
>>>>>>>>> We did many tests, we thought initially our stand alone parquet
>>>>> converter
>>>>>>>>> was generating parquet files with problematic data (like 10K
>>>>> characters
>>>>>>>>> string), but we were able to reproduce it with employee data from
>> the
>>>>>>>>> tutorial.
>>>>>>>>> 
>>>>>>>>> For example,  we duplicated the Drill Tutorial "Employee" data to
>>>>> reach
>>>>>>>>> 500 M records spread over 130 parquet files.
>>>>>>>>> Each files is ~60 MB.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> We ran this query over and over on 5 different sessions using a
>>>>> script:
>>>>>>>>> select * from hdfs.tmp.`PARQUET_EMPLOYEE` where full_name like
>> '%does
>>>>>>>>> not exist%';
>>>>>>>>> 
>>>>>>>>> Query return no rows and would take ~35 to 45 seconds to return.
>>>>>>>>> 
>>>>>>>>> Leaving the script running on each node, we eventually hit the
>> "nodes
>>>>>>>>> lost connectivity during query" error.
>>>>>>>>> 
>>>>>>>>> One the done that failed,
>>>>>>>>> 
>>>>>>>>> We see those log:
>>>>>>>>> 2016-09-26 20:07:09,029 [...uuid...frag:1:10] INFO
>>>>>>>>> o.a.d.e.w.f.FragmentStatusReporter - ...uuid...:1:10: State to
>>>>> report:
>>>>>>>>> RUNNING
>>>>>>>>> 2016-09-26 20:07:09,029 [...uuid...frag:1:10] DEBUG
>>>>>>>>> o.a.d.e.w.FragmentExecutor - Starting fragment 1:10 on
>> server064:31010
>>>>>>>>> 
>>>>>>>>> <--- 30 seconds gap for that fragment --->
>>>>>>>>> 
>>>>>>>>> 2016-09-26 20:37:09,976 [BitServer-2] WARN o.a.d.exec.rpc.control.
>>>>> ControlServer
>>>>>>>>> - Message of mode REQUEST of rpc type 2 took longer then 500 ms.
>>>>> Actual
>>>>>>>>> duration was 23617ms.
>>>>>>>>> 
>>>>>>>>> 2016-09-26 20:07:38,211 [...uuid...frag:1:10] DEBUG
>>>>>>>>> o.a.d.e.p.i.s.RemovingRecordBatch - doWork(): 0 records copied
>> out
>>>>> of 0,
>>>>>>>>> remaining: 0 incoming schema BatchSchema [....,
>>>>> selectionVector=TWO_BYTE]
>>>>>>>>> 2016-09-26 20:07:38,211 [...uuid...frag:1:10] DEBUG
>>>>>>>>> o.a.d.exec.rpc.control.WorkEventBus - Cancelling and removing
>>>>> fragment
>>>>>>>>> manager : ...uuid...
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> For the same query on a working node:
>>>>>>>>> 2016-09-26 20:07:09,056 [...uuid...frag:1:2] INFO
>>>>>>>>> o.a.d.e.w.f.FragmentStatusReporter - ...uuid...:1:2: State to
>> report:
>>>>>>>>> RUNNING
>>>>>>>>> 2016-09-26 20:07:09,056 [...uuid...frag:1:2] DEBUG
>>>>>>>>> o.a.d.e.w.FragmentExecutor - Starting fragment 1:2 on
>> server125:31010
>>>>>>>>> 2016-09-26 20:07:09,749 [...uuid...frag:1:2] DEBUG
>>>>>>>>> o.a.d.e.p.i.s.RemovingRecordBatch - doWork(): 0 records copied
>> out
>>>>> of 0,
>>>>>>>>> remaining: 0 incoming schema BatchSchema [....,
>>>>> selectionVector=TWO_BYTE]
>>>>>>>>> 2016-09-26 20:07:09,749 [...uuid...frag:1:2] DEBUG
>>>>>>>>> o.a.d.e.p.i.s.RemovingRecordBatch - doWork(): 0 records copied
>> out
>>>>> of 0,
>>>>>>>>> remaining: 0 incoming schema BatchSchema [....,
>>>>> selectionVector=TWO_BYTE]
>>>>>>>>> 2016-09-26 20:07:11,005 [...uuid...frag:1:2] DEBUG
>>>>>>>>> o.a.d.e.s.p.c.ParquetRecordReader - Read 87573 records out of row
>>>>>>>>> groups(0) in file `/data/drill/tmp/PARQUET_
>> EMPLOYEE/0_0_14.parquet
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> We are investigating what could get cause that 30 seconds gap for
>> that
>>>>>>>>> fragment.
>>>>>>>>> 
>>>>>>>>> Any idea let us know
>>>>>>>>> 
>>>>>>>>> Thanks
>>>>>>>>> Francois
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Mon, Sep 19, 2016 at 2:59 PM, François Méthot <
>> fmetho...@gmail.com <mailto:fmetho...@gmail.com>
>>>>>> 
>>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> Hi Sudheesh,
>>>>>>>>>> 
>>>>>>>>>> If I add selection filter so that no row are returned, the same
>>>>>>>>>> problem occur. I also simplified the query to include only few
>>>>> integer
>>>>>>>>>> columns.
>>>>>>>>>> 
>>>>>>>>>> That particular data repo is ~200+ Billions records spread over
>> ~50
>>>>> 000
>>>>>>>>>> parquet files.
>>>>>>>>>> 
>>>>>>>>>> We have other CSV data repo that are 100x smaller that does not
>>>>> trigger
>>>>>>>>>> this issue.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> + Is atsqa4-133.qa.lab [1] the Foreman node for the query in this
>>>>> case?
>>>>>>>>>> There is also a bizarre case where the node that is reported as
>> lost
>>>>> is the
>>>>>>>>>> node itself.
>>>>>>>>>> Yes, the stack trace is from the ticket, It did occurred once or
>>>>> twice
>>>>>>>>>> (in the many many attempts) that it was the node itself.
>>>>>>>>>> 
>>>>>>>>>> + Is there a spike in memory usage of the Drillbit this is the
>>>>> Foreman
>>>>>>>>>> for the query (process memory, not just heap)?
>>>>>>>>>> We don't notice any unusual spike, each nodes gets busy in the
>> same
>>>>>>>>>> range when query is running.
>>>>>>>>>> 
>>>>>>>>>> I tried running with 8GB/20GB and 4GB/24GB heap/off-heap, did
>> not see
>>>>>>>>>> any improvement.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> We will update from 1.7 to 1.8 before going ahead with more
>>>>>>>>>> investigation.
>>>>>>>>>> 
>>>>>>>>>> Thanks a lot.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On Mon, Sep 19, 2016 at 1:19 PM, Sudheesh Katkam <
>>>>> skat...@maprtech.com <mailto:skat...@maprtech.com>>
>>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> Hi Francois,
>>>>>>>>>>> 
>>>>>>>>>>> A simple query with only projections is not an “ideal” use case,
>>>>> since
>>>>>>>>>>> Drill is bound by how fast the client can consume records. There
>>>>> are 1000
>>>>>>>>>>> scanners sending data to 1 client (vs far fewer scanners sending
>>>>> data in
>>>>>>>>>>> the 12 node case).
>>>>>>>>>>> 
>>>>>>>>>>> This might increase the load on the Drillbit that is the
>> Foreman for
>>>>>>>>>>> the query. In the query profile, the scanners should be
>> spending a
>>>>> lot more
>>>>>>>>>>> time “waiting” to send records to the client (via root
>> fragment).
>>>>>>>>>>> + Is atsqa4-133.qa.lab [1] the Foreman node for the query in
>> this
>>>>> case?
>>>>>>>>>>> There is also a bizarre case where the node that is reported as
>>>>> lost is the
>>>>>>>>>>> node itself.
>>>>>>>>>>> + Is there a spike in memory usage of the Drillbit this is the
>>>>> Foreman
>>>>>>>>>>> for the query (process memory, not just heap)?
>>>>>>>>>>> 
>>>>>>>>>>> Regarding the warnings ...
>>>>>>>>>>> 
>>>>>>>>>>>> 2016-09-19 13:31:56,866 [BitServer-7] WARN
>>>>>>>>>>>> o.a.d.exec.rpc.control.ControlServer - Message of mode
>> REQUEST of
>>>>>>>>>>> rpc type
>>>>>>>>>>>> 6 took longer than 500 ms. Actual Duration was 16053ms.
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> RPC type 6 is a cancellation request; DRILL-4766 [2] should
>> help in
>>>>>>>>>>> this case, which is resolved in the latest version of Drill. So
>> as
>>>>> Chun
>>>>>>>>>>> suggested, upgrade the cluster to the latest version of Drill.
>>>>>>>>>>> 
>>>>>>>>>>>> 2016-09-19 14:15:33,357 [BitServer-4] WARN
>>>>>>>>>>>> o.a.d.exec.rpc.control.ControlClient - Message of mode
>> RESPONSE of
>>>>>>>>>>> rpc type
>>>>>>>>>>>> 1 took longer than 500 ms. Actual Duration was 981ms.
>>>>>>>>>>> 
>>>>>>>>>>> I am surprised that responses are taking that long to handle.
>>>>>>>>>>> + Are both messages on the same Drillbit?
>>>>>>>>>>> 
>>>>>>>>>>> The other warnings can be ignored.
>>>>>>>>>>> 
>>>>>>>>>>> Thank you,
>>>>>>>>>>> Sudheesh
>>>>>>>>>>> 
>>>>>>>>>>> [1] I just realized that atsqa4-133.qa.lab is in one of our test
>>>>>>>>>>> environments :)
>>>>>>>>>>> [2] https://issues.apache.org/jira/browse/DRILL-4766 <
>> https://issues.apache.org/jira/browse/DRILL-4766> <
>>>>>>>>>>> https://issues.apache.org/jira/browse/DRILL-4766 <
>> https://issues.apache.org/jira/browse/DRILL-4766>>
>>>>>>>>>>> 
>>>>>>>>>>>> On Sep 19, 2016, at 9:15 AM, François Méthot <
>> fmetho...@gmail.com <mailto:fmetho...@gmail.com>>
>>>>>>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> Hi Sudheesh,
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> + Does the query involve any aggregations or filters? Or is
>> this a
>>>>>>>>>>> select
>>>>>>>>>>>> query with only projections?
>>>>>>>>>>>> Simple query with only projections
>>>>>>>>>>>> 
>>>>>>>>>>>> + Any suspicious timings in the query profile?
>>>>>>>>>>>> Nothing specially different than our working query on our small
>>>>>>>>>>> cluster.
>>>>>>>>>>>> 
>>>>>>>>>>>> + Any suspicious warning messages in the logs around the time
>> of
>>>>>>>>>>> failure on
>>>>>>>>>>>> any of the drillbits? Specially on atsqa4-133.qa.lab? Specially
>>>>> this
>>>>>>>>>>> one
>>>>>>>>>>>> (“..” are place holders):
>>>>>>>>>>>> Message of mode .. of rpc type .. took longer than ..ms.
>> Actual
>>>>>>>>>>> duration
>>>>>>>>>>>> was ..ms.
>>>>>>>>>>>> 
>>>>>>>>>>>> Well we do see this warning on the failing node (on my last
>> test),
>>>>> I
>>>>>>>>>>> found
>>>>>>>>>>>> this WARNING in our log for the past month for pretty much
>> every
>>>>> node
>>>>>>>>>>> I
>>>>>>>>>>>> checked.
>>>>>>>>>>>> 2016-09-19 13:31:56,866 [BitServer-7] WARN
>>>>>>>>>>>> o.a.d.exec.rpc.control.ControlServer - Message of mode
>> REQUEST of
>>>>>>>>>>> rpc type
>>>>>>>>>>>> 6 took longer than 500 ms. Actual Duration was 16053ms.
>>>>>>>>>>>> 2016-09-19 14:15:33,357 [BitServer-4] WARN
>>>>>>>>>>>> o.a.d.exec.rpc.control.ControlClient - Message of mode
>> RESPONSE of
>>>>>>>>>>> rpc type
>>>>>>>>>>>> 1 took longer than 500 ms. Actual Duration was 981ms.
>>>>>>>>>>>> 
>>>>>>>>>>>> We really appreciate your help. I will dig in the source code
>> for
>>>>>>>>>>> when and
>>>>>>>>>>>> why this error happen.
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> Francois
>>>>>>>>>>>> 
>>>>>>>>>>>> P.S.:
>>>>>>>>>>>> We do see this also:
>>>>>>>>>>>> 2016-09-19 14:48:23,444 [drill-executor-9] WARN
>>>>>>>>>>>> o.a.d.exec.rpc.control.WorkEventBus - Fragment ......:1:2 not
>>>>> found
>>>>>>>>>>> in the
>>>>>>>>>>>> work bus.
>>>>>>>>>>>> 2016-09-19 14:48:23,444 [drill-executor-11] WARN
>>>>>>>>>>>> o.a.d.exec.rpc.control.WorkEventBus - Fragment ....:1:222 not
>>>>> found
>>>>>>>>>>> in the
>>>>>>>>>>>> work bus.
>>>>>>>>>>>> 2016-09-19 14:48:23,444 [drill-executor-12] WARN
>>>>>>>>>>>> o.a.d.exec.rpc.control.WorkEventBus - Fragment ....:1:442 not
>>>>> found
>>>>>>>>>>> in the
>>>>>>>>>>>> work bus.
>>>>>>>>>>>> 2016-09-19 14:48:23,444 [drill-executor-10] WARN
>>>>>>>>>>>> o.a.d.exec.rpc.control.WorkEventBus - Fragment ....:1:662 not
>>>>> found
>>>>>>>>>>> in the
>>>>>>>>>>>> work bus.
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> On Sun, Sep 18, 2016 at 2:57 PM, Sudheesh Katkam <
>>>>>>>>>>> skat...@maprtech.com <mailto:skat...@maprtech.com>>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>>> Hi Francois,
>>>>>>>>>>>>> 
>>>>>>>>>>>>> More questions..
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> + Can you share the query profile?
>>>>>>>>>>>>>> I will sum it up:
>>>>>>>>>>>>>> It is a select on 18 columns: 9 string, 9 integers.
>>>>>>>>>>>>>> Scan is done on 13862 parquet files spread  on 1000
>> fragments.
>>>>>>>>>>>>>> Fragments are spread accross 215 nodes.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> So ~5 leaf fragments (or scanners) per Drillbit seems fine.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> + Does the query involve any aggregations or filters? Or is
>> this a
>>>>>>>>>>> select
>>>>>>>>>>>>> query with only projections?
>>>>>>>>>>>>> + Any suspicious timings in the query profile?
>>>>>>>>>>>>> + Any suspicious warning messages in the logs around the time
>> of
>>>>>>>>>>> failure
>>>>>>>>>>>>> on any of the drillbits? Specially on atsqa4-133.qa.lab?
>> Specially
>>>>>>>>>>> this one
>>>>>>>>>>>>> (“..” are place holders):
>>>>>>>>>>>>> Message of mode .. of rpc type .. took longer than ..ms.
>> Actual
>>>>>>>>>>>>> duration was ..ms.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Thank you,
>>>>>>>>>>>>> Sudheesh
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Sep 15, 2016, at 11:27 AM, François Méthot <
>>>>> fmetho...@gmail.com <mailto:fmetho...@gmail.com>>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Hi Sudheesh,
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> + How many zookeeper servers in the quorum?
>>>>>>>>>>>>>> The quorum has 3 servers, everything looks healthy
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> + What is the load on atsqa4-133.qa.lab when this happens?
>> Any
>>>>> other
>>>>>>>>>>>>>> applications running on that node? How many threads is the
>> Drill
>>>>>>>>>>> process
>>>>>>>>>>>>>> using?
>>>>>>>>>>>>>> The load on the failing node(8 cores) is 14, when Drill is
>>>>> running.
>>>>>>>>>>> Which
>>>>>>>>>>>>>> is nothing out of the ordinary according to our admin.
>>>>>>>>>>>>>> HBase is also running.
>>>>>>>>>>>>>> planner.width.max_per_node is set to 8
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> + When running the same query on 12 nodes, is the data size
>> same?
>>>>>>>>>>>>>> Yes
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> + Can you share the query profile?
>>>>>>>>>>>>>> I will sum it up:
>>>>>>>>>>>>>> It is a select on 18 columns: 9 string, 9 integers.
>>>>>>>>>>>>>> Scan is done on 13862 parquet files spread  on 1000
>> fragments.
>>>>>>>>>>>>>> Fragments are spread accross 215 nodes.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> We are in process of increasing our Zookeeper session timeout
>>>>>>>>>>> config to
>>>>>>>>>>>>> see
>>>>>>>>>>>>>> if it helps.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> thanks
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Francois
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Wed, Sep 14, 2016 at 4:40 PM, Sudheesh Katkam <
>>>>>>>>>>> skat...@maprtech.com <mailto:skat...@maprtech.com>>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Hi Francois,
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Few questions:
>>>>>>>>>>>>>>> + How many zookeeper servers in the quorum?
>>>>>>>>>>>>>>> + What is the load on atsqa4-133.qa.lab when this happens?
>> Any
>>>>>>>>>>> other
>>>>>>>>>>>>>>> applications running on that node? How many threads is the
>> Drill
>>>>>>>>>>> process
>>>>>>>>>>>>>>> using?
>>>>>>>>>>>>>>> + When running the same query on 12 nodes, is the data size
>>>>> same?
>>>>>>>>>>>>>>> + Can you share the query profile?
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> This may not be the right thing to do, but for now, If the
>>>>> cluster
>>>>>>>>>>> is
>>>>>>>>>>>>>>> heavily loaded, increase the zk timeout.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Thank you,
>>>>>>>>>>>>>>> Sudheesh
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> On Sep 14, 2016, at 11:53 AM, François Méthot <
>>>>>>>>>>> fmetho...@gmail.com <mailto:fmetho...@gmail.com>>
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> We are running 1.7.
>>>>>>>>>>>>>>>> The log were taken from the jira tickets.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> We will try out 1.8 soon.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> On Wed, Sep 14, 2016 at 2:52 PM, Chun Chang <
>>>>> cch...@maprtech.com <mailto:cch...@maprtech.com>>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Looks like you are running 1.5. I believe there are some
>> work
>>>>>>>>>>> done in
>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>> area and the newer release should behave better.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> On Wed, Sep 14, 2016 at 11:43 AM, François Méthot <
>>>>>>>>>>>>> fmetho...@gmail.com <mailto:fmetho...@gmail.com>>
>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> We are trying to find a solution/workaround to issue:
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 2016-01-28 16:36:14,367 [Curator-ServiceCache-0] ERROR
>>>>>>>>>>>>>>>>>> o.a.drill.exec.work.foreman.Fo <
>> http://o.a.drill.exec.work.foreman.fo/>reman - SYSTEM ERROR:
>>>>>>>>>>>>> ForemanException:
>>>>>>>>>>>>>>>>>> One more more nodes lost connectivity during query.
>>>>> Identified
>>>>>>>>>>> nodes
>>>>>>>>>>>>>>>>>> were [atsqa4-133.qa.lab:31010].
>>>>>>>>>>>>>>>>>> org.apache.drill.common.exceptions.UserException: SYSTEM
>>>>> ERROR:
>>>>>>>>>>>>>>>>>> ForemanException: One more more nodes lost connectivity
>>>>> during
>>>>>>>>>>> query.
>>>>>>>>>>>>>>>>>> Identified nodes were [atsqa4-133.qa.lab:31010].
>>>>>>>>>>>>>>>>>>  at org.apache.drill.exec.work.for
>>>>>>>>>>> eman.Foreman$ForemanResult.
>>>>>>>>>>>>>>>>>> close(Foreman.java:746)
>>>>>>>>>>>>>>>>>> [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
>>>>>>>>>>>>>>>>>>  at org.apache.drill.exec.work.
>>>>> foreman.Foreman$StateSwitch.
>>>>>>>>>>>>>>>>>> processEvent(Foreman.java:858)
>>>>>>>>>>>>>>>>>> [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
>>>>>>>>>>>>>>>>>>  at org.apache.drill.exec.work.
>>>>> foreman.Foreman$StateSwitch.
>>>>>>>>>>>>>>>>>> processEvent(Foreman.java:790)
>>>>>>>>>>>>>>>>>> [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
>>>>>>>>>>>>>>>>>>  at org.apache.drill.exec.work.
>>>>> foreman.Foreman$StateSwitch.
>>>>>>>>>>>>>>>>>> moveToState(Foreman.java:792)
>>>>>>>>>>>>>>>>>> [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
>>>>>>>>>>>>>>>>>>  at org.apache.drill.exec.work.
>>>>> foreman.Foreman.moveToState(
>>>>>>>>>>>>>>>>>> Foreman.java:909)
>>>>>>>>>>>>>>>>>> [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
>>>>>>>>>>>>>>>>>>  at org.apache.drill.exec.work.
>>>>> foreman.Foreman.access$2700(
>>>>>>>>>>>>>>>>>> Foreman.java:110)
>>>>>>>>>>>>>>>>>> [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
>>>>>>>>>>>>>>>>>>  at org.apache.drill.exec.work.for
>>>>>>>>>>> eman.Foreman$StateListener.
>>>>>>>>>>>>>>>>>> moveToState(Foreman.java:1183)
>>>>>>>>>>>>>>>>>> [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> DRILL-4325  <https://issues.apache.org/ <
>> https://issues.apache.org/>
>>>>> jira/browse/DRILL-4325>
>>>>>>>>>>>>>>>>>> ForemanException:
>>>>>>>>>>>>>>>>>> One or more nodes lost connectivity during query
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Any one experienced this issue ?
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> It happens when running query involving many parquet
>> files
>>>>> on a
>>>>>>>>>>>>> cluster
>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>> 200 nodes. Same query on a smaller cluster of 12 nodes
>> runs
>>>>>>>>>>> fine.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> It is not caused by garbage collection, (checked on both
>> ZK
>>>>>>>>>>> node and
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>> involved drill bit).
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Negotiated max session timeout is 40 seconds.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> The sequence seems:
>>>>>>>>>>>>>>>>>> - Drill Query begins, using an existing ZK session.
>>>>>>>>>>>>>>>>>> - Drill Zk session timeouts
>>>>>>>>>>>>>>>>>> - perhaps it was writing something that took too long
>>>>>>>>>>>>>>>>>> - Drill attempts to renew session
>>>>>>>>>>>>>>>>>> - drill believes that the write operation failed, so it
>>>>>>>>>>> attempts
>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>> re-create the zk node, which trigger another exception.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> We are open to any suggestion. We will report any
>> finding.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>>>>> Francois
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>> 
>>> 
>> 
>> 

Reply via email to