Re: ZK lost connectivity issue on large cluster

2016-10-28 Thread Padma Penumarthy
Hi Francois,

Thank you for the picture and the info you provided.
We will keep you updated and let you know when we make changes in future 
release.

Thanks,
Padma


> On Oct 26, 2016, at 6:06 PM, François Méthot  wrote:
> 
> Hi,
> 
> Sorry it took so long, lost the origin picture, had to go to a ms paint
> training
> here we go:
> 
> https://github.com/fmethot/imagine/blob/master/affinity_factor.png
> 
> 
> 
> On Thu, Oct 20, 2016 at 12:57 PM, Sudheesh Katkam 
> wrote:
> 
>> The mailing list does not seem to allow for images. Can you put the image
>> elsewhere (Github or Dropbox), and reply with a link to it?
>> 
>> - Sudheesh
>> 
>>> On Oct 19, 2016, at 5:37 PM, François Méthot 
>> wrote:
>>> 
>>> We had problem on the 220 nodes cluster. No problem on the 12 nodes
>> cluster.
>>> 
>>> I agree that the data may not be distributed evenly. It would be a long
>> and tedious process for me to produce a report.
>>> 
>>> Here is a drawing  of the fragments overview before and after the
>> changes of the affinity factory on a sample query ran on the 220 nodes
>> cluster.  max_width_per_node=8 on both, but it turned out to be irrelevant
>> to the issue.
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> Before: SYSTEM ERROR: ForemanException: One more more nodes lost
>> connectivity during query.  Identified nodes were [server121:31010].
>>> 
>>> After: error is gone
>>> 
>>> Before: low disk io, high network io on the bottom part of the graph
>>> after : high disk io, low network io on the bottom part of the graph
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> On Tue, Oct 18, 2016 at 12:58 AM, Padma Penumarthy <
>> ppenumar...@maprtech.com > wrote:
>>> Hi Francois,
>>> 
>>> It would be good to understand how increasing affinity_factor helped in
>> your case
>>> so we can better document and also use that knowledge to improve things
>> in future release.
>>> 
>>> If you have two clusters,  it is not clear whether you had the problem
>> on 12 node cluster
>>> or 220 node cluster or both. Is the dataset same on both ? Is
>> max_width_per_node=8 in both clusters ?
>>> 
>>> Increasing affinity factor will lower remote reads  by scheduling more
>> fragments/doing more work
>>> on nodes which have data available locally.  So, there seem to be some
>> kind of non uniform
>>> data distribution for sure. It would be good if you can provide more
>> details i.e. how the data is
>>> distributed in the cluster and how the load on the nodes changed when
>> affinity factor was increased.
>>> 
>>> Thanks,
>>> Padma
>>> 
>>> 
 On Oct 14, 2016, at 6:45 PM, François Méthot > > wrote:
 
 We have  a 12 nodes cluster and a 220 nodes cluster, but they do not
>> talk
 to each other. So Padma's analysis do not apply but thanks for your
 comments. Our goal had been to run Drill on the 220 nodes cluster
>> after it
 proved worthy of it on the small cluster.
 
 planner.width.max_per_node was eventually reduced to 2 when we were
>> trying
 to figure this out, it would still fail. After we figured out the
 affinity_factor, we put it back to its original value and it would work
 fine.
 
 
 
 Sudheesh: Indeed, The Zk/drill services use the same network on our
>> bigger
 cluster.
 
 potential improvements:
 - planner.affinity_factor should be better documented.
 - When ZK disconnected, the running queries systematically failed.
>> When we
 disabled the ForemanException thrown in the QueryManager.
 drillbitUnregistered method, most of our query started to run
>> successfully,
 we would sometime get Drillbit Disconnected error within the rpc work
>> bus.
 It did confirm that we still had something on our network going on,
>> but it
 also showed that the RPC bus between drillbits was more resilient to
 network hiccup. I could not prove it, but I think under certain
>> condition,
 the ZK session gets recreated, which cause a Query Manager unregistered
 (query fail) and register call right after, but the RPC
 bus  would remains connected.
 
 
 We really appreciate your feedback and we hope to contribute to this
>> great
 project in the future.
 Thanks
 Francois
 
 
 
 
 
 
 On Fri, Oct 14, 2016 at 3:00 PM, Padma Penumarthy <
>> ppenumar...@maprtech.com >
 wrote:
 
> 
> Seems like you have 215 nodes, but the data for your query is there on
> only 12 nodes.
> Drill tries to distribute the scan fragments across the cluster more
> uniformly (trying to utilize all CPU resources).
> That is why you have lot of remote reads going on and increasing
>> affinity
> factor eliminates running scan
> fragments on the other (215-12) nodes.
> 
> you also mentioned planner.width.max_per_node is set to 

Re: ZK lost connectivity issue on large cluster

2016-10-20 Thread Sudheesh Katkam
The mailing list does not seem to allow for images. Can you put the image 
elsewhere (Github or Dropbox), and reply with a link to it?

- Sudheesh

> On Oct 19, 2016, at 5:37 PM, François Méthot  wrote:
> 
> We had problem on the 220 nodes cluster. No problem on the 12 nodes cluster.
> 
> I agree that the data may not be distributed evenly. It would be a long and 
> tedious process for me to produce a report. 
> 
> Here is a drawing  of the fragments overview before and after the changes of 
> the affinity factory on a sample query ran on the 220 nodes cluster.  
> max_width_per_node=8 on both, but it turned out to be irrelevant to the issue.
> 
> 
> 
> 
> 
> 
> Before: SYSTEM ERROR: ForemanException: One more more nodes lost connectivity 
> during query.  Identified nodes were [server121:31010].
> 
> After: error is gone
> 
> Before: low disk io, high network io on the bottom part of the graph
> after : high disk io, low network io on the bottom part of the graph
> 
> 
> 
> 
> 
> 
> 
> On Tue, Oct 18, 2016 at 12:58 AM, Padma Penumarthy  > wrote:
> Hi Francois,
> 
> It would be good to understand how increasing affinity_factor helped in your 
> case
> so we can better document and also use that knowledge to improve things in 
> future release.
> 
> If you have two clusters,  it is not clear whether you had the problem on 12 
> node cluster
> or 220 node cluster or both. Is the dataset same on both ? Is 
> max_width_per_node=8 in both clusters ?
> 
> Increasing affinity factor will lower remote reads  by scheduling more 
> fragments/doing more work
> on nodes which have data available locally.  So, there seem to be some kind 
> of non uniform
> data distribution for sure. It would be good if you can provide more details 
> i.e. how the data is
> distributed in the cluster and how the load on the nodes changed when 
> affinity factor was increased.
> 
> Thanks,
> Padma
> 
> 
> > On Oct 14, 2016, at 6:45 PM, François Méthot  > > wrote:
> >
> > We have  a 12 nodes cluster and a 220 nodes cluster, but they do not talk
> > to each other. So Padma's analysis do not apply but thanks for your
> > comments. Our goal had been to run Drill on the 220 nodes cluster after it
> > proved worthy of it on the small cluster.
> >
> > planner.width.max_per_node was eventually reduced to 2 when we were trying
> > to figure this out, it would still fail. After we figured out the
> > affinity_factor, we put it back to its original value and it would work
> > fine.
> >
> >
> >
> > Sudheesh: Indeed, The Zk/drill services use the same network on our bigger
> > cluster.
> >
> > potential improvements:
> > - planner.affinity_factor should be better documented.
> > - When ZK disconnected, the running queries systematically failed. When we
> > disabled the ForemanException thrown in the QueryManager.
> > drillbitUnregistered method, most of our query started to run successfully,
> > we would sometime get Drillbit Disconnected error within the rpc work bus.
> > It did confirm that we still had something on our network going on, but it
> > also showed that the RPC bus between drillbits was more resilient to
> > network hiccup. I could not prove it, but I think under certain condition,
> > the ZK session gets recreated, which cause a Query Manager unregistered
> > (query fail) and register call right after, but the RPC
> > bus  would remains connected.
> >
> >
> > We really appreciate your feedback and we hope to contribute to this great
> > project in the future.
> > Thanks
> > Francois
> >
> >
> >
> >
> >
> >
> > On Fri, Oct 14, 2016 at 3:00 PM, Padma Penumarthy  > >
> > wrote:
> >
> >>
> >> Seems like you have 215 nodes, but the data for your query is there on
> >> only 12 nodes.
> >> Drill tries to distribute the scan fragments across the cluster more
> >> uniformly (trying to utilize all CPU resources).
> >> That is why you have lot of remote reads going on and increasing affinity
> >> factor eliminates running scan
> >> fragments on the other (215-12) nodes.
> >>
> >> you also mentioned planner.width.max_per_node is set to 8.
> >> So, with increased affinity factor,  you have 8 scan fragments doing a lot
> >> more work on these 12 nodes.
> >> Still, you got 10X improvement. Seems like your network is the obvious
> >> bottleneck. Is it a 10G or 1G ?
> >>
> >> Also, increasing affinity factor helped in your case because there is no
> >> data on other nodes.
> >> But, if you have data non uniformly distributed across more nodes, you
> >> might still have the problem.
> >>
> >> Thanks,
> >> Padma
> >>
> >>> On Oct 14, 2016, at 11:18 AM, Sudheesh Katkam  >>> >
> >> wrote:
> >>>
> >>> Hi Francois,
> >>>
> >>> Thank you for posting your findings! Glad to see a 10X improvement.
> >>>
> >>> By 

Re: ZK lost connectivity issue on large cluster

2016-10-19 Thread François Méthot
We had problem on the 220 nodes cluster. No problem on the 12 nodes cluster.

I agree that the data may not be distributed evenly. It would be a long and
tedious process for me to produce a report.

Here is a drawing  of the fragments overview before and after the changes
of the affinity factory on a sample query ran on the 220 nodes
cluster.  max_width_per_node=8
on both, but it turned out to be irrelevant to the issue.



[image: Inline image 1]


Before: SYSTEM ERROR: ForemanException: One more more nodes lost
connectivity during query. Identified nodes were [server121:31010].

After: error is gone

Before: low disk io, high network io on the bottom part of the graph
after : high disk io, low network io on the bottom part of the graph







On Tue, Oct 18, 2016 at 12:58 AM, Padma Penumarthy  wrote:

> Hi Francois,
>
> It would be good to understand how increasing affinity_factor helped in
> your case
> so we can better document and also use that knowledge to improve things in
> future release.
>
> If you have two clusters,  it is not clear whether you had the problem on
> 12 node cluster
> or 220 node cluster or both. Is the dataset same on both ? Is
> max_width_per_node=8 in both clusters ?
>
> Increasing affinity factor will lower remote reads  by scheduling more
> fragments/doing more work
> on nodes which have data available locally.  So, there seem to be some
> kind of non uniform
> data distribution for sure. It would be good if you can provide more
> details i.e. how the data is
> distributed in the cluster and how the load on the nodes changed when
> affinity factor was increased.
>
> Thanks,
> Padma
>
>
> > On Oct 14, 2016, at 6:45 PM, François Méthot 
> wrote:
> >
> > We have  a 12 nodes cluster and a 220 nodes cluster, but they do not talk
> > to each other. So Padma's analysis do not apply but thanks for your
> > comments. Our goal had been to run Drill on the 220 nodes cluster after
> it
> > proved worthy of it on the small cluster.
> >
> > planner.width.max_per_node was eventually reduced to 2 when we were
> trying
> > to figure this out, it would still fail. After we figured out the
> > affinity_factor, we put it back to its original value and it would work
> > fine.
> >
> >
> >
> > Sudheesh: Indeed, The Zk/drill services use the same network on our
> bigger
> > cluster.
> >
> > potential improvements:
> > - planner.affinity_factor should be better documented.
> > - When ZK disconnected, the running queries systematically failed. When
> we
> > disabled the ForemanException thrown in the QueryManager.
> > drillbitUnregistered method, most of our query started to run
> successfully,
> > we would sometime get Drillbit Disconnected error within the rpc work
> bus.
> > It did confirm that we still had something on our network going on, but
> it
> > also showed that the RPC bus between drillbits was more resilient to
> > network hiccup. I could not prove it, but I think under certain
> condition,
> > the ZK session gets recreated, which cause a Query Manager unregistered
> > (query fail) and register call right after, but the RPC
> > bus  would remains connected.
> >
> >
> > We really appreciate your feedback and we hope to contribute to this
> great
> > project in the future.
> > Thanks
> > Francois
> >
> >
> >
> >
> >
> >
> > On Fri, Oct 14, 2016 at 3:00 PM, Padma Penumarthy <
> ppenumar...@maprtech.com>
> > wrote:
> >
> >>
> >> Seems like you have 215 nodes, but the data for your query is there on
> >> only 12 nodes.
> >> Drill tries to distribute the scan fragments across the cluster more
> >> uniformly (trying to utilize all CPU resources).
> >> That is why you have lot of remote reads going on and increasing
> affinity
> >> factor eliminates running scan
> >> fragments on the other (215-12) nodes.
> >>
> >> you also mentioned planner.width.max_per_node is set to 8.
> >> So, with increased affinity factor,  you have 8 scan fragments doing a
> lot
> >> more work on these 12 nodes.
> >> Still, you got 10X improvement. Seems like your network is the obvious
> >> bottleneck. Is it a 10G or 1G ?
> >>
> >> Also, increasing affinity factor helped in your case because there is no
> >> data on other nodes.
> >> But, if you have data non uniformly distributed across more nodes, you
> >> might still have the problem.
> >>
> >> Thanks,
> >> Padma
> >>
> >>> On Oct 14, 2016, at 11:18 AM, Sudheesh Katkam 
> >> wrote:
> >>>
> >>> Hi Francois,
> >>>
> >>> Thank you for posting your findings! Glad to see a 10X improvement.
> >>>
> >>> By increasing affinity factor, looks like Drill’s parallelizer is
> forced
> >> to assign fragments on nodes with data i.e. with high favorability for
> data
> >> locality.
> >>>
> >>> Regarding the random disconnection, I agree with your guess that the
> >> network bandwidth is being used up by remote reads which causes lags in
> >> drillbit to ZooKeeper heartbeats (since these services use the same
> >> 

Re: ZK lost connectivity issue on large cluster

2016-10-17 Thread Padma Penumarthy
Hi Francois,

It would be good to understand how increasing affinity_factor helped in your 
case
so we can better document and also use that knowledge to improve things in 
future release.

If you have two clusters,  it is not clear whether you had the problem on 12 
node cluster 
or 220 node cluster or both. Is the dataset same on both ? Is 
max_width_per_node=8 in both clusters ?

Increasing affinity factor will lower remote reads  by scheduling more 
fragments/doing more work
on nodes which have data available locally.  So, there seem to be some kind of 
non uniform 
data distribution for sure. It would be good if you can provide more details 
i.e. how the data is
distributed in the cluster and how the load on the nodes changed when affinity 
factor was increased.

Thanks,
Padma


> On Oct 14, 2016, at 6:45 PM, François Méthot  wrote:
> 
> We have  a 12 nodes cluster and a 220 nodes cluster, but they do not talk
> to each other. So Padma's analysis do not apply but thanks for your
> comments. Our goal had been to run Drill on the 220 nodes cluster after it
> proved worthy of it on the small cluster.
> 
> planner.width.max_per_node was eventually reduced to 2 when we were trying
> to figure this out, it would still fail. After we figured out the
> affinity_factor, we put it back to its original value and it would work
> fine.
> 
> 
> 
> Sudheesh: Indeed, The Zk/drill services use the same network on our bigger
> cluster.
> 
> potential improvements:
> - planner.affinity_factor should be better documented.
> - When ZK disconnected, the running queries systematically failed. When we
> disabled the ForemanException thrown in the QueryManager.
> drillbitUnregistered method, most of our query started to run successfully,
> we would sometime get Drillbit Disconnected error within the rpc work bus.
> It did confirm that we still had something on our network going on, but it
> also showed that the RPC bus between drillbits was more resilient to
> network hiccup. I could not prove it, but I think under certain condition,
> the ZK session gets recreated, which cause a Query Manager unregistered
> (query fail) and register call right after, but the RPC
> bus  would remains connected.
> 
> 
> We really appreciate your feedback and we hope to contribute to this great
> project in the future.
> Thanks
> Francois
> 
> 
> 
> 
> 
> 
> On Fri, Oct 14, 2016 at 3:00 PM, Padma Penumarthy 
> wrote:
> 
>> 
>> Seems like you have 215 nodes, but the data for your query is there on
>> only 12 nodes.
>> Drill tries to distribute the scan fragments across the cluster more
>> uniformly (trying to utilize all CPU resources).
>> That is why you have lot of remote reads going on and increasing affinity
>> factor eliminates running scan
>> fragments on the other (215-12) nodes.
>> 
>> you also mentioned planner.width.max_per_node is set to 8.
>> So, with increased affinity factor,  you have 8 scan fragments doing a lot
>> more work on these 12 nodes.
>> Still, you got 10X improvement. Seems like your network is the obvious
>> bottleneck. Is it a 10G or 1G ?
>> 
>> Also, increasing affinity factor helped in your case because there is no
>> data on other nodes.
>> But, if you have data non uniformly distributed across more nodes, you
>> might still have the problem.
>> 
>> Thanks,
>> Padma
>> 
>>> On Oct 14, 2016, at 11:18 AM, Sudheesh Katkam 
>> wrote:
>>> 
>>> Hi Francois,
>>> 
>>> Thank you for posting your findings! Glad to see a 10X improvement.
>>> 
>>> By increasing affinity factor, looks like Drill’s parallelizer is forced
>> to assign fragments on nodes with data i.e. with high favorability for data
>> locality.
>>> 
>>> Regarding the random disconnection, I agree with your guess that the
>> network bandwidth is being used up by remote reads which causes lags in
>> drillbit to ZooKeeper heartbeats (since these services use the same
>> network)? Maybe others can comment here.
>>> 
>>> Thank you,
>>> Sudheesh
>>> 
 On Oct 12, 2016, at 6:06 PM, François Méthot 
>> wrote:
 
 Hi,
 
 We finally got rid of this error. We have tried many, many things  (like
 modifying drill to ignore the error!), it ultimately came down to this
 change:
 
 from default
 planner.affinity_factor=1.2
 to
 planner.affinity_factor=100
 
 Basically this encourages fragment to only care about locally stored
>> files.
 We looked at the code that used that property and figured that 100 would
 have strong impact.
 
 What led us to this property is the fact that 1/4 of our fragments would
 take a lot more time to complete their scan, up to  10x the time of the
 fastest nodes.  On the slower nodes, Cloudera Manager would show very
>> low
 Disk IOPS with high Network IO compare to our faster nodes. We had
>> noticed
 that before but figured it would be some optimization to be done later
>> 

Re: ZK lost connectivity issue on large cluster

2016-10-14 Thread François Méthot
We have  a 12 nodes cluster and a 220 nodes cluster, but they do not talk
to each other. So Padma's analysis do not apply but thanks for your
comments. Our goal had been to run Drill on the 220 nodes cluster after it
proved worthy of it on the small cluster.

planner.width.max_per_node was eventually reduced to 2 when we were trying
to figure this out, it would still fail. After we figured out the
affinity_factor, we put it back to its original value and it would work
fine.



Sudheesh: Indeed, The Zk/drill services use the same network on our bigger
cluster.

potential improvements:
- planner.affinity_factor should be better documented.
- When ZK disconnected, the running queries systematically failed. When we
disabled the ForemanException thrown in the QueryManager.
drillbitUnregistered method, most of our query started to run successfully,
we would sometime get Drillbit Disconnected error within the rpc work bus.
It did confirm that we still had something on our network going on, but it
also showed that the RPC bus between drillbits was more resilient to
network hiccup. I could not prove it, but I think under certain condition,
the ZK session gets recreated, which cause a Query Manager unregistered
(query fail) and register call right after, but the RPC
 bus  would remains connected.


We really appreciate your feedback and we hope to contribute to this great
project in the future.
Thanks
Francois






On Fri, Oct 14, 2016 at 3:00 PM, Padma Penumarthy 
wrote:

>
> Seems like you have 215 nodes, but the data for your query is there on
> only 12 nodes.
> Drill tries to distribute the scan fragments across the cluster more
> uniformly (trying to utilize all CPU resources).
> That is why you have lot of remote reads going on and increasing affinity
> factor eliminates running scan
> fragments on the other (215-12) nodes.
>
> you also mentioned planner.width.max_per_node is set to 8.
> So, with increased affinity factor,  you have 8 scan fragments doing a lot
> more work on these 12 nodes.
> Still, you got 10X improvement. Seems like your network is the obvious
> bottleneck. Is it a 10G or 1G ?
>
> Also, increasing affinity factor helped in your case because there is no
> data on other nodes.
> But, if you have data non uniformly distributed across more nodes, you
> might still have the problem.
>
> Thanks,
> Padma
>
> > On Oct 14, 2016, at 11:18 AM, Sudheesh Katkam 
> wrote:
> >
> > Hi Francois,
> >
> > Thank you for posting your findings! Glad to see a 10X improvement.
> >
> > By increasing affinity factor, looks like Drill’s parallelizer is forced
> to assign fragments on nodes with data i.e. with high favorability for data
> locality.
> >
> > Regarding the random disconnection, I agree with your guess that the
> network bandwidth is being used up by remote reads which causes lags in
> drillbit to ZooKeeper heartbeats (since these services use the same
> network)? Maybe others can comment here.
> >
> > Thank you,
> > Sudheesh
> >
> >> On Oct 12, 2016, at 6:06 PM, François Méthot 
> wrote:
> >>
> >> Hi,
> >>
> >> We finally got rid of this error. We have tried many, many things  (like
> >> modifying drill to ignore the error!), it ultimately came down to this
> >> change:
> >>
> >> from default
> >> planner.affinity_factor=1.2
> >>  to
> >> planner.affinity_factor=100
> >>
> >> Basically this encourages fragment to only care about locally stored
> files.
> >> We looked at the code that used that property and figured that 100 would
> >> have strong impact.
> >>
> >> What led us to this property is the fact that 1/4 of our fragments would
> >> take a lot more time to complete their scan, up to  10x the time of the
> >> fastest nodes.  On the slower nodes, Cloudera Manager would show very
> low
> >> Disk IOPS with high Network IO compare to our faster nodes. We had
> noticed
> >> that before but figured it would be some optimization to be done later
> when
> >> more pressing issue would be fixed, like Zk disconnection and OOM. We
> were
> >> desperate and decided to fix anything that  would look unusual.
> >>
> >> After this change, query ran up to 10x faster.
> >>
> >> We no longer get random disconnection between node and Zookeeper.
> >>
> >> We are still wondering why exactly. Network should not be a bottleneck.
> >> Could high network traffic between a Drillbit and HDFS causes the
> Drillbit
> >> to timeout with zookeeper?
> >>
> >>
> >> On Fri, Sep 30, 2016 at 4:21 PM, François Méthot 
> >> wrote:
> >>
> >>> After the 30 seconds gap, all the Drill nodes receives the following:
> >>>
> >>> 2016-09-26 20:07:38,629 [Curator-ServiceCache-0] Debug Active drillbit
> set
> >>> changed. Now includes 220 total bits. New Active drill bits
> >>> ...faulty node is not on the list...
> >>> 2016-09-26 20:07:38,897 [Curator-ServiceCache-0] Debug Active drillbit
> set
> >>> changed. Now includes 221 total bits. New Active 

Re: ZK lost connectivity issue on large cluster

2016-10-14 Thread Padma Penumarthy

Seems like you have 215 nodes, but the data for your query is there on only 12 
nodes.
Drill tries to distribute the scan fragments across the cluster more uniformly 
(trying to utilize all CPU resources).
That is why you have lot of remote reads going on and increasing affinity 
factor eliminates running scan
fragments on the other (215-12) nodes. 

you also mentioned planner.width.max_per_node is set to 8. 
So, with increased affinity factor,  you have 8 scan fragments doing a lot more 
work on these 12 nodes. 
Still, you got 10X improvement. Seems like your network is the obvious 
bottleneck. Is it a 10G or 1G ? 

Also, increasing affinity factor helped in your case because there is no data 
on other nodes.
But, if you have data non uniformly distributed across more nodes, you might 
still have the problem.

Thanks,
Padma

> On Oct 14, 2016, at 11:18 AM, Sudheesh Katkam  wrote:
> 
> Hi Francois,
> 
> Thank you for posting your findings! Glad to see a 10X improvement.
> 
> By increasing affinity factor, looks like Drill’s parallelizer is forced to 
> assign fragments on nodes with data i.e. with high favorability for data 
> locality.
> 
> Regarding the random disconnection, I agree with your guess that the network 
> bandwidth is being used up by remote reads which causes lags in drillbit to 
> ZooKeeper heartbeats (since these services use the same network)? Maybe 
> others can comment here.
> 
> Thank you,
> Sudheesh
> 
>> On Oct 12, 2016, at 6:06 PM, François Méthot  wrote:
>> 
>> Hi,
>> 
>> We finally got rid of this error. We have tried many, many things  (like
>> modifying drill to ignore the error!), it ultimately came down to this
>> change:
>> 
>> from default
>> planner.affinity_factor=1.2
>>  to
>> planner.affinity_factor=100
>> 
>> Basically this encourages fragment to only care about locally stored files.
>> We looked at the code that used that property and figured that 100 would
>> have strong impact.
>> 
>> What led us to this property is the fact that 1/4 of our fragments would
>> take a lot more time to complete their scan, up to  10x the time of the
>> fastest nodes.  On the slower nodes, Cloudera Manager would show very low
>> Disk IOPS with high Network IO compare to our faster nodes. We had noticed
>> that before but figured it would be some optimization to be done later when
>> more pressing issue would be fixed, like Zk disconnection and OOM. We were
>> desperate and decided to fix anything that  would look unusual.
>> 
>> After this change, query ran up to 10x faster.
>> 
>> We no longer get random disconnection between node and Zookeeper.
>> 
>> We are still wondering why exactly. Network should not be a bottleneck.
>> Could high network traffic between a Drillbit and HDFS causes the Drillbit
>> to timeout with zookeeper?
>> 
>> 
>> On Fri, Sep 30, 2016 at 4:21 PM, François Méthot 
>> wrote:
>> 
>>> After the 30 seconds gap, all the Drill nodes receives the following:
>>> 
>>> 2016-09-26 20:07:38,629 [Curator-ServiceCache-0] Debug Active drillbit set
>>> changed. Now includes 220 total bits. New Active drill bits
>>> ...faulty node is not on the list...
>>> 2016-09-26 20:07:38,897 [Curator-ServiceCache-0] Debug Active drillbit set
>>> changed. Now includes 221 total bits. New Active drill bits
>>> ...faulty node is back on the list...
>>> 
>>> 
>>> So the faulty Drill node get unregistered and registered right after.
>>> 
>>> Drill is using the low level API for registering and unregistering, and
>>> the only place with unregistering occurs is when the drillbit is closed at
>>> shutdown.
>>> 
>>> That particular drillbit is still up and running after those log, it could
>>> not have trigger the unregistering process through a shutdown.
>>> 
>>> 
>>> 
>>> 
>>> Would you have an idea what else could cause a Drillbit to be unregistered
>>> from the DiscoveryService and registered again right after?
>>> 
>>> 
>>> 
>>> We are using Zookeeper 3.4.5
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> On Wed, Sep 28, 2016 at 10:36 AM, François Méthot 
>>> wrote:
>>> 
 Hi,
 
 We have switched to 1.8 and we are still getting node disconnection.
 
 We did many tests, we thought initially our stand alone parquet converter
 was generating parquet files with problematic data (like 10K characters
 string), but we were able to reproduce it with employee data from the
 tutorial.
 
 For example,  we duplicated the Drill Tutorial "Employee" data to reach
 500 M records spread over 130 parquet files.
 Each files is ~60 MB.
 
 
 We ran this query over and over on 5 different sessions using a script:
  select * from hdfs.tmp.`PARQUET_EMPLOYEE` where full_name like '%does
 not exist%';
 
  Query return no rows and would take ~35 to 45 seconds to return.
 
 Leaving the script running on each node, we eventually hit the 

Re: ZK lost connectivity issue on large cluster

2016-10-14 Thread Chun Chang
I am surprised that it's not the default.

On Fri, Oct 14, 2016 at 11:18 AM, Sudheesh Katkam 
wrote:

> Hi Francois,
>
> Thank you for posting your findings! Glad to see a 10X improvement.
>
> By increasing affinity factor, looks like Drill’s parallelizer is forced
> to assign fragments on nodes with data i.e. with high favorability for data
> locality.
>
> Regarding the random disconnection, I agree with your guess that the
> network bandwidth is being used up by remote reads which causes lags in
> drillbit to ZooKeeper heartbeats (since these services use the same
> network)? Maybe others can comment here.
>
> Thank you,
> Sudheesh
>
> > On Oct 12, 2016, at 6:06 PM, François Méthot 
> wrote:
> >
> > Hi,
> >
> > We finally got rid of this error. We have tried many, many things  (like
> > modifying drill to ignore the error!), it ultimately came down to this
> > change:
> >
> > from default
> > planner.affinity_factor=1.2
> >   to
> > planner.affinity_factor=100
> >
> > Basically this encourages fragment to only care about locally stored
> files.
> > We looked at the code that used that property and figured that 100 would
> > have strong impact.
> >
> > What led us to this property is the fact that 1/4 of our fragments would
> > take a lot more time to complete their scan, up to  10x the time of the
> > fastest nodes.  On the slower nodes, Cloudera Manager would show very low
> > Disk IOPS with high Network IO compare to our faster nodes. We had
> noticed
> > that before but figured it would be some optimization to be done later
> when
> > more pressing issue would be fixed, like Zk disconnection and OOM. We
> were
> > desperate and decided to fix anything that  would look unusual.
> >
> > After this change, query ran up to 10x faster.
> >
> > We no longer get random disconnection between node and Zookeeper.
> >
> > We are still wondering why exactly. Network should not be a bottleneck.
> > Could high network traffic between a Drillbit and HDFS causes the
> Drillbit
> > to timeout with zookeeper?
> >
> >
> > On Fri, Sep 30, 2016 at 4:21 PM, François Méthot 
> > wrote:
> >
> >> After the 30 seconds gap, all the Drill nodes receives the following:
> >>
> >> 2016-09-26 20:07:38,629 [Curator-ServiceCache-0] Debug Active drillbit
> set
> >> changed. Now includes 220 total bits. New Active drill bits
> >> ...faulty node is not on the list...
> >> 2016-09-26 20:07:38,897 [Curator-ServiceCache-0] Debug Active drillbit
> set
> >> changed. Now includes 221 total bits. New Active drill bits
> >> ...faulty node is back on the list...
> >>
> >>
> >> So the faulty Drill node get unregistered and registered right after.
> >>
> >> Drill is using the low level API for registering and unregistering, and
> >> the only place with unregistering occurs is when the drillbit is closed
> at
> >> shutdown.
> >>
> >> That particular drillbit is still up and running after those log, it
> could
> >> not have trigger the unregistering process through a shutdown.
> >>
> >>
> >>
> >>
> >> Would you have an idea what else could cause a Drillbit to be
> unregistered
> >> from the DiscoveryService and registered again right after?
> >>
> >>
> >>
> >> We are using Zookeeper 3.4.5
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >> On Wed, Sep 28, 2016 at 10:36 AM, François Méthot 
> >> wrote:
> >>
> >>> Hi,
> >>>
> >>> We have switched to 1.8 and we are still getting node disconnection.
> >>>
> >>> We did many tests, we thought initially our stand alone parquet
> converter
> >>> was generating parquet files with problematic data (like 10K characters
> >>> string), but we were able to reproduce it with employee data from the
> >>> tutorial.
> >>>
> >>> For example,  we duplicated the Drill Tutorial "Employee" data to reach
> >>> 500 M records spread over 130 parquet files.
> >>> Each files is ~60 MB.
> >>>
> >>>
> >>> We ran this query over and over on 5 different sessions using a script:
> >>>   select * from hdfs.tmp.`PARQUET_EMPLOYEE` where full_name like '%does
> >>> not exist%';
> >>>
> >>>   Query return no rows and would take ~35 to 45 seconds to return.
> >>>
> >>> Leaving the script running on each node, we eventually hit the "nodes
> >>> lost connectivity during query" error.
> >>>
> >>> One the done that failed,
> >>>
> >>>   We see those log:
> >>> 2016-09-26 20:07:09,029 [...uuid...frag:1:10] INFO
> >>> o.a.d.e.w.f.FragmentStatusReporter - ...uuid...:1:10: State to report:
> >>> RUNNING
> >>> 2016-09-26 20:07:09,029 [...uuid...frag:1:10] DEBUG
> >>> o.a.d.e.w.FragmentExecutor - Starting fragment 1:10 on server064:31010
> >>>
> >>> <--- 30 seconds gap for that fragment --->
> >>>
> >>> 2016-09-26 20:37:09,976 [BitServer-2] WARN o.a.d.exec.rpc.control.
> ControlServer
> >>> - Message of mode REQUEST of rpc type 2 took longer then 500 ms. Actual
> >>> duration was 23617ms.
> >>>
> >>> 2016-09-26 20:07:38,211 [...uuid...frag:1:10] DEBUG
> 

Re: ZK lost connectivity issue on large cluster

2016-10-14 Thread Sudheesh Katkam
Hi Francois,

Thank you for posting your findings! Glad to see a 10X improvement.

By increasing affinity factor, looks like Drill’s parallelizer is forced to 
assign fragments on nodes with data i.e. with high favorability for data 
locality.

Regarding the random disconnection, I agree with your guess that the network 
bandwidth is being used up by remote reads which causes lags in drillbit to 
ZooKeeper heartbeats (since these services use the same network)? Maybe others 
can comment here.

Thank you,
Sudheesh

> On Oct 12, 2016, at 6:06 PM, François Méthot  wrote:
> 
> Hi,
> 
> We finally got rid of this error. We have tried many, many things  (like
> modifying drill to ignore the error!), it ultimately came down to this
> change:
> 
> from default
> planner.affinity_factor=1.2
>   to
> planner.affinity_factor=100
> 
> Basically this encourages fragment to only care about locally stored files.
> We looked at the code that used that property and figured that 100 would
> have strong impact.
> 
> What led us to this property is the fact that 1/4 of our fragments would
> take a lot more time to complete their scan, up to  10x the time of the
> fastest nodes.  On the slower nodes, Cloudera Manager would show very low
> Disk IOPS with high Network IO compare to our faster nodes. We had noticed
> that before but figured it would be some optimization to be done later when
> more pressing issue would be fixed, like Zk disconnection and OOM. We were
> desperate and decided to fix anything that  would look unusual.
> 
> After this change, query ran up to 10x faster.
> 
> We no longer get random disconnection between node and Zookeeper.
> 
> We are still wondering why exactly. Network should not be a bottleneck.
> Could high network traffic between a Drillbit and HDFS causes the Drillbit
> to timeout with zookeeper?
> 
> 
> On Fri, Sep 30, 2016 at 4:21 PM, François Méthot 
> wrote:
> 
>> After the 30 seconds gap, all the Drill nodes receives the following:
>> 
>> 2016-09-26 20:07:38,629 [Curator-ServiceCache-0] Debug Active drillbit set
>> changed. Now includes 220 total bits. New Active drill bits
>> ...faulty node is not on the list...
>> 2016-09-26 20:07:38,897 [Curator-ServiceCache-0] Debug Active drillbit set
>> changed. Now includes 221 total bits. New Active drill bits
>> ...faulty node is back on the list...
>> 
>> 
>> So the faulty Drill node get unregistered and registered right after.
>> 
>> Drill is using the low level API for registering and unregistering, and
>> the only place with unregistering occurs is when the drillbit is closed at
>> shutdown.
>> 
>> That particular drillbit is still up and running after those log, it could
>> not have trigger the unregistering process through a shutdown.
>> 
>> 
>> 
>> 
>> Would you have an idea what else could cause a Drillbit to be unregistered
>> from the DiscoveryService and registered again right after?
>> 
>> 
>> 
>> We are using Zookeeper 3.4.5
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> On Wed, Sep 28, 2016 at 10:36 AM, François Méthot 
>> wrote:
>> 
>>> Hi,
>>> 
>>> We have switched to 1.8 and we are still getting node disconnection.
>>> 
>>> We did many tests, we thought initially our stand alone parquet converter
>>> was generating parquet files with problematic data (like 10K characters
>>> string), but we were able to reproduce it with employee data from the
>>> tutorial.
>>> 
>>> For example,  we duplicated the Drill Tutorial "Employee" data to reach
>>> 500 M records spread over 130 parquet files.
>>> Each files is ~60 MB.
>>> 
>>> 
>>> We ran this query over and over on 5 different sessions using a script:
>>>   select * from hdfs.tmp.`PARQUET_EMPLOYEE` where full_name like '%does
>>> not exist%';
>>> 
>>>   Query return no rows and would take ~35 to 45 seconds to return.
>>> 
>>> Leaving the script running on each node, we eventually hit the "nodes
>>> lost connectivity during query" error.
>>> 
>>> One the done that failed,
>>> 
>>>   We see those log:
>>> 2016-09-26 20:07:09,029 [...uuid...frag:1:10] INFO
>>> o.a.d.e.w.f.FragmentStatusReporter - ...uuid...:1:10: State to report:
>>> RUNNING
>>> 2016-09-26 20:07:09,029 [...uuid...frag:1:10] DEBUG
>>> o.a.d.e.w.FragmentExecutor - Starting fragment 1:10 on server064:31010
>>> 
>>> <--- 30 seconds gap for that fragment --->
>>> 
>>> 2016-09-26 20:37:09,976 [BitServer-2] WARN 
>>> o.a.d.exec.rpc.control.ControlServer
>>> - Message of mode REQUEST of rpc type 2 took longer then 500 ms. Actual
>>> duration was 23617ms.
>>> 
>>> 2016-09-26 20:07:38,211 [...uuid...frag:1:10] DEBUG
>>> o.a.d.e.p.i.s.RemovingRecordBatch - doWork(): 0 records copied out of 0,
>>> remaining: 0 incoming schema BatchSchema [, selectionVector=TWO_BYTE]
>>> 2016-09-26 20:07:38,211 [...uuid...frag:1:10] DEBUG
>>> o.a.d.exec.rpc.control.WorkEventBus - Cancelling and removing fragment
>>> manager : ...uuid...
>>> 
>>> 
>>> 
>>> For the same query on a 

Re: ZK lost connectivity issue on large cluster

2016-10-12 Thread François Méthot
Hi,

 We finally got rid of this error. We have tried many, many things  (like
modifying drill to ignore the error!), it ultimately came down to this
change:

 from default
planner.affinity_factor=1.2
   to
planner.affinity_factor=100

Basically this encourages fragment to only care about locally stored files.
We looked at the code that used that property and figured that 100 would
have strong impact.

What led us to this property is the fact that 1/4 of our fragments would
take a lot more time to complete their scan, up to  10x the time of the
fastest nodes.  On the slower nodes, Cloudera Manager would show very low
Disk IOPS with high Network IO compare to our faster nodes. We had noticed
that before but figured it would be some optimization to be done later when
more pressing issue would be fixed, like Zk disconnection and OOM. We were
desperate and decided to fix anything that  would look unusual.

After this change, query ran up to 10x faster.

We no longer get random disconnection between node and Zookeeper.

We are still wondering why exactly. Network should not be a bottleneck.
Could high network traffic between a Drillbit and HDFS causes the Drillbit
to timeout with zookeeper?


On Fri, Sep 30, 2016 at 4:21 PM, François Méthot 
wrote:

> After the 30 seconds gap, all the Drill nodes receives the following:
>
> 2016-09-26 20:07:38,629 [Curator-ServiceCache-0] Debug Active drillbit set
> changed. Now includes 220 total bits. New Active drill bits
> ...faulty node is not on the list...
> 2016-09-26 20:07:38,897 [Curator-ServiceCache-0] Debug Active drillbit set
> changed. Now includes 221 total bits. New Active drill bits
> ...faulty node is back on the list...
>
>
> So the faulty Drill node get unregistered and registered right after.
>
> Drill is using the low level API for registering and unregistering, and
> the only place with unregistering occurs is when the drillbit is closed at
> shutdown.
>
> That particular drillbit is still up and running after those log, it could
> not have trigger the unregistering process through a shutdown.
>
>
>
>
> Would you have an idea what else could cause a Drillbit to be unregistered
> from the DiscoveryService and registered again right after?
>
>
>
> We are using Zookeeper 3.4.5
>
>
>
>
>
>
>
>
>
>
> On Wed, Sep 28, 2016 at 10:36 AM, François Méthot 
> wrote:
>
>> Hi,
>>
>>  We have switched to 1.8 and we are still getting node disconnection.
>>
>> We did many tests, we thought initially our stand alone parquet converter
>> was generating parquet files with problematic data (like 10K characters
>> string), but we were able to reproduce it with employee data from the
>> tutorial.
>>
>> For example,  we duplicated the Drill Tutorial "Employee" data to reach
>> 500 M records spread over 130 parquet files.
>> Each files is ~60 MB.
>>
>>
>> We ran this query over and over on 5 different sessions using a script:
>>select * from hdfs.tmp.`PARQUET_EMPLOYEE` where full_name like '%does
>> not exist%';
>>
>>Query return no rows and would take ~35 to 45 seconds to return.
>>
>> Leaving the script running on each node, we eventually hit the "nodes
>> lost connectivity during query" error.
>>
>> One the done that failed,
>>
>>We see those log:
>> 2016-09-26 20:07:09,029 [...uuid...frag:1:10] INFO
>> o.a.d.e.w.f.FragmentStatusReporter - ...uuid...:1:10: State to report:
>> RUNNING
>> 2016-09-26 20:07:09,029 [...uuid...frag:1:10] DEBUG
>> o.a.d.e.w.FragmentExecutor - Starting fragment 1:10 on server064:31010
>>
>> <--- 30 seconds gap for that fragment --->
>>
>> 2016-09-26 20:37:09,976 [BitServer-2] WARN 
>> o.a.d.exec.rpc.control.ControlServer
>> - Message of mode REQUEST of rpc type 2 took longer then 500 ms. Actual
>> duration was 23617ms.
>>
>> 2016-09-26 20:07:38,211 [...uuid...frag:1:10] DEBUG
>> o.a.d.e.p.i.s.RemovingRecordBatch - doWork(): 0 records copied out of 0,
>> remaining: 0 incoming schema BatchSchema [, selectionVector=TWO_BYTE]
>> 2016-09-26 20:07:38,211 [...uuid...frag:1:10] DEBUG
>> o.a.d.exec.rpc.control.WorkEventBus - Cancelling and removing fragment
>> manager : ...uuid...
>>
>>
>>
>> For the same query on a working node:
>> 2016-09-26 20:07:09,056 [...uuid...frag:1:2] INFO
>> o.a.d.e.w.f.FragmentStatusReporter - ...uuid...:1:2: State to report:
>> RUNNING
>> 2016-09-26 20:07:09,056 [...uuid...frag:1:2] DEBUG
>> o.a.d.e.w.FragmentExecutor - Starting fragment 1:2 on server125:31010
>> 2016-09-26 20:07:09,749 [...uuid...frag:1:2] DEBUG
>> o.a.d.e.p.i.s.RemovingRecordBatch - doWork(): 0 records copied out of 0,
>> remaining: 0 incoming schema BatchSchema [, selectionVector=TWO_BYTE]
>> 2016-09-26 20:07:09,749 [...uuid...frag:1:2] DEBUG
>> o.a.d.e.p.i.s.RemovingRecordBatch - doWork(): 0 records copied out of 0,
>> remaining: 0 incoming schema BatchSchema [, selectionVector=TWO_BYTE]
>> 2016-09-26 20:07:11,005 [...uuid...frag:1:2] DEBUG
>> o.a.d.e.s.p.c.ParquetRecordReader - Read 

Re: ZK lost connectivity issue on large cluster

2016-09-30 Thread François Méthot
After the 30 seconds gap, all the Drill nodes receives the following:

2016-09-26 20:07:38,629 [Curator-ServiceCache-0] Debug Active drillbit set
changed. Now includes 220 total bits. New Active drill bits
...faulty node is not on the list...
2016-09-26 20:07:38,897 [Curator-ServiceCache-0] Debug Active drillbit set
changed. Now includes 221 total bits. New Active drill bits
...faulty node is back on the list...


So the faulty Drill node get unregistered and registered right after.

Drill is using the low level API for registering and unregistering, and the
only place with unregistering occurs is when the drillbit is closed at
shutdown.

That particular drillbit is still up and running after those log, it could
not have trigger the unregistering process through a shutdown.




Would you have an idea what else could cause a Drillbit to be unregistered
from the DiscoveryService and registered again right after?



We are using Zookeeper 3.4.5










On Wed, Sep 28, 2016 at 10:36 AM, François Méthot 
wrote:

> Hi,
>
>  We have switched to 1.8 and we are still getting node disconnection.
>
> We did many tests, we thought initially our stand alone parquet converter
> was generating parquet files with problematic data (like 10K characters
> string), but we were able to reproduce it with employee data from the
> tutorial.
>
> For example,  we duplicated the Drill Tutorial "Employee" data to reach
> 500 M records spread over 130 parquet files.
> Each files is ~60 MB.
>
>
> We ran this query over and over on 5 different sessions using a script:
>select * from hdfs.tmp.`PARQUET_EMPLOYEE` where full_name like '%does
> not exist%';
>
>Query return no rows and would take ~35 to 45 seconds to return.
>
> Leaving the script running on each node, we eventually hit the "nodes lost
> connectivity during query" error.
>
> One the done that failed,
>
>We see those log:
> 2016-09-26 20:07:09,029 [...uuid...frag:1:10] INFO 
> o.a.d.e.w.f.FragmentStatusReporter
> - ...uuid...:1:10: State to report: RUNNING
> 2016-09-26 20:07:09,029 [...uuid...frag:1:10] DEBUG
> o.a.d.e.w.FragmentExecutor - Starting fragment 1:10 on server064:31010
>
> <--- 30 seconds gap for that fragment --->
>
> 2016-09-26 20:37:09,976 [BitServer-2] WARN 
> o.a.d.exec.rpc.control.ControlServer
> - Message of mode REQUEST of rpc type 2 took longer then 500 ms. Actual
> duration was 23617ms.
>
> 2016-09-26 20:07:38,211 [...uuid...frag:1:10] DEBUG 
> o.a.d.e.p.i.s.RemovingRecordBatch
> - doWork(): 0 records copied out of 0, remaining: 0 incoming schema
> BatchSchema [, selectionVector=TWO_BYTE]
> 2016-09-26 20:07:38,211 [...uuid...frag:1:10] DEBUG 
> o.a.d.exec.rpc.control.WorkEventBus
> - Cancelling and removing fragment manager : ...uuid...
>
>
>
> For the same query on a working node:
> 2016-09-26 20:07:09,056 [...uuid...frag:1:2] INFO 
> o.a.d.e.w.f.FragmentStatusReporter
> - ...uuid...:1:2: State to report: RUNNING
> 2016-09-26 20:07:09,056 [...uuid...frag:1:2] DEBUG
> o.a.d.e.w.FragmentExecutor - Starting fragment 1:2 on server125:31010
> 2016-09-26 20:07:09,749 [...uuid...frag:1:2] DEBUG 
> o.a.d.e.p.i.s.RemovingRecordBatch
> - doWork(): 0 records copied out of 0, remaining: 0 incoming schema
> BatchSchema [, selectionVector=TWO_BYTE]
> 2016-09-26 20:07:09,749 [...uuid...frag:1:2] DEBUG 
> o.a.d.e.p.i.s.RemovingRecordBatch
> - doWork(): 0 records copied out of 0, remaining: 0 incoming schema
> BatchSchema [, selectionVector=TWO_BYTE]
> 2016-09-26 20:07:11,005 [...uuid...frag:1:2] DEBUG 
> o.a.d.e.s.p.c.ParquetRecordReader
> - Read 87573 records out of row groups(0) in file `/data/drill/tmp/PARQUET_
> EMPLOYEE/0_0_14.parquet
>
>
>
>
> We are investigating what could get cause that 30 seconds gap for that
> fragment.
>
> Any idea let us know
>
> Thanks
> Francois
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> On Mon, Sep 19, 2016 at 2:59 PM, François Méthot 
> wrote:
>
>> Hi Sudheesh,
>>
>>   If I add selection filter so that no row are returned, the same problem
>> occur. I also simplified the query to include only few integer columns.
>>
>> That particular data repo is ~200+ Billions records spread over ~50 000
>> parquet files.
>>
>> We have other CSV data repo that are 100x smaller that does not trigger
>> this issue.
>>
>>
>> + Is atsqa4-133.qa.lab [1] the Foreman node for the query in this case?
>> There is also a bizarre case where the node that is reported as lost is the
>> node itself.
>> Yes, the stack trace is from the ticket, It did occurred once or twice
>> (in the many many attempts) that it was the node itself.
>>
>> + Is there a spike in memory usage of the Drillbit this is the Foreman
>> for the query (process memory, not just heap)?
>> We don't notice any unusual spike, each nodes gets busy in the same range
>> when query is running.
>>
>> I tried running with 8GB/20GB and 4GB/24GB heap/off-heap, did not see any
>> improvement.
>>
>>
>> We will update from 1.7 to 

Re: ZK lost connectivity issue on large cluster

2016-09-28 Thread François Méthot
Hi,

 We have switched to 1.8 and we are still getting node disconnection.

We did many tests, we thought initially our stand alone parquet converter
was generating parquet files with problematic data (like 10K characters
string), but we were able to reproduce it with employee data from the
tutorial.

For example,  we duplicated the Drill Tutorial "Employee" data to reach 500
M records spread over 130 parquet files.
Each files is ~60 MB.


We ran this query over and over on 5 different sessions using a script:
   select * from hdfs.tmp.`PARQUET_EMPLOYEE` where full_name like '%does
not exist%';

   Query return no rows and would take ~35 to 45 seconds to return.

Leaving the script running on each node, we eventually hit the "nodes lost
connectivity during query" error.

One the done that failed,

   We see those log:
2016-09-26 20:07:09,029 [...uuid...frag:1:10] INFO
o.a.d.e.w.f.FragmentStatusReporter - ...uuid...:1:10: State to report:
RUNNING
2016-09-26 20:07:09,029 [...uuid...frag:1:10] DEBUG
o.a.d.e.w.FragmentExecutor - Starting fragment 1:10 on server064:31010

<--- 30 seconds gap for that fragment --->

2016-09-26 20:37:09,976 [BitServer-2] WARN
o.a.d.exec.rpc.control.ControlServer - Message of mode REQUEST of rpc type
2 took longer then 500 ms. Actual duration was 23617ms.

2016-09-26 20:07:38,211 [...uuid...frag:1:10] DEBUG
o.a.d.e.p.i.s.RemovingRecordBatch - doWork(): 0 records copied out of 0,
remaining: 0 incoming schema BatchSchema [, selectionVector=TWO_BYTE]
2016-09-26 20:07:38,211 [...uuid...frag:1:10] DEBUG
o.a.d.exec.rpc.control.WorkEventBus - Cancelling and removing fragment
manager : ...uuid...



For the same query on a working node:
2016-09-26 20:07:09,056 [...uuid...frag:1:2] INFO
o.a.d.e.w.f.FragmentStatusReporter - ...uuid...:1:2: State to report:
RUNNING
2016-09-26 20:07:09,056 [...uuid...frag:1:2] DEBUG
o.a.d.e.w.FragmentExecutor - Starting fragment 1:2 on server125:31010
2016-09-26 20:07:09,749 [...uuid...frag:1:2] DEBUG
o.a.d.e.p.i.s.RemovingRecordBatch - doWork(): 0 records copied out of 0,
remaining: 0 incoming schema BatchSchema [, selectionVector=TWO_BYTE]
2016-09-26 20:07:09,749 [...uuid...frag:1:2] DEBUG
o.a.d.e.p.i.s.RemovingRecordBatch - doWork(): 0 records copied out of 0,
remaining: 0 incoming schema BatchSchema [, selectionVector=TWO_BYTE]
2016-09-26 20:07:11,005 [...uuid...frag:1:2] DEBUG
o.a.d.e.s.p.c.ParquetRecordReader - Read 87573 records out of row groups(0)
in file `/data/drill/tmp/PARQUET_EMPLOYEE/0_0_14.parquet




We are investigating what could get cause that 30 seconds gap for that
fragment.

Any idea let us know

Thanks
Francois





















On Mon, Sep 19, 2016 at 2:59 PM, François Méthot 
wrote:

> Hi Sudheesh,
>
>   If I add selection filter so that no row are returned, the same problem
> occur. I also simplified the query to include only few integer columns.
>
> That particular data repo is ~200+ Billions records spread over ~50 000
> parquet files.
>
> We have other CSV data repo that are 100x smaller that does not trigger
> this issue.
>
>
> + Is atsqa4-133.qa.lab [1] the Foreman node for the query in this case?
> There is also a bizarre case where the node that is reported as lost is the
> node itself.
> Yes, the stack trace is from the ticket, It did occurred once or twice (in
> the many many attempts) that it was the node itself.
>
> + Is there a spike in memory usage of the Drillbit this is the Foreman for
> the query (process memory, not just heap)?
> We don't notice any unusual spike, each nodes gets busy in the same range
> when query is running.
>
> I tried running with 8GB/20GB and 4GB/24GB heap/off-heap, did not see any
> improvement.
>
>
> We will update from 1.7 to 1.8 before going ahead with more investigation.
>
> Thanks a lot.
>
>
>
>
>
>
>
>
>
>
> On Mon, Sep 19, 2016 at 1:19 PM, Sudheesh Katkam 
> wrote:
>
>> Hi Francois,
>>
>> A simple query with only projections is not an “ideal” use case, since
>> Drill is bound by how fast the client can consume records. There are 1000
>> scanners sending data to 1 client (vs far fewer scanners sending data in
>> the 12 node case).
>>
>> This might increase the load on the Drillbit that is the Foreman for the
>> query. In the query profile, the scanners should be spending a lot more
>> time “waiting” to send records to the client (via root fragment).
>> + Is atsqa4-133.qa.lab [1] the Foreman node for the query in this case?
>> There is also a bizarre case where the node that is reported as lost is the
>> node itself.
>> + Is there a spike in memory usage of the Drillbit this is the Foreman
>> for the query (process memory, not just heap)?
>>
>> Regarding the warnings ...
>>
>> > 2016-09-19 13:31:56,866 [BitServer-7] WARN
>> > o.a.d.exec.rpc.control.ControlServer - Message of mode REQUEST of rpc
>> type
>> > 6 took longer than 500 ms. Actual Duration was 16053ms.
>>
>>
>> RPC type 6 is a cancellation request; DRILL-4766 [2] should 

Re: ZK lost connectivity issue on large cluster

2016-09-19 Thread François Méthot
Hi Sudheesh,

  If I add selection filter so that no row are returned, the same problem
occur. I also simplified the query to include only few integer columns.

That particular data repo is ~200+ Billions records spread over ~50 000
parquet files.

We have other CSV data repo that are 100x smaller that does not trigger
this issue.


+ Is atsqa4-133.qa.lab [1] the Foreman node for the query in this case?
There is also a bizarre case where the node that is reported as lost is the
node itself.
Yes, the stack trace is from the ticket, It did occurred once or twice (in
the many many attempts) that it was the node itself.

+ Is there a spike in memory usage of the Drillbit this is the Foreman for
the query (process memory, not just heap)?
We don't notice any unusual spike, each nodes gets busy in the same range
when query is running.

I tried running with 8GB/20GB and 4GB/24GB heap/off-heap, did not see any
improvement.


We will update from 1.7 to 1.8 before going ahead with more investigation.

Thanks a lot.










On Mon, Sep 19, 2016 at 1:19 PM, Sudheesh Katkam 
wrote:

> Hi Francois,
>
> A simple query with only projections is not an “ideal” use case, since
> Drill is bound by how fast the client can consume records. There are 1000
> scanners sending data to 1 client (vs far fewer scanners sending data in
> the 12 node case).
>
> This might increase the load on the Drillbit that is the Foreman for the
> query. In the query profile, the scanners should be spending a lot more
> time “waiting” to send records to the client (via root fragment).
> + Is atsqa4-133.qa.lab [1] the Foreman node for the query in this case?
> There is also a bizarre case where the node that is reported as lost is the
> node itself.
> + Is there a spike in memory usage of the Drillbit this is the Foreman for
> the query (process memory, not just heap)?
>
> Regarding the warnings ...
>
> > 2016-09-19 13:31:56,866 [BitServer-7] WARN
> > o.a.d.exec.rpc.control.ControlServer - Message of mode REQUEST of rpc
> type
> > 6 took longer than 500 ms. Actual Duration was 16053ms.
>
>
> RPC type 6 is a cancellation request; DRILL-4766 [2] should help in this
> case, which is resolved in the latest version of Drill. So as Chun
> suggested, upgrade the cluster to the latest version of Drill.
>
> > 2016-09-19 14:15:33,357 [BitServer-4] WARN
> > o.a.d.exec.rpc.control.ControlClient - Message of mode RESPONSE of rpc
> type
> > 1 took longer than 500 ms. Actual Duration was 981ms.
>
> I am surprised that responses are taking that long to handle.
> + Are both messages on the same Drillbit?
>
> The other warnings can be ignored.
>
> Thank you,
> Sudheesh
>
> [1] I just realized that atsqa4-133.qa.lab is in one of our test
> environments :)
> [2] https://issues.apache.org/jira/browse/DRILL-4766 <
> https://issues.apache.org/jira/browse/DRILL-4766>
>
> > On Sep 19, 2016, at 9:15 AM, François Méthot 
> wrote:
> >
> > Hi Sudheesh,
> >
> >
> > + Does the query involve any aggregations or filters? Or is this a select
> > query with only projections?
> > Simple query with only projections
> >
> > + Any suspicious timings in the query profile?
> > Nothing specially different than our working query on our small cluster.
> >
> > + Any suspicious warning messages in the logs around the time of failure
> on
> > any of the drillbits? Specially on atsqa4-133.qa.lab? Specially this one
> > (“..” are place holders):
> >  Message of mode .. of rpc type .. took longer than ..ms.  Actual
> duration
> > was ..ms.
> >
> > Well we do see this warning on the failing node (on my last test), I
> found
> > this WARNING in our log for the past month for pretty much every node I
> > checked.
> > 2016-09-19 13:31:56,866 [BitServer-7] WARN
> > o.a.d.exec.rpc.control.ControlServer - Message of mode REQUEST of rpc
> type
> > 6 took longer than 500 ms. Actual Duration was 16053ms.
> > 2016-09-19 14:15:33,357 [BitServer-4] WARN
> > o.a.d.exec.rpc.control.ControlClient - Message of mode RESPONSE of rpc
> type
> > 1 took longer than 500 ms. Actual Duration was 981ms.
> >
> > We really appreciate your help. I will dig in the source code for when
> and
> > why this error happen.
> >
> >
> > Francois
> >
> > P.S.:
> > We do see this also:
> > 2016-09-19 14:48:23,444 [drill-executor-9] WARN
> > o.a.d.exec.rpc.control.WorkEventBus - Fragment ..:1:2 not found in
> the
> > work bus.
> > 2016-09-19 14:48:23,444 [drill-executor-11] WARN
> > o.a.d.exec.rpc.control.WorkEventBus - Fragment :1:222 not found in
> the
> > work bus.
> > 2016-09-19 14:48:23,444 [drill-executor-12] WARN
> > o.a.d.exec.rpc.control.WorkEventBus - Fragment :1:442 not found in
> the
> > work bus.
> > 2016-09-19 14:48:23,444 [drill-executor-10] WARN
> > o.a.d.exec.rpc.control.WorkEventBus - Fragment :1:662 not found in
> the
> > work bus.
> >
> >
> >
> >
> > On Sun, Sep 18, 2016 at 2:57 PM, Sudheesh Katkam 
> > wrote:
> >
> >> Hi Francois,
> >>
> 

Re: ZK lost connectivity issue on large cluster

2016-09-19 Thread Sudheesh Katkam
One more interesting thing and another guess to resolve the problem,

> P.S.:
> We do see this also:
> 2016-09-19 14:48:23,444 [drill-executor-9] WARN
> o.a.d.exec.rpc.control.WorkEventBus - Fragment ..:1:2 not found in the
> work bus.
> 2016-09-19 14:48:23,444 [drill-executor-11] WARN
> o.a.d.exec.rpc.control.WorkEventBus - Fragment :1:222 not found in the
> work bus.
> 2016-09-19 14:48:23,444 [drill-executor-12] WARN
> o.a.d.exec.rpc.control.WorkEventBus - Fragment :1:442 not found in the
> work bus.
> 2016-09-19 14:48:23,444 [drill-executor-10] WARN
> o.a.d.exec.rpc.control.WorkEventBus - Fragment :1:662 not found in the
> work bus.

+ Are there 220 nodes in your cluster?

If that is the case and these fragments are scanners, they seem be be assigned 
in a round robin fashion (and not based on data locality?). If so, please 
upgrade to the latest version where DRILL-4446 [1] is resolved.

Thank you,
Sudheesh

[1] https://issues.apache.org/jira/browse/DRILL-4446 


> On Sep 19, 2016, at 10:19 AM, Sudheesh Katkam  wrote:
> 
> Hi Francois,
> 
> A simple query with only projections is not an “ideal” use case, since Drill 
> is bound by how fast the client can consume records. There are 1000 scanners 
> sending data to 1 client (vs far fewer scanners sending data in the 12 node 
> case).
> 
> This might increase the load on the Drillbit that is the Foreman for the 
> query. In the query profile, the scanners should be spending a lot more time 
> “waiting” to send records to the client (via root fragment).
> + Is atsqa4-133.qa.lab [1] the Foreman node for the query in this case? There 
> is also a bizarre case where the node that is reported as lost is the node 
> itself.
> + Is there a spike in memory usage of the Drillbit this is the Foreman for 
> the query (process memory, not just heap)?
> 
> Regarding the warnings ...
> 
>> 2016-09-19 13:31:56,866 [BitServer-7] WARN
>> o.a.d.exec.rpc.control.ControlServer - Message of mode REQUEST of rpc type
>> 6 took longer than 500 ms. Actual Duration was 16053ms.
> 
> 
> RPC type 6 is a cancellation request; DRILL-4766 [2] should help in this 
> case, which is resolved in the latest version of Drill. So as Chun suggested, 
> upgrade the cluster to the latest version of Drill.
> 
>> 2016-09-19 14:15:33,357 [BitServer-4] WARN
>> o.a.d.exec.rpc.control.ControlClient - Message of mode RESPONSE of rpc type
>> 1 took longer than 500 ms. Actual Duration was 981ms.
> 
> I am surprised that responses are taking that long to handle.
> + Are both messages on the same Drillbit?
> 
> The other warnings can be ignored.
> 
> Thank you,
> Sudheesh
> 
> [1] I just realized that atsqa4-133.qa.lab is in one of our test environments 
> :)
> [2] https://issues.apache.org/jira/browse/DRILL-4766 
> 
> 
>> On Sep 19, 2016, at 9:15 AM, François Méthot > > wrote:
>> 
>> Hi Sudheesh,
>> 
>> 
>> + Does the query involve any aggregations or filters? Or is this a select
>> query with only projections?
>> Simple query with only projections
>> 
>> + Any suspicious timings in the query profile?
>> Nothing specially different than our working query on our small cluster.
>> 
>> + Any suspicious warning messages in the logs around the time of failure on
>> any of the drillbits? Specially on atsqa4-133.qa.lab? Specially this one
>> (“..” are place holders):
>>  Message of mode .. of rpc type .. took longer than ..ms.  Actual duration
>> was ..ms.
>> 
>> Well we do see this warning on the failing node (on my last test), I found
>> this WARNING in our log for the past month for pretty much every node I
>> checked.
>> 2016-09-19 13:31:56,866 [BitServer-7] WARN
>> o.a.d.exec.rpc.control.ControlServer - Message of mode REQUEST of rpc type
>> 6 took longer than 500 ms. Actual Duration was 16053ms.
>> 2016-09-19 14:15:33,357 [BitServer-4] WARN
>> o.a.d.exec.rpc.control.ControlClient - Message of mode RESPONSE of rpc type
>> 1 took longer than 500 ms. Actual Duration was 981ms.
>> 
>> We really appreciate your help. I will dig in the source code for when and
>> why this error happen.
>> 
>> 
>> Francois
>> 
>> P.S.:
>> We do see this also:
>> 2016-09-19 14:48:23,444 [drill-executor-9] WARN
>> o.a.d.exec.rpc.control.WorkEventBus - Fragment ..:1:2 not found in the
>> work bus.
>> 2016-09-19 14:48:23,444 [drill-executor-11] WARN
>> o.a.d.exec.rpc.control.WorkEventBus - Fragment :1:222 not found in the
>> work bus.
>> 2016-09-19 14:48:23,444 [drill-executor-12] WARN
>> o.a.d.exec.rpc.control.WorkEventBus - Fragment :1:442 not found in the
>> work bus.
>> 2016-09-19 14:48:23,444 [drill-executor-10] WARN
>> o.a.d.exec.rpc.control.WorkEventBus - Fragment :1:662 not found in the
>> work bus.
>> 
>> 
>> 
>> 
>> On Sun, Sep 18, 2016 at 2:57 PM, Sudheesh Katkam > >

Re: ZK lost connectivity issue on large cluster

2016-09-19 Thread Sudheesh Katkam
Hi Francois,

A simple query with only projections is not an “ideal” use case, since Drill is 
bound by how fast the client can consume records. There are 1000 scanners 
sending data to 1 client (vs far fewer scanners sending data in the 12 node 
case).

This might increase the load on the Drillbit that is the Foreman for the query. 
In the query profile, the scanners should be spending a lot more time “waiting” 
to send records to the client (via root fragment).
+ Is atsqa4-133.qa.lab [1] the Foreman node for the query in this case? There 
is also a bizarre case where the node that is reported as lost is the node 
itself.
+ Is there a spike in memory usage of the Drillbit this is the Foreman for the 
query (process memory, not just heap)?

Regarding the warnings ...

> 2016-09-19 13:31:56,866 [BitServer-7] WARN
> o.a.d.exec.rpc.control.ControlServer - Message of mode REQUEST of rpc type
> 6 took longer than 500 ms. Actual Duration was 16053ms.


RPC type 6 is a cancellation request; DRILL-4766 [2] should help in this case, 
which is resolved in the latest version of Drill. So as Chun suggested, upgrade 
the cluster to the latest version of Drill.

> 2016-09-19 14:15:33,357 [BitServer-4] WARN
> o.a.d.exec.rpc.control.ControlClient - Message of mode RESPONSE of rpc type
> 1 took longer than 500 ms. Actual Duration was 981ms.

I am surprised that responses are taking that long to handle.
+ Are both messages on the same Drillbit?

The other warnings can be ignored.

Thank you,
Sudheesh

[1] I just realized that atsqa4-133.qa.lab is in one of our test environments :)
[2] https://issues.apache.org/jira/browse/DRILL-4766 


> On Sep 19, 2016, at 9:15 AM, François Méthot  wrote:
> 
> Hi Sudheesh,
> 
> 
> + Does the query involve any aggregations or filters? Or is this a select
> query with only projections?
> Simple query with only projections
> 
> + Any suspicious timings in the query profile?
> Nothing specially different than our working query on our small cluster.
> 
> + Any suspicious warning messages in the logs around the time of failure on
> any of the drillbits? Specially on atsqa4-133.qa.lab? Specially this one
> (“..” are place holders):
>  Message of mode .. of rpc type .. took longer than ..ms.  Actual duration
> was ..ms.
> 
> Well we do see this warning on the failing node (on my last test), I found
> this WARNING in our log for the past month for pretty much every node I
> checked.
> 2016-09-19 13:31:56,866 [BitServer-7] WARN
> o.a.d.exec.rpc.control.ControlServer - Message of mode REQUEST of rpc type
> 6 took longer than 500 ms. Actual Duration was 16053ms.
> 2016-09-19 14:15:33,357 [BitServer-4] WARN
> o.a.d.exec.rpc.control.ControlClient - Message of mode RESPONSE of rpc type
> 1 took longer than 500 ms. Actual Duration was 981ms.
> 
> We really appreciate your help. I will dig in the source code for when and
> why this error happen.
> 
> 
> Francois
> 
> P.S.:
> We do see this also:
> 2016-09-19 14:48:23,444 [drill-executor-9] WARN
> o.a.d.exec.rpc.control.WorkEventBus - Fragment ..:1:2 not found in the
> work bus.
> 2016-09-19 14:48:23,444 [drill-executor-11] WARN
> o.a.d.exec.rpc.control.WorkEventBus - Fragment :1:222 not found in the
> work bus.
> 2016-09-19 14:48:23,444 [drill-executor-12] WARN
> o.a.d.exec.rpc.control.WorkEventBus - Fragment :1:442 not found in the
> work bus.
> 2016-09-19 14:48:23,444 [drill-executor-10] WARN
> o.a.d.exec.rpc.control.WorkEventBus - Fragment :1:662 not found in the
> work bus.
> 
> 
> 
> 
> On Sun, Sep 18, 2016 at 2:57 PM, Sudheesh Katkam 
> wrote:
> 
>> Hi Francois,
>> 
>> More questions..
>> 
>>> + Can you share the query profile?
>>>  I will sum it up:
>>> It is a select on 18 columns: 9 string, 9 integers.
>>> Scan is done on 13862 parquet files spread  on 1000 fragments.
>>> Fragments are spread accross 215 nodes.
>> 
>> So ~5 leaf fragments (or scanners) per Drillbit seems fine.
>> 
>> + Does the query involve any aggregations or filters? Or is this a select
>> query with only projections?
>> + Any suspicious timings in the query profile?
>> + Any suspicious warning messages in the logs around the time of failure
>> on any of the drillbits? Specially on atsqa4-133.qa.lab? Specially this one
>> (“..” are place holders):
>>  Message of mode .. of rpc type .. took longer than ..ms.  Actual
>> duration was ..ms.
>> 
>> Thank you,
>> Sudheesh
>> 
>>> On Sep 15, 2016, at 11:27 AM, François Méthot 
>> wrote:
>>> 
>>> Hi Sudheesh,
>>> 
>>> + How many zookeeper servers in the quorum?
>>> The quorum has 3 servers, everything looks healthy
>>> 
>>> + What is the load on atsqa4-133.qa.lab when this happens? Any other
>>> applications running on that node? How many threads is the Drill process
>>> using?
>>> The load on the failing node(8 cores) is 14, when Drill is running. Which
>>> is nothing out of the ordinary according to our 

Re: ZK lost connectivity issue on large cluster

2016-09-19 Thread François Méthot
Hi Sudheesh,


+ Does the query involve any aggregations or filters? Or is this a select
query with only projections?
Simple query with only projections

+ Any suspicious timings in the query profile?
Nothing specially different than our working query on our small cluster.

+ Any suspicious warning messages in the logs around the time of failure on
any of the drillbits? Specially on atsqa4-133.qa.lab? Specially this one
(“..” are place holders):
  Message of mode .. of rpc type .. took longer than ..ms.  Actual duration
was ..ms.

Well we do see this warning on the failing node (on my last test), I found
this WARNING in our log for the past month for pretty much every node I
checked.
2016-09-19 13:31:56,866 [BitServer-7] WARN
o.a.d.exec.rpc.control.ControlServer - Message of mode REQUEST of rpc type
6 took longer than 500 ms. Actual Duration was 16053ms.
2016-09-19 14:15:33,357 [BitServer-4] WARN
o.a.d.exec.rpc.control.ControlClient - Message of mode RESPONSE of rpc type
1 took longer than 500 ms. Actual Duration was 981ms.

We really appreciate your help. I will dig in the source code for when and
why this error happen.


Francois

P.S.:
We do see this also:
2016-09-19 14:48:23,444 [drill-executor-9] WARN
o.a.d.exec.rpc.control.WorkEventBus - Fragment ..:1:2 not found in the
work bus.
2016-09-19 14:48:23,444 [drill-executor-11] WARN
o.a.d.exec.rpc.control.WorkEventBus - Fragment :1:222 not found in the
work bus.
2016-09-19 14:48:23,444 [drill-executor-12] WARN
o.a.d.exec.rpc.control.WorkEventBus - Fragment :1:442 not found in the
work bus.
2016-09-19 14:48:23,444 [drill-executor-10] WARN
o.a.d.exec.rpc.control.WorkEventBus - Fragment :1:662 not found in the
work bus.




On Sun, Sep 18, 2016 at 2:57 PM, Sudheesh Katkam 
wrote:

> Hi Francois,
>
> More questions..
>
> > + Can you share the query profile?
> >   I will sum it up:
> >  It is a select on 18 columns: 9 string, 9 integers.
> >  Scan is done on 13862 parquet files spread  on 1000 fragments.
> >  Fragments are spread accross 215 nodes.
>
> So ~5 leaf fragments (or scanners) per Drillbit seems fine.
>
> + Does the query involve any aggregations or filters? Or is this a select
> query with only projections?
> + Any suspicious timings in the query profile?
> + Any suspicious warning messages in the logs around the time of failure
> on any of the drillbits? Specially on atsqa4-133.qa.lab? Specially this one
> (“..” are place holders):
>   Message of mode .. of rpc type .. took longer than ..ms.  Actual
> duration was ..ms.
>
> Thank you,
> Sudheesh
>
> > On Sep 15, 2016, at 11:27 AM, François Méthot 
> wrote:
> >
> > Hi Sudheesh,
> >
> > + How many zookeeper servers in the quorum?
> > The quorum has 3 servers, everything looks healthy
> >
> > + What is the load on atsqa4-133.qa.lab when this happens? Any other
> > applications running on that node? How many threads is the Drill process
> > using?
> > The load on the failing node(8 cores) is 14, when Drill is running. Which
> > is nothing out of the ordinary according to our admin.
> > HBase is also running.
> > planner.width.max_per_node is set to 8
> >
> > + When running the same query on 12 nodes, is the data size same?
> > Yes
> >
> > + Can you share the query profile?
> >   I will sum it up:
> >  It is a select on 18 columns: 9 string, 9 integers.
> >  Scan is done on 13862 parquet files spread  on 1000 fragments.
> >  Fragments are spread accross 215 nodes.
> >
> >
> > We are in process of increasing our Zookeeper session timeout config to
> see
> > if it helps.
> >
> > thanks
> >
> > Francois
> >
> >
> >
> >
> >
> >
> >
> > On Wed, Sep 14, 2016 at 4:40 PM, Sudheesh Katkam 
> > wrote:
> >
> >> Hi Francois,
> >>
> >> Few questions:
> >> + How many zookeeper servers in the quorum?
> >> + What is the load on atsqa4-133.qa.lab when this happens? Any other
> >> applications running on that node? How many threads is the Drill process
> >> using?
> >> + When running the same query on 12 nodes, is the data size same?
> >> + Can you share the query profile?
> >>
> >> This may not be the right thing to do, but for now, If the cluster is
> >> heavily loaded, increase the zk timeout.
> >>
> >> Thank you,
> >> Sudheesh
> >>
> >>> On Sep 14, 2016, at 11:53 AM, François Méthot 
> >> wrote:
> >>>
> >>> We are running 1.7.
> >>> The log were taken from the jira tickets.
> >>>
> >>> We will try out 1.8 soon.
> >>>
> >>>
> >>>
> >>>
> >>> On Wed, Sep 14, 2016 at 2:52 PM, Chun Chang 
> wrote:
> >>>
>  Looks like you are running 1.5. I believe there are some work done in
> >> that
>  area and the newer release should behave better.
> 
>  On Wed, Sep 14, 2016 at 11:43 AM, François Méthot <
> fmetho...@gmail.com>
>  wrote:
> 
> > Hi,
> >
> > We are trying to find a solution/workaround to issue:
> >
> > 2016-01-28 16:36:14,367 [Curator-ServiceCache-0] 

Re: ZK lost connectivity issue on large cluster

2016-09-18 Thread Sudheesh Katkam
Hi Francois,

More questions..

> + Can you share the query profile?
>   I will sum it up:
>  It is a select on 18 columns: 9 string, 9 integers.
>  Scan is done on 13862 parquet files spread  on 1000 fragments.
>  Fragments are spread accross 215 nodes.

So ~5 leaf fragments (or scanners) per Drillbit seems fine.

+ Does the query involve any aggregations or filters? Or is this a select query 
with only projections?
+ Any suspicious timings in the query profile?
+ Any suspicious warning messages in the logs around the time of failure on any 
of the drillbits? Specially on atsqa4-133.qa.lab? Specially this one (“..” are 
place holders):
  Message of mode .. of rpc type .. took longer than ..ms.  Actual duration was 
..ms.

Thank you,
Sudheesh

> On Sep 15, 2016, at 11:27 AM, François Méthot  wrote:
> 
> Hi Sudheesh,
> 
> + How many zookeeper servers in the quorum?
> The quorum has 3 servers, everything looks healthy
> 
> + What is the load on atsqa4-133.qa.lab when this happens? Any other
> applications running on that node? How many threads is the Drill process
> using?
> The load on the failing node(8 cores) is 14, when Drill is running. Which
> is nothing out of the ordinary according to our admin.
> HBase is also running.
> planner.width.max_per_node is set to 8
> 
> + When running the same query on 12 nodes, is the data size same?
> Yes
> 
> + Can you share the query profile?
>   I will sum it up:
>  It is a select on 18 columns: 9 string, 9 integers.
>  Scan is done on 13862 parquet files spread  on 1000 fragments.
>  Fragments are spread accross 215 nodes.
> 
> 
> We are in process of increasing our Zookeeper session timeout config to see
> if it helps.
> 
> thanks
> 
> Francois
> 
> 
> 
> 
> 
> 
> 
> On Wed, Sep 14, 2016 at 4:40 PM, Sudheesh Katkam 
> wrote:
> 
>> Hi Francois,
>> 
>> Few questions:
>> + How many zookeeper servers in the quorum?
>> + What is the load on atsqa4-133.qa.lab when this happens? Any other
>> applications running on that node? How many threads is the Drill process
>> using?
>> + When running the same query on 12 nodes, is the data size same?
>> + Can you share the query profile?
>> 
>> This may not be the right thing to do, but for now, If the cluster is
>> heavily loaded, increase the zk timeout.
>> 
>> Thank you,
>> Sudheesh
>> 
>>> On Sep 14, 2016, at 11:53 AM, François Méthot 
>> wrote:
>>> 
>>> We are running 1.7.
>>> The log were taken from the jira tickets.
>>> 
>>> We will try out 1.8 soon.
>>> 
>>> 
>>> 
>>> 
>>> On Wed, Sep 14, 2016 at 2:52 PM, Chun Chang  wrote:
>>> 
 Looks like you are running 1.5. I believe there are some work done in
>> that
 area and the newer release should behave better.
 
 On Wed, Sep 14, 2016 at 11:43 AM, François Méthot 
 wrote:
 
> Hi,
> 
> We are trying to find a solution/workaround to issue:
> 
> 2016-01-28 16:36:14,367 [Curator-ServiceCache-0] ERROR
> o.a.drill.exec.work.foreman.Foreman - SYSTEM ERROR: ForemanException:
> One more more nodes lost connectivity during query.  Identified nodes
> were [atsqa4-133.qa.lab:31010].
> org.apache.drill.common.exceptions.UserException: SYSTEM ERROR:
> ForemanException: One more more nodes lost connectivity during query.
> Identified nodes were [atsqa4-133.qa.lab:31010].
>   at org.apache.drill.exec.work.foreman.Foreman$ForemanResult.
> close(Foreman.java:746)
> [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
>   at org.apache.drill.exec.work.foreman.Foreman$StateSwitch.
> processEvent(Foreman.java:858)
> [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
>   at org.apache.drill.exec.work.foreman.Foreman$StateSwitch.
> processEvent(Foreman.java:790)
> [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
>   at org.apache.drill.exec.work.foreman.Foreman$StateSwitch.
> moveToState(Foreman.java:792)
> [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
>   at org.apache.drill.exec.work.foreman.Foreman.moveToState(
> Foreman.java:909)
> [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
>   at org.apache.drill.exec.work.foreman.Foreman.access$2700(
> Foreman.java:110)
> [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
>   at org.apache.drill.exec.work.foreman.Foreman$StateListener.
> moveToState(Foreman.java:1183)
> [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
> 
> 
> DRILL-4325  
> ForemanException:
> One or more nodes lost connectivity during query
> 
> 
> 
> Any one experienced this issue ?
> 
> It happens when running query involving many parquet files on a cluster
 of
> 200 nodes. Same query on a smaller cluster of 12 nodes runs fine.
> 
> It is not caused by garbage collection, (checked 

Re: ZK lost connectivity issue on large cluster

2016-09-14 Thread Sudheesh Katkam
Hi Francois,

Few questions:
+ How many zookeeper servers in the quorum?
+ What is the load on atsqa4-133.qa.lab when this happens? Any other 
applications running on that node? How many threads is the Drill process using?
+ When running the same query on 12 nodes, is the data size same?
+ Can you share the query profile?

This may not be the right thing to do, but for now, If the cluster is heavily 
loaded, increase the zk timeout.

Thank you,
Sudheesh

> On Sep 14, 2016, at 11:53 AM, François Méthot  wrote:
> 
> We are running 1.7.
> The log were taken from the jira tickets.
> 
> We will try out 1.8 soon.
> 
> 
> 
> 
> On Wed, Sep 14, 2016 at 2:52 PM, Chun Chang  wrote:
> 
>> Looks like you are running 1.5. I believe there are some work done in that
>> area and the newer release should behave better.
>> 
>> On Wed, Sep 14, 2016 at 11:43 AM, François Méthot 
>> wrote:
>> 
>>> Hi,
>>> 
>>>  We are trying to find a solution/workaround to issue:
>>> 
>>> 2016-01-28 16:36:14,367 [Curator-ServiceCache-0] ERROR
>>> o.a.drill.exec.work.foreman.Foreman - SYSTEM ERROR: ForemanException:
>>> One more more nodes lost connectivity during query.  Identified nodes
>>> were [atsqa4-133.qa.lab:31010].
>>> org.apache.drill.common.exceptions.UserException: SYSTEM ERROR:
>>> ForemanException: One more more nodes lost connectivity during query.
>>> Identified nodes were [atsqa4-133.qa.lab:31010].
>>>at org.apache.drill.exec.work.foreman.Foreman$ForemanResult.
>>> close(Foreman.java:746)
>>> [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
>>>at org.apache.drill.exec.work.foreman.Foreman$StateSwitch.
>>> processEvent(Foreman.java:858)
>>> [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
>>>at org.apache.drill.exec.work.foreman.Foreman$StateSwitch.
>>> processEvent(Foreman.java:790)
>>> [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
>>>at org.apache.drill.exec.work.foreman.Foreman$StateSwitch.
>>> moveToState(Foreman.java:792)
>>> [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
>>>at org.apache.drill.exec.work.foreman.Foreman.moveToState(
>>> Foreman.java:909)
>>> [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
>>>at org.apache.drill.exec.work.foreman.Foreman.access$2700(
>>> Foreman.java:110)
>>> [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
>>>at org.apache.drill.exec.work.foreman.Foreman$StateListener.
>>> moveToState(Foreman.java:1183)
>>> [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
>>> 
>>> 
>>> DRILL-4325  
>>> ForemanException:
>>> One or more nodes lost connectivity during query
>>> 
>>> 
>>> 
>>> Any one experienced this issue ?
>>> 
>>> It happens when running query involving many parquet files on a cluster
>> of
>>> 200 nodes. Same query on a smaller cluster of 12 nodes runs fine.
>>> 
>>> It is not caused by garbage collection, (checked on both ZK node and the
>>> involved drill bit).
>>> 
>>> Negotiated max session timeout is 40 seconds.
>>> 
>>> The sequence seems:
>>> - Drill Query begins, using an existing ZK session.
>>> - Drill Zk session timeouts
>>>  - perhaps it was writing something that took too long
>>> - Drill attempts to renew session
>>>   - drill believes that the write operation failed, so it attempts
>> to
>>> re-create the zk node, which trigger another exception.
>>> 
>>> We are open to any suggestion. We will report any finding.
>>> 
>>> Thanks
>>> Francois
>>> 
>>