Thanks Jacques and Yuliya's points. The list summarized by Jacques should be a good guideline to deal with the variety of zookeeper disconnection problems.
Regarding the initial target (5.b), I would like to propose that, after session timeouts (instead of setting the time-out on the global level, we should have another parameter in the drill-module.conf to decide this value), the drill-bit is supposed to return to the state where drill-bit seeks for initial connection to zookeeper. Meanwhile, fail the queries initiated by this drill bit's foreman. On Sun, Nov 8, 2015 at 12:39 PM, yuliya Feldman <[email protected] > wrote: > Did not notice your reply :) > Yes - I agree with Jacques - we should consider variety of the scenarios > here. > Thanks,Yuliya > From: Jacques Nadeau <[email protected]> > To: dev <[email protected]> > Sent: Sunday, November 8, 2015 11:56 AM > Subject: Re: Zookeeper down before query starts/after query finishes > > I think we need to talk through a couple of different scenarios and decide > on Drill behavior in each. > > Client Based > 1) Initial connection to ZK from client fails > 2) Client loses ZK Connection > a) Reconnects within session timeout > b) Cannot reconnect within session timeout (loses session) > 3) ZK Connection is gets reconnected with new session (2b) > > Drillbit Based > 4) Drillbit initial connection fails to complete > 5) Drillbit loses connection > a) reconnects within session timeout > b) cannot reconnect within session timeout (loses session) > 6) Drillbit reestablishes connection after timeout (5b) > > It seems like your initial proposal is entirely focused on item (5b) in the > list above. However, the code change affects all items 1-6. I think it > would be worthwhile to come up with clear definition of desired behavior > for all items 1-6. I also think the behavior in 2b should probably be very > different than in 5b. > > Note, I'm not suggesting that this initial fix needs to resolve all items > to the desired behavior. However, it is hard to review the patch without > measuring against what are target is across the items. My hope out of this > is a clear framework to review the patch as well as a number of jiras to > resolve issues across each of these issues where there are gaps. > > thanks! > jacques > > > > -- > Jacques Nadeau > CTO and Co-Founder, Dremio > > > > On Sun, Nov 8, 2015 at 9:36 AM, Hsuan Yi Chu <[email protected]> wrote: > > > I just submitted a pull request to address DRILL-3751, which focuses on > the > > scenario where query already finishes and zookeeper dies. So Foreman > cannot > > delete the profiles of running queries in zookeeper. > > > > I think in this case, after a few retries, Foreman can assume Zookeeper > is > > down. And, this query is assumed to fail since client might not be able > to > > receive the result (see the behavior in DRILL-3751 > > <https://issues.apache.org/jira/browse/DRILL-3751>). > > > > Does this make sense? > > > > > > On Fri, Nov 6, 2015 at 10:43 AM, Hsuan Yi Chu <[email protected]> > wrote: > > > > > My understanding is : > > > Before query starts/After query finishes, Foreman will put/delete > running > > > query profiles in zookeeper. > > > > > > However, if zookeeper is down before the put/delete is successful, > Drill > > > would be blocked at the put/delete operation. > > > > > > See https://issues.apache.org/jira/browse/DRILL-3751 > > > > > > I think it is not quite right to let Drill just wait for Zookeeper to > > > respond. Does it make sense to use "time-out" here? > > > > > > > > > > > > > > >
