Thanks Jacques and Yuliya's points.

The list summarized by Jacques should be a good guideline to deal with the
variety of zookeeper disconnection problems.

Regarding the initial target (5.b), I would like to propose that, after session
timeouts (instead of setting the time-out on the global level, we should
have another parameter in the drill-module.conf to decide this value), the
drill-bit is supposed to return to the state where drill-bit seeks for
initial connection to zookeeper. Meanwhile, fail the queries initiated by
this drill bit's foreman.

On Sun, Nov 8, 2015 at 12:39 PM, yuliya Feldman <[email protected]
> wrote:

> Did not notice your reply :)
> Yes - I agree with Jacques - we should consider variety of the scenarios
> here.
> Thanks,Yuliya
>       From: Jacques Nadeau <[email protected]>
>  To: dev <[email protected]>
>  Sent: Sunday, November 8, 2015 11:56 AM
>  Subject: Re: Zookeeper down before query starts/after query finishes
>
> I think we need to talk through a couple of different scenarios and decide
> on Drill behavior in each.
>
> Client Based
> 1) Initial connection to ZK from client fails
> 2) Client loses ZK Connection
>   a) Reconnects within session timeout
>   b) Cannot reconnect within session timeout (loses session)
> 3) ZK Connection is gets reconnected with new session (2b)
>
> Drillbit Based
> 4) Drillbit initial connection fails to complete
> 5) Drillbit loses connection
>   a) reconnects within session timeout
>   b) cannot reconnect within session timeout (loses session)
> 6) Drillbit reestablishes connection after timeout (5b)
>
> It seems like your initial proposal is entirely focused on item (5b) in the
> list above. However, the code change affects all items 1-6. I think it
> would be worthwhile to come up with clear definition of desired behavior
> for all items 1-6. I also think the behavior in 2b should probably be very
> different than in 5b.
>
> Note, I'm not suggesting that this initial fix needs to resolve all items
> to the desired behavior. However, it is hard to review the patch without
> measuring against what are target is across the items. My hope out of this
> is a clear framework to review the patch as well as a number of jiras to
> resolve issues across each of these issues where there are gaps.
>
> thanks!
> jacques
>
>
>
> --
> Jacques Nadeau
> CTO and Co-Founder, Dremio
>
>
>
> On Sun, Nov 8, 2015 at 9:36 AM, Hsuan Yi Chu <[email protected]> wrote:
>
> > I just submitted a pull request to address DRILL-3751, which focuses on
> the
> > scenario where query already finishes and zookeeper dies. So Foreman
> cannot
> > delete the profiles of running queries in zookeeper.
> >
> > I think in this case, after a few retries, Foreman can assume Zookeeper
> is
> > down. And, this query is assumed to fail since client might not be able
> to
> > receive the result (see the behavior in DRILL-3751
> > <https://issues.apache.org/jira/browse/DRILL-3751>).
> >
> > Does this make sense?
> >
> >
> > On Fri, Nov 6, 2015 at 10:43 AM, Hsuan Yi Chu <[email protected]>
> wrote:
> >
> > > My understanding is :
> > > Before query starts/After query finishes, Foreman will put/delete
> running
> > > query profiles in zookeeper.
> > >
> > > However, if zookeeper is down before the put/delete is successful,
> Drill
> > > would be blocked at the put/delete operation.
> > >
> > > See https://issues.apache.org/jira/browse/DRILL-3751
> > >
> > > I think it is not quite right to let Drill just wait for Zookeeper to
> > > respond. Does it make sense to use "time-out" here?
> > >
> > >
> > >
> >
>
>
>
>

Reply via email to