I think we need to talk through a couple of different scenarios and decide on Drill behavior in each.
Client Based 1) Initial connection to ZK from client fails 2) Client loses ZK Connection a) Reconnects within session timeout b) Cannot reconnect within session timeout (loses session) 3) ZK Connection is gets reconnected with new session (2b) Drillbit Based 4) Drillbit initial connection fails to complete 5) Drillbit loses connection a) reconnects within session timeout b) cannot reconnect within session timeout (loses session) 6) Drillbit reestablishes connection after timeout (5b) It seems like your initial proposal is entirely focused on item (5b) in the list above. However, the code change affects all items 1-6. I think it would be worthwhile to come up with clear definition of desired behavior for all items 1-6. I also think the behavior in 2b should probably be very different than in 5b. Note, I'm not suggesting that this initial fix needs to resolve all items to the desired behavior. However, it is hard to review the patch without measuring against what are target is across the items. My hope out of this is a clear framework to review the patch as well as a number of jiras to resolve issues across each of these issues where there are gaps. thanks! jacques -- Jacques Nadeau CTO and Co-Founder, Dremio On Sun, Nov 8, 2015 at 9:36 AM, Hsuan Yi Chu <[email protected]> wrote: > I just submitted a pull request to address DRILL-3751, which focuses on the > scenario where query already finishes and zookeeper dies. So Foreman cannot > delete the profiles of running queries in zookeeper. > > I think in this case, after a few retries, Foreman can assume Zookeeper is > down. And, this query is assumed to fail since client might not be able to > receive the result (see the behavior in DRILL-3751 > <https://issues.apache.org/jira/browse/DRILL-3751>). > > Does this make sense? > > > On Fri, Nov 6, 2015 at 10:43 AM, Hsuan Yi Chu <[email protected]> wrote: > > > My understanding is : > > Before query starts/After query finishes, Foreman will put/delete running > > query profiles in zookeeper. > > > > However, if zookeeper is down before the put/delete is successful, Drill > > would be blocked at the put/delete operation. > > > > See https://issues.apache.org/jira/browse/DRILL-3751 > > > > I think it is not quite right to let Drill just wait for Zookeeper to > > respond. Does it make sense to use "time-out" here? > > > > > > >
