In the reality if you can not connect to ZK (and ConnectionLoss is a client
side error) it either means issues with network on client node itself or issues
with ZK quorum. In those situations unless you receive (eventually) "Session
Expiration" or "Connection reestablished" again you don't know what is going
on. What probably would be prudent to do is to timeout if after ConnectionLoss
you do not have anything back from ZK server for time > ZK client timeout (30
sec. by default I think).
And again it will need to depend on the client - in your example it is a good
idea to fail in some other cases it may be a good idea to wait (e.g if you deal
with non-idempotent operations)
From: Hsuan Yi Chu <[email protected]>
To: [email protected]
Sent: Sunday, November 8, 2015 9:36 AM
Subject: Re: Zookeeper down before query starts/after query finishes
I just submitted a pull request to address DRILL-3751, which focuses on the
scenario where query already finishes and zookeeper dies. So Foreman cannot
delete the profiles of running queries in zookeeper.
I think in this case, after a few retries, Foreman can assume Zookeeper is
down. And, this query is assumed to fail since client might not be able to
receive the result (see the behavior in DRILL-3751
<https://issues.apache.org/jira/browse/DRILL-3751>).
Does this make sense?
On Fri, Nov 6, 2015 at 10:43 AM, Hsuan Yi Chu <[email protected]> wrote:
> My understanding is :
> Before query starts/After query finishes, Foreman will put/delete running
> query profiles in zookeeper.
>
> However, if zookeeper is down before the put/delete is successful, Drill
> would be blocked at the put/delete operation.
>
> See https://issues.apache.org/jira/browse/DRILL-3751
>
> I think it is not quite right to let Drill just wait for Zookeeper to
> respond. Does it make sense to use "time-out" here?
>
>
>