I doubt if we do periodic cluster discovery. Current model seems to rely on
static set of nodes that is queried while bootstrapping. Also the
"graceful" behavior described above basically tells that we are not
fault-tolerant. Dynamic cluster discovery is somewhat easy to implement via
watchers or suchlike. However making Drill fault-tolerant needs some
serious discussion, which is quite important for long running queries. As
we mature, I hope to see more discussions coming around these issues
because failures do happen.

Regards.
-Hanifi

On Tue, Mar 24, 2015 at 5:22 PM, Ramana Inukonda Nagaraj (JIRA) <
[email protected]> wrote:

> Ramana Inukonda Nagaraj created DRILL-2550:
> ----------------------------------------------
>
>              Summary: Drillbit disconnect from ZK results in drillbit
> being lost until restart
>                  Key: DRILL-2550
>                  URL: https://issues.apache.org/jira/browse/DRILL-2550
>              Project: Apache Drill
>           Issue Type: Bug
>           Components: Execution - Flow
>             Reporter: Ramana Inukonda Nagaraj
>             Assignee: Chris Westin
>             Priority: Minor
>
>
> Not quite sure if this is an issue or even if its important- maybe someone
> can think of a situation where this might be a bigger issue.
>
> Steps taken to recreate:
> 1. Startup drillbits on multiple nodes. (They all come up and form a 8
> node cluster)
> 2. Start executing a long running query.
> 3. Use TCPKILL to kill all connections between one node and zookeeper port
> 5181.
> Drill seems to behave very gracefully here - I see a nice error message
> saying Query failed: ForemanException: One more more nodes lost
> connectivity during query. Identified node was atsqa6c61.qa.lab
>
> However, once I start allowing connections back the node is not brought
> back as part of the cluster until a drillbit restart.
>
>
>
> --
> This message was sent by Atlassian JIRA
> (v6.3.4#6332)
>

Reply via email to