If we're going to go in that direction, it'd be nice to do it in a way that
allows not only for a node to rejoin the cluster, but for new nodes to be
added while it is already running. Seems like it shouldn't be so different.

That's still a ways off from recovering lost work done for queries with
fragments running on a node that goes down, though.

On Tue, Mar 24, 2015 at 5:46 PM, Hanifi Gunes <[email protected]> wrote:

> I doubt if we do periodic cluster discovery. Current model seems to rely on
> static set of nodes that is queried while bootstrapping. Also the
> "graceful" behavior described above basically tells that we are not
> fault-tolerant. Dynamic cluster discovery is somewhat easy to implement via
> watchers or suchlike. However making Drill fault-tolerant needs some
> serious discussion, which is quite important for long running queries. As
> we mature, I hope to see more discussions coming around these issues
> because failures do happen.
>
> Regards.
> -Hanifi
>
> On Tue, Mar 24, 2015 at 5:22 PM, Ramana Inukonda Nagaraj (JIRA) <
> [email protected]> wrote:
>
> > Ramana Inukonda Nagaraj created DRILL-2550:
> > ----------------------------------------------
> >
> >              Summary: Drillbit disconnect from ZK results in drillbit
> > being lost until restart
> >                  Key: DRILL-2550
> >                  URL: https://issues.apache.org/jira/browse/DRILL-2550
> >              Project: Apache Drill
> >           Issue Type: Bug
> >           Components: Execution - Flow
> >             Reporter: Ramana Inukonda Nagaraj
> >             Assignee: Chris Westin
> >             Priority: Minor
> >
> >
> > Not quite sure if this is an issue or even if its important- maybe
> someone
> > can think of a situation where this might be a bigger issue.
> >
> > Steps taken to recreate:
> > 1. Startup drillbits on multiple nodes. (They all come up and form a 8
> > node cluster)
> > 2. Start executing a long running query.
> > 3. Use TCPKILL to kill all connections between one node and zookeeper
> port
> > 5181.
> > Drill seems to behave very gracefully here - I see a nice error message
> > saying Query failed: ForemanException: One more more nodes lost
> > connectivity during query. Identified node was atsqa6c61.qa.lab
> >
> > However, once I start allowing connections back the node is not brought
> > back as part of the cluster until a drillbit restart.
> >
> >
> >
> > --
> > This message was sent by Atlassian JIRA
> > (v6.3.4#6332)
> >
>

Reply via email to