If we're going to go in that direction, it'd be nice to do it in a way that
allows not only for a node to rejoin the cluster, but for new nodes to be
added while it is already running. Seems like it shouldn't be so different.
- Agree. This will give us ability to elastically scale up/down allowing
online node updates without service going down.

That's still a ways off from recovering lost work done for queries with
fragments running on a node that goes down, though.
- I can see some trivial cases around which we only need to re-assign
computation to another node but do not need to replay the data from
upstream like a leaf fragment fails or similar. It gets more complicated
when fragment's lineage comes into picture.

Good food for thought.

-Hanifi

On Tue, Mar 24, 2015 at 6:01 PM, Chris Westin <[email protected]> wrote:

> If we're going to go in that direction, it'd be nice to do it in a way that
> allows not only for a node to rejoin the cluster, but for new nodes to be
> added while it is already running. Seems like it shouldn't be so different.
>
> That's still a ways off from recovering lost work done for queries with
> fragments running on a node that goes down, though.
>
> On Tue, Mar 24, 2015 at 5:46 PM, Hanifi Gunes <[email protected]> wrote:
>
> > I doubt if we do periodic cluster discovery. Current model seems to rely
> on
> > static set of nodes that is queried while bootstrapping. Also the
> > "graceful" behavior described above basically tells that we are not
> > fault-tolerant. Dynamic cluster discovery is somewhat easy to implement
> via
> > watchers or suchlike. However making Drill fault-tolerant needs some
> > serious discussion, which is quite important for long running queries. As
> > we mature, I hope to see more discussions coming around these issues
> > because failures do happen.
> >
> > Regards.
> > -Hanifi
> >
> > On Tue, Mar 24, 2015 at 5:22 PM, Ramana Inukonda Nagaraj (JIRA) <
> > [email protected]> wrote:
> >
> > > Ramana Inukonda Nagaraj created DRILL-2550:
> > > ----------------------------------------------
> > >
> > >              Summary: Drillbit disconnect from ZK results in drillbit
> > > being lost until restart
> > >                  Key: DRILL-2550
> > >                  URL: https://issues.apache.org/jira/browse/DRILL-2550
> > >              Project: Apache Drill
> > >           Issue Type: Bug
> > >           Components: Execution - Flow
> > >             Reporter: Ramana Inukonda Nagaraj
> > >             Assignee: Chris Westin
> > >             Priority: Minor
> > >
> > >
> > > Not quite sure if this is an issue or even if its important- maybe
> > someone
> > > can think of a situation where this might be a bigger issue.
> > >
> > > Steps taken to recreate:
> > > 1. Startup drillbits on multiple nodes. (They all come up and form a 8
> > > node cluster)
> > > 2. Start executing a long running query.
> > > 3. Use TCPKILL to kill all connections between one node and zookeeper
> > port
> > > 5181.
> > > Drill seems to behave very gracefully here - I see a nice error message
> > > saying Query failed: ForemanException: One more more nodes lost
> > > connectivity during query. Identified node was atsqa6c61.qa.lab
> > >
> > > However, once I start allowing connections back the node is not brought
> > > back as part of the cluster until a drillbit restart.
> > >
> > >
> > >
> > > --
> > > This message was sent by Atlassian JIRA
> > > (v6.3.4#6332)
> > >
> >
>

Reply via email to