If we're going to go in that direction, it'd be nice to do it in a way that allows not only for a node to rejoin the cluster, but for new nodes to be added while it is already running. Seems like it shouldn't be so different.
That's still a ways off from recovering lost work done for queries with fragments running on a node that goes down, though. On Tue, Mar 24, 2015 at 5:46 PM, Hanifi Gunes <[email protected]> wrote: > I doubt if we do periodic cluster discovery. Current model seems to rely on > static set of nodes that is queried while bootstrapping. Also the > "graceful" behavior described above basically tells that we are not > fault-tolerant. Dynamic cluster discovery is somewhat easy to implement via > watchers or suchlike. However making Drill fault-tolerant needs some > serious discussion, which is quite important for long running queries. As > we mature, I hope to see more discussions coming around these issues > because failures do happen. > > Regards. > -Hanifi > > On Tue, Mar 24, 2015 at 5:22 PM, Ramana Inukonda Nagaraj (JIRA) < > [email protected]> wrote: > > > Ramana Inukonda Nagaraj created DRILL-2550: > > ---------------------------------------------- > > > > Summary: Drillbit disconnect from ZK results in drillbit > > being lost until restart > > Key: DRILL-2550 > > URL: https://issues.apache.org/jira/browse/DRILL-2550 > > Project: Apache Drill > > Issue Type: Bug > > Components: Execution - Flow > > Reporter: Ramana Inukonda Nagaraj > > Assignee: Chris Westin > > Priority: Minor > > > > > > Not quite sure if this is an issue or even if its important- maybe > someone > > can think of a situation where this might be a bigger issue. > > > > Steps taken to recreate: > > 1. Startup drillbits on multiple nodes. (They all come up and form a 8 > > node cluster) > > 2. Start executing a long running query. > > 3. Use TCPKILL to kill all connections between one node and zookeeper > port > > 5181. > > Drill seems to behave very gracefully here - I see a nice error message > > saying Query failed: ForemanException: One more more nodes lost > > connectivity during query. Identified node was atsqa6c61.qa.lab > > > > However, once I start allowing connections back the node is not brought > > back as part of the cluster until a drillbit restart. > > > > > > > > -- > > This message was sent by Atlassian JIRA > > (v6.3.4#6332) > > >
