If we're going to go in that direction, it'd be nice to do it in a way that allows not only for a node to rejoin the cluster, but for new nodes to be added while it is already running. Seems like it shouldn't be so different. - Agree. This will give us ability to elastically scale up/down allowing online node updates without service going down.
That's still a ways off from recovering lost work done for queries with fragments running on a node that goes down, though. - I can see some trivial cases around which we only need to re-assign computation to another node but do not need to replay the data from upstream like a leaf fragment fails or similar. It gets more complicated when fragment's lineage comes into picture. Good food for thought. -Hanifi On Tue, Mar 24, 2015 at 6:01 PM, Chris Westin <[email protected]> wrote: > If we're going to go in that direction, it'd be nice to do it in a way that > allows not only for a node to rejoin the cluster, but for new nodes to be > added while it is already running. Seems like it shouldn't be so different. > > That's still a ways off from recovering lost work done for queries with > fragments running on a node that goes down, though. > > On Tue, Mar 24, 2015 at 5:46 PM, Hanifi Gunes <[email protected]> wrote: > > > I doubt if we do periodic cluster discovery. Current model seems to rely > on > > static set of nodes that is queried while bootstrapping. Also the > > "graceful" behavior described above basically tells that we are not > > fault-tolerant. Dynamic cluster discovery is somewhat easy to implement > via > > watchers or suchlike. However making Drill fault-tolerant needs some > > serious discussion, which is quite important for long running queries. As > > we mature, I hope to see more discussions coming around these issues > > because failures do happen. > > > > Regards. > > -Hanifi > > > > On Tue, Mar 24, 2015 at 5:22 PM, Ramana Inukonda Nagaraj (JIRA) < > > [email protected]> wrote: > > > > > Ramana Inukonda Nagaraj created DRILL-2550: > > > ---------------------------------------------- > > > > > > Summary: Drillbit disconnect from ZK results in drillbit > > > being lost until restart > > > Key: DRILL-2550 > > > URL: https://issues.apache.org/jira/browse/DRILL-2550 > > > Project: Apache Drill > > > Issue Type: Bug > > > Components: Execution - Flow > > > Reporter: Ramana Inukonda Nagaraj > > > Assignee: Chris Westin > > > Priority: Minor > > > > > > > > > Not quite sure if this is an issue or even if its important- maybe > > someone > > > can think of a situation where this might be a bigger issue. > > > > > > Steps taken to recreate: > > > 1. Startup drillbits on multiple nodes. (They all come up and form a 8 > > > node cluster) > > > 2. Start executing a long running query. > > > 3. Use TCPKILL to kill all connections between one node and zookeeper > > port > > > 5181. > > > Drill seems to behave very gracefully here - I see a nice error message > > > saying Query failed: ForemanException: One more more nodes lost > > > connectivity during query. Identified node was atsqa6c61.qa.lab > > > > > > However, once I start allowing connections back the node is not brought > > > back as part of the cluster until a drillbit restart. > > > > > > > > > > > > -- > > > This message was sent by Atlassian JIRA > > > (v6.3.4#6332) > > > > > >
