Hi Luke, How does this pseudocode interact with regular master operations (alter/drop etc). Are those operations suspended whenever there is a recovery operation, or would you go deeper to see if those operations are actually affected by the recovery and proceed otherwise?
Also conceptually, a DAG is a simplified finite state machine (no loops), so I suspect the main difference in Doug's proposal is that the dependencies are formalized and abstracted into the ODG instead of being implicit via state transitions (ie a collections of if-then-elses). -Sanjit On Mon, Aug 3, 2009 at 10:42 AM, Luke <[email protected]> wrote: > > I still don't see the why the ODG design simplify things. The argument > for ODG would be more convincing, if you can walk me through a simple > example that can demonstrate the necessity and/or simplicity. For > example: the recovery handling method can be straightforwardly > implemented with this rough pseudo code: > > recover(range_server) { > if (down_range_servers.size() == 0) > start_recovery(range_server); > else > restart_recovery(range_server); > } > > start_recovery(range_server) { > down_range_servers.add(range_server); > rid = get_recovery_id(); > mml->log_start_recovery(rid, down_range_servers); > do_recovery(rid); > } > > restart_recovery() { > down_range_servers.add(range_server); > old_rid = current_recovery_id; > rid = get_recovery_id(); > mml->log_restart_recovery(old_rid, rid, down_range_servers); > do_recovery(rid); > } > > do_recovery(rid) { > [root_range, meta_ranges, user_ranges] = get_down_ranges(rid); > > if (root_range) > recover_range(rid, root_range); > > parallel foreach(range : meta_ranges) > recover_range(rid, range); > > parallel foreach(range : user_ranges) > recover_range(rid, range); > } > > recover_range(rid, range) { > check_recovery_state(rid); > dest = range_scheduler->get_range_server(rid, range); > replay_load_range(dest, range); > replay_commit_log(dest, range); > replay_commit(dest) > mml->log_done_recover_range(rid, range); > } > > check_recovery_state(rid) { > if (rid != current_recovery_id) > throw RecoveryCancelled("recovery restarted"); > } > > I just don't see a DAG approach getting any simpler: You'd have to > create a separate class and setup dependencies explicitly for each > task; You'll need to traverse the entire graph for every > event/interrupt; You'll probably need to cancel a subtree explicitly > for any event/interrupt. In the above simple approach, the > dependencies are implicitly expressed in the running stack, an > exception can abort the entire stack as needed when a restart is > detected. > > For tasks like move and split, if the destination range server is not > available or can't finish due to related metadata not available, > simply cancel the task and the originating range server can retry > later (RSML will ensure durability of the transaction.) > > On Sun, Aug 2, 2009 at 12:36 AM, Doug Judd<[email protected]> wrote: > > The way I arrived at this design was via an attempt to simplify things. > The > > Master is essentially an interrupt driven system. It sits idle until it > > gets asynchronously interrupted by a user command, RangeServer failure, > or > > split report. Each one of these interrupts creates some operation that > > needs to be performed. Operations that are in-progress or future > operations > > may need to be halted and made dependent on this new operation (e.g. > > RangeServer recovery). This is what prompted the dependency graph idea > > which leads to the following server design: > > > > while (true) { > > wait_for_interrupt() > > pause_execution_engine() > > modify_dependency_graph() > > restart_execution_engine() > > } > > > > It doesn't get more clean and elegant than that. All prior design > attempts > > lead to a big pile of if statements and while loops. I can't think of a > > more simple design than this. > > Even the simplest task would require a separate class and explicit > dependency setup in the ODG design. If a task is not immutable like > the typical map-reduce jobs, i.e. if it depends on master states, > e.g., recovery or any nonblocking version of commands etc, the > application logic is much more obfuscated . > > IMHO, a potential better abstraction is an finite state machine. You > can specify/declare a state machine in a simple language, e.g: > > event: range_server_down(range_server): > state DEFAULT: { start_recovery(); } -> RECOVERING1 > state RECOVERING1: { restart_recovery(); } > ... > > and a compiler can validate and generate an FSM that you can > instantiate and basically do: > > while ((event = wait_for_event()) { > fsm->handle(event); > } > > But I think that the hand coded simple approach is easy to implement, > understand and probably more appropriate for the first release. > > __Luke > > > > > - Doug > > > > On Sat, Aug 1, 2009 at 2:07 PM, Luke <[email protected]> wrote: > >> > >> A DAG is only required when you want to do transactions with dyanmic > >> dependencies (only known at runtime) in parallel. Let's examine the > >> examples: > >> > >> On Sat, Aug 1, 2009 at 10:29 AM, Doug Judd<[email protected]> > wrote: > >> > I agree that for the current version of the ALTER TABLE command, you > >> > could > >> > probably implement it in such a way that you wouldn't need this > >> > mechanism. > >> > However, there are several other situations that come to mind where > this > >> > isn't the case: > >> > > >> > 1. We've talked about adding a 'force' option which would cause a > major > >> > compaction to occur on each range after it's schema was altered so > that > >> > the > >> > space taken up by dropped columns could get immediately freed up. > >> > 2. Tyler has been asking for the ability to delete certain portions of > a > >> > table (e.g. drop all rows less than a certain value or all cells > before > >> > some > >> > timestamp). To do this we're going to add a COMPACT command (see > issue > >> > 307). > >> > 3. At some point we'll be adding a SNAPSHOT command. Since CellStores > >> > can > >> > be shared, cheap table snapshots can be made by first performing a > >> > compaction and then copying the METADATA entries for the table being > >> > snapshotted. > >> > > >> > In all three of these situations, if a RangeServer goes down in the > >> > middle > >> > of the operation, when one of the participating ranges in the > operation > >> > is > >> > recovered on a new RangeServer, it won't have enough state to finish > the > >> > operation. > >> > >> So far all these commands can be implemented using a single > >> transaction. Let's use compact as an example: > >> > >> 1. master: writes begin compact(list of ranges) -> id in MML > >> 2. for each finished ranges master will get an ack and writes > >> done_compact(range) in MML > >> 3. If any of the range server dies the master can restart the > >> transaction restart compact(remaining ranges) -> id in MML > >> 4. the above steps (2-3) can happen multiple times, when multiple > >> range server dies. > >> > >> > The place where maintiaining a complex dependency graph is probably > the > >> > most > >> > important is in the RangeServer recovery process itself. As Sanjit > >> > pointed > >> > out, the recovery process needs to happen in roughly three stages, > ROOT > >> > range recovery, METADATA range recovery, and lastly USER range > >> > recovery. If > >> > you drill down into the RangeServer recovery process, it can be broken > >> > down > >> > into the following four dependent steps: > >> > > >> > 1. Replay RSML and assign all the ranges to new RangeServers > >> > 2. Invoke RangeServer::replay_load_range() for each range assigned to > a > >> > RangeServer > >> > 3. Replay commit log fragments > >> > 4. Invoke the RangeServer::replay_commit() on each participating > >> > RangeServer > >> > > >> > At any point in this process a RangeServer could go down and if it is > >> > one of > >> > the RangeServers participating in the recovery, then things > >> > complicated. As > >> > a general rule, you can't carry out step #4 for USER ranges unless the > >> > METADATA ranges are available and you can't carry out step #4 for > >> > METADATA > >> > ranges until the ROOT range is available. > >> > >> Since we only handle one recovery transaction at a time, with the 4 > >> steps being required sub transactions already known: > >> > >> 1. when a range server dies, master writes begin > >> recover(dead_range_server1) -> id in MML > >> 2. for each ranges in dead_range_server1's metalog, master invokes > >> replay_load_range and writes done_replay_load_range(range) in MML > >> 3. if another range server dies, master writes restart > >> recovery(dead_range_server1, dead_range_server2) -> id in MML. master > >> can then compute the union of the ranges that need to be recovered by > >> examine the done_replay_load_range so far and the ranges in both dead > >> range servers metalogs. and basically does 2 again. > >> 4. if replay_load_range finishes master can proceed to the next steps. > >> Any range server death during the following steps would cause a > >> restart (step 3.) > >> 5. the above steps can happen multiple times, hopefully with less > >> ranges to recover each time (otherwise, it's a cascading failure that > >> would cause the whole cluster to die. Will probably need to send out > >> alerts when the rate of death accelerates) > >> > >> > Ideally the recovery of each USER > >> > range should only be dependent on the recovery of the METADATA range > >> > that > >> > holds its corresponding entry. Getting the recovery process > implemented > >> > optimally will be important for maintaining good 99th percentile > latency > >> > in > >> > large (thousands of nodes) deployments running on a cloud provider > like > >> > EC2. > >> > >> This is the only place a DAG is useful (but not required) so far. And > >> you don't need to persist it in MML, you can reconstruct the DAG by > >> scanning the MML if master dies, because you know the dependencies > >> before hand (meta before user.) > >> > >> > Given all these dependencies, a directed acyclic graph seems like a > >> > clean > >> > and natural representation. > >> > >> A DAG is a natural representation for parallel transactions with > >> dependencies, but it really doesn't buy you much at this stage so far, > >> mostly because the dependencies are not created dynamically by user at > >> runtime (like in general map/reduce jobs.) The bulk of the work is to > >> get the recovery transaction right. I'd be very happy if we can do the > >> parallel replay_load_range and replay_commit_log like the Baidu guys > >> did. Fine grained parallel range recovery (multiple (meta, user) > >> transactions) is nice to have and a little premature for the first > >> release, IMHO. > >> > >> __Luke > >> > >> > > >> > - Doug > >> > > >> > On Fri, Jul 31, 2009 at 3:49 PM, Luke <[email protected]> wrote: > >> >> > >> >> On Fri, Jul 31, 2009 at 3:30 PM, Doug Judd<[email protected]> > >> >> wrote: > >> >> > The two examples that I gave are just two of many possible > scenarios: > >> >> > > >> >> > - What if you are in the middle of DROP TABLE and a range server > goes > >> >> > down > >> >> > and then in the middle of that recovery another range server goes > >> >> > down, > >> >> > etc. > >> >> > >> >> It would still work if the schema is updated in hyperspace and > >> >> remaining range servers, as long as the destination range servers are > >> >> doing the right thing: ignoring ranges that belongs to a nonexistent > >> >> table. The range server can cache negative results of schema queries > >> >> for a while to avoid hitting hyperspace too much on unseen tables > >> >> (range server should already do that already to avoid accidental ddos > >> >> on hyperspace.) > >> >> > >> >> > - What if someone does an ALTER TABLE and then before it is > complete, > >> >> > one of > >> >> > the participating range servers goes down and then before the range > >> >> > server > >> >> > recovers, one of the range servers participating in the recovery > goes > >> >> > down > >> >> > and then someone does a DROP table ... > >> >> > >> >> It really doesn't matter, hyperspace keeps the schema and the > >> >> remaining range servers get the memo: alter table can finish on the > >> >> remaining range servers. Same thing for the drop table. The key is to > >> >> populate the correctly schema lazily when the ranges are recovered. > >> >> > >> >> > - What if the Master is in the middle of a MOVE operation and then > >> >> > the > >> >> > range > >> >> > server that the range is being moved to goes down before it reports > >> >> > back, > >> >> > then recovery starts and then someone does an ALTER TABLE? > >> >> > >> >> > - What if there are 16 overlapping range server failures? > >> >> > >> >> As long as the operations are idempotent, all you need to do is just > >> >> updating schema on hyperpsace and remaining range server atomically. > >> >> > >> >> > Trying to program for all of the scenarios would be a total > nightmare > >> >> > without some sort of dependency abstraction. Just because these > >> >> > scenarios > >> >> > are rare, doesn't mean we can ignore them. We have to come up with > a > >> >> > clean > >> >> > design that covers all the possible scenarios. > >> >> > >> >> The cleanest design doesn't have to handle these dependencies if > >> >> possible. The ODG and related stuff would be a debugging nightmare. > >> >> > >> >> > This isn't over-engineering. It is clean engineering. There is no > >> >> > way > >> >> > to > >> >> > cleanly design the Master without some sort of dependency > >> >> > abstraction. > >> >> > >> >> I still haven't seen a compelling example yet. I'll change my mind > >> >> when I see it. > >> >> > >> >> __Luke > >> >> > >> >> > - Doug > >> >> > > >> >> > On Fri, Jul 31, 2009 at 2:54 PM, Luke <[email protected]> wrote: > >> >> >> > >> >> >> Master is getting more and more like a workqueue and jobtracker :) > >> >> >> It > >> >> >> seems to be advantageous to actually create a separate general > >> >> >> server > >> >> >> to manage all the tasks, which can be used for schedule map/reduce > >> >> >> tasks in the future as well. > >> >> >> > >> >> >> On Fri, Jul 31, 2009 at 11:14 AM, Doug Judd<[email protected]> > wrote: > >> >> >> > The Master is responsible for orchestrating recovery from > >> >> >> > RangeServer > >> >> >> > failures as well as carrying out meta operations in response to > >> >> >> > commands > >> >> >> > such as CREATE TABLE, ALTER TABLE, and DROP TABLE. These meta > >> >> >> > operations > >> >> >> > are relatively straightforward except in the face of RangeServer > >> >> >> > failure. > >> >> >> > When this happens, any in-progress meta operation that is > >> >> >> > dependent > >> >> >> > on > >> >> >> > the > >> >> >> > failed RangeServer needs to block until the RangeServer has been > >> >> >> > recovered. > >> >> >> > If another RangeServer that is involved in the recovery goes > down, > >> >> >> > there > >> >> >> > is > >> >> >> > now another recovery operation that needs to get carried out. > The > >> >> >> > Master > >> >> >> > can > >> >> >> > quickly start building up a fairly complex set of operation > >> >> >> > dependencies. > >> >> >> > > >> >> >> > The master is also responsible for moving ranges from one > >> >> >> > RangeServer > >> >> >> > to > >> >> >> > another when load across the RangeServers gets out of balance. > If > >> >> >> > a > >> >> >> > MOVE > >> >> >> > RANGE operation is in progress when, say, an ALTER TABLE request > >> >> >> > arrives, > >> >> >> > and the range being moved is part of the table specified in the > >> >> >> > ALTER > >> >> >> > TABLE > >> >> >> > request, then the ALTER TABLE operation needs to wait until the > >> >> >> > MOVE > >> >> >> > RANGE > >> >> >> > operation is complete before it can continue. Also, if two > ALTER > >> >> >> > TABLE > >> >> >> > requests arrive at the Master at the same time, then they should > >> >> >> > get > >> >> >> > carried > >> >> >> > out in sequential order with one of the ALTER TABLE operations > >> >> >> > depending > >> >> >> > on > >> >> >> > the completion of the other operation. > >> >> >> > >> >> >> I'm not sure about this particular case. For alter table while > >> >> >> ranges > >> >> >> are split/moved, it seems to that me as long as you update the > >> >> >> schema > >> >> >> in hyperspace/range servers atomically. The split/moved ranges on > >> >> >> the > >> >> >> destination new server will get the right schema. Also two alter > >> >> >> table > >> >> >> can overlap in many cases, as long as the schema updates on > >> >> >> hyperspace/range servers are atomic. For cases where alter table > on > >> >> >> the same table needs to be sequenced, it's actually not too much > to > >> >> >> ask the application to do the sequence, as alter table is not > really > >> >> >> a > >> >> >> frequent operations (otherwise, they should go with a generic > column > >> >> >> family and go nuts on qualifiers.) > >> >> >> > >> >> >> > To handle these dependencies, I propose designing the Master as > an > >> >> >> > execution > >> >> >> > engine for a directed acyclic graph of operations or operation > >> >> >> > dependency > >> >> >> > graph (ODG). Each node in the graph would represent an > operation > >> >> >> > (e.g. > >> >> >> > ALTER TABLE, RECOVER RangeServer) and would contain dynamic > state. > >> >> >> > Execution threads would carry out the operations by picking up > >> >> >> > nodes > >> >> >> > from > >> >> >> > the graph in topological sort order. When a RangeServer dies, > the > >> >> >> > ODG > >> >> >> > execution engine would pause, a new "RECOVER RangeServer" will > get > >> >> >> > created > >> >> >> > and the ODG will get modified to include this new node. All of > >> >> >> > the > >> >> >> > existing > >> >> >> > nodes that were dependent on that RangeServer would become > >> >> >> > dependent > >> >> >> > on > >> >> >> > this > >> >> >> > new RECOVER RangeServer node. At this point the ODG execution > >> >> >> > engine > >> >> >> > would > >> >> >> > be restarted. > >> >> >> > >> >> >> The same alter table arguments can apply here as well. You can let > >> >> >> the > >> >> >> alter table to proceed on hyperspace and the remaining range > >> >> >> servers. > >> >> >> The recovered ranges would get the right schema. Otherwise, an > alter > >> >> >> table command can take a long time (up to a few minutes) while one > >> >> >> of > >> >> >> the range server is being recovered. > >> >> >> > >> >> >> > The Master Meta Log (MML) would essentially persist any changes > to > >> >> >> > the > >> >> >> > ODG, > >> >> >> > both node state as well as structural graph changes. When the > >> >> >> > Master > >> >> >> > fails > >> >> >> > and a new one comes up, it would replay the MML to reconstruct > the > >> >> >> > ODG > >> >> >> > after > >> >> >> > which it could continue execution. > >> >> >> > > >> >> >> > Thoughts? > >> >> >> > >> >> >> It seems to me that an ODG is not absolutely required for normal > >> >> >> Hypertable operations. I'd like to avoid over engineering (if > >> >> >> possible) for the first release. > >> >> >> > >> >> >> __Luke > >> >> >> > >> >> >> > >> >> > > >> >> > > >> >> > > > >> >> > > >> >> > >> >> > >> > > >> > > >> > > > >> > > >> > >> > > > > > > > > > > > > > --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Hypertable Development" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/hypertable-dev?hl=en -~----------~----~----~----~------~----~------~--~---
