Hi Sanjit, Here's the first half of my feedback. I'll get back to you on the RangeServer changes after I think about it a little more
- Master needs to obtain an exclusive lock on the /servers/rsXX file in Hyperspace before it procedes with recovery - Once recovery is complete, the commit log and/or hyperspace entries should be marked "recovered" before the exclusive lock is relinquished so that if the server gets restarted, it will know that it's been removed from the service and can fail without trying to load ranges that have been moved to other servers in the system. - Each RangeServer that is receiving should only have a single recovery log for each Range class being recovered (i.e. ROOT, METADATA, SYSTEM, USER). - When a "player" finishes replaying its fragment, it should send over a flag that indicates that it is complete. I think the "player" (not the receiving RangeServer) should report back to the master that it has finished a fragment replay. If a player dies and there is some question about whether or not the fragment got fully played, then when the new player tries to replay the fragment, the receiving RangeServer can tell the player that it has already received the fragment. - I recommend naming the sub-operation something like RecoverServerCommitLog and passing the information about the specific type of commit log (path, etc.) into the constructor. Derivation is typically used for situations where you have special logic that needs to be added to extend a base class. I don't think that's the case here - Doug On Thu, Sep 1, 2011 at 4:03 PM, Sanjit Jhala <[email protected]> wrote: > Since all data in Hypertable is persisted in an underlying DFS (with > replication), when a RangeServer dies its state can be recovered from the > filesystem. Here is a design proposal for RangeServer failover: > > *High Level Failover Algorithm* > > > 1. Master receives server left notification for RangeServer X, waits > for some time after which it declares the server dead and starts recovery > 2. Master looks at RangeServer MetaLog (RSML) and Master MetaLog (MML) > and figures out which ranges were on the failed RS and in what state > 3. Master looks at X's CommitLog (CL) fragments to see which range > servers have local copies. Master assigns CL "players" biased towards > RangeServers with a local copy of the fragment > 4. Master re-assigns ranges (round robin for now) > 5. Master sends lists of ranges and new locations to players and issues > play. > 6. Players replay CL frags to new range locations. Say we have ranges > R1 .. RM, players P1 .. PN. For each recovered fragment RiPj all writes are > stored in a CellCache only. Once the RangeServer receives all data from Pj > for range Ri it writes the entire contents of the CellCache RiPj to > recovery > log under /servers/rsY/recovery_rsX/range_i and merges RiPj into a > CellCache > for Ri and deletes the CellCache RiPj. > 7. RangeServer X tells master it has committed data from Pj in its > recovery logs > 8. When the Master knows that all data for a range has been committed > it tells the destination RangeServer to flip the range live. > 9. RangeServer links its range recovery log for Ri into its CL, flips > the CellCache for Ri live and schedules a major compaction for Ri and sends > confirmation to Master. If the range was in the middle of a split the new > location reads the split log and proceeds with the split. > 10. Steps 5-9 are repeated for Root, Metadata, System and User ranges > (in that order) until all ranges are recovered > > * > * > > *Master Changes* > > Master will have a RecoverServer operation with 4 sub-operations: > > > - 1. RecoverServerRoot (obstructions RecoverServerRoot/Root) > - 2. RecoverServerMetadata (dependencies RecoverServerRoot, > obstructions RecoverServerMetadata) > - 3. RecoverServerSystem (dependencies RecoverServerRoot, > RecoverServerMetadata obstructions RecoverServerSystem) > - 4. RecoverServerUser (dependencies RecoverServerRoot, > RecoverServerMetadata, RecoverServerSystem obstructions RecoverServerUser) > > The logic for the "execute" step is the same for all and can be in a base > class called RecoverServerBase. Meta operations such as create table/alter > table will be dependent on RecoverServer operations. > > Steps 1-4 above are done in the RecoverServer operation. As part of step 4 > the RecoverServer operation creates 4 sub operations to recover root, > metadata, system and user ranges respectively, which are dependencies for > the overall RecoverServer operation > > *Range Server changes * > > New commands/APIs > > 1. play_fragment(failed server id (X) + fragment id, mapping of ranges to > new locations). The RangeServer starts reading this fragment and plays > updates to the destination rangeservers. [Maybe buffer 200K per call or > cumulative as well as per range buffer limits.] If a send fails it stops > sending updates to the failed range and continues. > > 2. cancel_play(failed server id X + fragment id, locations): master will > call this method to inform the player not to send any updates to a location. > This will be called in case one of the destination range servers dies during > recovery. > > 2. phantom_fragment_update(table, range, fragment, update_list, eos): > receive updates and write them to phantom CellCache. When eos==true append > CellCache out to recovery log in one write + sync > > 3. phantom_fragment_cancel(...): called by master in case a player dies and > the CellCaches from Pj need to be tossed away. > > No changes needed for the RSML since the recovered range is either in a > phantom state or live state. If its in the phantom state and the RangeServer > dies then the master reassigns the recovery ranges to a new location and > replays the CL fragments from the beginning > > *Recovery failures:* > > > - If destination RangeServer fails, potentially all players have to > replay to new destination (all play operations get serialized behind > root, metadata, system replays). Players inform the master of any failed > range updates and the master will later tell the player to replay the > fragment either to the same or another RangeSever. Master maintains map of > (X, fragment id) --> players and (X, range) --> new location > - If player dies then the master re-assigns a new player. R1Pj .. RMPj > are tossed away and the new player replays the fragment. > > > > Any thoughts? > -Sanjit > > -- > You received this message because you are subscribed to the Google Groups > "Hypertable Development" group. > To post to this group, send email to [email protected]. > To unsubscribe from this group, send email to > [email protected]. > For more options, visit this group at > http://groups.google.com/group/hypertable-dev?hl=en. > -- You received this message because you are subscribed to the Google Groups "Hypertable Development" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/hypertable-dev?hl=en.
