Hi Sanjit, Here's some feedback about the RangeServer changes....
1. I think you'll probably need a phantom "load" to load each range and a "commit" to flip the phantom ranges live. You'll have to handle a possible race condition with the commit API. If the Master issues commit, but the RangeServer dies before it sends a response back, there will need to be some way to determine whether or not the phantom ranges got flipped live. 2. I recommend dropping the word "fragment" from the API names for the receiving RangeServer. Conceptually, the APIs don't deal with fragments, they just load ranges and receive updates. For example: phantom_load phantom_update phantom_cancel phantom_commit 3. There's another race condition that you'll need to address. To flip a set of phantom ranges live, the RangeServer needs to 1) write the Ranges to the RSML, and 2) link the recovery log into the Commit log. A simple approach might be to link the recovery log first and then write the RSML. - Doug On Thu, Sep 1, 2011 at 4:03 PM, Sanjit Jhala <[email protected]> wrote: > Since all data in Hypertable is persisted in an underlying DFS (with > replication), when a RangeServer dies its state can be recovered from the > filesystem. Here is a design proposal for RangeServer failover: > > *High Level Failover Algorithm* > > > 1. Master receives server left notification for RangeServer X, waits > for some time after which it declares the server dead and starts recovery > 2. Master looks at RangeServer MetaLog (RSML) and Master MetaLog (MML) > and figures out which ranges were on the failed RS and in what state > 3. Master looks at X's CommitLog (CL) fragments to see which range > servers have local copies. Master assigns CL "players" biased towards > RangeServers with a local copy of the fragment > 4. Master re-assigns ranges (round robin for now) > 5. Master sends lists of ranges and new locations to players and issues > play. > 6. Players replay CL frags to new range locations. Say we have ranges > R1 .. RM, players P1 .. PN. For each recovered fragment RiPj all writes are > stored in a CellCache only. Once the RangeServer receives all data from Pj > for range Ri it writes the entire contents of the CellCache RiPj to > recovery > log under /servers/rsY/recovery_rsX/range_i and merges RiPj into a > CellCache > for Ri and deletes the CellCache RiPj. > 7. RangeServer X tells master it has committed data from Pj in its > recovery logs > 8. When the Master knows that all data for a range has been committed > it tells the destination RangeServer to flip the range live. > 9. RangeServer links its range recovery log for Ri into its CL, flips > the CellCache for Ri live and schedules a major compaction for Ri and sends > confirmation to Master. If the range was in the middle of a split the new > location reads the split log and proceeds with the split. > 10. Steps 5-9 are repeated for Root, Metadata, System and User ranges > (in that order) until all ranges are recovered > > * > * > > *Master Changes* > > Master will have a RecoverServer operation with 4 sub-operations: > > > - 1. RecoverServerRoot (obstructions RecoverServerRoot/Root) > - 2. RecoverServerMetadata (dependencies RecoverServerRoot, > obstructions RecoverServerMetadata) > - 3. RecoverServerSystem (dependencies RecoverServerRoot, > RecoverServerMetadata obstructions RecoverServerSystem) > - 4. RecoverServerUser (dependencies RecoverServerRoot, > RecoverServerMetadata, RecoverServerSystem obstructions RecoverServerUser) > > The logic for the "execute" step is the same for all and can be in a base > class called RecoverServerBase. Meta operations such as create table/alter > table will be dependent on RecoverServer operations. > > Steps 1-4 above are done in the RecoverServer operation. As part of step 4 > the RecoverServer operation creates 4 sub operations to recover root, > metadata, system and user ranges respectively, which are dependencies for > the overall RecoverServer operation > > *Range Server changes * > > New commands/APIs > > 1. play_fragment(failed server id (X) + fragment id, mapping of ranges to > new locations). The RangeServer starts reading this fragment and plays > updates to the destination rangeservers. [Maybe buffer 200K per call or > cumulative as well as per range buffer limits.] If a send fails it stops > sending updates to the failed range and continues. > > 2. cancel_play(failed server id X + fragment id, locations): master will > call this method to inform the player not to send any updates to a location. > This will be called in case one of the destination range servers dies during > recovery. > > 2. phantom_fragment_update(table, range, fragment, update_list, eos): > receive updates and write them to phantom CellCache. When eos==true append > CellCache out to recovery log in one write + sync > > 3. phantom_fragment_cancel(...): called by master in case a player dies and > the CellCaches from Pj need to be tossed away. > > No changes needed for the RSML since the recovered range is either in a > phantom state or live state. If its in the phantom state and the RangeServer > dies then the master reassigns the recovery ranges to a new location and > replays the CL fragments from the beginning > > *Recovery failures:* > > > - If destination RangeServer fails, potentially all players have to > replay to new destination (all play operations get serialized behind > root, metadata, system replays). Players inform the master of any failed > range updates and the master will later tell the player to replay the > fragment either to the same or another RangeSever. Master maintains map of > (X, fragment id) --> players and (X, range) --> new location > - If player dies then the master re-assigns a new player. R1Pj .. RMPj > are tossed away and the new player replays the fragment. > > > > Any thoughts? > -Sanjit > > -- > You received this message because you are subscribed to the Google Groups > "Hypertable Development" group. > To post to this group, send email to [email protected]. > To unsubscribe from this group, send email to > [email protected]. > For more options, visit this group at > http://groups.google.com/group/hypertable-dev?hl=en. > -- You received this message because you are subscribed to the Google Groups "Hypertable Development" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/hypertable-dev?hl=en.
