Since all data in Hypertable is persisted in an underlying DFS (with
replication), when a RangeServer dies its state can be recovered from the
filesystem. Here is a design proposal for RangeServer failover:

*High Level Failover Algorithm*


   1. Master receives server left notification for RangeServer X, waits for
   some time after which it declares the server dead and starts recovery
   2. Master looks at RangeServer MetaLog (RSML) and Master MetaLog (MML)
   and figures out which ranges were on the failed RS and in what state
   3. Master looks at X's CommitLog (CL) fragments  to see which range
   servers have local copies. Master assigns CL "players" biased towards
   RangeServers with a local copy of the fragment
   4. Master re-assigns ranges (round robin for now)
   5. Master sends lists of ranges and new locations to players and issues
   play.
   6. Players replay CL frags to new range locations. Say we have ranges R1
   .. RM, players P1 .. PN. For each recovered fragment RiPj all writes are
   stored in a CellCache only. Once the RangeServer receives all data from Pj
   for range Ri it writes the entire contents of the CellCache RiPj to recovery
   log under /servers/rsY/recovery_rsX/range_i and merges RiPj into a CellCache
   for Ri and deletes the CellCache RiPj.
   7. RangeServer X tells master it has committed data from Pj in its
   recovery logs
   8. When the Master knows that all data for a range has been committed it
   tells the destination RangeServer to flip the range live.
   9. RangeServer links its range recovery log for Ri into its CL, flips the
   CellCache for Ri live and schedules a major compaction for Ri and sends
   confirmation to Master. If the range was in the middle of a split the new
   location reads the split log and proceeds with the split.
   10. Steps 5-9 are repeated for Root, Metadata, System and User ranges (in
   that order) until all ranges are recovered

*
*

*Master Changes*

Master will have a RecoverServer operation with 4 sub-operations:


   - 1. RecoverServerRoot (obstructions RecoverServerRoot/Root)
   - 2. RecoverServerMetadata (dependencies RecoverServerRoot, obstructions
   RecoverServerMetadata)
   - 3. RecoverServerSystem (dependencies RecoverServerRoot,
   RecoverServerMetadata obstructions RecoverServerSystem)
   - 4. RecoverServerUser (dependencies RecoverServerRoot,
   RecoverServerMetadata, RecoverServerSystem obstructions RecoverServerUser)

The logic for the "execute" step is the same for all and can be in a base
class called RecoverServerBase. Meta operations such as create table/alter
table will be dependent on RecoverServer operations.

Steps 1-4 above are done in the RecoverServer operation. As part of step 4
the RecoverServer operation creates 4 sub operations to recover root,
metadata, system and user ranges respectively, which are dependencies for
the overall RecoverServer operation

*Range Server changes *

New commands/APIs

1. play_fragment(failed server id (X) + fragment id, mapping of ranges to
new locations). The RangeServer starts reading this fragment and plays
updates to the destination rangeservers. [Maybe buffer 200K per call or
cumulative as well as per range buffer limits.] If a send fails it stops
sending updates to the failed range and continues.

2. cancel_play(failed server id X + fragment id, locations): master will
call this method to inform the player not to send any updates to a location.
This will be called in case one of the destination range servers dies during
recovery.

2. phantom_fragment_update(table, range, fragment, update_list, eos):
receive updates and write them to phantom CellCache. When eos==true append
CellCache out to recovery log in one write + sync

3. phantom_fragment_cancel(...): called by master in case a player dies and
the CellCaches from Pj need to be tossed away.

No changes needed for the RSML since the recovered range is either in a
phantom state or live state. If its in the phantom state and the RangeServer
dies then the master reassigns the recovery ranges to a new location and
replays the CL fragments from the beginning

*Recovery failures:*


   - If destination RangeServer fails, potentially all players have to
   replay to new destination (all play operations get serialized behind
   root, metadata, system replays). Players inform the master of any failed
   range updates and the master will later tell the player to replay the
   fragment either to the same or another RangeSever. Master maintains map of
   (X, fragment id) --> players and (X, range) --> new location
   - If player dies then the master re-assigns a new player. R1Pj .. RMPj
   are tossed away and the new player replays the fragment.



Any thoughts?
-Sanjit

-- 
You received this message because you are subscribed to the Google Groups 
"Hypertable Development" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/hypertable-dev?hl=en.

Reply via email to