Hi Sanjit,

Here's some feedback about the RangeServer changes....

1. I think you'll probably need a phantom "load" to load each range and a
"commit" to flip the phantom ranges live.  You'll have to handle a possible
race condition with the commit API.  If the Master issues commit, but the
RangeServer dies before it sends a response back, there will need to be some
way to determine whether or not the phantom ranges got flipped live.

2. I recommend dropping the word "fragment" from the API names for the
receiving RangeServer.  Conceptually, the APIs don't deal with fragments,
they just load ranges and receive updates.  For example:

phantom_load
phantom_update
phantom_cancel
phantom_commit

3. There's another race condition that you'll need to address.  To flip a
set of phantom ranges live, the RangeServer needs to 1) write the Ranges to
the RSML, and 2) link the recovery log into the Commit log.  A simple
approach might be to link the recovery log first and then write the RSML.

- Doug

On Thu, Sep 1, 2011 at 4:03 PM, Sanjit Jhala <[email protected]> wrote:

> Since all data in Hypertable is persisted in an underlying DFS (with
> replication), when a RangeServer dies its state can be recovered from the
> filesystem. Here is a design proposal for RangeServer failover:
>
> *High Level Failover Algorithm*
>
>
>    1. Master receives server left notification for RangeServer X, waits
>    for some time after which it declares the server dead and starts recovery
>    2. Master looks at RangeServer MetaLog (RSML) and Master MetaLog (MML)
>    and figures out which ranges were on the failed RS and in what state
>    3. Master looks at X's CommitLog (CL) fragments  to see which range
>    servers have local copies. Master assigns CL "players" biased towards
>    RangeServers with a local copy of the fragment
>    4. Master re-assigns ranges (round robin for now)
>    5. Master sends lists of ranges and new locations to players and issues
>    play.
>    6. Players replay CL frags to new range locations. Say we have ranges
>    R1 .. RM, players P1 .. PN. For each recovered fragment RiPj all writes are
>    stored in a CellCache only. Once the RangeServer receives all data from Pj
>    for range Ri it writes the entire contents of the CellCache RiPj to 
> recovery
>    log under /servers/rsY/recovery_rsX/range_i and merges RiPj into a 
> CellCache
>    for Ri and deletes the CellCache RiPj.
>    7. RangeServer X tells master it has committed data from Pj in its
>    recovery logs
>    8. When the Master knows that all data for a range has been committed
>    it tells the destination RangeServer to flip the range live.
>    9. RangeServer links its range recovery log for Ri into its CL, flips
>    the CellCache for Ri live and schedules a major compaction for Ri and sends
>    confirmation to Master. If the range was in the middle of a split the new
>    location reads the split log and proceeds with the split.
>    10. Steps 5-9 are repeated for Root, Metadata, System and User ranges
>    (in that order) until all ranges are recovered
>
> *
> *
>
> *Master Changes*
>
> Master will have a RecoverServer operation with 4 sub-operations:
>
>
>    - 1. RecoverServerRoot (obstructions RecoverServerRoot/Root)
>    - 2. RecoverServerMetadata (dependencies RecoverServerRoot,
>    obstructions RecoverServerMetadata)
>    - 3. RecoverServerSystem (dependencies RecoverServerRoot,
>    RecoverServerMetadata obstructions RecoverServerSystem)
>    - 4. RecoverServerUser (dependencies RecoverServerRoot,
>    RecoverServerMetadata, RecoverServerSystem obstructions RecoverServerUser)
>
> The logic for the "execute" step is the same for all and can be in a base
> class called RecoverServerBase. Meta operations such as create table/alter
> table will be dependent on RecoverServer operations.
>
> Steps 1-4 above are done in the RecoverServer operation. As part of step 4
> the RecoverServer operation creates 4 sub operations to recover root,
> metadata, system and user ranges respectively, which are dependencies for
> the overall RecoverServer operation
>
> *Range Server changes *
>
> New commands/APIs
>
> 1. play_fragment(failed server id (X) + fragment id, mapping of ranges to
> new locations). The RangeServer starts reading this fragment and plays
> updates to the destination rangeservers. [Maybe buffer 200K per call or
> cumulative as well as per range buffer limits.] If a send fails it stops
> sending updates to the failed range and continues.
>
> 2. cancel_play(failed server id X + fragment id, locations): master will
> call this method to inform the player not to send any updates to a location.
> This will be called in case one of the destination range servers dies during
> recovery.
>
> 2. phantom_fragment_update(table, range, fragment, update_list, eos):
> receive updates and write them to phantom CellCache. When eos==true append
> CellCache out to recovery log in one write + sync
>
> 3. phantom_fragment_cancel(...): called by master in case a player dies and
> the CellCaches from Pj need to be tossed away.
>
> No changes needed for the RSML since the recovered range is either in a
> phantom state or live state. If its in the phantom state and the RangeServer
> dies then the master reassigns the recovery ranges to a new location and
> replays the CL fragments from the beginning
>
> *Recovery failures:*
>
>
>    - If destination RangeServer fails, potentially all players have to
>    replay to new destination (all play operations get serialized behind
>    root, metadata, system replays). Players inform the master of any failed
>    range updates and the master will later tell the player to replay the
>    fragment either to the same or another RangeSever. Master maintains map of
>    (X, fragment id) --> players and (X, range) --> new location
>    - If player dies then the master re-assigns a new player. R1Pj .. RMPj
>    are tossed away and the new player replays the fragment.
>
>
>
> Any thoughts?
> -Sanjit
>
>  --
> You received this message because you are subscribed to the Google Groups
> "Hypertable Development" group.
> To post to this group, send email to [email protected].
> To unsubscribe from this group, send email to
> [email protected].
> For more options, visit this group at
> http://groups.google.com/group/hypertable-dev?hl=en.
>

-- 
You received this message because you are subscribed to the Google Groups 
"Hypertable Development" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/hypertable-dev?hl=en.

Reply via email to