[hypertable-dev] Re: RangeServer failover design

Yingfeng Tue, 25 Oct 2011 06:56:27 -0700

According to the roadmap, it seems such feature will be available on
version 0.9.6.0 by 15.Nov,  is that right?


I think such feature would be one of the major issue on whether
bigdata company would choose to use.



On Sep 8, 12:40 am, Doug Judd <[email protected]> wrote:
> Hi Sanjit,
>
> Here's some feedback about the RangeServer changes....
>
> 1. I think you'll probably need a phantom "load" to load each range and a
> "commit" to flip the phantom ranges live.  You'll have to handle a possible
> race condition with the commit API.  If the Master issues commit, but the
> RangeServer dies before it sends a response back, there will need to be some
> way to determine whether or not the phantom ranges got flipped live.
>
> 2. I recommend dropping the word "fragment" from the API names for the
> receiving RangeServer.  Conceptually, the APIs don't deal with fragments,
> they just load ranges and receive updates.  For example:
>
> phantom_load
> phantom_update
> phantom_cancel
> phantom_commit
>
> 3. There's another race condition that you'll need to address.  To flip a
> set of phantom ranges live, the RangeServer needs to 1) write the Ranges to
> the RSML, and 2) link the recovery log into the Commit log.  A simple
> approach might be to link the recovery log first and then write the RSML.
>
> - Doug
>
>
>
>
>
>
>
> On Thu, Sep 1, 2011 at 4:03 PM, Sanjit Jhala <[email protected]> wrote:
> > Since all data in Hypertable is persisted in an underlying DFS (with
> > replication), when a RangeServer dies its state can be recovered from the
> > filesystem. Here is a design proposal for RangeServer failover:
>
> > *High Level Failover Algorithm*
>
> >    1. Master receives server left notification for RangeServer X, waits
> >    for some time after which it declares the server dead and starts recovery
> >    2. Master looks at RangeServer MetaLog (RSML) and Master MetaLog (MML)
> >    and figures out which ranges were on the failed RS and in what state
> >    3. Master looks at X's CommitLog (CL) fragments  to see which range
> >    servers have local copies. Master assigns CL "players" biased towards
> >    RangeServers with a local copy of the fragment
> >    4. Master re-assigns ranges (round robin for now)
> >    5. Master sends lists of ranges and new locations to players and issues
> >    play.
> >    6. Players replay CL frags to new range locations. Say we have ranges
> >    R1 .. RM, players P1 .. PN. For each recovered fragment RiPj all writes 
> > are
> >    stored in a CellCache only. Once the RangeServer receives all data from 
> > Pj
> >    for range Ri it writes the entire contents of the CellCache RiPj to 
> > recovery
> >    log under /servers/rsY/recovery_rsX/range_i and merges RiPj into a 
> > CellCache
> >    for Ri and deletes the CellCache RiPj.
> >    7. RangeServer X tells master it has committed data from Pj in its
> >    recovery logs
> >    8. When the Master knows that all data for a range has been committed
> >    it tells the destination RangeServer to flip the range live.
> >    9. RangeServer links its range recovery log for Ri into its CL, flips
> >    the CellCache for Ri live and schedules a major compaction for Ri and 
> > sends
> >    confirmation to Master. If the range was in the middle of a split the new
> >    location reads the split log and proceeds with the split.
> >    10. Steps 5-9 are repeated for Root, Metadata, System and User ranges
> >    (in that order) until all ranges are recovered
>
> > *
> > *
>
> > *Master Changes*
>
> > Master will have a RecoverServer operation with 4 sub-operations:
>
> >    - 1. RecoverServerRoot (obstructions RecoverServerRoot/Root)
> >    - 2. RecoverServerMetadata (dependencies RecoverServerRoot,
> >    obstructions RecoverServerMetadata)
> >    - 3. RecoverServerSystem (dependencies RecoverServerRoot,
> >    RecoverServerMetadata obstructions RecoverServerSystem)
> >    - 4. RecoverServerUser (dependencies RecoverServerRoot,
> >    RecoverServerMetadata, RecoverServerSystem obstructions 
> > RecoverServerUser)
>
> > The logic for the "execute" step is the same for all and can be in a base
> > class called RecoverServerBase. Meta operations such as create table/alter
> > table will be dependent on RecoverServer operations.
>
> > Steps 1-4 above are done in the RecoverServer operation. As part of step 4
> > the RecoverServer operation creates 4 sub operations to recover root,
> > metadata, system and user ranges respectively, which are dependencies for
> > the overall RecoverServer operation
>
> > *Range Server changes *
>
> > New commands/APIs
>
> > 1. play_fragment(failed server id (X) + fragment id, mapping of ranges to
> > new locations). The RangeServer starts reading this fragment and plays
> > updates to the destination rangeservers. [Maybe buffer 200K per call or
> > cumulative as well as per range buffer limits.] If a send fails it stops
> > sending updates to the failed range and continues.
>
> > 2. cancel_play(failed server id X + fragment id, locations): master will
> > call this method to inform the player not to send any updates to a location.
> > This will be called in case one of the destination range servers dies during
> > recovery.
>
> > 2. phantom_fragment_update(table, range, fragment, update_list, eos):
> > receive updates and write them to phantom CellCache. When eos==true append
> > CellCache out to recovery log in one write + sync
>
> > 3. phantom_fragment_cancel(...): called by master in case a player dies and
> > the CellCaches from Pj need to be tossed away.
>
> > No changes needed for the RSML since the recovered range is either in a
> > phantom state or live state. If its in the phantom state and the RangeServer
> > dies then the master reassigns the recovery ranges to a new location and
> > replays the CL fragments from the beginning
>
> > *Recovery failures:*
>
> >    - If destination RangeServer fails, potentially all players have to
> >    replay to new destination (all play operations get serialized behind
> >    root, metadata, system replays). Players inform the master of any failed
> >    range updates and the master will later tell the player to replay the
> >    fragment either to the same or another RangeSever. Master maintains map 
> > of
> >    (X, fragment id) --> players and (X, range) --> new location
> >    - If player dies then the master re-assigns a new player. R1Pj .. RMPj
> >    are tossed away and the new player replays the fragment.
>
> > Any thoughts?
> > -Sanjit
>
> >  --
> > You received this message because you are subscribed to the Google Groups
> > "Hypertable Development" group.
> > To post to this group, send email to [email protected].
> > To unsubscribe from this group, send email to
> > [email protected].
> > For more options, visit this group at
> >http://groups.google.com/group/hypertable-dev?hl=en.

-- 
You received this message because you are subscribed to the Google Groups 
"Hypertable Development" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/hypertable-dev?hl=en.

[hypertable-dev] Re: RangeServer failover design

Reply via email to