Re: [hypertable-dev] Re: RangeServer failover design

Doug Judd Tue, 25 Oct 2011 07:03:52 -0700

We're actively working on the failover implementation now.  It took us a
little longer to work out the kinks with the load balancer, so that date may
slip a little.  RangeServer failover is our #1 priority and will be be
available as soon as possible.


- Doug

On Tue, Oct 25, 2011 at 6:53 AM, Yingfeng <[email protected]> wrote:

> According to the roadmap, it seems such feature will be available on
> version 0.9.6.0 by 15.Nov,  is that right?
>
> I think such feature would be one of the major issue on whether
> bigdata company would choose to use.
>
>
>
> On Sep 8, 12:40 am, Doug Judd <[email protected]> wrote:
> > Hi Sanjit,
> >
> > Here's some feedback about the RangeServer changes....
> >
> > 1. I think you'll probably need a phantom "load" to load each range and a
> > "commit" to flip the phantom ranges live.  You'll have to handle a
> possible
> > race condition with the commit API.  If the Master issues commit, but the
> > RangeServer dies before it sends a response back, there will need to be
> some
> > way to determine whether or not the phantom ranges got flipped live.
> >
> > 2. I recommend dropping the word "fragment" from the API names for the
> > receiving RangeServer.  Conceptually, the APIs don't deal with fragments,
> > they just load ranges and receive updates.  For example:
> >
> > phantom_load
> > phantom_update
> > phantom_cancel
> > phantom_commit
> >
> > 3. There's another race condition that you'll need to address.  To flip a
> > set of phantom ranges live, the RangeServer needs to 1) write the Ranges
> to
> > the RSML, and 2) link the recovery log into the Commit log.  A simple
> > approach might be to link the recovery log first and then write the RSML.
> >
> > - Doug
> >
> >
> >
> >
> >
> >
> >
> > On Thu, Sep 1, 2011 at 4:03 PM, Sanjit Jhala <[email protected]> wrote:
> > > Since all data in Hypertable is persisted in an underlying DFS (with
> > > replication), when a RangeServer dies its state can be recovered from
> the
> > > filesystem. Here is a design proposal for RangeServer failover:
> >
> > > *High Level Failover Algorithm*
> >
> > >    1. Master receives server left notification for RangeServer X, waits
> > >    for some time after which it declares the server dead and starts
> recovery
> > >    2. Master looks at RangeServer MetaLog (RSML) and Master MetaLog
> (MML)
> > >    and figures out which ranges were on the failed RS and in what state
> > >    3. Master looks at X's CommitLog (CL) fragments  to see which range
> > >    servers have local copies. Master assigns CL "players" biased
> towards
> > >    RangeServers with a local copy of the fragment
> > >    4. Master re-assigns ranges (round robin for now)
> > >    5. Master sends lists of ranges and new locations to players and
> issues
> > >    play.
> > >    6. Players replay CL frags to new range locations. Say we have
> ranges
> > >    R1 .. RM, players P1 .. PN. For each recovered fragment RiPj all
> writes are
> > >    stored in a CellCache only. Once the RangeServer receives all data
> from Pj
> > >    for range Ri it writes the entire contents of the CellCache RiPj to
> recovery
> > >    log under /servers/rsY/recovery_rsX/range_i and merges RiPj into a
> CellCache
> > >    for Ri and deletes the CellCache RiPj.
> > >    7. RangeServer X tells master it has committed data from Pj in its
> > >    recovery logs
> > >    8. When the Master knows that all data for a range has been
> committed
> > >    it tells the destination RangeServer to flip the range live.
> > >    9. RangeServer links its range recovery log for Ri into its CL,
> flips
> > >    the CellCache for Ri live and schedules a major compaction for Ri
> and sends
> > >    confirmation to Master. If the range was in the middle of a split
> the new
> > >    location reads the split log and proceeds with the split.
> > >    10. Steps 5-9 are repeated for Root, Metadata, System and User
> ranges
> > >    (in that order) until all ranges are recovered
> >
> > > *
> > > *
> >
> > > *Master Changes*
> >
> > > Master will have a RecoverServer operation with 4 sub-operations:
> >
> > >    - 1. RecoverServerRoot (obstructions RecoverServerRoot/Root)
> > >    - 2. RecoverServerMetadata (dependencies RecoverServerRoot,
> > >    obstructions RecoverServerMetadata)
> > >    - 3. RecoverServerSystem (dependencies RecoverServerRoot,
> > >    RecoverServerMetadata obstructions RecoverServerSystem)
> > >    - 4. RecoverServerUser (dependencies RecoverServerRoot,
> > >    RecoverServerMetadata, RecoverServerSystem obstructions
> RecoverServerUser)
> >
> > > The logic for the "execute" step is the same for all and can be in a
> base
> > > class called RecoverServerBase. Meta operations such as create
> table/alter
> > > table will be dependent on RecoverServer operations.
> >
> > > Steps 1-4 above are done in the RecoverServer operation. As part of
> step 4
> > > the RecoverServer operation creates 4 sub operations to recover root,
> > > metadata, system and user ranges respectively, which are dependencies
> for
> > > the overall RecoverServer operation
> >
> > > *Range Server changes *
> >
> > > New commands/APIs
> >
> > > 1. play_fragment(failed server id (X) + fragment id, mapping of ranges
> to
> > > new locations). The RangeServer starts reading this fragment and plays
> > > updates to the destination rangeservers. [Maybe buffer 200K per call or
> > > cumulative as well as per range buffer limits.] If a send fails it
> stops
> > > sending updates to the failed range and continues.
> >
> > > 2. cancel_play(failed server id X + fragment id, locations): master
> will
> > > call this method to inform the player not to send any updates to a
> location.
> > > This will be called in case one of the destination range servers dies
> during
> > > recovery.
> >
> > > 2. phantom_fragment_update(table, range, fragment, update_list, eos):
> > > receive updates and write them to phantom CellCache. When eos==true
> append
> > > CellCache out to recovery log in one write + sync
> >
> > > 3. phantom_fragment_cancel(...): called by master in case a player dies
> and
> > > the CellCaches from Pj need to be tossed away.
> >
> > > No changes needed for the RSML since the recovered range is either in a
> > > phantom state or live state. If its in the phantom state and the
> RangeServer
> > > dies then the master reassigns the recovery ranges to a new location
> and
> > > replays the CL fragments from the beginning
> >
> > > *Recovery failures:*
> >
> > >    - If destination RangeServer fails, potentially all players have to
> > >    replay to new destination (all play operations get serialized behind
> > >    root, metadata, system replays). Players inform the master of any
> failed
> > >    range updates and the master will later tell the player to replay
> the
> > >    fragment either to the same or another RangeSever. Master maintains
> map of
> > >    (X, fragment id) --> players and (X, range) --> new location
> > >    - If player dies then the master re-assigns a new player. R1Pj ..
> RMPj
> > >    are tossed away and the new player replays the fragment.
> >
> > > Any thoughts?
> > > -Sanjit
> >
> > >  --
> > > You received this message because you are subscribed to the Google
> Groups
> > > "Hypertable Development" group.
> > > To post to this group, send email to [email protected].
> > > To unsubscribe from this group, send email to
> > > [email protected].
> > > For more options, visit this group at
> > >http://groups.google.com/group/hypertable-dev?hl=en.
>
> --
> You received this message because you are subscribed to the Google Groups
> "Hypertable Development" group.
> To post to this group, send email to [email protected].
> To unsubscribe from this group, send email to
> [email protected].
> For more options, visit this group at
> http://groups.google.com/group/hypertable-dev?hl=en.
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Hypertable Development" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/hypertable-dev?hl=en.

Re: [hypertable-dev] Re: RangeServer failover design

Reply via email to