I think Luke is right about going for a simple design, but under the
circumstances, I think we need some kind of framework to reason about
recovery, otherwise it could end up as huge unworkable pile of recursion and
if-thens.

As a DAG I think this solution is at most a 4-partite graph where the first
level is all RootRange recoveries, the next is MetaRange recoveries, the
next is UserRange recoveries and finally DropTables/AlterTables etc. Except
for the last level, a node depends on all items in the previous level being
completed (eg can't recover a MetaRange unless all RootRange recoveries are
done). At the last 2 levels, it might make sense at some point to add some
kind of prioritization (eg user Range x is not being used so recover it
last) but that is probably to much at this point.

-Sanjit

On Fri, Jul 31, 2009 at 3:31 PM, Luke <[email protected]> wrote:

>
> On Fri, Jul 31, 2009 at 3:20 PM, Sanjit Jhala<[email protected]> wrote:
> > For alter table, its not merely an atomic update to Hyperspace. The
> master
> > updates the schema on hyperspace and then sends "update schema" commands
> to
> > all the RangeServers and waits for them to ack before returning. This
> avoids
> > unnecessary per-RangeServer traffic to Hyperspace. Since alter table is
> > expected to be a fairly infrequent operations I don't think its
> unreasonable
> > for users to have to wait if execution is blocked by RangeServer
> recovery.
> > This is pretty much the same drop table.
>
> I meant you still do atomic update schema commands on range servers.
> It's just that new ranges would get the schema from the same range
> server if there are other ranges of the table on that range server. If
> the new ranges are from a recovered range server and happen to be a
> new table, you do incur some additional hyperspace traffic.
>
> My reasoning is that we should do the simplest thing that works,
> especially for infrequent operations.
>
> __Luke
>
> >
> > -Sanjit
> >
> > On Fri, Jul 31, 2009 at 2:54 PM, Luke <[email protected]> wrote:
> >>
> >> Master is getting more and more like a workqueue and jobtracker :) It
> >> seems to be advantageous to actually create a separate general server
> >> to manage all the tasks, which can be used for schedule map/reduce
> >> tasks in the future as well.
> >>
> >> On Fri, Jul 31, 2009 at 11:14 AM, Doug Judd<[email protected]> wrote:
> >> > The Master is responsible for orchestrating recovery from RangeServer
> >> > failures as well as carrying out meta operations in response to
> commands
> >> > such as CREATE TABLE, ALTER TABLE, and DROP TABLE.  These meta
> >> > operations
> >> > are relatively straightforward except in the face of RangeServer
> >> > failure.
> >> > When this happens, any in-progress meta operation that is dependent on
> >> > the
> >> > failed RangeServer needs to block until the RangeServer has been
> >> > recovered.
> >> > If another RangeServer that is involved in the recovery goes down,
> there
> >> > is
> >> > now another recovery operation that needs to get carried out. The
> Master
> >> > can
> >> > quickly start building up a fairly complex set of operation
> >> > dependencies.
> >> >
> >> > The master is also responsible for moving ranges from one RangeServer
> to
> >> > another when load across the RangeServers gets out of balance.  If a
> >> > MOVE
> >> > RANGE operation is in progress when, say, an ALTER TABLE request
> >> > arrives,
> >> > and the range being moved is part of the table specified in the ALTER
> >> > TABLE
> >> > request, then the ALTER TABLE operation needs to wait until the MOVE
> >> > RANGE
> >> > operation is complete before it can continue.  Also, if two ALTER
> TABLE
> >> > requests arrive at the Master at the same time, then they should get
> >> > carried
> >> > out in sequential order with one of the ALTER TABLE operations
> depending
> >> > on
> >> > the completion of the other operation.
> >>
> >> I'm not sure about this particular case. For alter table while ranges
> >> are split/moved, it seems to that me as long as you update the schema
> >> in hyperspace/range servers atomically. The split/moved ranges on the
> >> destination new server will get the right schema. Also two alter table
> >> can overlap in many cases, as long as the schema updates on
> >> hyperspace/range servers are atomic. For cases where alter table on
> >> the same table needs to be sequenced, it's actually not too much to
> >> ask the application to do the sequence, as alter table is not really a
> >> frequent operations (otherwise, they should go with a generic column
> >> family and go nuts on qualifiers.)
> >>
> >> > To handle these dependencies, I propose designing the Master as an
> >> > execution
> >> > engine for a directed acyclic graph of operations or operation
> >> > dependency
> >> > graph (ODG).  Each node in the graph would represent an operation
> (e.g.
> >> > ALTER TABLE, RECOVER RangeServer) and would contain dynamic state.
> >> > Execution threads would carry out the operations by picking up nodes
> >> > from
> >> > the graph in topological sort order.  When a RangeServer dies, the ODG
> >> > execution engine would pause, a new "RECOVER RangeServer" will get
> >> > created
> >> > and the ODG will get modified to include this new node.  All of the
> >> > existing
> >> > nodes that were dependent on that RangeServer would become dependent
> on
> >> > this
> >> > new RECOVER RangeServer node.  At this point the ODG execution engine
> >> > would
> >> > be restarted.
> >>
> >> The same alter table arguments can apply here as well. You can let the
> >> alter table to proceed on hyperspace and the remaining range servers.
> >> The recovered ranges would get the right schema. Otherwise, an alter
> >> table command can take a long time (up to a few minutes) while one of
> >> the range server is being recovered.
> >>
> >> > The Master Meta Log (MML) would essentially persist any changes to the
> >> > ODG,
> >> > both node state as well as structural graph changes.  When the Master
> >> > fails
> >> > and a new one comes up, it would replay the MML to reconstruct the ODG
> >> > after
> >> > which it could continue execution.
> >> >
> >> > Thoughts?
> >>
> >> It seems to me that an ODG is not absolutely required for normal
> >> Hypertable operations. I'd like to avoid over engineering (if
> >> possible) for the first release.
> >>
> >> __Luke
> >>
> >>
> >
> >
> > >
> >
>
> >
>

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"Hypertable Development" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to 
[email protected]
For more options, visit this group at 
http://groups.google.com/group/hypertable-dev?hl=en
-~----------~----~----~----~------~----~------~--~---

Reply via email to