[hypertable-dev] Re: Master Design Proposal - Operation Dependency Graph (ODG)

Luke Mon, 03 Aug 2009 14:29:08 -0700

I'm ambivalent about this as well, especially considering a ODG
library would be useful for future map/reduce framework. But I think
rewrite/redone is underrated, especially rewrite after gaining real
world experience of the code in the wild.


If you decide to go down the ODG route, we should avoid accidental
dependencies simply because we have the ODG hammer. e.g., having alter
table/drop table depending on recovery, as both alter table and drop
table can implemented easily in a nonblocking fashion.

__Luke

On Mon, Aug 3, 2009 at 2:17 PM, Doug Judd<[email protected]> wrote:
> Because the approach you suggested will have to get thrown away and re-done
> when we want to optimize it for optimum availability.  It's better to figure
> out the correct long term abstraction that models the problem in a clean way
> and then do a simpler implementation using that abstraction.  Otherwise you
> end up writing a bunch of code and spend many releases debugging it, only to
> throw the whole thing away and start over when its time to do it right.
>
> - Doug
>
> On Mon, Aug 3, 2009 at 1:56 PM, Luke <[email protected]> wrote:
>>
>> On Mon, Aug 3, 2009 at 1:48 PM, Doug Judd<[email protected]> wrote:
>> > You haven't convinced me that this approach is simpler.  For example,
>> > reconstruct_graph() is just assembling vertices and edges together to
>> > form a
>> > graph, what exactly does reconstruct_states() do?
>>
>> scan the mml and initialize the state of the fsm. Anyway, I'm not
>> advocating for a specific FSM approach now. I'm just wondering why
>> can't we finish implementing recovery in a straightforward fashion as
>> I mentioned early and worry about generic tasks later. Your ODG should
>> work fine for generic task management but it doesn't really make
>> implement recovery faster, simpler or more robust. We really need
>> recovery like yesterday.
>>
>> __Luke
>>
>> >
>> > On Mon, Aug 3, 2009 at 12:23 PM, Luke <[email protected]> wrote:
>> >>
>> >> On Mon, Aug 3, 2009 at 12:08 PM, Doug Judd<[email protected]>
>> >> wrote:
>> >> > The finite state machine implementation does not seem simpler to me.
>> >> > Creating a separate classes for each task seems like the right thing
>> >> > to
>> >> > do
>> >> > because it encapsulates all of the task logic and associated state in
>> >> > one
>> >> > place.
>> >>
>> >> You can create separate class for each task in any approaches. ODG
>> >> approach would force you to do so even for trivial cases.
>> >>
>> >> > Also you wouldn't traverse the graph every time for each interrupt,
>> >> > there would be a secondary index on the graph that would allow you to
>> >> > lookup
>> >> > tasks (nodes) by the tables they depend on.  Plus all of the graph
>> >> > manipulation logic will be encapsulated into its own library
>> >> > completely
>> >> > decoupled from the task logic.
>> >>
>> >> The secondary index doesn't buy you much, if majority of the tasks are
>> >> depend upon some event. You still need to call the libraries explicity
>> >> to setup the dependencies explicitly for each task (either in task
>> >> constructor or by master.)
>> >>
>> >> > When the Master comes up it needs to go through a recovery process to
>> >> > reconstruct its state in order to pick up where it left off when it
>> >> > went
>> >> > down.  With the ODG approach here's the logic:
>> >> >
>> >> > replay_log_and_reconstruct_graph();
>> >> > start_execution_engine()
>> >> >
>> >> > What does the pseudo code for the finite state machine approach look
>> >> > like?
>> >> >
>> >> > - Doug
>> >> >
>> >>
>> >> replay_log_and_reconstruct_states();
>> >> start_fsm();
>> >>
>> >> I still haven't seen a specific example to show why the ODG approach
>> >> actually helps 'recovery' etc.
>> >>
>> >> __Luke
>> >>
>> >> > On Mon, Aug 3, 2009 at 10:42 AM, Luke <[email protected]> wrote:
>> >> >>
>> >> >> I still don't see the why the ODG design simplify things. The
>> >> >> argument
>> >> >> for ODG would be more convincing, if you can walk me through a
>> >> >> simple
>> >> >> example that can demonstrate the necessity and/or simplicity. For
>> >> >> example: the recovery handling method can be straightforwardly
>> >> >> implemented with this rough pseudo code:
>> >> >>
>> >> >> recover(range_server) {
>> >> >>  if (down_range_servers.size() == 0)
>> >> >>    start_recovery(range_server);
>> >> >>  else
>> >> >>    restart_recovery(range_server);
>> >> >> }
>> >> >>
>> >> >> start_recovery(range_server) {
>> >> >>  down_range_servers.add(range_server);
>> >> >>  rid = get_recovery_id();
>> >> >>  mml->log_start_recovery(rid, down_range_servers);
>> >> >>  do_recovery(rid);
>> >> >> }
>> >> >>
>> >> >> restart_recovery() {
>> >> >>  down_range_servers.add(range_server);
>> >> >>  old_rid = current_recovery_id;
>> >> >>  rid = get_recovery_id();
>> >> >>  mml->log_restart_recovery(old_rid, rid, down_range_servers);
>> >> >>  do_recovery(rid);
>> >> >> }
>> >> >>
>> >> >> do_recovery(rid) {
>> >> >>  [root_range, meta_ranges, user_ranges] = get_down_ranges(rid);
>> >> >>
>> >> >>  if (root_range)
>> >> >>    recover_range(rid, root_range);
>> >> >>
>> >> >>  parallel foreach(range : meta_ranges)
>> >> >>    recover_range(rid, range);
>> >> >>
>> >> >>  parallel foreach(range : user_ranges)
>> >> >>    recover_range(rid, range);
>> >> >> }
>> >> >>
>> >> >> recover_range(rid, range) {
>> >> >>  check_recovery_state(rid);
>> >> >>  dest = range_scheduler->get_range_server(rid, range);
>> >> >>  replay_load_range(dest, range);
>> >> >>  replay_commit_log(dest, range);
>> >> >>  replay_commit(dest)
>> >> >>  mml->log_done_recover_range(rid, range);
>> >> >> }
>> >> >>
>> >> >> check_recovery_state(rid) {
>> >> >>  if (rid != current_recovery_id)
>> >> >>    throw RecoveryCancelled("recovery restarted");
>> >> >> }
>> >> >>
>> >> >> I just don't see a DAG approach getting any simpler: You'd have to
>> >> >> create a separate class and setup dependencies explicitly for each
>> >> >> task; You'll need to traverse the entire graph for every
>> >> >> event/interrupt; You'll probably need to cancel a subtree explicitly
>> >> >> for any event/interrupt. In the above simple approach, the
>> >> >> dependencies are implicitly expressed in the running stack, an
>> >> >> exception can abort the entire stack as needed when a restart is
>> >> >> detected.
>> >> >>
>> >> >> For tasks like move and split, if the destination range server is
>> >> >> not
>> >> >> available or can't finish due to related metadata not available,
>> >> >> simply cancel the task and the originating range server can retry
>> >> >> later (RSML will ensure durability of the transaction.)
>> >> >>
>> >> >> On Sun, Aug 2, 2009 at 12:36 AM, Doug Judd<[email protected]>
>> >> >> wrote:
>> >> >> > The way I arrived at this design was via an attempt to simplify
>> >> >> > things.
>> >> >> > The
>> >> >> > Master is essentially an interrupt driven system.  It sits idle
>> >> >> > until
>> >> >> > it
>> >> >> > gets asynchronously interrupted by a user command, RangeServer
>> >> >> > failure,
>> >> >> > or
>> >> >> > split report.  Each one of these interrupts creates some operation
>> >> >> > that
>> >> >> > needs to be performed.  Operations that are in-progress or future
>> >> >> > operations
>> >> >> > may need to be halted and made dependent on this new operation
>> >> >> > (e.g.
>> >> >> > RangeServer recovery).  This is what prompted the dependency graph
>> >> >> > idea
>> >> >> > which leads to the following server design:
>> >> >> >
>> >> >> > while (true) {
>> >> >> >   wait_for_interrupt()
>> >> >> >   pause_execution_engine()
>> >> >> >   modify_dependency_graph()
>> >> >> >   restart_execution_engine()
>> >> >> > }
>> >> >> >
>> >> >> > It doesn't get more clean and elegant than that.  All prior design
>> >> >> > attempts
>> >> >> > lead to a big pile of if statements and while loops.  I can't
>> >> >> > think
>> >> >> > of a
>> >> >> > more simple design than this.
>> >> >>
>> >> >> Even the simplest task would require a separate class and explicit
>> >> >> dependency setup in the ODG design. If a task is not immutable like
>> >> >> the typical map-reduce jobs, i.e. if it depends on master states,
>> >> >> e.g., recovery or any nonblocking version of commands etc, the
>> >> >> application logic is much more obfuscated .
>> >> >>
>> >> >> IMHO, a potential better abstraction is an finite state machine. You
>> >> >> can specify/declare a state machine in a simple language, e.g:
>> >> >>
>> >> >> event: range_server_down(range_server):
>> >> >>  state DEFAULT: { start_recovery(); } -> RECOVERING1
>> >> >>  state RECOVERING1: { restart_recovery(); }
>> >> >>  ...
>> >> >>
>> >> >> and a compiler can validate and generate an FSM that you can
>> >> >> instantiate and basically do:
>> >> >>
>> >> >> while ((event = wait_for_event()) {
>> >> >>  fsm->handle(event);
>> >> >> }
>> >> >>
>> >> >> But I think that the hand coded simple approach is easy to
>> >> >> implement,
>> >> >> understand and probably more appropriate for the first release.
>> >> >>
>> >> >> __Luke
>> >> >>
>> >> >> >
>> >> >> > - Doug
>> >> >> >
>> >> >> > On Sat, Aug 1, 2009 at 2:07 PM, Luke <[email protected]> wrote:
>> >> >> >>
>> >> >> >> A DAG is only required when you want to do transactions with
>> >> >> >> dyanmic
>> >> >> >> dependencies (only known at runtime) in parallel. Let's examine
>> >> >> >> the
>> >> >> >> examples:
>> >> >> >>
>> >> >> >> On Sat, Aug 1, 2009 at 10:29 AM, Doug Judd<[email protected]>
>> >> >> >> wrote:
>> >> >> >> > I agree that for the current version of the ALTER TABLE
>> >> >> >> > command,
>> >> >> >> > you
>> >> >> >> > could
>> >> >> >> > probably implement it in such a way that you wouldn't need this
>> >> >> >> > mechanism.
>> >> >> >> > However, there are several other situations that come to mind
>> >> >> >> > where
>> >> >> >> > this
>> >> >> >> > isn't the case:
>> >> >> >> >
>> >> >> >> > 1. We've talked about adding a 'force' option which would cause
>> >> >> >> > a
>> >> >> >> > major
>> >> >> >> > compaction to occur on each range after it's schema was altered
>> >> >> >> > so
>> >> >> >> > that
>> >> >> >> > the
>> >> >> >> > space taken up by dropped columns could get immediately freed
>> >> >> >> > up.
>> >> >> >> > 2. Tyler has been asking for the ability to delete certain
>> >> >> >> > portions
>> >> >> >> > of a
>> >> >> >> > table (e.g. drop all rows less than a certain value or all
>> >> >> >> > cells
>> >> >> >> > before
>> >> >> >> > some
>> >> >> >> > timestamp).  To do this we're going to add a COMPACT command
>> >> >> >> > (see
>> >> >> >> > issue
>> >> >> >> > 307).
>> >> >> >> > 3. At some point we'll be adding a SNAPSHOT command.  Since
>> >> >> >> > CellStores
>> >> >> >> > can
>> >> >> >> > be shared, cheap table snapshots can be made by first
>> >> >> >> > performing a
>> >> >> >> > compaction and then copying the METADATA entries for the table
>> >> >> >> > being
>> >> >> >> > snapshotted.
>> >> >> >> >
>> >> >> >> > In all three of these situations, if a RangeServer goes down in
>> >> >> >> > the
>> >> >> >> > middle
>> >> >> >> > of the operation, when one of the participating ranges in the
>> >> >> >> > operation
>> >> >> >> > is
>> >> >> >> > recovered on a new RangeServer, it won't have enough state to
>> >> >> >> > finish
>> >> >> >> > the
>> >> >> >> > operation.
>> >> >> >>
>> >> >> >> So far all these commands can be implemented using a single
>> >> >> >> transaction. Let's use compact as an example:
>> >> >> >>
>> >> >> >> 1. master: writes begin compact(list of ranges) -> id in MML
>> >> >> >> 2. for each finished ranges master will get an ack and writes
>> >> >> >> done_compact(range) in MML
>> >> >> >> 3. If any of the range server dies the master can restart the
>> >> >> >> transaction restart compact(remaining ranges) -> id in MML
>> >> >> >> 4. the above steps (2-3) can happen multiple times, when multiple
>> >> >> >> range server dies.
>> >> >> >>
>> >> >> >> > The place where maintiaining a complex dependency graph is
>> >> >> >> > probably
>> >> >> >> > the
>> >> >> >> > most
>> >> >> >> > important is in the RangeServer recovery process itself.  As
>> >> >> >> > Sanjit
>> >> >> >> > pointed
>> >> >> >> > out, the recovery process needs to happen in roughly three
>> >> >> >> > stages,
>> >> >> >> > ROOT
>> >> >> >> > range recovery, METADATA range recovery, and lastly USER range
>> >> >> >> > recovery.  If
>> >> >> >> > you drill down into the RangeServer recovery process, it can be
>> >> >> >> > broken
>> >> >> >> > down
>> >> >> >> > into the following four dependent steps:
>> >> >> >> >
>> >> >> >> > 1. Replay RSML and assign all the ranges to new RangeServers
>> >> >> >> > 2. Invoke RangeServer::replay_load_range() for each range
>> >> >> >> > assigned
>> >> >> >> > to
>> >> >> >> > a
>> >> >> >> > RangeServer
>> >> >> >> > 3. Replay commit log fragments
>> >> >> >> > 4. Invoke the RangeServer::replay_commit() on each
>> >> >> >> > participating
>> >> >> >> > RangeServer
>> >> >> >> >
>> >> >> >> > At any point in this process a RangeServer could go down and if
>> >> >> >> > it
>> >> >> >> > is
>> >> >> >> > one of
>> >> >> >> > the RangeServers participating in the recovery, then things
>> >> >> >> > complicated.  As
>> >> >> >> > a general rule, you can't carry out step #4 for USER ranges
>> >> >> >> > unless
>> >> >> >> > the
>> >> >> >> > METADATA ranges are available and you can't carry out step #4
>> >> >> >> > for
>> >> >> >> > METADATA
>> >> >> >> > ranges until the ROOT range is available.
>> >> >> >>
>> >> >> >> Since we only handle one recovery transaction at a time, with the
>> >> >> >> 4
>> >> >> >> steps being required sub transactions already known:
>> >> >> >>
>> >> >> >> 1. when a range server dies, master writes begin
>> >> >> >> recover(dead_range_server1) -> id in MML
>> >> >> >> 2. for each ranges in dead_range_server1's metalog, master
>> >> >> >> invokes
>> >> >> >> replay_load_range and writes done_replay_load_range(range) in MML
>> >> >> >> 3. if another range server dies, master writes restart
>> >> >> >> recovery(dead_range_server1, dead_range_server2) -> id in MML.
>> >> >> >> master
>> >> >> >> can then compute the union of the ranges that need to be
>> >> >> >> recovered
>> >> >> >> by
>> >> >> >> examine the done_replay_load_range so far and the ranges in both
>> >> >> >> dead
>> >> >> >> range servers metalogs. and basically does 2 again.
>> >> >> >> 4. if replay_load_range finishes master can proceed to the next
>> >> >> >> steps.
>> >> >> >> Any range server death during the following steps would cause a
>> >> >> >> restart (step 3.)
>> >> >> >> 5. the above steps can happen multiple times, hopefully with less
>> >> >> >> ranges to recover each time (otherwise, it's a cascading failure
>> >> >> >> that
>> >> >> >> would cause the whole cluster to die. Will probably need to send
>> >> >> >> out
>> >> >> >> alerts when the rate of death accelerates)
>> >> >> >>
>> >> >> >> > Ideally the recovery of each USER
>> >> >> >> > range should only be dependent on the recovery of the METADATA
>> >> >> >> > range
>> >> >> >> > that
>> >> >> >> > holds its corresponding entry.  Getting the recovery process
>> >> >> >> > implemented
>> >> >> >> > optimally will be important for maintaining good 99th
>> >> >> >> > percentile
>> >> >> >> > latency
>> >> >> >> > in
>> >> >> >> > large (thousands of nodes) deployments running on a cloud
>> >> >> >> > provider
>> >> >> >> > like
>> >> >> >> > EC2.
>> >> >> >>
>> >> >> >> This is the only place a DAG is useful (but not required) so far.
>> >> >> >> And
>> >> >> >> you don't need to persist it in MML, you can reconstruct the DAG
>> >> >> >> by
>> >> >> >> scanning the MML if master dies, because you know the
>> >> >> >> dependencies
>> >> >> >> before hand (meta before user.)
>> >> >> >>
>> >> >> >> > Given all these dependencies, a directed acyclic graph seems
>> >> >> >> > like
>> >> >> >> > a
>> >> >> >> > clean
>> >> >> >> > and natural representation.
>> >> >> >>
>> >> >> >> A DAG is a natural representation for parallel transactions with
>> >> >> >> dependencies, but it really doesn't buy you much at this stage so
>> >> >> >> far,
>> >> >> >> mostly because the dependencies are not created dynamically by
>> >> >> >> user
>> >> >> >> at
>> >> >> >> runtime (like in general map/reduce jobs.) The bulk of the work
>> >> >> >> is
>> >> >> >> to
>> >> >> >> get the recovery transaction right. I'd be very happy if we can
>> >> >> >> do
>> >> >> >> the
>> >> >> >> parallel replay_load_range and replay_commit_log like the Baidu
>> >> >> >> guys
>> >> >> >> did. Fine grained parallel range recovery (multiple (meta, user)
>> >> >> >> transactions) is nice to have and a little premature for the
>> >> >> >> first
>> >> >> >> release, IMHO.
>> >> >> >>
>> >> >> >> __Luke
>> >> >> >>
>> >> >> >> >
>> >> >> >> > - Doug
>> >> >> >> >
>> >> >> >> > On Fri, Jul 31, 2009 at 3:49 PM, Luke <[email protected]> wrote:
>> >> >> >> >>
>> >> >> >> >> On Fri, Jul 31, 2009 at 3:30 PM, Doug
>> >> >> >> >> Judd<[email protected]>
>> >> >> >> >> wrote:
>> >> >> >> >> > The two examples that I gave are just two of many possible
>> >> >> >> >> > scenarios:
>> >> >> >> >> >
>> >> >> >> >> > - What if you are in the middle of DROP TABLE and a range
>> >> >> >> >> > server
>> >> >> >> >> > goes
>> >> >> >> >> > down
>> >> >> >> >> > and then in the middle of that recovery another range server
>> >> >> >> >> > goes
>> >> >> >> >> > down,
>> >> >> >> >> > etc.
>> >> >> >> >>
>> >> >> >> >> It would still work if the schema is updated in hyperspace and
>> >> >> >> >> remaining range servers, as long as the destination range
>> >> >> >> >> servers
>> >> >> >> >> are
>> >> >> >> >> doing the right thing: ignoring ranges that belongs to a
>> >> >> >> >> nonexistent
>> >> >> >> >> table. The range server can cache negative results of schema
>> >> >> >> >> queries
>> >> >> >> >> for a while to avoid hitting hyperspace too much on unseen
>> >> >> >> >> tables
>> >> >> >> >> (range server should already do that already to avoid
>> >> >> >> >> accidental
>> >> >> >> >> ddos
>> >> >> >> >> on hyperspace.)
>> >> >> >> >>
>> >> >> >> >> > - What if someone does an ALTER TABLE and then before it is
>> >> >> >> >> > complete,
>> >> >> >> >> > one of
>> >> >> >> >> > the participating range servers goes down and then before
>> >> >> >> >> > the
>> >> >> >> >> > range
>> >> >> >> >> > server
>> >> >> >> >> > recovers, one of the range servers participating in the
>> >> >> >> >> > recovery
>> >> >> >> >> > goes
>> >> >> >> >> > down
>> >> >> >> >> > and then someone does a DROP table ...
>> >> >> >> >>
>> >> >> >> >> It really doesn't matter, hyperspace keeps the schema and the
>> >> >> >> >> remaining range servers get the memo: alter table can finish
>> >> >> >> >> on
>> >> >> >> >> the
>> >> >> >> >> remaining range servers. Same thing for the drop table. The
>> >> >> >> >> key
>> >> >> >> >> is
>> >> >> >> >> to
>> >> >> >> >> populate the correctly schema lazily when the ranges are
>> >> >> >> >> recovered.
>> >> >> >> >>
>> >> >> >> >> > - What if the Master is in the middle of a MOVE operation
>> >> >> >> >> > and
>> >> >> >> >> > then
>> >> >> >> >> > the
>> >> >> >> >> > range
>> >> >> >> >> > server that the range is being moved to goes down before it
>> >> >> >> >> > reports
>> >> >> >> >> > back,
>> >> >> >> >> > then recovery starts and then someone does an ALTER TABLE?
>> >> >> >> >>
>> >> >> >> >> > - What if there are 16 overlapping range server failures?
>> >> >> >> >>
>> >> >> >> >> As long as the operations are idempotent, all you need to do
>> >> >> >> >> is
>> >> >> >> >> just
>> >> >> >> >> updating schema on hyperpsace and remaining range server
>> >> >> >> >> atomically.
>> >> >> >> >>
>> >> >> >> >> > Trying to program for all of the scenarios would be a total
>> >> >> >> >> > nightmare
>> >> >> >> >> > without some sort of dependency abstraction.  Just because
>> >> >> >> >> > these
>> >> >> >> >> > scenarios
>> >> >> >> >> > are rare, doesn't mean we can ignore them.  We have to come
>> >> >> >> >> > up
>> >> >> >> >> > with a
>> >> >> >> >> > clean
>> >> >> >> >> > design that covers all the possible scenarios.
>> >> >> >> >>
>> >> >> >> >> The cleanest design doesn't have to handle these dependencies
>> >> >> >> >> if
>> >> >> >> >> possible. The ODG and related stuff would be a debugging
>> >> >> >> >> nightmare.
>> >> >> >> >>
>> >> >> >> >> > This isn't over-engineering.  It is clean engineering.
>> >> >> >> >> > There
>> >> >> >> >> > is
>> >> >> >> >> > no
>> >> >> >> >> > way
>> >> >> >> >> > to
>> >> >> >> >> > cleanly design the Master without some sort of dependency
>> >> >> >> >> > abstraction.
>> >> >> >> >>
>> >> >> >> >> I still haven't seen a compelling example yet. I'll change my
>> >> >> >> >> mind
>> >> >> >> >> when I see it.
>> >> >> >> >>
>> >> >> >> >> __Luke
>> >> >> >> >>
>> >> >> >> >> > - Doug
>> >> >> >> >> >
>> >> >> >> >> > On Fri, Jul 31, 2009 at 2:54 PM, Luke <[email protected]>
>> >> >> >> >> > wrote:
>> >> >> >> >> >>
>> >> >> >> >> >> Master is getting more and more like a workqueue and
>> >> >> >> >> >> jobtracker
>> >> >> >> >> >> :)
>> >> >> >> >> >> It
>> >> >> >> >> >> seems to be advantageous to actually create a separate
>> >> >> >> >> >> general
>> >> >> >> >> >> server
>> >> >> >> >> >> to manage all the tasks, which can be used for schedule
>> >> >> >> >> >> map/reduce
>> >> >> >> >> >> tasks in the future as well.
>> >> >> >> >> >>
>> >> >> >> >> >> On Fri, Jul 31, 2009 at 11:14 AM, Doug
>> >> >> >> >> >> Judd<[email protected]>
>> >> >> >> >> >> wrote:
>> >> >> >> >> >> > The Master is responsible for orchestrating recovery from
>> >> >> >> >> >> > RangeServer
>> >> >> >> >> >> > failures as well as carrying out meta operations in
>> >> >> >> >> >> > response
>> >> >> >> >> >> > to
>> >> >> >> >> >> > commands
>> >> >> >> >> >> > such as CREATE TABLE, ALTER TABLE, and DROP TABLE.  These
>> >> >> >> >> >> > meta
>> >> >> >> >> >> > operations
>> >> >> >> >> >> > are relatively straightforward except in the face of
>> >> >> >> >> >> > RangeServer
>> >> >> >> >> >> > failure.
>> >> >> >> >> >> > When this happens, any in-progress meta operation that is
>> >> >> >> >> >> > dependent
>> >> >> >> >> >> > on
>> >> >> >> >> >> > the
>> >> >> >> >> >> > failed RangeServer needs to block until the RangeServer
>> >> >> >> >> >> > has
>> >> >> >> >> >> > been
>> >> >> >> >> >> > recovered.
>> >> >> >> >> >> > If another RangeServer that is involved in the recovery
>> >> >> >> >> >> > goes
>> >> >> >> >> >> > down,
>> >> >> >> >> >> > there
>> >> >> >> >> >> > is
>> >> >> >> >> >> > now another recovery operation that needs to get carried
>> >> >> >> >> >> > out.
>> >> >> >> >> >> > The
>> >> >> >> >> >> > Master
>> >> >> >> >> >> > can
>> >> >> >> >> >> > quickly start building up a fairly complex set of
>> >> >> >> >> >> > operation
>> >> >> >> >> >> > dependencies.
>> >> >> >> >> >> >
>> >> >> >> >> >> > The master is also responsible for moving ranges from one
>> >> >> >> >> >> > RangeServer
>> >> >> >> >> >> > to
>> >> >> >> >> >> > another when load across the RangeServers gets out of
>> >> >> >> >> >> > balance.
>> >> >> >> >> >> > If
>> >> >> >> >> >> > a
>> >> >> >> >> >> > MOVE
>> >> >> >> >> >> > RANGE operation is in progress when, say, an ALTER TABLE
>> >> >> >> >> >> > request
>> >> >> >> >> >> > arrives,
>> >> >> >> >> >> > and the range being moved is part of the table specified
>> >> >> >> >> >> > in
>> >> >> >> >> >> > the
>> >> >> >> >> >> > ALTER
>> >> >> >> >> >> > TABLE
>> >> >> >> >> >> > request, then the ALTER TABLE operation needs to wait
>> >> >> >> >> >> > until
>> >> >> >> >> >> > the
>> >> >> >> >> >> > MOVE
>> >> >> >> >> >> > RANGE
>> >> >> >> >> >> > operation is complete before it can continue.  Also, if
>> >> >> >> >> >> > two
>> >> >> >> >> >> > ALTER
>> >> >> >> >> >> > TABLE
>> >> >> >> >> >> > requests arrive at the Master at the same time, then they
>> >> >> >> >> >> > should
>> >> >> >> >> >> > get
>> >> >> >> >> >> > carried
>> >> >> >> >> >> > out in sequential order with one of the ALTER TABLE
>> >> >> >> >> >> > operations
>> >> >> >> >> >> > depending
>> >> >> >> >> >> > on
>> >> >> >> >> >> > the completion of the other operation.
>> >> >> >> >> >>
>> >> >> >> >> >> I'm not sure about this particular case. For alter table
>> >> >> >> >> >> while
>> >> >> >> >> >> ranges
>> >> >> >> >> >> are split/moved, it seems to that me as long as you update
>> >> >> >> >> >> the
>> >> >> >> >> >> schema
>> >> >> >> >> >> in hyperspace/range servers atomically. The split/moved
>> >> >> >> >> >> ranges
>> >> >> >> >> >> on
>> >> >> >> >> >> the
>> >> >> >> >> >> destination new server will get the right schema. Also two
>> >> >> >> >> >> alter
>> >> >> >> >> >> table
>> >> >> >> >> >> can overlap in many cases, as long as the schema updates on
>> >> >> >> >> >> hyperspace/range servers are atomic. For cases where alter
>> >> >> >> >> >> table
>> >> >> >> >> >> on
>> >> >> >> >> >> the same table needs to be sequenced, it's actually not too
>> >> >> >> >> >> much
>> >> >> >> >> >> to
>> >> >> >> >> >> ask the application to do the sequence, as alter table is
>> >> >> >> >> >> not
>> >> >> >> >> >> really
>> >> >> >> >> >> a
>> >> >> >> >> >> frequent operations (otherwise, they should go with a
>> >> >> >> >> >> generic
>> >> >> >> >> >> column
>> >> >> >> >> >> family and go nuts on qualifiers.)
>> >> >> >> >> >>
>> >> >> >> >> >> > To handle these dependencies, I propose designing the
>> >> >> >> >> >> > Master
>> >> >> >> >> >> > as
>> >> >> >> >> >> > an
>> >> >> >> >> >> > execution
>> >> >> >> >> >> > engine for a directed acyclic graph of operations or
>> >> >> >> >> >> > operation
>> >> >> >> >> >> > dependency
>> >> >> >> >> >> > graph (ODG).  Each node in the graph would represent an
>> >> >> >> >> >> > operation
>> >> >> >> >> >> > (e.g.
>> >> >> >> >> >> > ALTER TABLE, RECOVER RangeServer) and would contain
>> >> >> >> >> >> > dynamic
>> >> >> >> >> >> > state.
>> >> >> >> >> >> > Execution threads would carry out the operations by
>> >> >> >> >> >> > picking
>> >> >> >> >> >> > up
>> >> >> >> >> >> > nodes
>> >> >> >> >> >> > from
>> >> >> >> >> >> > the graph in topological sort order.  When a RangeServer
>> >> >> >> >> >> > dies,
>> >> >> >> >> >> > the
>> >> >> >> >> >> > ODG
>> >> >> >> >> >> > execution engine would pause, a new "RECOVER RangeServer"
>> >> >> >> >> >> > will
>> >> >> >> >> >> > get
>> >> >> >> >> >> > created
>> >> >> >> >> >> > and the ODG will get modified to include this new node.
>> >> >> >> >> >> > All
>> >> >> >> >> >> > of
>> >> >> >> >> >> > the
>> >> >> >> >> >> > existing
>> >> >> >> >> >> > nodes that were dependent on that RangeServer would
>> >> >> >> >> >> > become
>> >> >> >> >> >> > dependent
>> >> >> >> >> >> > on
>> >> >> >> >> >> > this
>> >> >> >> >> >> > new RECOVER RangeServer node.  At this point the ODG
>> >> >> >> >> >> > execution
>> >> >> >> >> >> > engine
>> >> >> >> >> >> > would
>> >> >> >> >> >> > be restarted.
>> >> >> >> >> >>
>> >> >> >> >> >> The same alter table arguments can apply here as well. You
>> >> >> >> >> >> can
>> >> >> >> >> >> let
>> >> >> >> >> >> the
>> >> >> >> >> >> alter table to proceed on hyperspace and the remaining
>> >> >> >> >> >> range
>> >> >> >> >> >> servers.
>> >> >> >> >> >> The recovered ranges would get the right schema. Otherwise,
>> >> >> >> >> >> an
>> >> >> >> >> >> alter
>> >> >> >> >> >> table command can take a long time (up to a few minutes)
>> >> >> >> >> >> while
>> >> >> >> >> >> one
>> >> >> >> >> >> of
>> >> >> >> >> >> the range server is being recovered.
>> >> >> >> >> >>
>> >> >> >> >> >> > The Master Meta Log (MML) would essentially persist any
>> >> >> >> >> >> > changes
>> >> >> >> >> >> > to
>> >> >> >> >> >> > the
>> >> >> >> >> >> > ODG,
>> >> >> >> >> >> > both node state as well as structural graph changes.
>> >> >> >> >> >> > When
>> >> >> >> >> >> > the
>> >> >> >> >> >> > Master
>> >> >> >> >> >> > fails
>> >> >> >> >> >> > and a new one comes up, it would replay the MML to
>> >> >> >> >> >> > reconstruct
>> >> >> >> >> >> > the
>> >> >> >> >> >> > ODG
>> >> >> >> >> >> > after
>> >> >> >> >> >> > which it could continue execution.
>> >> >> >> >> >> >
>> >> >> >> >> >> > Thoughts?
>> >> >> >> >> >>
>> >> >> >> >> >> It seems to me that an ODG is not absolutely required for
>> >> >> >> >> >> normal
>> >> >> >> >> >> Hypertable operations. I'd like to avoid over engineering
>> >> >> >> >> >> (if
>> >> >> >> >> >> possible) for the first release.
>> >> >> >> >> >>
>> >> >> >> >> >> __Luke
>> >> >> >> >> >>
>> >> >> >> >> >>
>> >> >> >> >> >
>> >> >> >> >> >
>> >> >> >> >> > >
>> >> >> >> >> >
>> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >
>> >> >> >> >
>> >> >> >> > >
>> >> >> >> >
>> >> >> >>
>> >> >> >>
>> >> >> >
>> >> >> >
>> >> >> > >
>> >> >> >
>> >> >>
>> >> >>
>> >> >
>> >> >
>> >> > >
>> >> >
>> >>
>> >>
>> >
>> >
>> > >
>> >
>>
>>
>
>
> >
>

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"Hypertable Development" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to 
[email protected]
For more options, visit this group at 
http://groups.google.com/group/hypertable-dev?hl=en
-~----------~----~----~----~------~----~------~--~---

[hypertable-dev] Re: Master Design Proposal - Operation Dependency Graph (ODG)

Reply via email to