I think Luke is right about going for a simple design, but under the circumstances, I think we need some kind of framework to reason about recovery, otherwise it could end up as huge unworkable pile of recursion and if-thens.
As a DAG I think this solution is at most a 4-partite graph where the first level is all RootRange recoveries, the next is MetaRange recoveries, the next is UserRange recoveries and finally DropTables/AlterTables etc. Except for the last level, a node depends on all items in the previous level being completed (eg can't recover a MetaRange unless all RootRange recoveries are done). At the last 2 levels, it might make sense at some point to add some kind of prioritization (eg user Range x is not being used so recover it last) but that is probably to much at this point. -Sanjit On Fri, Jul 31, 2009 at 3:31 PM, Luke <[email protected]> wrote: > > On Fri, Jul 31, 2009 at 3:20 PM, Sanjit Jhala<[email protected]> wrote: > > For alter table, its not merely an atomic update to Hyperspace. The > master > > updates the schema on hyperspace and then sends "update schema" commands > to > > all the RangeServers and waits for them to ack before returning. This > avoids > > unnecessary per-RangeServer traffic to Hyperspace. Since alter table is > > expected to be a fairly infrequent operations I don't think its > unreasonable > > for users to have to wait if execution is blocked by RangeServer > recovery. > > This is pretty much the same drop table. > > I meant you still do atomic update schema commands on range servers. > It's just that new ranges would get the schema from the same range > server if there are other ranges of the table on that range server. If > the new ranges are from a recovered range server and happen to be a > new table, you do incur some additional hyperspace traffic. > > My reasoning is that we should do the simplest thing that works, > especially for infrequent operations. > > __Luke > > > > > -Sanjit > > > > On Fri, Jul 31, 2009 at 2:54 PM, Luke <[email protected]> wrote: > >> > >> Master is getting more and more like a workqueue and jobtracker :) It > >> seems to be advantageous to actually create a separate general server > >> to manage all the tasks, which can be used for schedule map/reduce > >> tasks in the future as well. > >> > >> On Fri, Jul 31, 2009 at 11:14 AM, Doug Judd<[email protected]> wrote: > >> > The Master is responsible for orchestrating recovery from RangeServer > >> > failures as well as carrying out meta operations in response to > commands > >> > such as CREATE TABLE, ALTER TABLE, and DROP TABLE. These meta > >> > operations > >> > are relatively straightforward except in the face of RangeServer > >> > failure. > >> > When this happens, any in-progress meta operation that is dependent on > >> > the > >> > failed RangeServer needs to block until the RangeServer has been > >> > recovered. > >> > If another RangeServer that is involved in the recovery goes down, > there > >> > is > >> > now another recovery operation that needs to get carried out. The > Master > >> > can > >> > quickly start building up a fairly complex set of operation > >> > dependencies. > >> > > >> > The master is also responsible for moving ranges from one RangeServer > to > >> > another when load across the RangeServers gets out of balance. If a > >> > MOVE > >> > RANGE operation is in progress when, say, an ALTER TABLE request > >> > arrives, > >> > and the range being moved is part of the table specified in the ALTER > >> > TABLE > >> > request, then the ALTER TABLE operation needs to wait until the MOVE > >> > RANGE > >> > operation is complete before it can continue. Also, if two ALTER > TABLE > >> > requests arrive at the Master at the same time, then they should get > >> > carried > >> > out in sequential order with one of the ALTER TABLE operations > depending > >> > on > >> > the completion of the other operation. > >> > >> I'm not sure about this particular case. For alter table while ranges > >> are split/moved, it seems to that me as long as you update the schema > >> in hyperspace/range servers atomically. The split/moved ranges on the > >> destination new server will get the right schema. Also two alter table > >> can overlap in many cases, as long as the schema updates on > >> hyperspace/range servers are atomic. For cases where alter table on > >> the same table needs to be sequenced, it's actually not too much to > >> ask the application to do the sequence, as alter table is not really a > >> frequent operations (otherwise, they should go with a generic column > >> family and go nuts on qualifiers.) > >> > >> > To handle these dependencies, I propose designing the Master as an > >> > execution > >> > engine for a directed acyclic graph of operations or operation > >> > dependency > >> > graph (ODG). Each node in the graph would represent an operation > (e.g. > >> > ALTER TABLE, RECOVER RangeServer) and would contain dynamic state. > >> > Execution threads would carry out the operations by picking up nodes > >> > from > >> > the graph in topological sort order. When a RangeServer dies, the ODG > >> > execution engine would pause, a new "RECOVER RangeServer" will get > >> > created > >> > and the ODG will get modified to include this new node. All of the > >> > existing > >> > nodes that were dependent on that RangeServer would become dependent > on > >> > this > >> > new RECOVER RangeServer node. At this point the ODG execution engine > >> > would > >> > be restarted. > >> > >> The same alter table arguments can apply here as well. You can let the > >> alter table to proceed on hyperspace and the remaining range servers. > >> The recovered ranges would get the right schema. Otherwise, an alter > >> table command can take a long time (up to a few minutes) while one of > >> the range server is being recovered. > >> > >> > The Master Meta Log (MML) would essentially persist any changes to the > >> > ODG, > >> > both node state as well as structural graph changes. When the Master > >> > fails > >> > and a new one comes up, it would replay the MML to reconstruct the ODG > >> > after > >> > which it could continue execution. > >> > > >> > Thoughts? > >> > >> It seems to me that an ODG is not absolutely required for normal > >> Hypertable operations. I'd like to avoid over engineering (if > >> possible) for the first release. > >> > >> __Luke > >> > >> > > > > > > > > > > > > > --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Hypertable Development" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/hypertable-dev?hl=en -~----------~----~----~----~------~----~------~--~---
