Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change 
notification.

The "Hbase/MasterRewrite" page has been changed by stack.
http://wiki.apache.org/hadoop/Hbase/MasterRewrite?action=diff&rev1=10&rev2=11

--------------------------------------------------

  = Design Notes for Master Rewrite =
- 
  Initial Master Rewrite design came of conversations had at the hbase 
hackathon held at StumbleUpon, August 5-7, 2009 
([[https://issues.apache.org/jira/secure/attachment/12418561/HBase+Hackathon+Notes+-+Sunday.pdf|Jon
 Gray kept notes]]).  The umbrella issue for the master rewrite is 
[[https://issues.apache.org/jira/browse/HBASE-1816|HBASE-1816]].  Timeline is 
hbase 0.21.0.
  
  == Table of Contents ==
@@ -9, +8 @@

   * [[#problems|Problems with current Master]]
   * [[#scope|Design Scope]]
   * [[#design|Design]]
-   * [[#all|Move all region state transitions to zookeeper]]
+   * [[#moveall|Move all state, state transitions, and schema to go via 
zookeeper]]
+    * [[#zklayout|Zookeeper layout]]
    * [[#clean|Region State changes are clean, minimal, and comprehensive]]
-   * [[#state|Cluster/Table State Changes via ZK]]
-   * [[#schema|Schema]]
-   * [[#distinct|In Zookeeper, a State and a Schema section]]
    * [[#balancer|Load Assignment/Balancer]]
    * [[#root|Remove -ROOT-]]
    * [[#root|Remove Heartbeat]]
@@ -21, +18 @@

    * [[#intermediary|Further remove Master as necessary intermediary]]
  
  <<Anchor(now)>>
+ 
  == What does the Master do now? ==
  Here's a bit of a refresher on what Master currently does:
+ 
   * Region Assignment
    * On startup and balancing as regions are created and deleted as regions 
grow and split.
   * Scan root/meta
@@ -33, +32 @@

    * Distributes out administered close, flush, compact messages
   * Watches ZK for its own lease and for regionservers so knows when to run 
recovery
  
+ <<Anchor(problems)>>
  
- <<Anchor(problems)>>
  == Problems with current Master ==
- There is a list in the 
[[https://issues.apache.org/jira/secure/ManageLinks.jspa?id=12434794|Issue 
Links]] section of HBASE-1816.
+ There is a list in the 
[[https://issues.apache.org/jira/secure/ManageLinks.jspa?id=12434794|Issue 
Links]] section of HBASE-1816. <<Anchor(scope)>>
  
- <<Anchor(scope)>>
  == Design Scope ==
   1. Rewrite of Master is for HBase 0.21
   1. Design for:
-    1. A cluster of 200 regionservers... (TODO: These numbers don't make sense 
-- jgray do you remember what they were about?  See misc section below)
-         
+   1. A cluster of 1k regionservers.
+   1. Each regionserver carries 100 regions of 1G each (100k regions =~ 100TB)
+ 
  <<Anchor(design)>>
+ 
  == Design ==
+ <<Anchor(moveall)>>
  
- <<Anchor(all)>>
- === Move all region state transitions to zookeeper ===
+ === Move all state, state transitions, and schema to go via zookeeper ===
- Run state transitions by changing state in zookeeper rather than in Master.
+ Tables are offlined, onlined, made read-only, and dropped (Add freeze of 
flushes and compactions state to facilitate snapshotting).  Currently HBase 
Master does this by messaging regionservers.  Instead move state to zookeeper.  
Let regionservers watch for changes and react.  Allow that a cluster may have 
up to 100 tables.  Tables are made of regions.  There may be thousands of 
regions per table.  A regionserver could be carrying a region from each of the 
100 tables.  TODO: Should regionserver have a table watcher or a watcher per 
region?
+ 
+ Tables have schema.  Tables are made of column families.  Column families 
have schema/attributes.  Column families can be added and removed.  Currently 
the schema is written into a column in the .META. catalog family.  Move all 
schema to zookeeper.   Regionservers would have watchers on schema and would 
react to changes.  TODO: A watcher per column family or a watcher per table or 
a watcher on the parent directory for schema?
+ 
+ Run region state transitions -- i.e. opening, closing -- by changing state in 
zookeeper rather than in Master maps as is currently done.
  
  Keep up a region transition trail; regions move through states from 
''unassigned'' to ''opening'' to ''open'', etc.  A region can't jump states as 
in going from ''unassigned'' to ''open''.
  
- Master (or client) moves regions between states.  Watchers on RegionServers 
notice additions and act on it.  Master (or client) can do transitions in bulk; 
e.g. assign a regionserver 50 regions to open on startup.  Effect is that 
Master "pushes" work out to regionservers rather than wait on them to heartbeat.
+ Master (or client) moves regions between states.  Watchers on RegionServers 
notice changes and act on it.  Master (or client) can do transitions in bulk; 
e.g. assign a regionserver 50 regions to open on startup.  Effect is that 
Master "pushes" work out to regionservers rather than wait on them to heartbeat.
  
- A problem we have in current master is that states do not form a circle.  
Once a region is open, master stops keeping state; region state is moved to 
.META. table once assigned with its condition checked periodically by .META. 
table scan.  Makes for confusion and evil such as region double assignment 
because there are race condition potholes as we move from one system -- 
internal state maps in master -- to the other during update to state in .META.  
Current thinking is to keep region lifecycle all up in zookeeper but that won't 
scale.  Postulate 100k regions -- 100TB at 1G regions -- each with two or three 
possible states each with watchers for state change is too much to put in a zk 
cluster.  TODO: how to manage transition from zk to .META.?
+ A problem we have in current master is that states do not make a circle.  
Once a region is open, master stops keeping account of a regions' state; region 
state is now kept out in the .META. catalog table with its condition checked 
periodically by .META. table scan.  State spanning two systems currently makes 
for confusion and evil such as region double assignment because there are race 
condition potholes as we move from one system -- internal state maps in master 
-- to the other during update to state in .META.  Current thinking is to keep 
region lifecycle all up in zookeeper but that won't scale.  Postulate 100k 
regions -- 100TB at 1G regions -- each with two or three possible states each 
with watchers for state change.  My guess is that this is too much to put in 
zk.  TODO: how to manage transition from zk to .META.?
+ 
+ State and Schema are distinct in zk.  No interactions.
+ 
+ <<Anchor(zklayout)>>
+ 
+ ==== zk layout ====
+ {{{
+ /hbase/master
+ /hbase/shutdown
+ /hbase/root-region-server
+ 
+ # Is STARTCODE a timestamp or a random id?
+ /hbase/rs/STARTCODE/load/
+ /hbase/rs/STARTCODE/regions/opening/
+ /hbase/tables/TABLENAME/schema/attributes serialized as JSON # These are 
table attributes.  Distinct from state flags such as read-only.
+ /hbase/tables/TABLENAME/schema/families/FAMILYNAME/attributes serialized as 
JSON
+ /hbase/tables/TABLENAME/state/attribute # Can have only one attribute at a 
time?  E.g. Read-only implies online and no flush/compaction.  Allow support 
for multiple.
+ }}}
  
  <<Anchor(clean)>>
+ 
  === Region State changes are clean, minimal, and comprehensive ===
- Currently, moving a region from opening to open may involve a region 
compaction -- i.e. a change to content in filesystem.  Better if modification 
of filesystem content was done when no question of ownership involved.
+ Currently, moving a region from ''opening'' to ''open'' may involve a region 
compaction -- i.e. a change to content in filesystem.  Better if modification 
of filesystem content was done when no question of ownership involved.
  
  In current o.a.h.h.master.RegionManager.RegionState inner class, here are 
possible states:
+ {{{
- {{{    private volatile boolean unassigned = false;
+ private volatile boolean unassigned = false;
-     private volatile boolean pendingOpen = false;
+ private volatile boolean pendingOpen = false;
-     private volatile boolean open = false;
+ private volatile boolean open = false;
-     private volatile boolean closing = false;
+ private volatile boolean closing = false;
-     private volatile boolean pendingClose = false;
+ private volatile boolean pendingClose = false;
-     private volatile boolean closed = false;
+ private volatile boolean closed = false;
-     private volatile boolean offlined = false;}}}
+ private volatile boolean offlined = false;
- Its incomplete.
+ }}}
  
+ TODO: Its incomplete.
- <<Anchor(state)>>
- === Cluster/Table State ===
- Move cluster and table state to zookeeper.
- 
- Do shutdown via ZK.
- 
- Remove table state change mechanism from master and instead update state in 
zk.  Clients can set table state.  Watchers in regionservers react to table 
state changes.  This will make it so table state changes are near 
instantaneous; e.g. setting table read-only, disabling compactions/flushes and 
all without having to take table offline.
- 
- 
- <<Anchor(distinct)>>
- === In Zookeeper, a State and a Schema section ===
- Two locations in zk; one for schema and then one for state.  No connection.  
For example, could have hierarchy in zk as follows:
- 
- {{{/hbase/tables/name/schema/{family1, family2}
- /hbase/tables/name/state/{attributes [read-only, enabled, nocompact, noflush]}
- /hbase/regionservers/openregions/{list of regions...}
- /hbase/regionserver/to_open/{list of regions....}
- /hbase/regionservers/to_close/{list of regions...}
- 
- 
- <<Anchor(schema)>>
- === Schema Edits ===
- Move Table Schema from .META.  Rather than storing complete Schema in each 
region, stored once in ZK.
- 
- Edit Schema with tables online by making change in ZK.  Watchers inform 
watching RegionServers of changes.  Make it so can add families, amend table 
and family configurations, etc.
  
  <<Anchor(balancer)>>
+ 
  === Region Assignment/Balancer ===
  Make it so don't need to put up a cluster to test balancing algorithm.
  
  Assignment / balancing
+ 
   * RS publish load into ZK
-   * /hbase/rsload/startcode({‘json’:’load’})
+   * /hbase/rsload/STARTCODE/load({‘json’:’load’})
    * Configure period it is refreshed
   * Assignment inputs
    * Load
@@ -120, +122 @@

    * Regionservers watch their own to open queues 
/hbase/rsopen/region(extra_info, which hlogs to replay or it’s a split, etc)
  
  Safe-mode assignment
+ 
   * Collect all regions to assign
   * Randomize and assign out in bulk, one msg per RS
-  * NO MORE SAFE-MODE
   * Region assignment is always
    * Look at all regions to be assigned
    * Make a single decision for the assignment of all of these
  
  <<Anchor(root)>>
+ 
  === Remove -ROOT- ===
  Remove -ROOT- from filesystem; have it only live up in zk (Possible now 
Region Historian feature has been removed).
  
  <<Anchor(heartbeat)>>
+ 
  === Remove Heartbeat ===
  We don't need RegionServers pinging master every 3 seconds if zookeeper is 
intermediary.
  
  <<Anchor(safemode)>>
+ 
  === Remove Safe Mode ===
  Safe mode is broke.  It doesn't do anything.  Remove it.
  
  <<Anchor(intermediary)>>
+ 
  === Further remove Master as necessary intermediary ===
  Clients do not need to go via master administrating tables, changing schema 
or sending flush/compaction commands, etc.  Clients should be able to write 
direct to regionserver or to zk.
  
+ <<Anchor(misc)>>
  
- 
- <<Anchor(misc)>>
  == Miscellaneous ==
- 
   * At meetup we talked of moving .META. to zk and adding a getClosest to zk 
code base.  Thats been punted on for now.
   * At meetup we had design numbers but they don't make sense now we've lost 
the context
-      1. 200 regionservers
+   1. 200 regionservers
-      1. 32 outstanding wal logs per regionserver
+   1. 32 outstanding wal logs per regionserver
-      1. 200 regions per regionserver being written to
+   1. 200 regions per regionserver being written to
-      1. 2GB or 30 hour full log roll
+   1. 2GB or 30 hour full log roll
-      1. 10MB/sec write speed
+   1. 10MB/sec write speed
-      1. 1.2M edits per 2G
+   1. 1.2M edits per 2G
-      1. 7k writes/second across cluster (?) -- whats this?  Wrong.
+   1. 7k writes/second across cluster (?) -- whats this?  Wrong.
-      1. 1.2M edits per 30 hours?
+   1. 1.2M edits per 30 hours?
-      1. 100 writes/sec across cluster (?) -- Whats this?  Wrong?
+   1. 100 writes/sec across cluster (?) -- Whats this?  Wrong?
  

Reply via email to