Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change 
notification.

The "Hbase/MasterRewrite" page has been changed by stack.
http://wiki.apache.org/hadoop/Hbase/MasterRewrite?action=diff&rev1=13&rev2=14

--------------------------------------------------

   1. Rewrite of Master is for HBase 0.21
   1. Design for:
    1. A cluster of 1k regionservers.
-   1. Each regionserver carries 100 regions of 1G each (100k regions =~ 100TB)
+   1. Each regionserver carries 100 regions of 1G each (100k regions =~ 
1-200TB)
  
  <<Anchor(design)>>
  == Design ==
@@ -54, +54 @@

  === Move all state, state transitions, and schema to go via zookeeper ===
  Currently state transitions are done inside master shuffling between Maps 
triggered by messages carried on the back of regionserver heartbeats.  Move all 
to zookeeper.
  
+ (Patrick Hunt and Mahadev have been helping with the below via 
[[http://wiki.apache.org/hadoop/ZooKeeper/HBaseUseCases|HBase Zookeeper Use 
Cases]])
+ 
  <<Anchor(tablestate)>>
  ==== Table State ====
  Tables are offlined, onlined, made read-only, and dropped (Add freeze of 
flushes and compactions state to facilitate snapshotting).  Currently HBase 
Master does this by messaging regionservers.  Instead move state to zookeeper.  
Let regionservers watch for changes and react.  Allow that a cluster may have 
up to 100 tables.  Tables are made of regions.  There may be thousands of 
regions per table.  A regionserver could be carrying a region from each of the 
100 tables.
  
- Tables have schema.  Tables are made of column families.  Column families 
have schema/attributes.  Column families can be added and removed.  Currently 
the schema is written into a column in the .META. catalog family.  Move all 
schema to zookeeper.   Regionservers would have watchers on schema and would 
react to changes.
+ Tables have schema.  Tables are made of column families.  Column families 
have schema/attributes.  Column families can be added and removed.  Currently 
the schema is written into a column in the .META. catalog family.  Move all 
schema to zookeeper.  Regionservers could have schema watchers and react to 
schema changes.
  
- In a tables znode up in zk, have a file that per table on the cluster, it 
lists current state attributes -- read-only, no-flush -- and that tables' 
schema all in JSON.  Only the differences from default are up in zk.  All 
regionservers keep watch on this znode reacting if changed spinning through 
their list of regions making reconciliation with current state of tables znode 
content.
+ ===== Design =====
+ 
+ In a tables directory up in zk, have a znode per table as per 
[[http://wiki.apache.org/hadoop/ZooKeeper/HBaseUseCases#case1|phunt's 
suggestion]].  The znode will be named for the table.  In each table's znode 
keep state attributes -- read-only, no-flush -- and the tables' schema (all in 
JSON).  Only carry the differences from the default default up in zk to save on 
amount of data that needs to be passed.  Let all regionservers watch all table 
znodes reacting if changes spinning through their list of regions making 
reconciliation with current state of a tables' znode content.
+ 
+ 
+ <<Anchor(zklayout)>>
+ 
+ ====== zk layout ======
+ {{{
+ /hbase/tables/table1 {JSON object would have state and schema objects, etc.  
State is read-only, offline, etc.  Schema has differences from default only}
+ /hbase/tables/table2
+ }}}
+ 
+ 
+ ====== Other considerations? ======
+ 
+ A state per region?
+ 
+ Should we rather have just one file with all table schemas and states in it?  
Easier to deal with?  Patrick warns that it could bump the 1MB zk znode content 
limit and that it could slow zk having to shuttle near-1MB of table 
schema+state on every little edit.
+ 
+ Patrick suggests that we have a alltables znode adjacent to the tables 
directory of znodes and in here we'd have state for all tables.  This is state 
in two places so will leave aside unless really needed.
  
  <<Anchor(regionstate)>>
  ==== Region State ====
@@ -69, +91 @@

  
  Keep up a region transition trail; regions move through states from 
''unassigned'' to ''opening'' to ''open'', etc.  A region can't jump states as 
in going from ''unassigned'' to ''open''.
  
- Master (or client) moves regions between states.  Watchers on RegionServers 
notice changes and act on it.  Master (or client) can do transitions in bulk; 
e.g. assign a regionserver 50 regions to open on startup.  Effect is that 
Master "pushes" work out to regionservers rather than wait on them to heartbeat.
+ Part of transition involves moving a region under a regionserver.
  
- A problem we have in current master is that states do not make a circle.  
Once a region is open, master stops keeping account of a regions' state; region 
state is now kept out in the .META. catalog table with its condition checked 
periodically by .META. table scan.  State spanning two systems currently makes 
for confusion and evil such as region double assignment because there are race 
condition potholes as we move from one system -- internal state maps in master 
-- to the other during update to state in .META.  Current thinking is to keep 
region lifecycle all up in zookeeper but that won't scale.  Postulate 100k 
regions -- 100TB at 1G regions -- each with two or three possible states each 
with watchers for state change.  My guess is that this is too much to put in zk 
(Mahadev+Patrick say no if data is small).  TODO: how to manage transition from 
zk to .META.?  Also, can't do getClosest up in zk, only in .META.
+ Master (or client) moves regions between states.  Watchers on RegionServers 
notice changes and act.  Master (or client) can do transitions in bulk; e.g. 
assign a regionserver 50 regions to open on startup.  Effect is that Master 
"pushes" work out to regionservers rather than wait on them to heartbeat, the 
way we currently assign regions.
  
- TODO: qs in zk?
+ A problem we have in current system (<= hbase 0.20.x) is that states do not 
make a circle.  Once a region is open, master stops keeping account of a 
regions' state; region state is now kept out in the .META. catalog table with 
its condition checked periodically by .META. table scan.  State spanning two 
systems currently makes for confusion and evil such as region double assignment 
because there are race condition potholes as we move from one system -- 
internal state maps in master -- to the other during update to state in .META.
  
- <<Anchor(zklayout)>>
+ Current thinking is to keep region lifecycle all up in zookeeper but that 
won't scale.  Postulate 100k regions -- 100TB at 1G regions -- each with two or 
three possible states each with watchers for state change.  My guess is that 
this is too much to put in zk (Mahadev+Patrick say no if data is small).  TODO: 
how to manage transition from zk to .META.?  Also, can't do getClosest up in 
zk, only in .META.
  
+ ===== Design =====
+ Here is 
[[http://wiki.apache.org/hadoop/ZooKeeper/HBaseUseCases#case2|Patrick's 
suggestion]].  We already keep a znode per regionserver though its named for 
the regionservers startcode.  On evaporation of the regionserver ephemeral 
node, master would run a reconciliation (or on assumption of master roll, new 
master would check state in zk making sure a regionserver per region) adding 
unassigned regions back to the unassigned pool.
- ==== zk layout ====
- {{{
- /hbase/master
- /hbase/shutdown
- /hbase/root-region-server
  
+ All regions would be listed in .META. table always.  Whether they are online, 
splitting or closing, etc., would be up in zk.
- # Is STARTCODE a timestamp or a random id?
- /hbase/rs/STARTCODE
- 
- /hbase/tables {JSON array of table objects.  Each table object would have 
state and schema objects, etc.  State is read-only, offline, etc.  Schema has 
differences from default only}
- }}}
  
  <<Anchor(clean)>>
  

Reply via email to