Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change 
notification.

The "ZooKeeper/HBaseUseCases" page has been changed by PatrickHunt.
http://wiki.apache.org/hadoop/ZooKeeper/HBaseUseCases?action=diff&rev1=11&rev2=12

--------------------------------------------------

  === Case 1 ===
  Summary: HBase Table State and Schema Changes
  
+ A table has a schema and state (online, read-only, etc.).  When we say 
thousands of RegionServers, we're trying to give a sense of how many watchers 
we'll have on the znode that holds table schemas and state.  When we say 
hundreds of tables, we're trying to give some sense of how big the znode 
content will be... say 256 bytes of schema -- we'll only record difference from 
default to minimize whats up in zk -- and then state I see as being something 
like zk's four-letter words only they can be compounded in this case.  So, 100s 
of tables X 1024 schema X (2 four-letter words each on average) at the outside 
makes for about a MB of data that thousands of regionservers are watching.  
That OK?
+ 
  Expected scale: Thousands of RegionServers watching ready to react to changes 
with about 100 tables each of which can have 1 or 2 states and an involved 
schema
  
+ [MS] I was thinking one znode of state and schema.  RegionServers would all 
have a watch on it.  100s of tables means that a schema change on any table 
would trigger watches on 1000s of RegionServers.  That might be OK though 
because any RegionServer could be carrying a Region from the edited table.
- [PDH] the link is very useful, let's do the math here so that it's more 
clear, correct where I get it wrong. So if I get this right:
-  * 100 tables
-  * each table may have 1000+ regions (let's say 1000 for this calculation)
-  * 1 region server will carry a region from each table (per the link)
-  * if I understand correctly, region servers don't "own" the region znode in 
the table, just watch tables for regions it carries
  
- [MS] Number of regions is not pertinent here.  Whats relevant is that a table 
has a schema and state (online, read-only, etc.).  When I say thousands of 
RegionServers, I'm trying to give a sense of how many watchers we'll have on 
the znode that holds table schemas and state.  When I say hundreds of tables, 
I'm trying to give some sense of how big the znode content will be... say 256 
bytes of schema -- we'll only record difference from default to minimize whats 
up in zk -- and then state I see as being something like zk's four-letter words 
only they can be compounded in this case.  So, 100s of tables X 1024 schema X 
(2 four-letter words each on average) at the outside makes for about a MB of 
data that thousands of regionservers are watching.  That OK?
+ [PDH] My original assumption was that each table has it's own znode (and 
would still be my advice). In general you don't want to store very much data 
per znode - the reason being that writes will slow (think of this -- client 
copies data to ZK server, which copies data to ZK leader, which broadcasts data 
to all servers in the cluster, which then commit allowing the original server 
to respond to the client). If the znode is changing infrequently, then no big 
deal, but in general you don't want to do this. Also, the ZK server has a 1mb 
max data size by default, so if you increase the number of tables (etc...) you 
will bump this at some point.
  
+ [PDH Hence my original assumption, and suggestion. Consider having a znode 
per table, rather than a single znode. It's more scalable and should be better 
in general. That's up to you though - 1 znode will work too.
- [PDH] So this means a region server watches each of 100 tables. 
-  * 100 * 1000 = 100k watches, as each region server watching 100 table nodes
-  * watches typically fire as a group/table (ie on/off/ro/drop each table)
-   * 1000 watches would fire notifying 1000 region servers each time a table 
changes
  
- [MS] I was thinking one znode of state and schema.  RegionServers would all 
have a watch on it.  100s of tables means that a schema change on any table 
would trigger watches on 1000s of RegionServers.  That might be OK though 
because any RegionServer could be carrying a Region from the edited table.
+ [PDH] A single table can change right? Not all the tables necessarily change 
state at the same time? Then splitting into multiple znodes makes even more 
sense - you would only be changing (writing) for things that change, even 
better is that the watchers will know the exact table that changed rather than 
determining by reading the data and diffing.... I'm no expert on hbase but from 
a typical ZK use case this is better. Obv this is a bit more complex than a 
single znode, also there are more (separate) notifications that will fire 
instead of a single one.... so you'd have to think through your use case (you 
could have a toplevel "state" znode that brings down all the tables in the case 
where all the tables need to go down... then you wouldn't have to change each 
table individually for this case (all tables down for whatever reason).
  
  General recipe implemented: A better description of problem and sketch of the 
solution can be found at 
[[http://wiki.apache.org/hadoop/Hbase/MasterRewrite#tablestate|Master Rewrite: 
Table State]]
  
  [PDH] this is essentially "dynamic configuration" usecase - we are telling 
each region server the state of the table containing a region it manages, when 
the master changes the state the watchers are notified
  
  [MS] Is "dynamic configuration' usecase a zk usecase type described somewhere?
+ 
+ [PDH] What we have is 
[[http://hadoop.apache.org/zookeeper/docs/current/recipes.html#sc_outOfTheBox|here]].
 Not much to it really - both for name service and dynamic config you are 
creating znodes that store relevant data, ZK clients can read/write/watch those 
nodes.
  
  === Case 2 ===
  Summary: HBase Region Transitions from unassigned to open and from open to 
unassigned with some intermediate states
@@ -87, +84 @@

  
  [MS]  ZK will do the increment for us?  This looks good too.
  
+ [PDH] Right, the "increment" is using the SEQUENTIAL flag on create
+ 
  
  Any metadata stored for a region znode (ie to identify)? As long as size is 
small no problem. (if a bit larger consider a /regions/<regionX>  znodes which 
has a list of all regions and their identity (otw r/o data fine too)
  
@@ -105, +104 @@

  
  [MS] Really?  This sounds great Patrick.  Let me take a closer look.....  
Excellent.
  
+ [PDH] Obv you need the hw (jvm heap esp, io bandwidth) to support it and the 
GC needs to be tuned properly to reduce pausing (which cause 
timeout/expiration) but 100k is not that much.
+ 
  
  See [[http://bit.ly/4ekN8G|this perf doc]] for some ideas, 20 clients doing 
50k watches each - 1 million watches on a single core standalone server and 
still << 5ms avg response time (async ops, keep that in mind re implementation 
time) YMMV of course but your numbers are well below this. 
  
@@ -121, +122 @@

  
  [MS] Excellent.
  
+ [PDH] think about potential other worst case scenarios, this is key to proper 
operation of the system. Esp around "herd" effects and trying to minimize those.
+ 
  [PDH end]
  
  General recipe implemented: None yet.  Need help.  Was thinking of keeping 
queues up in zk -- queues per regionserver for it to open/close etc.  But the 
list of all regions is kept elsewhere currently and probably for the 
foreseeable future out in our .META. catalog table.  Some further description 
can be found here 
[[http://wiki.apache.org/hadoop/Hbase/MasterRewrite#regionstate|Master Rewrite: 
Region State]]

Reply via email to