Dear Wiki user, You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.
The "ZooKeeper/HBaseUseCases" page has been changed by PatrickHunt. http://wiki.apache.org/hadoop/ZooKeeper/HBaseUseCases?action=diff&rev1=11&rev2=12 -------------------------------------------------- === Case 1 === Summary: HBase Table State and Schema Changes + A table has a schema and state (online, read-only, etc.). When we say thousands of RegionServers, we're trying to give a sense of how many watchers we'll have on the znode that holds table schemas and state. When we say hundreds of tables, we're trying to give some sense of how big the znode content will be... say 256 bytes of schema -- we'll only record difference from default to minimize whats up in zk -- and then state I see as being something like zk's four-letter words only they can be compounded in this case. So, 100s of tables X 1024 schema X (2 four-letter words each on average) at the outside makes for about a MB of data that thousands of regionservers are watching. That OK? + Expected scale: Thousands of RegionServers watching ready to react to changes with about 100 tables each of which can have 1 or 2 states and an involved schema + [MS] I was thinking one znode of state and schema. RegionServers would all have a watch on it. 100s of tables means that a schema change on any table would trigger watches on 1000s of RegionServers. That might be OK though because any RegionServer could be carrying a Region from the edited table. - [PDH] the link is very useful, let's do the math here so that it's more clear, correct where I get it wrong. So if I get this right: - * 100 tables - * each table may have 1000+ regions (let's say 1000 for this calculation) - * 1 region server will carry a region from each table (per the link) - * if I understand correctly, region servers don't "own" the region znode in the table, just watch tables for regions it carries - [MS] Number of regions is not pertinent here. Whats relevant is that a table has a schema and state (online, read-only, etc.). When I say thousands of RegionServers, I'm trying to give a sense of how many watchers we'll have on the znode that holds table schemas and state. When I say hundreds of tables, I'm trying to give some sense of how big the znode content will be... say 256 bytes of schema -- we'll only record difference from default to minimize whats up in zk -- and then state I see as being something like zk's four-letter words only they can be compounded in this case. So, 100s of tables X 1024 schema X (2 four-letter words each on average) at the outside makes for about a MB of data that thousands of regionservers are watching. That OK? + [PDH] My original assumption was that each table has it's own znode (and would still be my advice). In general you don't want to store very much data per znode - the reason being that writes will slow (think of this -- client copies data to ZK server, which copies data to ZK leader, which broadcasts data to all servers in the cluster, which then commit allowing the original server to respond to the client). If the znode is changing infrequently, then no big deal, but in general you don't want to do this. Also, the ZK server has a 1mb max data size by default, so if you increase the number of tables (etc...) you will bump this at some point. + [PDH Hence my original assumption, and suggestion. Consider having a znode per table, rather than a single znode. It's more scalable and should be better in general. That's up to you though - 1 znode will work too. - [PDH] So this means a region server watches each of 100 tables. - * 100 * 1000 = 100k watches, as each region server watching 100 table nodes - * watches typically fire as a group/table (ie on/off/ro/drop each table) - * 1000 watches would fire notifying 1000 region servers each time a table changes - [MS] I was thinking one znode of state and schema. RegionServers would all have a watch on it. 100s of tables means that a schema change on any table would trigger watches on 1000s of RegionServers. That might be OK though because any RegionServer could be carrying a Region from the edited table. + [PDH] A single table can change right? Not all the tables necessarily change state at the same time? Then splitting into multiple znodes makes even more sense - you would only be changing (writing) for things that change, even better is that the watchers will know the exact table that changed rather than determining by reading the data and diffing.... I'm no expert on hbase but from a typical ZK use case this is better. Obv this is a bit more complex than a single znode, also there are more (separate) notifications that will fire instead of a single one.... so you'd have to think through your use case (you could have a toplevel "state" znode that brings down all the tables in the case where all the tables need to go down... then you wouldn't have to change each table individually for this case (all tables down for whatever reason). General recipe implemented: A better description of problem and sketch of the solution can be found at [[http://wiki.apache.org/hadoop/Hbase/MasterRewrite#tablestate|Master Rewrite: Table State]] [PDH] this is essentially "dynamic configuration" usecase - we are telling each region server the state of the table containing a region it manages, when the master changes the state the watchers are notified [MS] Is "dynamic configuration' usecase a zk usecase type described somewhere? + + [PDH] What we have is [[http://hadoop.apache.org/zookeeper/docs/current/recipes.html#sc_outOfTheBox|here]]. Not much to it really - both for name service and dynamic config you are creating znodes that store relevant data, ZK clients can read/write/watch those nodes. === Case 2 === Summary: HBase Region Transitions from unassigned to open and from open to unassigned with some intermediate states @@ -87, +84 @@ [MS] ZK will do the increment for us? This looks good too. + [PDH] Right, the "increment" is using the SEQUENTIAL flag on create + Any metadata stored for a region znode (ie to identify)? As long as size is small no problem. (if a bit larger consider a /regions/<regionX> znodes which has a list of all regions and their identity (otw r/o data fine too) @@ -105, +104 @@ [MS] Really? This sounds great Patrick. Let me take a closer look..... Excellent. + [PDH] Obv you need the hw (jvm heap esp, io bandwidth) to support it and the GC needs to be tuned properly to reduce pausing (which cause timeout/expiration) but 100k is not that much. + See [[http://bit.ly/4ekN8G|this perf doc]] for some ideas, 20 clients doing 50k watches each - 1 million watches on a single core standalone server and still << 5ms avg response time (async ops, keep that in mind re implementation time) YMMV of course but your numbers are well below this. @@ -121, +122 @@ [MS] Excellent. + [PDH] think about potential other worst case scenarios, this is key to proper operation of the system. Esp around "herd" effects and trying to minimize those. + [PDH end] General recipe implemented: None yet. Need help. Was thinking of keeping queues up in zk -- queues per regionserver for it to open/close etc. But the list of all regions is kept elsewhere currently and probably for the foreseeable future out in our .META. catalog table. Some further description can be found here [[http://wiki.apache.org/hadoop/Hbase/MasterRewrite#regionstate|Master Rewrite: Region State]]
