[
https://issues.apache.org/jira/browse/HBASE-4755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13590886#comment-13590886
]
Devaraj Das commented on HBASE-4755:
------------------------------------
The notes as a comment on the jira (for easy access :) )
At a high level this has a couple of parts to it:
Creation of table flow (assuming pre-split table)
---------------------------------------------------------------------
0. HBase layer has policies based on which it decides where to place the region
files. AssignmentDomain is defined when a new table is created (today it is all
the nodes in the cluster).
1. The HMaster chooses the primary RS on a round-robin basis, and the
secondary/tertiary RSs are chosen on different racks (best effort, and best
effort to place secondary/tertiary on the same rack).
2. The meta table is updated with the information about the location mapping
for the added regions.
3. The RegionServers are then asked to open the regions and they get the
favored nodes information as well. The mapping information from regions to
favorednodes is cached in the regionservers.
4. This information is then passed to the filesystem (HDFS-2576) when the
regionservers create new files on the filesystem. For now, a create API has
been added in HDFS to accept a favorednodes list as an additional argument.
Failure recovery
---------------------------------------------------------------------
When the primary RS dies, ideally, we should assign the regions on that RS to
their respective secondaries (maybe whichever has less load or fewer primaries
among the remaining two). At some point the maintenance tool should run and set
the mapping in meta right (three RS locations, etc.)
Maintenance of the metadata & region locations
---------------------------------------------------------------------
Over a period of time, nodes may fail, and/or hdfs-balancer may run that might
potentially have a bad impact on the locality set up in above steps.
Periodically, a tool would be run that would inspect the meta table, and see if
the mapping is still optimal. The tool (as per the code in facebook's branch)
takes a couple of options it can optimize for - maximum-locality,
minimum-region-reassign, munkres algorithm for assigning secondary/tertiary RS
for regions. There is a chore that periodically checks for updates in meta
(based on timestamps) for the region locations and updates assignment-plans.
In 0.89-fb and in prior versions of HBase, the hbase balancer is run upon
regionserver reporting heartbeats, and the balancer basically ensures that the
assignment-plans that have been precomputed are met (and regions might get
unassigned from their current regionservers, etc.).
I think it makes sense to have the above tool be part of the locality-aware
loadbalancer itself since the loadbalancer today runs asynchronously and it
could do a lot more work without impacting heartbeat latencies, etc. It will
also address the issue to do with the conflicts that [~jmhsieh] raised in his
previous comment. I'll look at this aspect some more.
TODO:
Handle non pre-split tables
> HBase based block placement in DFS
> ----------------------------------
>
> Key: HBASE-4755
> URL: https://issues.apache.org/jira/browse/HBASE-4755
> Project: HBase
> Issue Type: New Feature
> Affects Versions: 0.94.0
> Reporter: Karthik Ranganathan
> Assignee: Christopher Gist
> Priority: Critical
> Attachments: 4755-wip-1.patch, hbase-4755-notes.txt
>
>
> The feature as is only useful for HBase clusters that care about data
> locality on regionservers, but this feature can also enable a lot of nice
> features down the road.
> The basic idea is as follows: instead of letting HDFS determine where to
> replicate data (r=3) by place blocks on various regions, it is better to let
> HBase do so by providing hints to HDFS through the DFS client. That way
> instead of replicating data at a blocks level, we can replicate data at a
> per-region level (each region owned by a promary, a secondary and a tertiary
> regionserver). This is better for 2 things:
> - Can make region failover faster on clusters which benefit from data affinity
> - On large clusters with random block placement policy, this helps reduce the
> probability of data loss
> The algo is as follows:
> - Each region in META will have 3 columns which are the preferred
> regionservers for that region (primary, secondary and tertiary)
> - Preferred assignment can be controlled by a config knob
> - Upon cluster start, HMaster will enter a mapping from each region to 3
> regionservers (random hash, could use current locality, etc)
> - The load balancer would assign out regions preferring region assignments to
> primary over secondary over tertiary over any other node
> - Periodically (say weekly, configurable) the HMaster would run a locality
> checked and make sure the map it has for region to regionservers is optimal.
> Down the road, this can be enhanced to control region placement in the
> following cases:
> - Mixed hardware SKU where some regionservers can hold fewer regions
> - Load balancing across tables where we dont want multiple regions of a table
> to get assigned to the same regionservers
> - Multi-tenancy, where we can restrict the assignment of the regions of some
> table to a subset of regionservers, so an abusive app cannot take down the
> whole HBase cluster.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira