[ 
https://issues.apache.org/jira/browse/HBASE-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13116774#comment-13116774
 ] 

Jonathan Hsieh commented on HBASE-4489:
---------------------------------------

@Dave

My main suggestion is fixing based on the original author's intent in one patch 
(fixing the ascii encoded hex 7f problem) and then potentially changing the 
semantics/default in a different patch.  I believe we agree that the intent of 
the original author's code looks to be for ascii hex ranges and that the 0x7f 
max is broken.

In the tables I've encountered, it seems more folks who just use ascii rowkeys 
than use binary rowkeys.  Using the uniform byte range split keys for ascii 
character ranges -- would make the new alternate default just a "wrong" for 
many users.  The shell provides a generic mechanism for generating splits for 
new tables now (HBASE-4000) so it seems like using that completely generic 
approach seems more useful given knowledge about your particular row keys.

>From a code skim, it seems that rollingSplits is "smarter" - it take existing 
>row key boundaries and split them at region midpoints.  This is still 
>vulnerable to skewed rows key distributions but at least takes into account 
>the existing rowkey ranges!



                
> Better key splitting in RegionSplitter
> --------------------------------------
>
>                 Key: HBASE-4489
>                 URL: https://issues.apache.org/jira/browse/HBASE-4489
>             Project: HBase
>          Issue Type: Improvement
>    Affects Versions: 0.90.4
>            Reporter: Dave Revell
>            Assignee: Dave Revell
>         Attachments: HBASE-4489-branch0.90-v1.patch, HBASE-4489-trunk-v1.patch
>
>
> The RegionSplitter utility allows users to create a pre-split table from the 
> command line or do a rolling split on an existing table. It supports 
> pluggable split algorithms that implement the SplitAlgorithm interface. The 
> only/default SplitAlgorithm is one that assumes keys fall in the range from 
> ASCII string "00000000" to ASCII string "7FFFFFFF". This is not a sane 
> default, and seems useless to most users. Users are likely to be surprised by 
> the fact that all the region splits occur in in the byte range of ASCII 
> characters.
> A better default split algorithm would be one that evenly divides the space 
> of all bytes, which is what this patch does. Making a table with five regions 
> would split at \x33\x33..., \x66\x66...., \x99\x99..., \xCC\xCC..., and 
> \xFF\xFF.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to