[ 
https://issues.apache.org/jira/browse/HBASE-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13123577#comment-13123577
 ] 

Dave Revell commented on HBASE-4489:
------------------------------------

@Nicolas, responding to your points one-by-one:

1 and 2. This is a perfectly valid key format (choosing readability over space 
efficiency). Hashed keys are good. There is no disagreement here, but as a 
default it is inefficient for people who don't need to print their HBase keys, 
or are willing to parse their keys before printing. Claims of equal 
compressibility need evidence (raw bytes vs. ASCII)
3. Why should Java integers or longs have anything to do with an MD5 hash, 
which is a sequence of 16 bytes? Do we expect clients to truncate their MD5 
hashes and make sure the high-order bit is a 0 (as required to be in the range 
0-7F)? This is a bizarre default and puts a strange burden on clients, whose 
MD5 generator is giving them an arbitrary 128-bit array.
4. What unevenness are you referring to? If you're referring to the unevenness 
that results from using arbitrary keys in a partitioning scheme designed for 
ASCII, it can be quite bad. The first ASCII character, '0', has the ordinal 48. 
So the first region would cover the range (empty)..48), which is 48/256 = 18% 
of the key space, regardless of how many regions there are.

Maybe we should chat on freenode or Monday? I think it would be fast and easy 
to figure out where we disagree if we were chatting in realtime. Also, thanks 
for the input.
                
> Better key splitting in RegionSplitter
> --------------------------------------
>
>                 Key: HBASE-4489
>                 URL: https://issues.apache.org/jira/browse/HBASE-4489
>             Project: HBase
>          Issue Type: Improvement
>    Affects Versions: 0.90.4
>            Reporter: Dave Revell
>            Assignee: Dave Revell
>         Attachments: HBASE-4489-branch0.90-v1.patch, HBASE-4489-trunk-v1.patch
>
>
> The RegionSplitter utility allows users to create a pre-split table from the 
> command line or do a rolling split on an existing table. It supports 
> pluggable split algorithms that implement the SplitAlgorithm interface. The 
> only/default SplitAlgorithm is one that assumes keys fall in the range from 
> ASCII string "00000000" to ASCII string "7FFFFFFF". This is not a sane 
> default, and seems useless to most users. Users are likely to be surprised by 
> the fact that all the region splits occur in in the byte range of ASCII 
> characters.
> A better default split algorithm would be one that evenly divides the space 
> of all bytes, which is what this patch does. Making a table with five regions 
> would split at \x33\x33..., \x66\x66...., \x99\x99..., \xCC\xCC..., and 
> \xFF\xFF.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to