[jira] Commented: (HBASE-2531) 32-bit encoding of regionnames waaaaaaayyyyy too susceptible to hash clashes

Kannan Muthukkaruppan (JIRA) Tue, 18 May 2010 17:35:18 -0700

    [ 
https://issues.apache.org/jira/browse/HBASE-2531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12868935#action_12868935
 ]


Kannan Muthukkaruppan commented on HBASE-2531:
----------------------------------------------

Writing out the details of one possible solution. Comments welcome.

Old region names continue to have the format: 
<tablename>,<startkey>,<timestamp>. For region names in this format, the 
encoded name will continue to be the old JenkinsHash implementation.
New region names have the format: <tablename>,<startkey>,<timestamp>,<dirname> 
where <dirname> is the md5 hash of the <tablename>,<startkey>,<timestamp> and 
will serve as encoded name/directory name in FS for the region.

This preserves the property that child regions (splits) will have a region name 
that sorts higher than the parent.

Search to determine what region serves a particular key is done today by 
building a key of the form:
  <tablename>,<searchkey>,99999999999999

That's 14 9's. On our test cluster I noticed region names of the form: 
test1,0013440000,1273816773769. That's 13 digits for the timestamp part. Going 
forward, we'll have region names of the form: 
test1,<key>,<13digit-ts>,<md5hash>. But the 14 9's based search key would 
continue to work just fine even with the new format region names since '9' > 
',' .



> 32-bit encoding of regionnames waaaaaaayyyyy too susceptible to hash clashes
> ----------------------------------------------------------------------------
>
>                 Key: HBASE-2531
>                 URL: https://issues.apache.org/jira/browse/HBASE-2531
>             Project: Hadoop HBase
>          Issue Type: Bug
>            Reporter: stack
>            Assignee: stack
>            Priority: Blocker
>             Fix For: 0.21.0
>
>
> Kannan tripped over two regionnames that hashed the same:
> Here is code demo'ing that his two names hash the same:
> {code}
> package org;
> import org.apache.hadoop.hbase.util.Bytes;
> import org.apache.hadoop.hbase.util.JenkinsHash;
> public class Testing {
>   public static void main(final String [] args) {
>     
> System.out.println(encodeRegionName(Bytes.toBytes("test1,6838000000,1273541236167")));
>     
> System.out.println(encodeRegionName(Bytes.toBytes("test1,0520100000,1273541610201")));
>   }
>   /**
>    * @param regionName
>    * @return the encodedName
>    */
>   public static int encodeRegionName(final byte [] regionName) {
>     return Math.abs(JenkinsHash.getInstance().hash(regionName, 
> regionName.length, 0));
>   }
> }
> {code}
> Need new encoding mechanism.  Will need to migrate old regions to new schema.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-2531) 32-bit encoding of regionnames waaaaaaayyyyy too susceptible to hash clashes

Reply via email to