[ 
https://issues.apache.org/jira/browse/HBASE-2531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12868903#action_12868903
 ] 

stack commented on HBASE-2531:
------------------------------

... continuing

As to what needs to be done, at a minimum, we need to change the way we name 
dirs in the filesystem.  Currently its done as follows

{code}
  /**
   * @param regionName
   * @return the encodedName
   */
  public static int encodeRegionName(final byte [] regionName) {
    return Math.abs(JenkinsHash.getInstance().hash(regionName, 
regionName.length, 0));
  }
{code}

The minimally intrusive thing would be to change the above hashing to instead 
return a byte array or a String and have the function md5 or sha-1 the 
regionName  so there is some relation between the regionname and hash, or just 
return a UUID, a product that cannot be related to the regionname.  We'd then 
need to go through code base and make sure that everywhere we deal with the 
encoded name of the region, that we can handle BOTH the new style byte [] or 
String format and the old format int.

Since we cannot derive the regionname from the UUID, we must be careful we do 
not misplace the UUID.  We'd have to save it into the regions HRegionInfo 
object.

md5/sha-1 would be superior because we can always go from regionname to the 
encoded name.

I was thinking (and I think Kannan the same), that rather than timestamp alone 
as the 3rd component of the regionname, that rather we'd make it so the 3rd 
portion of the regionname serve two functions: its current one as 
differentiator between child and parent (see previous comment) but that this 
3rd component would also be what we use for the region directory in the 
filesystem.   Timestamp alone would not be enough.  After this afternoon's IRC 
discussions, UUID isn't suitable.  We'd have to tag on something extra.  It 
could be an md5 of the startkey or it could just be jenkins hash of the 
startkey since likelihood of hash-of-startkey+timestamp would clash is unlikely.

I liked this later option because you'd read the regionname and would be able 
to then easily find the region's dir in the filesystem. 

This would be a more intrusive change than the one above where we just change 
hash function.



> 32-bit encoding of regionnames waaaaaaayyyyy too susceptible to hash clashes
> ----------------------------------------------------------------------------
>
>                 Key: HBASE-2531
>                 URL: https://issues.apache.org/jira/browse/HBASE-2531
>             Project: Hadoop HBase
>          Issue Type: Bug
>            Reporter: stack
>            Assignee: stack
>            Priority: Blocker
>             Fix For: 0.21.0
>
>
> Kannan tripped over two regionnames that hashed the same:
> Here is code demo'ing that his two names hash the same:
> {code}
> package org;
> import org.apache.hadoop.hbase.util.Bytes;
> import org.apache.hadoop.hbase.util.JenkinsHash;
> public class Testing {
>   public static void main(final String [] args) {
>     
> System.out.println(encodeRegionName(Bytes.toBytes("test1,6838000000,1273541236167")));
>     
> System.out.println(encodeRegionName(Bytes.toBytes("test1,0520100000,1273541610201")));
>   }
>   /**
>    * @param regionName
>    * @return the encodedName
>    */
>   public static int encodeRegionName(final byte [] regionName) {
>     return Math.abs(JenkinsHash.getInstance().hash(regionName, 
> regionName.length, 0));
>   }
> }
> {code}
> Need new encoding mechanism.  Will need to migrate old regions to new schema.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to