[
https://issues.apache.org/jira/browse/HBASE-2531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12868903#action_12868903
]
stack commented on HBASE-2531:
------------------------------
... continuing
As to what needs to be done, at a minimum, we need to change the way we name
dirs in the filesystem. Currently its done as follows
{code}
/**
* @param regionName
* @return the encodedName
*/
public static int encodeRegionName(final byte [] regionName) {
return Math.abs(JenkinsHash.getInstance().hash(regionName,
regionName.length, 0));
}
{code}
The minimally intrusive thing would be to change the above hashing to instead
return a byte array or a String and have the function md5 or sha-1 the
regionName so there is some relation between the regionname and hash, or just
return a UUID, a product that cannot be related to the regionname. We'd then
need to go through code base and make sure that everywhere we deal with the
encoded name of the region, that we can handle BOTH the new style byte [] or
String format and the old format int.
Since we cannot derive the regionname from the UUID, we must be careful we do
not misplace the UUID. We'd have to save it into the regions HRegionInfo
object.
md5/sha-1 would be superior because we can always go from regionname to the
encoded name.
I was thinking (and I think Kannan the same), that rather than timestamp alone
as the 3rd component of the regionname, that rather we'd make it so the 3rd
portion of the regionname serve two functions: its current one as
differentiator between child and parent (see previous comment) but that this
3rd component would also be what we use for the region directory in the
filesystem. Timestamp alone would not be enough. After this afternoon's IRC
discussions, UUID isn't suitable. We'd have to tag on something extra. It
could be an md5 of the startkey or it could just be jenkins hash of the
startkey since likelihood of hash-of-startkey+timestamp would clash is unlikely.
I liked this later option because you'd read the regionname and would be able
to then easily find the region's dir in the filesystem.
This would be a more intrusive change than the one above where we just change
hash function.
> 32-bit encoding of regionnames waaaaaaayyyyy too susceptible to hash clashes
> ----------------------------------------------------------------------------
>
> Key: HBASE-2531
> URL: https://issues.apache.org/jira/browse/HBASE-2531
> Project: Hadoop HBase
> Issue Type: Bug
> Reporter: stack
> Assignee: stack
> Priority: Blocker
> Fix For: 0.21.0
>
>
> Kannan tripped over two regionnames that hashed the same:
> Here is code demo'ing that his two names hash the same:
> {code}
> package org;
> import org.apache.hadoop.hbase.util.Bytes;
> import org.apache.hadoop.hbase.util.JenkinsHash;
> public class Testing {
> public static void main(final String [] args) {
>
> System.out.println(encodeRegionName(Bytes.toBytes("test1,6838000000,1273541236167")));
>
> System.out.println(encodeRegionName(Bytes.toBytes("test1,0520100000,1273541610201")));
> }
> /**
> * @param regionName
> * @return the encodedName
> */
> public static int encodeRegionName(final byte [] regionName) {
> return Math.abs(JenkinsHash.getInstance().hash(regionName,
> regionName.length, 0));
> }
> }
> {code}
> Need new encoding mechanism. Will need to migrate old regions to new schema.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.