[ 
https://issues.apache.org/jira/browse/HBASE-82?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

stack updated HBASE-82:
-----------------------

    Attachment: Perf.java

I need to be able to use byte arrays as keys in Maps.  Byte arrays alone don't 
work as Map keys since byte [] 'Compare' using object identity rather than byte 
content.  I need this functionality because rows and regionnames, etc., are 
byte arrays where before they were Comparable Text.   I could wrap the byte 
array into an ImmutableBytesWritable once the byte array arrives server-side 
and use this as Key since IBW is Comparable.  That'd work.

But, I took a look at using the hash of the byte array Integer as Map key.   
For sure, if I use a simple hash of the byte array, as we would be doing if we 
used IBW -- See the WritableComparator.hashBytes which IBW (and Text) uses -- 
its faster especially if invocations are < 100k; its 3 to 4 times as fast.  At 
about 1M iterations, the difference is less.  Using the byte array hash Integer 
instead of IBW is only about 20% faster.  I guess that hot spot is what makes 
for the improvements but, for sure, its taking its time warming up.  Since I 
can make other savings -- e.g. get rid of the rowsToLocks Map -- I'm going to 
go with using a hash code Integer as keys in the locksToRows Map.

A Jenkins hash is more robust than the simple hash and its better suited to the 
types of keys we'll be seeing and better than CRCs, etc. -- see 
http://www.ddj.com/184410284 --  but its more expensive to make.  In my 
testing, it was about same as IBW at 100k or less but at 1M, it took ~twice as 
long.

I did various tests.  I'll attach the last code that I was using.  It was 
reading a file of 750k unique-ish URLs and hashing these.  The code does 
HRegionServer.batchUpdate-like things inserting into a Map in case the 
hashCode-making is lazy (the put will force the hash code calculation).

I also tried wrapping the byte array in a ByteBuffer.  This was about 20% 
slower and more than IBW.  I'm guessing its hashing code more involved than 
that of WritableComparator.

> row keys should be array of bytes with a specified comparator
> -------------------------------------------------------------
>
>                 Key: HBASE-82
>                 URL: https://issues.apache.org/jira/browse/HBASE-82
>             Project: Hadoop HBase
>          Issue Type: Wish
>            Reporter: Jim Kellerman
>            Assignee: stack
>             Fix For: 0.2.0
>
>         Attachments: 82-v2.patch, 82-v3.patch, 82-v4.patch, 82.patch, 
> Perf.java
>
>
> I have heard from several people that row keys in HBase should be less 
> restricted than hadoop.io.Text.
> What do you think?
> At the very least, a row key has to be a WritableComparable. This would lead 
> to the most general case being either hadoop.io.BytesWritable or 
> hbase.io.ImmutableBytesWritable. The primary difference between these two 
> classes is that hadoop.io.BytesWritable by default allocates 100 bytes and if 
> you do not pay attention to the length, (BytesWritable.getSize()), converting 
> a String to a BytesWritable and vice versa can become problematic. 
> hbase.io.ImmutableBytesWritable, in contrast only allocates as many bytes as 
> you pass in and then does not allow the size to be changed.
> If we were to change from Text to a non-text key, my preference would be for 
> ImmutableBytesWritable, because it has a fixed size once set, and operations 
> like get, etc do not have to something like System.arrayCopy where you 
> specify the number of bytes to copy.
> Your comments, questions are welcome on this issue. If we receive enough 
> feedback that Text is too restrictive, we are willing to change it, but we 
> need to hear what would be the most useful thing to change it to as well.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to