Thanks Alex for  your answer.

I am not yet at a stage where I can measure the performance (I am still at the db design stage, initial population) but my understanding what that randomizing the keys was a way of avoiding keys hotspots. To simplify let's assume that have documents attached to users that I need to search by date. I have two tables: one "Random" optimized to random access and one "Indexes" optimized for sequential access scanners.

'Random' stores document details:
Random:
doc_1-> Title:"some title1",Text:"some longer text1",user:1,CreateDate:2010-01-01 doc_2-> Title:"some title2",Text:"some longer text2",user:1,CreateDate:2010-01-02
....

'Indexes' stores document indexes (for instance here is an index on date and date+user):
date_2100101:id:1
date_2100102:id:2
...
date_user1_2100101:id:1
date_user1_2100102:id:2


As a user typically add many documents in a short period of time, it is usual to have that documents obtained by the scanner are also in the same order in the Random table (without randomization). So, once I get the IDs of the documents from the scanner query, I need to fork concurrent threads/processes to get the document details: that (from what I understand) would create a key hotspot in the 'Random' table. Is my reasoning above correct? My feeling is that a typical hbase application do both scanner/random access patterns alternatively.

Another question I have until I test this is how many random search hbase will stand. The scanner will present links to the documents (paging implemantion), so I am not sure what a realistic value of document per page could be: 10, 20 or 100? As (at least) one new socket (is that true?) is created at each random access request, I am affraid such a design could bring the hbase layer down (until maybe http://issues.apache.org/jira/browse/HBASE-1845 is fixed)

Thanks
TuX



Alex Baranov wrote:
Hello Tux,

Accessing a table in "random access"-manner is not the reason for
randomizing keys. You will likely need to randomize your keys only for
better performance during importing existed large dataset into HBase.
Otherwise if you don't have insertion rate bigger than 20K records/sec I
wouldn't suggest you to think about this issue. It would be great if you
tell us more about your use-case.

MD5, SHA-1 or Jenkins Hash (in org.apache.hadoop.hbase.util.JenkinsHash) are
all mechanisms you might consider.

Alex Baranau

sematext.com
http://en.wordpress.com/tag/hadoop-ecosystem-digest/

On Thu, Mar 11, 2010 at 12:07 PM, TuX RaceR <tuxrace...@gmail.com> wrote:

Hello List,

I'll be accessing a table mainly in random access and I am looking for an
efficient way of randomizing the keys.
I thought about a MD5 hash of the ID of the record, but as MD5 returns a
string of chars [0-9A-F] I was wondering if there was a better method to
use.

Thanks
TuX



Reply via email to