Re: random access and hotspots

TuX RaceR Thu, 11 Mar 2010 05:37:28 -0800

Thanks Alex for  your answer.

I am not yet at a stage where I can measure the performance (I am stillat the db design stage, initial population) but my understanding whatthat randomizing the keys was a way of avoiding keys hotspots.To simplify let's assume that have documents attached to users that Ineed to search by date.I have two tables: one "Random" optimized to random access and one"Indexes" optimized for sequential access scanners.


'Random' stores document details:
Random:

doc_1-> Title:"some title1",Text:"some longertext1",user:1,CreateDate:2010-01-01doc_2-> Title:"some title2",Text:"some longertext2",user:1,CreateDate:2010-01-02

....

'Indexes' stores document indexes (for instance here is an index on dateand date+user):

date_2100101:id:1
date_2100102:id:2
...
date_user1_2100101:id:1
date_user1_2100102:id:2

As a user typically add many documents in a short period of time, it isusual to have that documents obtained by the scanner are also in thesame order in the Random table (without randomization).So, once I get the IDs of the documents from the scanner query, I needto fork concurrent threads/processes to get the document details: that(from what I understand) would create a key hotspot in the 'Random' table.Is my reasoning above correct? My feeling is that a typical hbaseapplication do both scanner/random access patterns alternatively.

Another question I have until I test this is how many random searchhbase will stand. The scanner will present links to the documents(paging implemantion), so I am not sure what a realistic value ofdocument per page could be: 10, 20 or 100? As (at least) one new socket(is that true?) is created at each random access request, I am affraidsuch a design could bring the hbase layer down (until maybehttp://issues.apache.org/jira/browse/HBASE-1845 is fixed)


Thanks
TuX



Alex Baranov wrote:

Hello Tux,

Accessing a table in "random access"-manner is not the reason for
randomizing keys. You will likely need to randomize your keys only for
better performance during importing existed large dataset into HBase.
Otherwise if you don't have insertion rate bigger than 20K records/sec I
wouldn't suggest you to think about this issue. It would be great if you
tell us more about your use-case.

MD5, SHA-1 or Jenkins Hash (in org.apache.hadoop.hbase.util.JenkinsHash) are
all mechanisms you might consider.

Alex Baranau

sematext.com
http://en.wordpress.com/tag/hadoop-ecosystem-digest/

On Thu, Mar 11, 2010 at 12:07 PM, TuX RaceR <tuxrace...@gmail.com> wrote:

Hello List,

I'll be accessing a table mainly in random access and I am looking for an
efficient way of randomizing the keys.
I thought about a MD5 hash of the ID of the record, but as MD5 returns a
string of chars [0-9A-F] I was wondering if there was a better method to
use.

Thanks
TuX

Re: random access and hotspots

Reply via email to