High Throughput using row keys based on the current time

Andreas Reiter Fri, 28 Oct 2011 07:08:49 -0700

Hi everybody,

we have the following scenario:
our clustered web application needs to write records to hbase, we need to 
support a very high throughput, we expect up to 10-30 thousends requests per 
second and may be even more


so usually it is not a problem for HBase, if we use a "random" row key; in this 
case the data is distributed between all region servers equally
but, we need to generate our keys based on the current time, so we are able to 
run MR jobs for a period of time without processing the whole data, using
  scan.setStartRow(stopRow);
  scan.setStopRow(startRow);

in our case the generated row keys look similar and are there for going to the 
same region server... so this approach is not really using the power of the 
whole cluster, but only one server, which can be dangerous in case of a very 
high load

so, we are thinking about writing the records first to a HDFS file, and run 
additionally a MR job periodically to read the finnished HDFS files and insert 
the records to HBase

what do you guys think about it? any suggestions would be very appreciated

regards
andre

High Throughput using row keys based on the current time

Reply via email to