J-D, If I understand things correctly in case of using approach with "timestamp as the first component of the row key" there still will be issues with write performance, because all inserted rows will have lexicographically ordered keys. I mean even if we add some randomly generated key part to the key this will not affect the whole order and thus, after region grow enough it will be split into two ones and again all load (of writing new rows) will be dedicated to single region because row keys are sorted.
Am I correct? So, there is actually *only two* options here (in case we are using single HBase table): 1. Make row keys random to *distribute load between the nodes* when importing/inserting new data. 2. Make row keys represent an inverted timestamp to *improve speed of selecting only new data to process* (using setStartRow and scanning from it). In addition to mentioned benefits each option has drawbacks: first one will cause worse speed for selecting new data to process (this can be done using Scan.setTimeRange), the second option will cause worse import/insert performance. Having said that, I'd resume that there is no "perfect" solution here and one need to choose between options based on system behaviour (test results). Thanks, Alex Baranau http://en.wordpress.com/tag/hadoop-ecosystem-digest/ On Sat, Mar 6, 2010 at 7:06 AM, Jean-Daniel Cryans <jdcry...@apache.org>wrote: > What Ryan said plus see inline. > > J-D > > On Fri, Mar 5, 2010 at 6:37 PM, Otis Gospodnetic > <otis_gospodne...@yahoo.com> wrote: > > Hmmmm.... no. I'm wondering if there is a way to first *jump* to a row > with a given key and then scan to the end from there. > > For example, imagine keys: > > ... > > 777 > > 444 > > 222 > > 666 > > > > And imagine that some job went through these rows. It got to the last > row, row with key 666. This key 666 got stored somewhere as "this is the > last key we saw". > > After that happens, some more rows get added, so now we have this: > > ... > > 777 > > 444 > > 222 > > 666 <=== last seen > > 333 > > 999 > > 888 > > But how would you do that since the row keys are ordered > lexicographically? What you described is easily doable if using a > timestamp as the first component of the row key + using > > http://hadoop.apache.org/hbase/docs/r0.20.3/api/org/apache/hadoop/hbase/client/Scan.html#setStartRow(byte[])<http://hadoop.apache.org/hbase/docs/r0.20.3/api/org/apache/hadoop/hbase/client/Scan.html#setStartRow%28byte%5B%5D%29> >