J-D,

If I understand things correctly in case of using approach with "timestamp
as the first component of the row key" there still will be issues with write
performance, because all inserted rows will have lexicographically ordered
keys. I mean even if we add some randomly generated key part to the key this
will not affect the whole order and thus, after region grow enough it will
be split into two ones and again all load (of writing new rows) will be
dedicated to single region because row keys are sorted.

Am I correct?


So, there is actually *only two* options here (in case we are using single
HBase table):

1. Make row keys random to *distribute load between the nodes* when
importing/inserting new data.
2. Make row keys represent an inverted timestamp to *improve speed of
selecting only new data to process* (using setStartRow and scanning from
it).

In addition to mentioned benefits each option has drawbacks: first one will
cause worse speed for selecting new data to process (this can be done using
Scan.setTimeRange), the second option will cause worse import/insert
performance.

Having said that, I'd resume that there is no "perfect" solution here and
one need to choose between options based on system behaviour (test results).

Thanks,

Alex Baranau

http://en.wordpress.com/tag/hadoop-ecosystem-digest/

On Sat, Mar 6, 2010 at 7:06 AM, Jean-Daniel Cryans <jdcry...@apache.org>wrote:

> What Ryan said plus see inline.
>
> J-D
>
> On Fri, Mar 5, 2010 at 6:37 PM, Otis Gospodnetic
> <otis_gospodne...@yahoo.com> wrote:
> > Hmmmm.... no.  I'm wondering if there is a way to first *jump* to a row
> with a given key and then scan to the end from there.
> > For example, imagine keys:
> > ...
> > 777
> > 444
> > 222
> > 666
> >
> > And imagine that some job went through these rows.  It got to the last
> row, row with key 666.  This key 666 got stored somewhere as "this is the
> last key we saw".
> > After that happens, some more rows get added, so now we have this:
> > ...
> > 777
> > 444
> > 222
> > 666  <=== last seen
> > 333
> > 999
> > 888
>
> But how would you do that since the row keys are ordered
> lexicographically? What you described is easily doable if using a
> timestamp as the first component of the row key + using
>
> http://hadoop.apache.org/hbase/docs/r0.20.3/api/org/apache/hadoop/hbase/client/Scan.html#setStartRow(byte[])<http://hadoop.apache.org/hbase/docs/r0.20.3/api/org/apache/hadoop/hbase/client/Scan.html#setStartRow%28byte%5B%5D%29>
>

Reply via email to