Alex,

I'd say it really depends on your use case... the number of rows
inserted per day, your access patterns, etc. A single machine can
easily sustain a steady rate of insertion of a couple of thousand rows
per second so if your insert is a bit randomized (so that your row key
isn't _only_ a ts) then you may actually hit a couple of machines. If
you use HBase for other stuff along with that table, it may prove to
be a good solution.

J-D

On Tue, Mar 9, 2010 at 1:52 AM, Alex Baranov <alex.barano...@gmail.com> wrote:
> J-D,
>
> If I understand things correctly in case of using approach with "timestamp
> as the first component of the row key" there still will be issues with write
> performance, because all inserted rows will have lexicographically ordered
> keys. I mean even if we add some randomly generated key part to the key this
> will not affect the whole order and thus, after region grow enough it will
> be split into two ones and again all load (of writing new rows) will be
> dedicated to single region because row keys are sorted.
>
> Am I correct?
>
>
> So, there is actually *only two* options here (in case we are using single
> HBase table):
>
> 1. Make row keys random to *distribute load between the nodes* when
> importing/inserting new data.
> 2. Make row keys represent an inverted timestamp to *improve speed of
> selecting only new data to process* (using setStartRow and scanning from
> it).
>
> In addition to mentioned benefits each option has drawbacks: first one will
> cause worse speed for selecting new data to process (this can be done using
> Scan.setTimeRange), the second option will cause worse import/insert
> performance.
>
> Having said that, I'd resume that there is no "perfect" solution here and
> one need to choose between options based on system behaviour (test results).
>
> Thanks,
>
> Alex Baranau
>
> http://en.wordpress.com/tag/hadoop-ecosystem-digest/
>
> On Sat, Mar 6, 2010 at 7:06 AM, Jean-Daniel Cryans <jdcry...@apache.org>wrote:
>
>> What Ryan said plus see inline.
>>
>> J-D
>>
>> On Fri, Mar 5, 2010 at 6:37 PM, Otis Gospodnetic
>> <otis_gospodne...@yahoo.com> wrote:
>> > Hmmmm.... no.  I'm wondering if there is a way to first *jump* to a row
>> with a given key and then scan to the end from there.
>> > For example, imagine keys:
>> > ...
>> > 777
>> > 444
>> > 222
>> > 666
>> >
>> > And imagine that some job went through these rows.  It got to the last
>> row, row with key 666.  This key 666 got stored somewhere as "this is the
>> last key we saw".
>> > After that happens, some more rows get added, so now we have this:
>> > ...
>> > 777
>> > 444
>> > 222
>> > 666  <=== last seen
>> > 333
>> > 999
>> > 888
>>
>> But how would you do that since the row keys are ordered
>> lexicographically? What you described is easily doable if using a
>> timestamp as the first component of the row key + using
>>
>> http://hadoop.apache.org/hbase/docs/r0.20.3/api/org/apache/hadoop/hbase/client/Scan.html#setStartRow(byte[])<http://hadoop.apache.org/hbase/docs/r0.20.3/api/org/apache/hadoop/hbase/client/Scan.html#setStartRow%28byte%5B%5D%29>
>>
>

Reply via email to