Alex, I'd say it really depends on your use case... the number of rows inserted per day, your access patterns, etc. A single machine can easily sustain a steady rate of insertion of a couple of thousand rows per second so if your insert is a bit randomized (so that your row key isn't _only_ a ts) then you may actually hit a couple of machines. If you use HBase for other stuff along with that table, it may prove to be a good solution.
J-D On Tue, Mar 9, 2010 at 1:52 AM, Alex Baranov <alex.barano...@gmail.com> wrote: > J-D, > > If I understand things correctly in case of using approach with "timestamp > as the first component of the row key" there still will be issues with write > performance, because all inserted rows will have lexicographically ordered > keys. I mean even if we add some randomly generated key part to the key this > will not affect the whole order and thus, after region grow enough it will > be split into two ones and again all load (of writing new rows) will be > dedicated to single region because row keys are sorted. > > Am I correct? > > > So, there is actually *only two* options here (in case we are using single > HBase table): > > 1. Make row keys random to *distribute load between the nodes* when > importing/inserting new data. > 2. Make row keys represent an inverted timestamp to *improve speed of > selecting only new data to process* (using setStartRow and scanning from > it). > > In addition to mentioned benefits each option has drawbacks: first one will > cause worse speed for selecting new data to process (this can be done using > Scan.setTimeRange), the second option will cause worse import/insert > performance. > > Having said that, I'd resume that there is no "perfect" solution here and > one need to choose between options based on system behaviour (test results). > > Thanks, > > Alex Baranau > > http://en.wordpress.com/tag/hadoop-ecosystem-digest/ > > On Sat, Mar 6, 2010 at 7:06 AM, Jean-Daniel Cryans <jdcry...@apache.org>wrote: > >> What Ryan said plus see inline. >> >> J-D >> >> On Fri, Mar 5, 2010 at 6:37 PM, Otis Gospodnetic >> <otis_gospodne...@yahoo.com> wrote: >> > Hmmmm.... no. I'm wondering if there is a way to first *jump* to a row >> with a given key and then scan to the end from there. >> > For example, imagine keys: >> > ... >> > 777 >> > 444 >> > 222 >> > 666 >> > >> > And imagine that some job went through these rows. It got to the last >> row, row with key 666. This key 666 got stored somewhere as "this is the >> last key we saw". >> > After that happens, some more rows get added, so now we have this: >> > ... >> > 777 >> > 444 >> > 222 >> > 666 <=== last seen >> > 333 >> > 999 >> > 888 >> >> But how would you do that since the row keys are ordered >> lexicographically? What you described is easily doable if using a >> timestamp as the first component of the row key + using >> >> http://hadoop.apache.org/hbase/docs/r0.20.3/api/org/apache/hadoop/hbase/client/Scan.html#setStartRow(byte[])<http://hadoop.apache.org/hbase/docs/r0.20.3/api/org/apache/hadoop/hbase/client/Scan.html#setStartRow%28byte%5B%5D%29> >> >