Hi J-D,

----- Original Message ----
> From: Jean-Daniel Cryans <jdcry...@apache.org>
> To: hbase-user@hadoop.apache.org
> Sent: Fri, March 5, 2010 5:38:03 PM
> Subject: Re: Jumping to row and scan forward?
> 
> Otis,
> 
> What you're basically saying is: is there a way to sequentially scan
> random row keys?


Hmmmm.... no.  I'm wondering if there is a way to first *jump* to a row with a 
given key and then scan to the end from there.
For example, imagine keys:
...
777
444
222
666

And imagine that some job went through these rows.  It got to the last row, row 
with key 666.  This key 666 got stored somewhere as "this is the last key we 
saw".
After that happens, some more rows get added, so now we have this:
...
777
444
222
666  <=== last seen
333
999
888

Then, 15 minutes later, the job starts again and wants to process only the new 
data.  That is, only rows after row with key 666.
So how can we do that efficiently?
Can we say "jump to key=666 and then scan from there forward"?
Or do we have to start from the very beginning of the table every time, looking 
for row with key 666, ignoring all rows until we find this row 666 and 
processing only rows after 666.

My "worry" is that we have to start from the beginning every time and filter 
many-many-many rows,
so I'm wondering if jumping directly to a specific key and then doing a scan 
from there is possible.


> I can't think of an awesome answer... sequential insert could make
> sense depending on how much data you have to write per day, there's
> stuff that can be optimized to make it work better. Also you could
> write the data to 2 tables and only process the second one... which
> you clear afterwards (maybe actually keep 2 tables just for that since
> while you process one you want to write to the other).


Yeah, I was thinking something with multiple tables (one big/archive one and 
another small one for new data) might work, but if we can jump to a specific 
key and then scan, that is even better.

Thanks,
Otis

> J-D
> 
> On Fri, Mar 5, 2010 at 1:50 PM, Otis Gospodnetic
> wrote:
> > Hi,
> >
> > I need to process (with a MR job) data stored in HBase.  The data is added 
> > to 
> HBase incrementally (and stored in there forever) and so I'd like this MR job 
> to 
> process only the new data every time it runs.  The row keys are not 
> timestamps 
> (because we know what this does to performance of bulk puts), but rather 
> random 
> identifiers.  To process only the new data each time the MR job runs, the 
> *timestamp* (stored in one of the columns in each row) is stored elsewhere as 
> "timestamp of the last processed/seen row" and the MR job uses a server-side 
> filter to zip through all previously processed by filtering (skipping) rows 
> where ts < stored ts.
> >
> > Jean-Daniel Cryans suggested this 2-3 months ago here:
> > 
> http://search-hadoop.com/m?id=31a243e70912242347k55ffc527w344c9fe2842fe...@mail.gmail.com
> >
> > I say "zip", but this still means going through millions and millions and 
> hundreds of millions of rows.
> >
> > Is there *anything* in HBase that would allow one to skip/jump to (or 
> > near!) 
> the "last processed/seen row" and scan from there on, instead of always 
> having 
> to scan from the very beginning?
> >
> > Thanks,
> > Otis
> > ----
> > Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> > Hadoop ecosystem search :: http://search-hadoop.com/
> >
> >

Reply via email to