I think you might want to use the Scan.setTimeRange which can be used
to only get 'new' things.

-ryan

On Fri, Mar 5, 2010 at 6:37 PM, Otis Gospodnetic
<otis_gospodne...@yahoo.com> wrote:
> Hi J-D,
>
>
> ----- Original Message ----
>> From: Jean-Daniel Cryans <jdcry...@apache.org>
>> To: hbase-user@hadoop.apache.org
>> Sent: Fri, March 5, 2010 5:38:03 PM
>> Subject: Re: Jumping to row and scan forward?
>>
>> Otis,
>>
>> What you're basically saying is: is there a way to sequentially scan
>> random row keys?
>
>
> Hmmmm.... no.  I'm wondering if there is a way to first *jump* to a row with 
> a given key and then scan to the end from there.
> For example, imagine keys:
> ...
> 777
> 444
> 222
> 666
>
> And imagine that some job went through these rows.  It got to the last row, 
> row with key 666.  This key 666 got stored somewhere as "this is the last key 
> we saw".
> After that happens, some more rows get added, so now we have this:
> ...
> 777
> 444
> 222
> 666  <=== last seen
> 333
> 999
> 888
>
> Then, 15 minutes later, the job starts again and wants to process only the 
> new data.  That is, only rows after row with key 666.
> So how can we do that efficiently?
> Can we say "jump to key=666 and then scan from there forward"?
> Or do we have to start from the very beginning of the table every time, 
> looking for row with key 666, ignoring all rows until we find this row 666 
> and processing only rows after 666.
>
> My "worry" is that we have to start from the beginning every time and filter 
> many-many-many rows,
> so I'm wondering if jumping directly to a specific key and then doing a scan 
> from there is possible.
>
>
>> I can't think of an awesome answer... sequential insert could make
>> sense depending on how much data you have to write per day, there's
>> stuff that can be optimized to make it work better. Also you could
>> write the data to 2 tables and only process the second one... which
>> you clear afterwards (maybe actually keep 2 tables just for that since
>> while you process one you want to write to the other).
>
>
> Yeah, I was thinking something with multiple tables (one big/archive one and 
> another small one for new data) might work, but if we can jump to a specific 
> key and then scan, that is even better.
>
> Thanks,
> Otis
>
>> J-D
>>
>> On Fri, Mar 5, 2010 at 1:50 PM, Otis Gospodnetic
>> wrote:
>> > Hi,
>> >
>> > I need to process (with a MR job) data stored in HBase.  The data is added 
>> > to
>> HBase incrementally (and stored in there forever) and so I'd like this MR 
>> job to
>> process only the new data every time it runs.  The row keys are not 
>> timestamps
>> (because we know what this does to performance of bulk puts), but rather 
>> random
>> identifiers.  To process only the new data each time the MR job runs, the
>> *timestamp* (stored in one of the columns in each row) is stored elsewhere as
>> "timestamp of the last processed/seen row" and the MR job uses a server-side
>> filter to zip through all previously processed by filtering (skipping) rows
>> where ts < stored ts.
>> >
>> > Jean-Daniel Cryans suggested this 2-3 months ago here:
>> >
>> http://search-hadoop.com/m?id=31a243e70912242347k55ffc527w344c9fe2842fe...@mail.gmail.com
>> >
>> > I say "zip", but this still means going through millions and millions and
>> hundreds of millions of rows.
>> >
>> > Is there *anything* in HBase that would allow one to skip/jump to (or 
>> > near!)
>> the "last processed/seen row" and scan from there on, instead of always 
>> having
>> to scan from the very beginning?
>> >
>> > Thanks,
>> > Otis
>> > ----
>> > Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
>> > Hadoop ecosystem search :: http://search-hadoop.com/
>> >
>> >
>
>

Reply via email to