Otis,

What you're basically saying is: is there a way to sequentially scan
random row keys?

I can't think of an awesome answer... sequential insert could make
sense depending on how much data you have to write per day, there's
stuff that can be optimized to make it work better. Also you could
write the data to 2 tables and only process the second one... which
you clear afterwards (maybe actually keep 2 tables just for that since
while you process one you want to write to the other).

J-D

On Fri, Mar 5, 2010 at 1:50 PM, Otis Gospodnetic
<otis_gospodne...@yahoo.com> wrote:
> Hi,
>
> I need to process (with a MR job) data stored in HBase.  The data is added to 
> HBase incrementally (and stored in there forever) and so I'd like this MR job 
> to process only the new data every time it runs.  The row keys are not 
> timestamps (because we know what this does to performance of bulk puts), but 
> rather random identifiers.  To process only the new data each time the MR job 
> runs, the *timestamp* (stored in one of the columns in each row) is stored 
> elsewhere as "timestamp of the last processed/seen row" and the MR job uses a 
> server-side filter to zip through all previously processed by filtering 
> (skipping) rows where ts < stored ts.
>
> Jean-Daniel Cryans suggested this 2-3 months ago here:
> http://search-hadoop.com/m?id=31a243e70912242347k55ffc527w344c9fe2842fe...@mail.gmail.com
>
> I say "zip", but this still means going through millions and millions and 
> hundreds of millions of rows.
>
> Is there *anything* in HBase that would allow one to skip/jump to (or near!) 
> the "last processed/seen row" and scan from there on, instead of always 
> having to scan from the very beginning?
>
> Thanks,
> Otis
> ----
> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> Hadoop ecosystem search :: http://search-hadoop.com/
>
>

Reply via email to