I think you might want to use the Scan.setTimeRange which can be used to only get 'new' things.
-ryan On Fri, Mar 5, 2010 at 6:37 PM, Otis Gospodnetic <otis_gospodne...@yahoo.com> wrote: > Hi J-D, > > > ----- Original Message ---- >> From: Jean-Daniel Cryans <jdcry...@apache.org> >> To: hbase-user@hadoop.apache.org >> Sent: Fri, March 5, 2010 5:38:03 PM >> Subject: Re: Jumping to row and scan forward? >> >> Otis, >> >> What you're basically saying is: is there a way to sequentially scan >> random row keys? > > > Hmmmm.... no. I'm wondering if there is a way to first *jump* to a row with > a given key and then scan to the end from there. > For example, imagine keys: > ... > 777 > 444 > 222 > 666 > > And imagine that some job went through these rows. It got to the last row, > row with key 666. This key 666 got stored somewhere as "this is the last key > we saw". > After that happens, some more rows get added, so now we have this: > ... > 777 > 444 > 222 > 666 <=== last seen > 333 > 999 > 888 > > Then, 15 minutes later, the job starts again and wants to process only the > new data. That is, only rows after row with key 666. > So how can we do that efficiently? > Can we say "jump to key=666 and then scan from there forward"? > Or do we have to start from the very beginning of the table every time, > looking for row with key 666, ignoring all rows until we find this row 666 > and processing only rows after 666. > > My "worry" is that we have to start from the beginning every time and filter > many-many-many rows, > so I'm wondering if jumping directly to a specific key and then doing a scan > from there is possible. > > >> I can't think of an awesome answer... sequential insert could make >> sense depending on how much data you have to write per day, there's >> stuff that can be optimized to make it work better. Also you could >> write the data to 2 tables and only process the second one... which >> you clear afterwards (maybe actually keep 2 tables just for that since >> while you process one you want to write to the other). > > > Yeah, I was thinking something with multiple tables (one big/archive one and > another small one for new data) might work, but if we can jump to a specific > key and then scan, that is even better. > > Thanks, > Otis > >> J-D >> >> On Fri, Mar 5, 2010 at 1:50 PM, Otis Gospodnetic >> wrote: >> > Hi, >> > >> > I need to process (with a MR job) data stored in HBase. The data is added >> > to >> HBase incrementally (and stored in there forever) and so I'd like this MR >> job to >> process only the new data every time it runs. The row keys are not >> timestamps >> (because we know what this does to performance of bulk puts), but rather >> random >> identifiers. To process only the new data each time the MR job runs, the >> *timestamp* (stored in one of the columns in each row) is stored elsewhere as >> "timestamp of the last processed/seen row" and the MR job uses a server-side >> filter to zip through all previously processed by filtering (skipping) rows >> where ts < stored ts. >> > >> > Jean-Daniel Cryans suggested this 2-3 months ago here: >> > >> http://search-hadoop.com/m?id=31a243e70912242347k55ffc527w344c9fe2842fe...@mail.gmail.com >> > >> > I say "zip", but this still means going through millions and millions and >> hundreds of millions of rows. >> > >> > Is there *anything* in HBase that would allow one to skip/jump to (or >> > near!) >> the "last processed/seen row" and scan from there on, instead of always >> having >> to scan from the very beginning? >> > >> > Thanks, >> > Otis >> > ---- >> > Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch >> > Hadoop ecosystem search :: http://search-hadoop.com/ >> > >> > > >