Hi. I suggest you build an index with two cols. nextFetchDate, rowKey Only update the index with the newly fetched items and optimize every night or so.
If I am not totally incorrect I think these days you have some index structure within HBase already ? Which means you might not need Lucene. Cheers //Marcus On Sun, Jul 5, 2009 at 11:26 PM, stack <[email protected]> wrote: > On Sat, Jul 4, 2009 at 5:21 PM, maxjar10 <[email protected]> wrote: > > > > > Hi All, > > > > I am developing a schema that will be used for crawling. > > > Out of interest, what crawler are you using? > > > > > > Now, here's the dilemma I have... When I create a MapReduce job to go > > through each row in the above I want to schedule the url to be recrawled > > again at some date in the future. For example, > > > > // Simple psudeocode > > Map( row, rowResult ) > > { > > BatchUpdate update = new BatchUpdate( row.get() ); > > update.put( "contents:content", downloadPage( pageUrl ) ); > > update.updateKey( nextFetchDate + ":" reverseDomain( pageUrl ) ); > // > > ???? No idea how to do this > > } > > > So you want to write a new row with a nextFetchDate prefix? > > FYI, have you seen > > http://hadoop.apache.org/hbase/docs/r0.19.3/api/org/apache/hadoop/hbase/util/Keying.html#createKey(java.lang.String)<http://hadoop.apache.org/hbase/docs/r0.19.3/api/org/apache/hadoop/hbase/util/Keying.html#createKey%28java.lang.String%29> > ? > > (You might also find http://sourceforge.net/projects/publicsuffix/ might > also be useful) > > > > > 1) Does HBase you to update the key for a row? Are HBase row keys > > immutable? > > > > > Yes. > > If you 'update' a row key, changing it, you will create a new row. > > > > > > > 2) If I can't update a key what's the easiest way to copy a row and > assign > > it a different key? > > > > > Get all of the row and then put it all with the new key (Billy Pearson's > suggestion would be the way to go I'd suggest -- keeping a column with > timestamp in it or using hbase versions -- in TRUNK you can ask for data > within a timerange. Running a scanner asking for rows > some timestamp > should be fast). > > > > > > > 3) What are the implications for updating/deleting from a table that you > > are > > currently scanning as part of the mapReduce job? > > > > > Scanners return the state of the row at the time they trip over it. > > > > > > > It seems to me that I may want to do a map and a reduce and during the > map > > phase I would record the rows that I fetched while in the reduce phase I > > would then take those rows, re-add them with the nextFetchDate and then > > remove the old row. > > > Do you have to remove old data? You could let it age or be removed when > the > number of versions of pages are > configured maximum. > > > > I would probably want to do this process in phases (e.g. scan only 5,000 > > rows at a time) so that if my Mapper died for any particular reason I > could > > address the issue and in the worst case only have lost the work that I > had > > done on 5,000 rows. > > > You could keep an already-seen in another hbase table and just rerun whole > job if first job failed. Check the already-seen before crawling a page to > see if you'd crawled it recently or not? > > St.Ack > -- Marcus Herou CTO and co-founder Tailsweep AB +46702561312 [email protected] http://www.tailsweep.com/
