Re: HBase schema for crawling

Marcus Herou Mon, 06 Jul 2009 00:18:17 -0700

Hi.

I suggest you build an index with two cols. nextFetchDate, rowKey
Only update the index with the newly fetched items and optimize every night
or so.


If I am not totally incorrect I think these days you have some index
structure within HBase already ? Which means you might not need Lucene.

Cheers

//Marcus


On Sun, Jul 5, 2009 at 11:26 PM, stack <[email protected]> wrote:

> On Sat, Jul 4, 2009 at 5:21 PM, maxjar10 <[email protected]> wrote:
>
> >
> > Hi All,
> >
> > I am developing a schema that will be used for crawling.
>
>
> Out of interest, what crawler are you using?
>
>
> >
> > Now, here's the dilemma I have... When I create a MapReduce job to go
> > through each row in the above I want to schedule the url to be recrawled
> > again at some date in the future. For example,
> >
> > // Simple psudeocode
> > Map( row, rowResult )
> > {
> >      BatchUpdate update = new BatchUpdate( row.get() );
> >      update.put( "contents:content", downloadPage( pageUrl ) );
> >      update.updateKey( nextFetchDate + ":"  reverseDomain( pageUrl ) );
> //
> > ???? No idea how to do this
> > }
>
>
> So you want to write a new row with a nextFetchDate prefix?
>
> FYI, have you seen
>
> http://hadoop.apache.org/hbase/docs/r0.19.3/api/org/apache/hadoop/hbase/util/Keying.html#createKey(java.lang.String)<http://hadoop.apache.org/hbase/docs/r0.19.3/api/org/apache/hadoop/hbase/util/Keying.html#createKey%28java.lang.String%29>
> ?
>
> (You might also find http://sourceforge.net/projects/publicsuffix/ might
> also be useful)
>
>
>
> > 1) Does HBase you to update the key for a row? Are HBase row keys
> > immutable?
> >
>
>
> Yes.
>
> If you 'update' a row key, changing it, you will create a new row.
>
>
>
> >
> > 2) If I can't update a key what's the easiest way to copy a row and
> assign
> > it a different key?
> >
>
>
> Get all of the row and then put it all with the new key (Billy Pearson's
> suggestion would be the way to go I'd suggest -- keeping a column with
> timestamp in it or using hbase versions -- in TRUNK you can ask for data
> within a timerange.  Running a scanner asking for rows > some timestamp
> should be fast).
>
>
>
> >
> > 3) What are the implications for updating/deleting from a table that you
> > are
> > currently scanning as part of the mapReduce job?
> >
>
>
> Scanners return the state of the row at the time they trip over it.
>
>
>
> >
> > It seems to me that I may want to do a map and a reduce and during the
> map
> > phase I would record the rows that I fetched while in the reduce phase I
> > would then take those rows, re-add them with the nextFetchDate and then
> > remove the old row.
>
>
> Do you have to remove old data?  You could let it age or be removed when
> the
> number of versions of pages are > configured maximum.
>
>
> > I would probably want to do this process in phases (e.g. scan only 5,000
> > rows at a time) so that if my Mapper died for any particular reason I
> could
> > address the issue and in the worst case only have lost the work that I
> had
> > done on 5,000 rows.
>
>
> You could keep an already-seen in another hbase table and just rerun whole
> job if first job failed.  Check the already-seen before crawling a page to
> see if you'd crawled it recently or not?
>
> St.Ack
>



-- 
Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
[email protected]
http://www.tailsweep.com/

Re: HBase schema for crawling

Reply via email to