Re: Depth restriction on large crawls

Vince Filby Tue, 21 Aug 2007 07:26:29 -0700

Matt,

(I didn't notice your message until this morning...)


I implemented this as well but I am using Nutch 0.8 so that the generated
Lucene index version matches what our front end searcher is using.  I
implemented this in a manner similar to scoring.  The first step was to add
a crawl depth field to CrawlDatum.  When I crawl a page I insert the current
page depth into the content metadata, then in ParseOutputFormat I inspect
the value from the content metadata and either choose not crawl the links on
this page or I update the crawl depth for each crawl datum.

I am planning to submit this back to the Nutch community as a patch once we
have it tested a bit more.  I am really new to the Nutch codebase so there
may be a better way to do it.  Any advice, suggestions or comments would be
awesome.

Cheers,
Vince



On 8/16/07, Matt Kangas <[EMAIL PROTECTED]> wrote:
>
> Vince, I have implemented crawl-depth limits in Nutch 0.7. Because
> there was no crawldb metadata support (yet), I had to store crawl-
> depth in a custom db (mapfile), and add a processing step to the
> crawl cycle.
>
> We had this working for ~30k base URLs and depth limited to 4-20
> links (min 4, extended when our feature-detector found relevant
> content).
>
> I have a port of this code to Nutch 0.9, not written by me and not
> stress-tested. If you're interested, I can see if it's possible to
> release the source, and adapt it as a patch to the latest Nutch
> codebase.
>
> (AFAIK, this functionality is essential for any vertical search
> engine. Which is to say, any search startup that wants to succeed by
> not attacking Google head-on, IMHO.)
>
> --Matt Kangas
>
>
> On Aug 14, 2007, at 11:47 AM, Vince Filby wrote:
>
> > I have a list of, say, 8 million URL's that I will need to crawl
> > with Nutch
> > and I will also need to freshen these URL's on a regular basis (I
> > will not
> > be following external links though).  Since I have so many URL's I
> > would
> > like to crawl breadth first and restrict the depth to say 3 or 4
> > levels.  I
> > also want to be able to inject new URL's at any time and have Nutch
> > automagically start crawling to the appropriate depth. In the intranet
> > recrawl script, the depth is represented by a new segment with all the
> > available links from the previous segment.  With the large amount
> > of pages I
> > will be crawling I would like to restrict the segment size to a
> > something
> > that can be crawled in a few hours so I can constantly maintain a
> > fresh
> > index.
> >
> > How can I control depth with a much larger crawl, especially when
> > there will
> > be brand new URL's thrown into the mix later on?
> >
> > Any advice on this topic would be greatly appreciated,
> > Vince
>
> --
> Matt Kangas / [EMAIL PROTECTED]
>
>
>

Re: Depth restriction on large crawls

Reply via email to