Re: Depth restriction on large crawls

Matt Kangas Thu, 16 Aug 2007 15:45:31 -0700

Vince, I have implemented crawl-depth limits in Nutch 0.7. Becausethere was no crawldb metadata support (yet), I had to store crawl-depth in a custom db (mapfile), and add a processing step to thecrawl cycle.

We had this working for ~30k base URLs and depth limited to 4-20links (min 4, extended when our feature-detector found relevantcontent).

I have a port of this code to Nutch 0.9, not written by me and notstress-tested. If you're interested, I can see if it's possible torelease the source, and adapt it as a patch to the latest Nutchcodebase.

(AFAIK, this functionality is essential for any vertical searchengine. Which is to say, any search startup that wants to succeed bynot attacking Google head-on, IMHO.)


--Matt Kangas


On Aug 14, 2007, at 11:47 AM, Vince Filby wrote:

I have a list of, say, 8 million URL's that I will need to crawlwith Nutchand I will also need to freshen these URL's on a regular basis (Iwill notbe following external links though). Since I have so many URL's Iwouldlike to crawl breadth first and restrict the depth to say 3 or 4levels. I
also want to be able to inject new URL's at any time and have Nutch
automagically start crawling to the appropriate depth. In the intranet
recrawl script, the depth is represented by a new segment with all the
available links from the previous segment. With the large amountof pages Iwill be crawling I would like to restrict the segment size to asomethingthat can be crawled in a few hours so I can constantly maintain afresh
index.
How can I control depth with a much larger crawl, especially whenthere will
be brand new URL's thrown into the mix later on?

Any advice on this topic would be greatly appreciated,
Vince


--
Matt Kangas / [EMAIL PROTECTED]

Re: Depth restriction on large crawls

Reply via email to