Matt, (I didn't notice your message until this morning...)
I implemented this as well but I am using Nutch 0.8 so that the generated Lucene index version matches what our front end searcher is using. I implemented this in a manner similar to scoring. The first step was to add a crawl depth field to CrawlDatum. When I crawl a page I insert the current page depth into the content metadata, then in ParseOutputFormat I inspect the value from the content metadata and either choose not crawl the links on this page or I update the crawl depth for each crawl datum. I am planning to submit this back to the Nutch community as a patch once we have it tested a bit more. I am really new to the Nutch codebase so there may be a better way to do it. Any advice, suggestions or comments would be awesome. Cheers, Vince On 8/16/07, Matt Kangas <[EMAIL PROTECTED]> wrote: > > Vince, I have implemented crawl-depth limits in Nutch 0.7. Because > there was no crawldb metadata support (yet), I had to store crawl- > depth in a custom db (mapfile), and add a processing step to the > crawl cycle. > > We had this working for ~30k base URLs and depth limited to 4-20 > links (min 4, extended when our feature-detector found relevant > content). > > I have a port of this code to Nutch 0.9, not written by me and not > stress-tested. If you're interested, I can see if it's possible to > release the source, and adapt it as a patch to the latest Nutch > codebase. > > (AFAIK, this functionality is essential for any vertical search > engine. Which is to say, any search startup that wants to succeed by > not attacking Google head-on, IMHO.) > > --Matt Kangas > > > On Aug 14, 2007, at 11:47 AM, Vince Filby wrote: > > > I have a list of, say, 8 million URL's that I will need to crawl > > with Nutch > > and I will also need to freshen these URL's on a regular basis (I > > will not > > be following external links though). Since I have so many URL's I > > would > > like to crawl breadth first and restrict the depth to say 3 or 4 > > levels. I > > also want to be able to inject new URL's at any time and have Nutch > > automagically start crawling to the appropriate depth. In the intranet > > recrawl script, the depth is represented by a new segment with all the > > available links from the previous segment. With the large amount > > of pages I > > will be crawling I would like to restrict the segment size to a > > something > > that can be crawled in a few hours so I can constantly maintain a > > fresh > > index. > > > > How can I control depth with a much larger crawl, especially when > > there will > > be brand new URL's thrown into the mix later on? > > > > Any advice on this topic would be greatly appreciated, > > Vince > > -- > Matt Kangas / [EMAIL PROTECTED] > > >
