Vince, I have implemented crawl-depth limits in Nutch 0.7. Because
there was no crawldb metadata support (yet), I had to store crawl-
depth in a custom db (mapfile), and add a processing step to the
crawl cycle.
We had this working for ~30k base URLs and depth limited to 4-20
links (min 4, extended when our feature-detector found relevant
content).
I have a port of this code to Nutch 0.9, not written by me and not
stress-tested. If you're interested, I can see if it's possible to
release the source, and adapt it as a patch to the latest Nutch
codebase.
(AFAIK, this functionality is essential for any vertical search
engine. Which is to say, any search startup that wants to succeed by
not attacking Google head-on, IMHO.)
--Matt Kangas
On Aug 14, 2007, at 11:47 AM, Vince Filby wrote:
I have a list of, say, 8 million URL's that I will need to crawl
with Nutch
and I will also need to freshen these URL's on a regular basis (I
will not
be following external links though). Since I have so many URL's I
would
like to crawl breadth first and restrict the depth to say 3 or 4
levels. I
also want to be able to inject new URL's at any time and have Nutch
automagically start crawling to the appropriate depth. In the intranet
recrawl script, the depth is represented by a new segment with all the
available links from the previous segment. With the large amount
of pages I
will be crawling I would like to restrict the segment size to a
something
that can be crawled in a few hours so I can constantly maintain a
fresh
index.
How can I control depth with a much larger crawl, especially when
there will
be brand new URL's thrown into the mix later on?
Any advice on this topic would be greatly appreciated,
Vince
--
Matt Kangas / [EMAIL PROTECTED]