I have a list of, say, 8 million URL's that I will need to crawl with Nutch
and I will also need to freshen these URL's on a regular basis (I will not
be following external links though).  Since I have so many URL's I would
like to crawl breadth first and restrict the depth to say 3 or 4 levels.  I
also want to be able to inject new URL's at any time and have Nutch
automagically start crawling to the appropriate depth. In the intranet
recrawl script, the depth is represented by a new segment with all the
available links from the previous segment.  With the large amount of pages I
will be crawling I would like to restrict the segment size to a something
that can be crawled in a few hours so I can constantly maintain a fresh
index.

How can I control depth with a much larger crawl, especially when there will
be brand new URL's thrown into the mix later on?

Any advice on this topic would be greatly appreciated,
Vince

Reply via email to