I have a list of, say, 8 million URL's that I will need to crawl with Nutch and I will also need to freshen these URL's on a regular basis (I will not be following external links though). Since I have so many URL's I would like to crawl breadth first and restrict the depth to say 3 or 4 levels. I also want to be able to inject new URL's at any time and have Nutch automagically start crawling to the appropriate depth. In the intranet recrawl script, the depth is represented by a new segment with all the available links from the previous segment. With the large amount of pages I will be crawling I would like to restrict the segment size to a something that can be crawled in a few hours so I can constantly maintain a fresh index.
How can I control depth with a much larger crawl, especially when there will be brand new URL's thrown into the mix later on? Any advice on this topic would be greatly appreciated, Vince
