Seeded with a list of urls, nutch whole-web crawler is going to take unknown number of cycles of generate/fetch/updatedb in order to drive to some level of completeness, both for internal links and outlinks. It's crucial to monitor the progress. I'll appreciate some suggestions or best practices on the following questions:

1. After each cycle, how to list what were fetched successfully, and separately, what urls were failed? 2. Are there tools to create progress reports? 3. Will these failed urls be included in the next fetchlist generated by "nutch generate"? If not, how to control when these failed urls get fetched again? 4. What's a good way to measure the completeness of crawling a list of sites, say 1000 seed urls? For internal links of a site, how to determine all internal links were fetched or at least tried? Same question for outlinks. 5. I see a great need for automation of this process. Is there a tool or plan in Nutch for such automation? Any body has developed an automated process that can be shared?

Thanks,
AJ

Reply via email to