Seeded with a list of urls, nutch whole-web crawler is going to take
unknown number of cycles of generate/fetch/updatedb in order to drive to
some level of completeness, both for internal links and outlinks. It's
crucial to monitor the progress. I'll appreciate some suggestions or
best practices on the following questions:
1. After each cycle, how to list what were fetched successfully, and
separately, what urls were failed?
2. Are there tools to create progress reports?
3. Will these failed urls be included in the next fetchlist generated by
"nutch generate"? If not, how to control when these failed urls get
fetched again?
4. What's a good way to measure the completeness of crawling a list of
sites, say 1000 seed urls? For internal links of a site, how to
determine all internal links were fetched or at least tried? Same
question for outlinks.
5. I see a great need for automation of this process. Is there a tool or
plan in Nutch for such automation? Any body has developed an automated
process that can be shared?
Thanks,
AJ