Ken van Mulder wrote:
First is that the fetcher slows down over time and continues to use more and more memory as it goes (which I think is eventually hanging the process).

What parser plugins do you have enabled? These are usually the culprit. Try using 'kill -QUIT' to see what various threads are doing, both at the start and later, when it slows and grows.

Second problem is trying to use the crawl. I've tried with a seeds/url file contain 4, 2000 and then 100k urls in it. Using:

$ bin/nutch crawl seeds

Which goes through its processing and completes, but doesn't visit any of the urls in the seeds file. What am I missing to get it to actually do the crawl?

Are you using NDFS? If so, the seeds directory needs to be stored in NDFS. Use 'bin/nutch ndfs -put seeds seeds'.

Doug

Reply via email to