Ken van Mulder wrote:
First is that the fetcher slows down over time and continues to use more
and more memory as it goes (which I think is eventually hanging the
process).
What parser plugins do you have enabled? These are usually the culprit.
Try using 'kill -QUIT' to see what various threads are doing, both at
the start and later, when it slows and grows.
Second problem is trying to use the crawl. I've tried with a seeds/url
file contain 4, 2000 and then 100k urls in it. Using:
$ bin/nutch crawl seeds
Which goes through its processing and completes, but doesn't visit any
of the urls in the seeds file. What am I missing to get it to actually
do the crawl?
Are you using NDFS? If so, the seeds directory needs to be stored in
NDFS. Use 'bin/nutch ndfs -put seeds seeds'.
Doug