Paul Tomblin wrote:
My nutch crawl just stopped.  The process is still there, and doesn't
respond to a "kill -TERM" or a "kill -HUP", but it hasn't written
anything to the log file in the last 40 minutes.  The last thing it
logged was some calls to my custom url filter.  Nothing has been
written in the hadoop directory or the crawldir/crawldb or the
segments dir in that time.

How can I tell what's going on and why it's stopped?

If you run in distributed / pseudo-distributed mode, you can check the status in the JobTracker UI. If you are running in "local" mode, then it's likely that the process is in a (single) reduce phase sorting the data - with larger jobs in "local" mode the sorting phase may take very long time, due to a heavy disk IO (and in disk-wait state it may be uninterruptible).

Try to generate a thread dump to see what code is being executed.

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply via email to