Brian Whitman wrote:
On yesterdays nutch-nightly, from Dennis Kubes suggestions on how to normalize URLs, I removed the parsed folders via

rm -rf crawl_parse parse_data parse_text

from a recent crawl so I could re-parse the crawl using a regex urlnormalizer.

I ran bin/nutch parse crawl/segments/2007.... on a 80K document segment.

The hadoop log (set to INFO) showed a lot of warnings on unparsable documents, with a mapred.JobClient - map XX% reduce 0% ticker steadily going up. It then stopped at map 49% with no more warnings or info, and has been that way for about 6 hours. Top shows java at 99% CPU.

Is it hung or should re-parsing an already crawled segment take this long? Shouldn't hadoop be showing the parse progress?

To test I killed the process and set my nutch-site back to the original -- no url normalizer. No change-- still hangs in the same spot. Any ideas?

In such case you should always do a full thread dump of this JVM process. Under Unix systems this is achieved by doing "kill -SIGQUIT <pid>", under Windows Ctrl-Break.

--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Reply via email to