re-parse hang?

Brian Whitman Wed, 03 Jan 2007 20:10:12 -0800

On yesterdays nutch-nightly, from Dennis Kubes suggestions on how tonormalize URLs, I removed the parsed folders via


rm -rf crawl_parse parse_data parse_text

from a recent crawl so I could re-parse the crawl using a regexurlnormalizer.


I ran bin/nutch parse crawl/segments/2007.... on a 80K document segment.

The hadoop log (set to INFO) showed a lot of warnings on unparsabledocuments, with a mapred.JobClient - map XX% reduce 0% tickersteadily going up. It then stopped at map 49% with no more warningsor info, and has been that way for about 6 hours. Top shows java at99% CPU.

Is it hung or should re-parsing an already crawled segment take thislong? Shouldn't hadoop be showing the parse progress?

To test I killed the process and set my nutch-site back to theoriginal -- no url normalizer. No change-- still hangs in the samespot. Any ideas?


-Brian

re-parse hang?

Reply via email to