Re: re-parse hang?

Dennis Kubes Thu, 04 Jan 2007 07:48:23 -0800

What nutch version are you using and what is your setup. An 80K reparseshould only take a few minutes at most.


Dennis


Brian Whitman wrote:

On yesterdays nutch-nightly, from Dennis Kubes suggestions on how tonormalize URLs, I removed the parsed folders via
rm -rf crawl_parse parse_data parse_text
from a recent crawl so I could re-parse the crawl using a regexurlnormalizer.
I ran bin/nutch parse crawl/segments/2007.... on a 80K document segment.
The hadoop log (set to INFO) showed a lot of warnings on unparsabledocuments, with a mapred.JobClient - map XX% reduce 0% ticker steadilygoing up. It then stopped at map 49% with no more warnings or info, andhas been that way for about 6 hours. Top shows java at 99% CPU.
Is it hung or should re-parsing an already crawled segment take thislong? Shouldn't hadoop be showing the parse progress?
To test I killed the process and set my nutch-site back to the original-- no url normalizer. No change-- still hangs in the same spot. Any ideas?
-Brian

Re: re-parse hang?

Reply via email to