>From my experience this could be one of two things. You should have this in >your hadoop-site.xml file, when using Hadoop 0.9 if you don't already; <property> <name>mapred.speculative.execution</name> <value>false</value> </property>
That could be one reason for the process hang, the other could be due to this issue; http://issues.apache.org/jira/browse/NUTCH-233 In either case re-parsing should never take that long, whenever Java pegs the CPU like that its rarely a good thing. ----- Original Message ---- From: Brian Whitman <[EMAIL PROTECTED]> To: [email protected] Sent: Wednesday, January 3, 2007 11:09:47 PM Subject: re-parse hang? On yesterdays nutch-nightly, from Dennis Kubes suggestions on how to normalize URLs, I removed the parsed folders via rm -rf crawl_parse parse_data parse_text from a recent crawl so I could re-parse the crawl using a regex urlnormalizer. I ran bin/nutch parse crawl/segments/2007.... on a 80K document segment. The hadoop log (set to INFO) showed a lot of warnings on unparsable documents, with a mapred.JobClient - map XX% reduce 0% ticker steadily going up. It then stopped at map 49% with no more warnings or info, and has been that way for about 6 hours. Top shows java at 99% CPU. Is it hung or should re-parsing an already crawled segment take this long? Shouldn't hadoop be showing the parse progress? To test I killed the process and set my nutch-site back to the original -- no url normalizer. No change-- still hangs in the same spot. Any ideas? -Brian
