>From my experience this could be one of two things. You should have this in
>your hadoop-site.xml file, when using Hadoop 0.9 if you don't already;
<property>
<name>mapred.speculative.execution</name>
<value>false</value>
</property>
That could be one reason for the process hang, the other could be due to this
issue;
http://issues.apache.org/jira/browse/NUTCH-233
In either case re-parsing should never take that long, whenever Java pegs the
CPU like that its rarely a good thing.
----- Original Message ----
From: Brian Whitman <[EMAIL PROTECTED]>
To: [email protected]
Sent: Wednesday, January 3, 2007 11:09:47 PM
Subject: re-parse hang?
On yesterdays nutch-nightly, from Dennis Kubes suggestions on how to
normalize URLs, I removed the parsed folders via
rm -rf crawl_parse parse_data parse_text
from a recent crawl so I could re-parse the crawl using a regex
urlnormalizer.
I ran bin/nutch parse crawl/segments/2007.... on a 80K document segment.
The hadoop log (set to INFO) showed a lot of warnings on unparsable
documents, with a mapred.JobClient - map XX% reduce 0% ticker
steadily going up. It then stopped at map 49% with no more warnings
or info, and has been that way for about 6 hours. Top shows java at
99% CPU.
Is it hung or should re-parsing an already crawled segment take this
long? Shouldn't hadoop be showing the parse progress?
To test I killed the process and set my nutch-site back to the
original -- no url normalizer. No change-- still hangs in the same
spot. Any ideas?
-Brian
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general