Re: re-parse hang?

Brian Whitman Thu, 04 Jan 2007 10:03:02 -0800


On Jan 4, 2007, at 10:47 AM, Dennis Kubes wrote:

What nutch version are you using and what is your setup. An 80Kreparse should only take a few minutes at most.

Hi, not sure if my followup mail got through, but I found out that myre-parse hang was coming from the parse-mp3 plugin -- it was hangingon a particular mp3 file. I'm looking into it...

That said, my 80K reparse (after taking out parse-mp3) took about 30minutes. On a dual Xeon 3.0 debian machine with 4GB RAM, running thenutch nightly from two days ago. Does this seem slower than normal?

Brian Whitman wrote:
On yesterdays nutch-nightly, from Dennis Kubes suggestions on howto normalize URLs, I removed the parsed folders via
rm -rf crawl_parse parse_data parse_text
from a recent crawl so I could re-parse the crawl using a regexurlnormalizer.I ran bin/nutch parse crawl/segments/2007.... on a 80K documentsegment.The hadoop log (set to INFO) showed a lot of warnings onunparsable documents, with a mapred.JobClient - map XX% reduce 0%ticker steadily going up. It then stopped at map 49% with no morewarnings or info, and has been that way for about 6 hours. Topshows java at 99% CPU.Is it hung or should re-parsing an already crawled segment takethis long? Shouldn't hadoop be showing the parse progress?To test I killed the process and set my nutch-site back to theoriginal -- no url normalizer. No change-- still hangs in the samespot. Any ideas?
-Brian


--
http://variogr.am/
[EMAIL PROTECTED]

Re: re-parse hang?

Reply via email to