I'm running a web crawl with nutch 0.9 on a 5 node hadoop 12.2 linux
cluster.  The initial injection, fetching, parsing, etc. ran just fine
and came back with ~3 million new urls.  The next step appeared to run
just fine (no errors, says it completed), but although it fetched
around 3 million pages it only parsed around 40k of them.

Since this was a whole web crawl 40k out of 3 million parsed pages
just didn't make sense, so I deleted the parse folders (crawl_parse,
parse_date, parse_text) and ran the parse on the segment again.  It
seemed to be running fine, making progress on both map and reduce
tasks, but the instant the last map task finished it just stopped
making progress (reduce tasks at 29% complete).  I left it for about 4
hours without any change.

Then I killed the job and restarted it in the hopes it was just some
fluke, but it did the same thing again, reduce task progress stopped
the instant the last map task completed.  I left it running for about
10 hours after mapping was completed and still no progress.

Looking at the nodes themselves, they each have most of their ram in
use (2 gigs each), but almost no processor or network usage.  I've
noticed an occasional spike of processor usage every few minutes along
with apparent network usage, but I haven't seen a change in the used
disk space on any of the nodes.

Anyone have any idea what might be going on or, better yet, how to fix it?

Reply via email to