I'm running a web crawl with nutch 0.9 on a 5 node hadoop 12.2 linux cluster. The initial injection, fetching, parsing, etc. ran just fine and came back with ~3 million new urls. The next step appeared to run just fine (no errors, says it completed), but although it fetched around 3 million pages it only parsed around 40k of them.
Since this was a whole web crawl 40k out of 3 million parsed pages just didn't make sense, so I deleted the parse folders (crawl_parse, parse_date, parse_text) and ran the parse on the segment again. It seemed to be running fine, making progress on both map and reduce tasks, but the instant the last map task finished it just stopped making progress (reduce tasks at 29% complete). I left it for about 4 hours without any change. Then I killed the job and restarted it in the hopes it was just some fluke, but it did the same thing again, reduce task progress stopped the instant the last map task completed. I left it running for about 10 hours after mapping was completed and still no progress. Looking at the nodes themselves, they each have most of their ram in use (2 gigs each), but almost no processor or network usage. I've noticed an occasional spike of processor usage every few minutes along with apparent network usage, but I haven't seen a change in the used disk space on any of the nodes. Anyone have any idea what might be going on or, better yet, how to fix it?
