Hi Jothi, We are trying to index around 245GB compressed data (~1TB uncompressed) on a 9 node Hadoop cluster with 8 slaves and 1 master. In Map, we are just parsing the files, passing the same to reduce. In Reduce, we are indexing the parsed data in much like Nutch style.
When we ran the job, map got over in less than 4hrs. But strange thing happened with reduces. They went past 100% progress (some 200%!). They showed 200+% before getting killed! Is this some kind of bug in Hadoop? All eventually got killed saying "Task attempt_200907091637_0004_r_000000_0 failed to report status for 1201 seconds. Killing!" But I guess indexing in reduce takes more than 1200+ seconds. How to go about it? Thanks in advance, Prashant, Search and Information Extraction Lab, IIIT-Hyderabad, INDIA.
