Hi, I'm trying to crawl approx. 500.000 urls. After inject and generate I started fetchers using 6 map tasks and 3 reduce tasks. All the map tasks had successfully completed while all the reduce tasks got an OutOfMemory exception. This exception was caught after the append phase (during the sort phase). As far as I observed, during a fetch operation, all the map tasks outputs to a temp. sequence file. During the reduce operation, each reducer copies all map outputs to their local disk and append them to a single seq. file. After this operation reducer try to sort this file and output the sorted file to its local disk. And then, a record writer is opened to write this sorted file to the segment, which is in DFS. If this scenario is correct, then all the reduce tasks are supposed to do the same job. All try to sort the whole map outputs and the winner of this operation will be able to write to dfs. So only one reducer is expected to write to dfs. If this is the case then an OutOfMemory exception will not be surprising for 500.000+urls. Since reducers will try to sort a file bigger then 1GB. Any comments on this scenario are welcome. And how can I avoid these exceptions? Thanx,
-- Hamza KAYA
