Hi all, I have started using Hadoop for a few of my Natural Language Processing applications. I was facing a problem due to the my programs throwing up OutOfMemory Exception during the Map phase.
I looked into the implementation and noticed that all the intermediate key value pairs are collected in memory for the entire duration of any single MapRunner instance. As I understand from reading the code, the MapRunner keeps calling the user-defined map() method for all the key-value pairs assigned to it by the MapTask. The MapTask does the check for whether it should be dumping the intermediate key value pairs to the disk only after the MapRunner.run() method has returned. Now, I was facing problems because due to the nature of this application, I ended up emitting too many intermediate key-value pairs for some set of the input data getting allocated to a single MapRunner instance. This was leading to JVM going OutofMemory. If my understanding of the implementation is correct, then I am wondering if there is any particular reason to take this approach. A better approach (and I may be wrong here) would be to let MapRunner keep track of the memory it has been utilizing and if the allocations run too high then it should: 1) Either dump the intermediate key-value pairs to disk itself. OR 2) Better option will be to call an API (new) provided by the MapTask that would dump the key-value pair to the disk and then pass the control back to the MapRunner. MapRunner will simply resume the task and return ultimately return in the normal way. I am suggesting this approach as there are other applications too which may benefit if they are not restricted by this limitations. Please let me know what your opinions on this. If this is not incorporated into the main Hadoop release and then I intend to add this as a patch for my applications. Do you see any obvious loopholes which I might have overlooked. Thanks in advance for the help! Regards Gaurav -- View this message in context: http://www.nabble.com/Why-does-MapRunner-collect-all-intermediate-key-value-in-memory--tf3405027.html#a9484185 Sent from the Hadoop Dev mailing list archive at Nabble.com.
