Why does MapRunner collect all intermediate key-value in memory?

Gaurav Agarwal Wed, 14 Mar 2007 14:37:18 -0800

Hi all,

I have started using Hadoop for a few of my Natural Language Processing
applications. I was facing a problem due to the my programs throwing up
OutOfMemory Exception during the Map phase.


I looked into the implementation and noticed that all the intermediate key
value pairs are collected in memory for the entire duration of any single
MapRunner instance. As I understand from reading the code, the MapRunner
keeps calling the user-defined map() method for all the key-value pairs
assigned to it by the MapTask. The MapTask does the check for whether it
should be dumping the intermediate key value pairs to the disk only after
the MapRunner.run() method has returned.

Now, I was facing problems because due to the nature of this application, I
ended up emitting too many intermediate key-value pairs for some set of the
input data getting allocated to a single MapRunner instance.  This was
leading to JVM going OutofMemory.

If my understanding of the implementation is correct, then I am wondering if
there is any particular reason to take this approach. A better approach (and
I may be wrong here) would be to let MapRunner keep track of the memory it
has been utilizing and if the allocations run too high then it should:

1) Either dump the intermediate key-value pairs to disk itself. OR
2) Better option will be to call an API (new) provided by the MapTask that
would dump the key-value pair to the disk and then pass the control back to
the MapRunner. MapRunner will simply resume the task and return ultimately
return in the normal way.

I am suggesting this approach as there are other applications too which may
benefit if they are not restricted by this limitations.

Please let me know what your opinions on this. If this is not incorporated
into the main Hadoop release and then I intend to add this as a patch for my
applications. Do you see any obvious loopholes which I might have
overlooked.

Thanks in advance for the help!

Regards
Gaurav 
-- 
View this message in context: 
http://www.nabble.com/Why-does-MapRunner-collect-all-intermediate-key-value-in-memory--tf3405027.html#a9484185
Sent from the Hadoop Dev mailing list archive at Nabble.com.

Why does MapRunner collect all intermediate key-value in memory?

Reply via email to