Cluster gets overloaded processing large files via streaming

Leo Alekseyev Mon, 21 Sep 2009 11:45:44 -0700

Hi all,
I have a streaming job running on ~300 GB of ASCII data in 3 large
files, where the mapper and reducer are Perl scripts.  Mapper does
trivial data cleanup, and reducer builds a hash then iterates over
this hash writing output.  Hash key is the first field in the data,
i.e. the same as the streaming map/reduce key.  However, the nodes
become bogged down to the point of being unusable -- it looks like too
much data is being read into memory.  I am relatively new to hadoop,
so it's not clear to me how to ensure that the reduce tasks don't run
out of memory...
Thanks for any help!

Cluster gets overloaded processing large files via streaming

Reply via email to