Hi all,
I have a streaming job running on ~300 GB of ASCII data in 3 large
files, where the mapper and reducer are Perl scripts. Mapper does
trivial data cleanup, and reducer builds a hash then iterates over
this hash writing output. Hash key is the first field in the data,
i.e. the same as the streaming map/reduce key. However, the nodes
become bogged down to the point of being unusable -- it looks like too
much data is being read into memory. I am relatively new to hadoop,
so it's not clear to me how to ensure that the reduce tasks don't run
out of memory...
Thanks for any help!