Mark, Ideally input into a map reduce program is a splittable file; We want to be able to parallelize our data processing so that each map task only has to deal with a chunk of the input (typically around the hdfs block size). You can feed proprietary binary data input a map reduce program, but you'll need to also create an InputFormat and RecordReader class to let Hadoop know how to read it. An example of this would be:
https://openpdc.svn.codeplex.com/svn/Hadoop/Current%20Version/src/TVA/Hadoop/MapReduce/Historian/ <https://openpdc.svn.codeplex.com/svn/Hadoop/Current%20Version/src/TVA/Hadoop/MapReduce/Historian/>where these classes let Hadoop read, split, and process binary archives of time series data. Another tip is to not keep state in your map tasks as you want to "stream" through the data, processing each k/v pair as you see them, and then moving on. When dealing with large amounts of data, keeping state in certain ways tends not to be scalable past a certain point. If you must keep state, its easier to do so in the reduce task as long as you bound how much data you want to cache up in the reducer and make sure that data fits in the reduce task's child jvm heap size. Josh Patterson Solutions Architect Cloudera On Fri, May 28, 2010 at 12:33 AM, Mark Kerzner <[email protected]>wrote: > Hi, > > I need to put a binary file in map and then emit that map. I do it by > encoding it as a string using Base64 encoding, so that's fine, but I am > dealing with pretty large files, and I am running out of memory. That is > because I read a complete file into memory. Is there a way to pass streams? > > Thank you, > Mark >
