If your input file is a list of paths each one with \n at the end, the a 
TextFileInputFormat would split them for you.

I would write it something like the following

Mapper {

Void map(Long offset, String path, collector) {
  Path p = new Path(path);
  FileSystem fs = p.getFileSystem(getConf());
  fs.open;
  //Copy the file to a temporary location
  //Write out temp file path to a CONFIG file
  try {
    Process commandx = Runtime.exec("commandX","CONFIG");
    //Read output from commandx; and send it to collector
   } finally {
      //Delete the temp file
      //Delete the temp config
   }
}

This would launch a new instance of commandX each time.  This assumes that 
commandX can process a single file at a time, and the the startup overhead is 
not too big.  If you do need all the files in one place then you need to look 
more deeply at what commandx is doing and see how that can be split up into 
map/reduce.

--Bobby

On 6/22/11 7:51 AM, "Hassen Riahi" <hassen.ri...@cern.ch> wrote:

Hi all,

I'm looking to parallelize a workflow using mapReduce. The workflow
can be summarized as following:

1- Specify the list of paths of binary files to process in a
configuration file (let's call this configuration file CONFIG). These
binary files are stored in HDFS. This list of path can vary from 1
files to 10000* files.
2- Process the list of files given in CONFIG: It is done by calling a
command (let's call it commandX) and giving CONFIG as option, smthg
like: commandX CONFIG. CommandX reads CONFIG and takes care to open
the files, process them and generate then the output.
3- Merging...this step can be ignored for now.

The only solutions that I'm seeing to port this workflow to mapReduce
are:

1- Write a map code which takes as input a list of paths and then call
appropriately commandX. But, AFAIK, the job will not be split and will
run as a single mapReduce job over HDFS.
2- Read the input files, and then get the output of the read operation
and pass it as input to map code. This solution implies a deeper and
complicated modification of commandX.

Any ideas, comments or suggestions would be appreciated.

Thanks in advance for the help,
Hassen

Reply via email to