Thanks Bobby for the reply! Please find comments inline.

If your input file is a list of paths each one with \n at the end, the a TextFileInputFormat would split them for you.

I would write it something like the following

Mapper {

Void map(Long offset, String path, collector) {
  Path p = new Path(path);
  FileSystem fs = p.getFileSystem(getConf());
  fs.open;
  //Copy the file to a temporary location
is this step mandatory? I mean why I can't read the file from HDFS and process it from there? What is the gain in doing this step?

  //Write out temp file path to a CONFIG file
  try {
    Process commandx = Runtime.exec(“commandX”,”CONFIG”);
    //Read output from commandx; and send it to collector
   } finally {
      //Delete the temp file
      //Delete the temp config
   }
}



This would launch a new instance of commandX each time. This assumes that commandX can process a single file at a time, and the the startup overhead is not too big.

This solution can be applied since the commandX can process a single file at a time and I will have a map task per file...and that is fine. However, I can obtain the same result without map/reduce: I can write a bash script which split 1 file per task (not a map task...I intend by task here a piece of work) and then submit them to the cluster for parallel execution. Is it right? or am I missing smthg?

I'm looking to benefit from map/reduce working in conjunction with HDFS to optimize the workflows execution and the cluster/storage usage (benefit from the data locality optimization...)...If I'm not wrong, using map/reduce in conjunction with HDFS means that the splitting step will result in a 1 map task / HDFS block and Hadoop will do its best to run the map task on a node where the input data resides in HDFS.

If you do need all the files in one place then you need to look more deeply at what commandx is doing and see how that can be split up into map/reduce.

Sincerely, I would avoid this solution and consider it as the last one since I don't know how much time can take and maybe can result in the modification of the whole framework...

Thanks
Hassen


--Bobby

On 6/22/11 7:51 AM, "Hassen Riahi" <hassen.ri...@cern.ch> wrote:

Hi all,

I'm looking to parallelize a workflow using mapReduce. The workflow
can be summarized as following:

1- Specify the list of paths of binary files to process in a
configuration file (let's call this configuration file CONFIG). These
binary files are stored in HDFS. This list of path can vary from 1
files to 10000* files.
2- Process the list of files given in CONFIG: It is done by calling a
command (let's call it commandX) and giving CONFIG as option, smthg
like: commandX CONFIG. CommandX reads CONFIG and takes care to open
the files, process them and generate then the output.
3- Merging...this step can be ignored for now.

The only solutions that I'm seeing to port this workflow to mapReduce
are:

1- Write a map code which takes as input a list of paths and then call
appropriately commandX. But, AFAIK, the job will not be split and will
run as a single mapReduce job over HDFS.
2- Read the input files, and then get the output of the read operation
and pass it as input to map code. This solution implies a deeper and
complicated modification of commandX.

Any ideas, comments or suggestions would be appreciated.

Thanks in advance for the help,
Hassen


Reply via email to