Re: Parallelize a workflow using mapReduce

Hassen Riahi Wed, 22 Jun 2011 08:47:44 -0700

Thanks Bobby for the reply! Please find comments inline.

If your input file is a list of paths each one with \n at the end,the a TextFileInputFormat would split them for you.
I would write it something like the following

Mapper {

Void map(Long offset, String path, collector) {
  Path p = new Path(path);
  FileSystem fs = p.getFileSystem(getConf());
  fs.open;
  //Copy the file to a temporary location

is this step mandatory? I mean why I can't read the file from HDFS andprocess it from there? What is the gain in doing this step?

  //Write out temp file path to a CONFIG file
  try {
    Process commandx = Runtime.exec(“commandX”,”CONFIG”);
    //Read output from commandx; and send it to collector
   } finally {
      //Delete the temp file
      //Delete the temp config
   }
}

This would launch a new instance of commandX each time. Thisassumes that commandX can process a single file at a time, and thethe startup overhead is not too big.

This solution can be applied since the commandX can process a singlefile at a time and I will have a map task per file...and that is fine.However, I can obtain the same result without map/reduce: I can writea bash script which split 1 file per task (not a map task...I intendby task here a piece of work) and then submit them to the cluster forparallel execution. Is it right? or am I missing smthg?

I'm looking to benefit from map/reduce working in conjunction withHDFS to optimize the workflows execution and the cluster/storage usage(benefit from the data locality optimization...)...If I'm not wrong,using map/reduce in conjunction with HDFS means that the splittingstep will result in a 1 map task / HDFS block and Hadoop will do itsbest to run the map task on a node where the input data resides in HDFS.

If you do need all the files in one place then you need to look moredeeply at what commandx is doing and see how that can be split upinto map/reduce.

Sincerely, I would avoid this solution and consider it as the last onesince I don't know how much time can take and maybe can result in themodification of the whole framework...


Thanks
Hassen


--Bobby

On 6/22/11 7:51 AM, "Hassen Riahi" <hassen.ri...@cern.ch> wrote:

Hi all,

I'm looking to parallelize a workflow using mapReduce. The workflow
can be summarized as following:

1- Specify the list of paths of binary files to process in a
configuration file (let's call this configuration file CONFIG). These
binary files are stored in HDFS. This list of path can vary from 1
files to 10000* files.
2- Process the list of files given in CONFIG: It is done by calling a
command (let's call it commandX) and giving CONFIG as option, smthg
like: commandX CONFIG. CommandX reads CONFIG and takes care to open
the files, process them and generate then the output.
3- Merging...this step can be ignored for now.

The only solutions that I'm seeing to port this workflow to mapReduce
are:

1- Write a map code which takes as input a list of paths and then call
appropriately commandX. But, AFAIK, the job will not be split and will
run as a single mapReduce job over HDFS.
2- Read the input files, and then get the output of the read operation
and pass it as input to map code. This solution implies a deeper and
complicated modification of commandX.

Any ideas, comments or suggestions would be appreciated.

Thanks in advance for the help,
Hassen

Re: Parallelize a workflow using mapReduce

Reply via email to