Parallelize a workflow using mapReduce

Hassen Riahi Wed, 22 Jun 2011 05:51:38 -0700

Hi all,

I'm looking to parallelize a workflow using mapReduce. The workflowcan be summarized as following:

1- Specify the list of paths of binary files to process in aconfiguration file (let's call this configuration file CONFIG). Thesebinary files are stored in HDFS. This list of path can vary from 1files to 10000* files.2- Process the list of files given in CONFIG: It is done by calling acommand (let's call it commandX) and giving CONFIG as option, smthglike: commandX CONFIG. CommandX reads CONFIG and takes care to openthe files, process them and generate then the output.

3- Merging...this step can be ignored for now.

The only solutions that I'm seeing to port this workflow to mapReduceare:

1- Write a map code which takes as input a list of paths and then callappropriately commandX. But, AFAIK, the job will not be split and willrun as a single mapReduce job over HDFS.2- Read the input files, and then get the output of the read operationand pass it as input to map code. This solution implies a deeper andcomplicated modification of commandX.


Any ideas, comments or suggestions would be appreciated.

Thanks in advance for the help,
Hassen

Parallelize a workflow using mapReduce

Reply via email to