Hi all,

I'm looking to parallelize a workflow using mapReduce. The workflow can be summarized as following:

1- Specify the list of paths of binary files to process in a configuration file (let's call this configuration file CONFIG). These binary files are stored in HDFS. This list of path can vary from 1 files to 10000* files. 2- Process the list of files given in CONFIG: It is done by calling a command (let's call it commandX) and giving CONFIG as option, smthg like: commandX CONFIG. CommandX reads CONFIG and takes care to open the files, process them and generate then the output.
3- Merging...this step can be ignored for now.

The only solutions that I'm seeing to port this workflow to mapReduce are:

1- Write a map code which takes as input a list of paths and then call appropriately commandX. But, AFAIK, the job will not be split and will run as a single mapReduce job over HDFS. 2- Read the input files, and then get the output of the read operation and pass it as input to map code. This solution implies a deeper and complicated modification of commandX.

Any ideas, comments or suggestions would be appreciated.

Thanks in advance for the help,
Hassen

Reply via email to