Re: Parallelize a workflow using mapReduce

Bibek Paudel Wed, 22 Jun 2011 08:01:24 -0700

On Wed, Jun 22, 2011 at 2:51 PM, Hassen Riahi <hassen.ri...@cern.ch> wrote:
> Hi all,
>
> I'm looking to parallelize a workflow using mapReduce. The workflow can be
> summarized as following:
>
> 1- Specify the list of paths of binary files to process in a configuration
> file (let's call this configuration file CONFIG). These binary files are
> stored in HDFS. This list of path can vary from 1 files to 10000* files.
> 2- Process the list of files given in CONFIG: It is done by calling a
> command (let's call it commandX) and giving CONFIG as option, smthg like:
> commandX CONFIG. CommandX reads CONFIG and takes care to open the files,
> process them and generate then the output.
> 3- Merging...this step can be ignored for now.
>
> The only solutions that I'm seeing to port this workflow to mapReduce are:
>
> 1- Write a map code which takes as input a list of paths and then call
> appropriately commandX. But, AFAIK, the job will not be split and will run
> as a single mapReduce job over HDFS.
> 2- Read the input files, and then get the output of the read operation and
> pass it as input to map code. This solution implies a deeper and complicated
> modification of commandX.
>
> Any ideas, comments or suggestions would be appreciated.


Hi,
If you are looking for a hadoop-oriented solution to this, here is my
suggestion:

1. Create a HDFS directory with all your input files in it. If you
don't want to do this, create a JobConf object and add each input file
into it (maybe by reading your CONFIG).

2. Overrise FileInputFormat and return false from isSplitable method-
this causes each input file to be processed by a single mapper.

I hope I understood your problem properly, and my suggestion is the
kind you were looking for.

Bibek

Re: Parallelize a workflow using mapReduce

Reply via email to