On Wed, Jun 22, 2011 at 5:00 PM, Bibek Paudel <eternalyo...@gmail.com> wrote:
> On Wed, Jun 22, 2011 at 2:51 PM, Hassen Riahi <hassen.ri...@cern.ch> wrote:
>> Hi all,
>>
>> I'm looking to parallelize a workflow using mapReduce. The workflow can be
>> summarized as following:
>>
>> 1- Specify the list of paths of binary files to process in a configuration
>> file (let's call this configuration file CONFIG). These binary files are
>> stored in HDFS. This list of path can vary from 1 files to 10000* files.
>> 2- Process the list of files given in CONFIG: It is done by calling a
>> command (let's call it commandX) and giving CONFIG as option, smthg like:
>> commandX CONFIG. CommandX reads CONFIG and takes care to open the files,
>> process them and generate then the output.
>> 3- Merging...this step can be ignored for now.
>>
>> The only solutions that I'm seeing to port this workflow to mapReduce are:
>>
>> 1- Write a map code which takes as input a list of paths and then call
>> appropriately commandX. But, AFAIK, the job will not be split and will run
>> as a single mapReduce job over HDFS.
>> 2- Read the input files, and then get the output of the read operation and
>> pass it as input to map code. This solution implies a deeper and complicated
>> modification of commandX.
>>
>> Any ideas, comments or suggestions would be appreciated.
>
> Hi,
> If you are looking for a hadoop-oriented solution to this, here is my
> suggestion:
>
> 1. Create a HDFS directory with all your input files in it. If you
> don't want to do this, create a JobConf object and add each input file
> into it (maybe by reading your CONFIG).
>
> 2. Overrise FileInputFormat and return false from isSplitable method-
> this causes each input file to be processed by a single mapper.
>

Of course, the third step would be running CommandX from the mapper :)

-b

Reply via email to