On Wed, Jun 22, 2011 at 5:00 PM, Bibek Paudel <eternalyo...@gmail.com> wrote: > On Wed, Jun 22, 2011 at 2:51 PM, Hassen Riahi <hassen.ri...@cern.ch> wrote: >> Hi all, >> >> I'm looking to parallelize a workflow using mapReduce. The workflow can be >> summarized as following: >> >> 1- Specify the list of paths of binary files to process in a configuration >> file (let's call this configuration file CONFIG). These binary files are >> stored in HDFS. This list of path can vary from 1 files to 10000* files. >> 2- Process the list of files given in CONFIG: It is done by calling a >> command (let's call it commandX) and giving CONFIG as option, smthg like: >> commandX CONFIG. CommandX reads CONFIG and takes care to open the files, >> process them and generate then the output. >> 3- Merging...this step can be ignored for now. >> >> The only solutions that I'm seeing to port this workflow to mapReduce are: >> >> 1- Write a map code which takes as input a list of paths and then call >> appropriately commandX. But, AFAIK, the job will not be split and will run >> as a single mapReduce job over HDFS. >> 2- Read the input files, and then get the output of the read operation and >> pass it as input to map code. This solution implies a deeper and complicated >> modification of commandX. >> >> Any ideas, comments or suggestions would be appreciated. > > Hi, > If you are looking for a hadoop-oriented solution to this, here is my > suggestion: > > 1. Create a HDFS directory with all your input files in it. If you > don't want to do this, create a JobConf object and add each input file > into it (maybe by reading your CONFIG). > > 2. Overrise FileInputFormat and return false from isSplitable method- > this causes each input file to be processed by a single mapper. >
Of course, the third step would be running CommandX from the mapper :) -b