On Wed, Jun 22, 2011 at 2:51 PM, Hassen Riahi <hassen.ri...@cern.ch> wrote: > Hi all, > > I'm looking to parallelize a workflow using mapReduce. The workflow can be > summarized as following: > > 1- Specify the list of paths of binary files to process in a configuration > file (let's call this configuration file CONFIG). These binary files are > stored in HDFS. This list of path can vary from 1 files to 10000* files. > 2- Process the list of files given in CONFIG: It is done by calling a > command (let's call it commandX) and giving CONFIG as option, smthg like: > commandX CONFIG. CommandX reads CONFIG and takes care to open the files, > process them and generate then the output. > 3- Merging...this step can be ignored for now. > > The only solutions that I'm seeing to port this workflow to mapReduce are: > > 1- Write a map code which takes as input a list of paths and then call > appropriately commandX. But, AFAIK, the job will not be split and will run > as a single mapReduce job over HDFS. > 2- Read the input files, and then get the output of the read operation and > pass it as input to map code. This solution implies a deeper and complicated > modification of commandX. > > Any ideas, comments or suggestions would be appreciated.
Hi, If you are looking for a hadoop-oriented solution to this, here is my suggestion: 1. Create a HDFS directory with all your input files in it. If you don't want to do this, create a JobConf object and add each input file into it (maybe by reading your CONFIG). 2. Overrise FileInputFormat and return false from isSplitable method- this causes each input file to be processed by a single mapper. I hope I understood your problem properly, and my suggestion is the kind you were looking for. Bibek