If your input file is a list of paths each one with \n at the end, the a TextFileInputFormat would split them for you.
I would write it something like the following Mapper { Void map(Long offset, String path, collector) { Path p = new Path(path); FileSystem fs = p.getFileSystem(getConf()); fs.open; //Copy the file to a temporary location //Write out temp file path to a CONFIG file try { Process commandx = Runtime.exec("commandX","CONFIG"); //Read output from commandx; and send it to collector } finally { //Delete the temp file //Delete the temp config } } This would launch a new instance of commandX each time. This assumes that commandX can process a single file at a time, and the the startup overhead is not too big. If you do need all the files in one place then you need to look more deeply at what commandx is doing and see how that can be split up into map/reduce. --Bobby On 6/22/11 7:51 AM, "Hassen Riahi" <hassen.ri...@cern.ch> wrote: Hi all, I'm looking to parallelize a workflow using mapReduce. The workflow can be summarized as following: 1- Specify the list of paths of binary files to process in a configuration file (let's call this configuration file CONFIG). These binary files are stored in HDFS. This list of path can vary from 1 files to 10000* files. 2- Process the list of files given in CONFIG: It is done by calling a command (let's call it commandX) and giving CONFIG as option, smthg like: commandX CONFIG. CommandX reads CONFIG and takes care to open the files, process them and generate then the output. 3- Merging...this step can be ignored for now. The only solutions that I'm seeing to port this workflow to mapReduce are: 1- Write a map code which takes as input a list of paths and then call appropriately commandX. But, AFAIK, the job will not be split and will run as a single mapReduce job over HDFS. 2- Read the input files, and then get the output of the read operation and pass it as input to map code. This solution implies a deeper and complicated modification of commandX. Any ideas, comments or suggestions would be appreciated. Thanks in advance for the help, Hassen