Thanks Bobby for the reply! Please find comments inline.
If your input file is a list of paths each one with \n at the end,
the a TextFileInputFormat would split them for you.
I would write it something like the following
Mapper {
Void map(Long offset, String path, collector) {
Path p = new Path(path);
FileSystem fs = p.getFileSystem(getConf());
fs.open;
//Copy the file to a temporary location
is this step mandatory? I mean why I can't read the file from HDFS and
process it from there? What is the gain in doing this step?
//Write out temp file path to a CONFIG file
try {
Process commandx = Runtime.exec(“commandX”,”CONFIG”);
//Read output from commandx; and send it to collector
} finally {
//Delete the temp file
//Delete the temp config
}
}
This would launch a new instance of commandX each time. This
assumes that commandX can process a single file at a time, and the
the startup overhead is not too big.
This solution can be applied since the commandX can process a single
file at a time and I will have a map task per file...and that is fine.
However, I can obtain the same result without map/reduce: I can write
a bash script which split 1 file per task (not a map task...I intend
by task here a piece of work) and then submit them to the cluster for
parallel execution. Is it right? or am I missing smthg?
I'm looking to benefit from map/reduce working in conjunction with
HDFS to optimize the workflows execution and the cluster/storage usage
(benefit from the data locality optimization...)...If I'm not wrong,
using map/reduce in conjunction with HDFS means that the splitting
step will result in a 1 map task / HDFS block and Hadoop will do its
best to run the map task on a node where the input data resides in HDFS.
If you do need all the files in one place then you need to look more
deeply at what commandx is doing and see how that can be split up
into map/reduce.
Sincerely, I would avoid this solution and consider it as the last one
since I don't know how much time can take and maybe can result in the
modification of the whole framework...
Thanks
Hassen
--Bobby
On 6/22/11 7:51 AM, "Hassen Riahi" <hassen.ri...@cern.ch> wrote:
Hi all,
I'm looking to parallelize a workflow using mapReduce. The workflow
can be summarized as following:
1- Specify the list of paths of binary files to process in a
configuration file (let's call this configuration file CONFIG). These
binary files are stored in HDFS. This list of path can vary from 1
files to 10000* files.
2- Process the list of files given in CONFIG: It is done by calling a
command (let's call it commandX) and giving CONFIG as option, smthg
like: commandX CONFIG. CommandX reads CONFIG and takes care to open
the files, process them and generate then the output.
3- Merging...this step can be ignored for now.
The only solutions that I'm seeing to port this workflow to mapReduce
are:
1- Write a map code which takes as input a list of paths and then call
appropriately commandX. But, AFAIK, the job will not be split and will
run as a single mapReduce job over HDFS.
2- Read the input files, and then get the output of the read operation
and pass it as input to map code. This solution implies a deeper and
complicated modification of commandX.
Any ideas, comments or suggestions would be appreciated.
Thanks in advance for the help,
Hassen