Posting again to [email protected]. Thank you,
Jaliya ===== Dear Hadoop devs, Please help me to figure out a way to program the following problem using Hadoop. I have a program which I need to invoke in parallel using Hadoop. The program takes an input file(binary) and produce an output file (binary) Input.bin ->prog.exe-> output.bin The input data set is about 1TB in size. Each input data file is about 33MB in size. (So I have about 31000 files) The output binary file is about 9KBs in size. I have implemented this program using Hadoop in the following way. I keep the input data in a shared parallel file system (Lustre File System). Then, I collect the input file names and write them to a collection of files in HDFS (let's say hdfs_input_0.txt ..). Each hdfs_input file contains roughly the equal number of files URIs to the original input file. The map task, simply take a string Value which is a URI to an original input data file and execute the program as an external program. The output of the program is also written to the shared file system (Lustre File System). The problem in this approach is I am not utilizing the true benefit of MapReduce. The use of local disks. Could you please suggest me a way to use local disks for the above problem.? I thought, of the following way, but would like to verify from you if there is a better way. 1. Upload the original data files in HDFS 2. In the map task, read the data file as an binary object. 3. Save it in the local file system. 4. Call the executable 5. Push the output from the local file system to HDFS. Any suggestion is greatly appreciated. Thank you, Jaliya
