Binary executable with binary data on Hadoop MapReduce

Stefano Alberto Russo Tue, 11 Oct 2011 06:09:01 -0700

Hi all,

I'm trying to use Hadoop MapReduce (new api) in a particular way. What Iwould like to do is to make it work with a external executable not madefor mapreduce (but able to read from hdfs), and with binary input.

The idea is to store on hdfs the binary input files, and then to run amapreduce job specifing these files as input. Once the mapreduce task islandend to on the node, I would like to block it from reading the inputdata, but instead I would like it to spawn the precompiled executable toload the input data from hdfs. In this way, the mapreduce frameworkshould have taken care in placing the mapper as closer as possible tothe data, and consequently the binary spawned.

I do not want to run the reduce, the aggreation (very fast for mycomputational problem) will be done via a simple script that will takecare of downloading from hdfs the outputs (previously uploaded to hdfsfrom the spawned binary).


I made some tests:

I can obtain the file being currently "analyzed" by the mapper (to passit to the the spawned binary) using:


  Configuration conf = context.getConfiguration();
  FileSplit fileSplit = (FileSplit) context.getInputSplit();
  String sFileName = fileSplit.getPath().toString();

I could avoid input binary files to be splitted using the "isSplitable"function in a new InputFormat(about performance: files will be usuallysmaller than block size)


But I don't know how to block the map task from reading his input file:

I was thinkg about something like defining a new RecordReader withrecords defined by the end of file, so that in the map() function of themapper I can spawn the binary. But will this cause the entire file to beloaded in the memory?

Or, is there a way to tell the MapReduce framework to do notautomatically feed the map task using the push but instead to wait forthe map task to pull? (and never pull?)


Any help is appreciated!

Thnakyou,
Stefano.

Binary executable with binary data on Hadoop MapReduce

Reply via email to