Performing exactly one map operation per file

Albert Strasheim Sun, 08 Apr 2007 02:55:14 -0700

Hello all

I'm a new Hadoop user and I'm looking at using Hadoop for a distributedmachine learning application.

For my application (and probably many machine learning applications), onewould probably want to do something like the following:


1. Upload a bunch of images/audio/whatever to the DFS
2. Run a map operation to do something like:
2.1 perform some transformation on each image, creating N new images

2.2 convert the audio into feature vectors, storing all the feature vectorsfrom a single audio file in a new file

3. Store the output of these map operations in the DFS

In general, one wants to take a dataset with N discrete items, and map themto N other items. Each item can typically be mapped independently of theother items, so this distributes nicely. However, each item must be sent tothe map operation as a unit.

I've looked through the Hadoop wiki and the code and so far I've come upwith the following:

- HadoopStreaming will be useful, since my algorithms can be implemented asC++ or Python programs

- I probably want to use an IdentityReducer to achieve what I outlined above

From what I understood from running the sample programs, Hadoop splits up

input files and passes the pieces to the map operations. However, I can'tquite figure out how one would create a job configuration that maps a singlefile at a time instead of splitting the file (which isn't what one wantswhen dealing with images or audio).

Does anybody have some ideas on how to accomplish this? I'm guessing somenew code might have to be written, so any pointers on where to start wouldbe much appreciated.


Thanks for your time.

Regards,

Albert

Performing exactly one map operation per file

Reply via email to