Hello all
I'm a new Hadoop user and I'm looking at using Hadoop for a distributed
machine learning application.
For my application (and probably many machine learning applications), one
would probably want to do something like the following:
1. Upload a bunch of images/audio/whatever to the DFS
2. Run a map operation to do something like:
2.1 perform some transformation on each image, creating N new images
2.2 convert the audio into feature vectors, storing all the feature vectors
from a single audio file in a new file
3. Store the output of these map operations in the DFS
In general, one wants to take a dataset with N discrete items, and map them
to N other items. Each item can typically be mapped independently of the
other items, so this distributes nicely. However, each item must be sent to
the map operation as a unit.
I've looked through the Hadoop wiki and the code and so far I've come up
with the following:
- HadoopStreaming will be useful, since my algorithms can be implemented as
C++ or Python programs
- I probably want to use an IdentityReducer to achieve what I outlined above
From what I understood from running the sample programs, Hadoop splits up
input files and passes the pieces to the map operations. However, I can't
quite figure out how one would create a job configuration that maps a single
file at a time instead of splitting the file (which isn't what one wants
when dealing with images or audio).
Does anybody have some ideas on how to accomplish this? I'm guessing some
new code might have to be written, so any pointers on where to start would
be much appreciated.
Thanks for your time.
Regards,
Albert