Hello all

I'm a new Hadoop user and I'm looking at using Hadoop for a distributed machine learning application.

For my application (and probably many machine learning applications), one would probably want to do something like the following:

1. Upload a bunch of images/audio/whatever to the DFS
2. Run a map operation to do something like:
2.1 perform some transformation on each image, creating N new images
2.2 convert the audio into feature vectors, storing all the feature vectors from a single audio file in a new file
3. Store the output of these map operations in the DFS

In general, one wants to take a dataset with N discrete items, and map them to N other items. Each item can typically be mapped independently of the other items, so this distributes nicely. However, each item must be sent to the map operation as a unit.

I've looked through the Hadoop wiki and the code and so far I've come up with the following:

- HadoopStreaming will be useful, since my algorithms can be implemented as C++ or Python programs
- I probably want to use an IdentityReducer to achieve what I outlined above

From what I understood from running the sample programs, Hadoop splits up
input files and passes the pieces to the map operations. However, I can't quite figure out how one would create a job configuration that maps a single file at a time instead of splitting the file (which isn't what one wants when dealing with images or audio).

Does anybody have some ideas on how to accomplish this? I'm guessing some new code might have to be written, so any pointers on where to start would be much appreciated.

Thanks for your time.

Regards,

Albert


Reply via email to