Hi Albert,
On Sun, Apr 08, 2007 at 11:53:58AM +0200, Albert Strasheim wrote:
>Hello all
>
>I'm a new Hadoop user and I'm looking at using Hadoop for a distributed
>machine learning application.
Welcome to Hadoop!
Here is a broad outline of how hadoop's map-reduce framework works specifically
for user inputs/formats etc.:
a) User specifies the input directory via JobConf.setInputPath or
mapred.input.dir in the .xml file.
b) User specifies the format of the input files so that the framework can then
decide how to break the data into 'records' i.e. key/value pairs which are then
sent to the user defined map/reduce apis. I suspect you will have to come up
with your own InputFormat class (depending on audio/image/video files etc.) by
subclassing from org.apache.hadoop.mapred.InputFormatBase and also a
org.apache.hadoop.mapred.RecordReader (which actually reads individual
key/value pairs). There are some examples in org.apache.hadoop.mapred package
for both the above: TextInputFormat/LineRecordReader and
SequenceFileInputFormat/SequenceFileRecordReader; usually they come in pairs.
>
>From what I understood from running the sample programs, Hadoop splits up
>input files and passes the pieces to the map operations. However, I can't
>quite figure out how one would create a job configuration that maps a
>single file at a time instead of splitting the file (which isn't what one
>wants when dealing with images or audio).
>
The InputFormatBase defines an 'isSplitable' api which is used by the
framework to deduce whether the mapred framework splits up the input files. You
could trivially turn this off by returning 'false' for your
{Audio|Video|Image}InputFormat classes.
>- HadoopStreaming will be useful, since my algorithms can be implemented as
>C++ or Python programs
The C++ map-reduce api that Owen has been working on might interest you:
http://issues.apache.org/jira/browse/HADOOP-234.
hth,
Arun