Hi Albert,

On Sun, Apr 08, 2007 at 11:53:58AM +0200, Albert Strasheim wrote:
>Hello all
>
>I'm a new Hadoop user and I'm looking at using Hadoop for a distributed 
>machine learning application.

Welcome to Hadoop!

Here is a broad outline of how hadoop's map-reduce framework works specifically 
for user inputs/formats etc.:
a) User specifies the input directory via JobConf.setInputPath or 
mapred.input.dir in the .xml file.
b) User specifies the format of the input files so that the framework can then 
decide how to break the data into 'records' i.e. key/value pairs which are then 
sent to the user defined map/reduce apis. I suspect you will have to come up 
with your own InputFormat class (depending on audio/image/video files etc.) by 
subclassing from org.apache.hadoop.mapred.InputFormatBase and also a 
org.apache.hadoop.mapred.RecordReader (which actually reads individual 
key/value pairs). There are some examples in org.apache.hadoop.mapred package 
for both the above: TextInputFormat/LineRecordReader and 
SequenceFileInputFormat/SequenceFileRecordReader; usually they come in pairs.

>
>From what I understood from running the sample programs, Hadoop splits up 
>input files and passes the pieces to the map operations. However, I can't 
>quite figure out how one would create a job configuration that maps a 
>single file at a time instead of splitting the file (which isn't what one 
>wants when dealing with images or audio).
>

 The InputFormatBase defines an 'isSplitable' api which is used by the 
framework to deduce whether the mapred framework splits up the input files. You 
could trivially turn this off by returning 'false' for your 
{Audio|Video|Image}InputFormat classes.

>- HadoopStreaming will be useful, since my algorithms can be implemented as 
>C++ or Python programs

The C++ map-reduce api that Owen has been working on might interest you: 
http://issues.apache.org/jira/browse/HADOOP-234.

hth,
Arun

Reply via email to