Hello all

On Sun, 08 Apr 2007, Arun C Murthy wrote:

Hi Albert,

On Sun, Apr 08, 2007 at 11:53:58AM +0200, Albert Strasheim wrote:
>Hello all
>
>I'm a new Hadoop user and I'm looking at using Hadoop for a distributed
>machine learning application.

Welcome to Hadoop!

Here is a broad outline of how hadoop's map-reduce framework works specifically for user inputs/formats etc.: a) User specifies the input directory via JobConf.setInputPath or mapred.input.dir in the .xml file.
b) User specifies the format of the input files so that the framework
can then decide how to break the data into 'records' i.e. key/value
pairs which are then sent to the user defined map/reduce apis. I
(depending on audio/image/video files etc.) by subclassing from
org.apache.hadoop.mapred.InputFormatBase and also a
org.apache.hadoop.mapred.RecordReader (which actually reads individual
key/value pairs). There are some examples in org.apache.hadoop.mapred
package for both the above: TextInputFormat/LineRecordReader and
SequenceFileInputFormat/SequenceFileRecordReader; usually they come in
pairs.

Thanks for the pointers. I'm on my way to coding up a
SingleFileInputFormat and a SingleFileRecordReader (for lack of better
names at present).

It seems my RecordReader has to specify the types of the keys and
values. From looking at the other record readers, it seems like I want
a BytesWritable value. However, I'm not sure what to do about the key.
One probably wants some kind of string value bases on the full path to
the input file...

Assuming that gets sorted out, the job configuration would look
something like this:

mapred.input.format.class: SingleFileInputFormat
mapred.output.format.class: SingleFileOutputFormat
mapred.input.key.class: UTF8 (maybe?)
mapred.input.value.class: BytesWritable
mapred.output.key.class: UTF8 (maybe?)
mapred.output.value.class: BytesWritable

At this point, I'm unsure about how one would convince Hadoop to make
an output file for each input file, and how the names for the output
files are determined.

From the HadoopStreaming wiki page it seems that the number of output
files depends on the number of reduce tasks, which probably isn't what
one wants for this application. Any thoughts on what I can do here to
get a one-to-one mapping? For example, I'd like to do something like:

bin/hadoop -mapper crop.py -input origimgs/ -output croppedimgs/

so that if origimgs/ contains foo.jpg and bar.jpg, I end up with cropped
versions of foo.jpg and bar.jpg in croppedimgs/.

I hope this isn't a case of square peg, round hole. Hadoop's DFS and
job scheduling looks perfectly suited to this kind of application, if I
can figure out how to make Hadoop divide the "work" in a way that makes
sense in this case.

>From what I understood from running the sample programs, Hadoop splits up
>input files and passes the pieces to the map operations. However, I can't
>quite figure out how one would create a job configuration that maps a
>single file at a time instead of splitting the file (which isn't what one
>wants when dealing with images or audio).

 The InputFormatBase defines an 'isSplitable' api which is used by
the framework to deduce whether the mapred framework splits up the
input files. You could trivially turn this off by returning 'false'
for your {Audio|Video|Image}InputFormat classes.

Thanks, I'll try this.

>- HadoopStreaming will be useful, since my algorithms can be implemented >as
>C++ or Python programs

The C++ map-reduce api that Owen has been working on might interest
you: http://issues.apache.org/jira/browse/HADOOP-234.

I'll definately take a closer look at this.

Regards,

Albert

Reply via email to