Re: Performing exactly one map operation per file

Arun C Murthy Sun, 08 Apr 2007 11:33:56 -0700

On Sun, Apr 08, 2007 at 07:23:52PM +0200, Albert Strasheim wrote:
>Hello all
>
>It seems my RecordReader has to specify the types of the keys and
>values. From looking at the other record readers, it seems like I want
>a BytesWritable value. However, I'm not sure what to do about the key.
>One probably wants some kind of string value bases on the full path to
>the input file...
>


The 'Mapper' interface which you have to implement for you mapper class also 
extends the JobConfigurable interface which means you can provide your own 
'configure' method to which the framework passes the jobconf.xml (as a JobConf 
object). Here (@see 
org.apache.hadoop.mapred.SortValidator.RecordStatsChecker.Map.configure in 
src/test) you can save the actual input file which is available as 
'map.input.file' and then use that as the 'key'. (PS: Use 'Text' class instead 
of 'UTF8' which is deprecated.)
>Assuming that gets sorted out, the job configuration would look
>something like this:
>
>mapred.input.format.class: SingleFileInputFormat
>mapred.output.format.class: SingleFileOutputFormat
>mapred.input.key.class: UTF8 (maybe?)
>mapred.input.value.class: BytesWritable
>mapred.output.key.class: UTF8 (maybe?)
>mapred.output.value.class: BytesWritable
>
>At this point, I'm unsure about how one would convince Hadoop to make
>an output file for each input file, and how the names for the output
>files are determined.
>

Take a look at org.apache.hadoop.examples.RandomWriter.Map.map() in 
src/examples. Here each map opens and writes to a file in hdfs in a specific 
directory and no output is sent to the reducer at all i.e. RandomWriter uses 
'NullOutputFormat'. Thus, assuming you set up each input file as one of your 
audio/video/image file and thus has only 1 key/value pair, you can map one 
output file to one input file. This should solve your needs... you thus get the 
framework to spawn one map per input file and get away with only 1 reducer 
which is a no-op.

hth,
Arun

>From the HadoopStreaming wiki page it seems that the number of output
>files depends on the number of reduce tasks, which probably isn't what
>one wants for this application. Any thoughts on what I can do here to
>get a one-to-one mapping? For example, I'd like to do something like:
>
>bin/hadoop -mapper crop.py -input origimgs/ -output croppedimgs/
>
>so that if origimgs/ contains foo.jpg and bar.jpg, I end up with cropped
>versions of foo.jpg and bar.jpg in croppedimgs/.
>
>I hope this isn't a case of square peg, round hole. Hadoop's DFS and
>job scheduling looks perfectly suited to this kind of application, if I
>can figure out how to make Hadoop divide the "work" in a way that makes
>sense in this case.
>
>>>From what I understood from running the sample programs, Hadoop splits up
>>>input files and passes the pieces to the map operations. However, I can't
>>>quite figure out how one would create a job configuration that maps a
>>>single file at a time instead of splitting the file (which isn't what one
>>>wants when dealing with images or audio).
>>
>> The InputFormatBase defines an 'isSplitable' api which is used by
>>the framework to deduce whether the mapred framework splits up the
>>input files. You could trivially turn this off by returning 'false'
>>for your {Audio|Video|Image}InputFormat classes.
>
>Thanks, I'll try this.
>
>>>- HadoopStreaming will be useful, since my algorithms can be implemented 
>>>as
>>>C++ or Python programs
>>
>>The C++ map-reduce api that Owen has been working on might interest
>>you: http://issues.apache.org/jira/browse/HADOOP-234.
>
>I'll definately take a closer look at this.
>
>Regards,
>
>Albert 
>

Re: Performing exactly one map operation per file

Reply via email to