Re: Store mapreduce output into my own data structures

Lars George Fri, 27 Nov 2009 06:54:51 -0800

Hi Liu,

You have a few choices, you either a) use no OutputFormat at all or b)
create your own custom one that handles what you need. I have MapReduce
jobs that scan a HBase table and compute a specific value that I then
store in memcached. For that I do that directly in a custom TableMapper
and set the output format to


job.setOutputFormatClass(NullOutputFormat.class);

I often also set the number of reducers to 0 as I can do all the work in
the Mapper. This is because row keys are sorted and unique, so there is
no need to have a Reducer as there is nothing to reduce. So I do

job.setNumReduceTasks(0);

The new Hadoop MapReduce API has removed the ability to set the number
of map tasks. This was always just a hint to the framework anyways and
was not a hard limit. The number of Mappers is linked to the InputFormat
that is used as it is responsible to split the input data into equal
chunks for processing. Our TableInputFormat for example splits the
tables at region boundaries. A FileInputFormat may split text files into
equal blocks matching the Hadoop block size while specifying one of the
data nodes having a copy of it. That way the data can be processed
local. But if the input file is a compressed, non-splittable format such
as GZip then a single Mapper is handling the whole file. Even if you
would have specified 10 map tasks it would only use one as it has no
other choice.

Lars

Liu Xianglong schrieb:
> Hi, everyone. Is there someone who uses map-reduce to store the reduce output 
> in memory. I mean, now the output path of job is set and reduce outputs are 
> stored into files under this path.(see the comments along with the following 
> codes)
>      job.setOutputFormatClass(MyOutputFormat.class);
>      //can I implement my OutputFormat to store these output key-value pairs 
> in my data structures, or are these other ways to do it?
>      job.setOutputKeyClass(ImmutableBytesWritable.class);
>      job.setOutputValueClass(Result.class);
>      FileOutputFormat.setOutputPath(job, outputDir);
>
>  Is there any way to store them in some variables or data structures? Then 
> how can I implement my OutputFormat? Any suggestions and codes are welcomed. 
>
> Another question: is there some way to set the number of map task? It seems 
> there is no API to do this in hadoop new job APIs. I am not sure the way to 
> set this number.
>
> Thanks!
>
> Best Wishes!
> _____________________________________________________________
>  
> 刘祥龙  Liu Xianglong
>
>

Re: Store mapreduce output into my own data structures

Reply via email to