Re: Should mapreduce.ReduceContext reuse same object in nextKeyValue?

Ed Mazur Wed, 13 Jan 2010 09:30:24 -0800

On Tue, Jan 12, 2010 at 7:14 PM, Eric Sammer <e...@lifeless.net> wrote:
> On 1/12/10 6:53 PM, Wilkes, Chris wrote:
>> I created my own Writable class to store 3 pieces of information.  In my
>> mapreducer.Reducer class I collect all of them and then process as a
>> group, ie:
>>
>> reduce(key, values, context) {
>>   List<Foo> myFoos =new ArrayList();
>>   for (Foo value : values) {
>>    myFoos.add(value);
>>   }
>> }
>
> snip
>
>>
>> Am I doing something wrong?  Should I expect this VALUEIN object to
>> change from underneath me?  I'm using hadoop 0.20.1 (from a cloudera
>> tarball)
>
> That's the documented behavior. Hadoop reuses the same Writable instance
> and replaces the *members* in the readFields() method in most cases (all
> cases?). The instance of Foo in your example will be the same object and
> simply have its members overwritten after each call to readFields().
> Currently, you're building a list of the same object. At the end of your
> for, you'll have a list of N objects all containing the same data. This
> is one of those "gotchas." If you really need to build a list like this,
> you'd have to resort to doing a deep copy, but you're better off avoid
> it if you can as it will drastically impact performance and add the
> requirement that all values for a given key fit in memory.


What is the preferred method of avoiding value buffering? For example,
if you're building a basic inverted index, you have one key (term)
associated with many values (doc ids) in your reducer. If you want an
output pair of something like <Text, IntArrayWritable>, is there a way
to build and output the id array without buffering values? The only
alternative I see is to instead use <Text, IntWritable> and repeat the
term for every doc id, but this seems wasteful.

Ed

Re: Should mapreduce.ReduceContext reuse same object in nextKeyValue?

Reply via email to