On Tue, Jan 12, 2010 at 7:14 PM, Eric Sammer <e...@lifeless.net> wrote: > On 1/12/10 6:53 PM, Wilkes, Chris wrote: >> I created my own Writable class to store 3 pieces of information. In my >> mapreducer.Reducer class I collect all of them and then process as a >> group, ie: >> >> reduce(key, values, context) { >> List<Foo> myFoos =new ArrayList(); >> for (Foo value : values) { >> myFoos.add(value); >> } >> } > > snip > >> >> Am I doing something wrong? Should I expect this VALUEIN object to >> change from underneath me? I'm using hadoop 0.20.1 (from a cloudera >> tarball) > > That's the documented behavior. Hadoop reuses the same Writable instance > and replaces the *members* in the readFields() method in most cases (all > cases?). The instance of Foo in your example will be the same object and > simply have its members overwritten after each call to readFields(). > Currently, you're building a list of the same object. At the end of your > for, you'll have a list of N objects all containing the same data. This > is one of those "gotchas." If you really need to build a list like this, > you'd have to resort to doing a deep copy, but you're better off avoid > it if you can as it will drastically impact performance and add the > requirement that all values for a given key fit in memory.
What is the preferred method of avoiding value buffering? For example, if you're building a basic inverted index, you have one key (term) associated with many values (doc ids) in your reducer. If you want an output pair of something like <Text, IntArrayWritable>, is there a way to build and output the id array without buffering values? The only alternative I see is to instead use <Text, IntWritable> and repeat the term for every doc id, but this seems wasteful. Ed