ouch. MapWritable does not reset the hash table on a readFields. The hash table would just grow and grow. the write method dumps the entire hash out. patch is simple: just do a instance.clear() in the readFields() call. (But i haven't looked at the base class).
________________________________ From: Mike Forrest [mailto:[EMAIL PROTECTED] Sent: Thu 1/10/2008 3:20 PM To: hadoop-user@lucene.apache.org Subject: Re: problem with IdentityMapper I'm using Text for the keys and MapWritable for the values. Joydeep Sen Sarma wrote: > what are the key value types in the Sequencefile? > > seems that the maprunner calls createKey and createValue just once. so if the > value serializes out it's entire memory allocated (and not what it last read) > - it would cause this problem. > > (I have periodically shot myself in the foot with this bullet). > > ________________________________ > > From: Mike Forrest [mailto:[EMAIL PROTECTED] > Sent: Thu 1/10/2008 2:51 PM > To: hadoop-user@lucene.apache.org > Subject: problem with IdentityMapper > > > > Hi, > I'm running into a problem where IdentityMapper seems to produce way too > much data. For example, I have a job that reads a sequence file using > IdentityMapper and then uses IdentityReducer to write everything back > out to another sequence file. My input is a ~60MB sequence file and > after the map phase has completed, the job tracker UI reports about 10GB > for "Map output bytes". It seems like the output collector does not get > properly reset and so each map that gets emitted has the correct key but > the value ends up being all the data you've encountered up to that > point. I think this is a known issue but I can't seem to find any > discussion about it right now. Has anyone else run into this, and if > so, is there a solution? I'm using the latest code in the 0.15 branch. > Thanks > Mike > > > >