ouch. MapWritable does not reset the hash table on a readFields. The hash table 
would just grow and grow. the write method dumps the entire hash out.
 
patch is simple: just do a instance.clear() in the readFields() call. (But i 
haven't looked at the base class).

________________________________

From: Mike Forrest [mailto:[EMAIL PROTECTED]
Sent: Thu 1/10/2008 3:20 PM
To: hadoop-user@lucene.apache.org
Subject: Re: problem with IdentityMapper



I'm using Text for the keys and MapWritable for the values.

Joydeep Sen Sarma wrote:
> what are the key value types in the Sequencefile?
> 
> seems that the maprunner calls createKey and createValue just once. so if the 
> value serializes out it's entire memory allocated (and not what it last read) 
> - it would cause this problem.
> 
> (I have periodically shot myself in the foot with this bullet).
>
> ________________________________
>
> From: Mike Forrest [mailto:[EMAIL PROTECTED]
> Sent: Thu 1/10/2008 2:51 PM
> To: hadoop-user@lucene.apache.org
> Subject: problem with IdentityMapper
>
>
>
> Hi,
> I'm running into a problem where IdentityMapper seems to produce way too
> much data.  For example, I have a job that reads a sequence file using
> IdentityMapper and then uses IdentityReducer to write everything back
> out to another sequence file.  My input is a ~60MB sequence file and
> after the map phase has completed, the job tracker UI reports about 10GB
> for "Map output bytes".  It seems like the output collector does not get
> properly reset and so each map that gets emitted has the correct key but
> the value ends up being all the data you've encountered up to that
> point.  I think this is a known issue but I can't seem to find any
> discussion about it right now.  Has anyone else run into this, and if
> so, is there a solution?  I'm using the latest code in the 0.15 branch.
> Thanks
> Mike
>
>
>
>  



Reply via email to