[ http://issues.apache.org/jira/browse/HADOOP-611?page=comments#action_12445297 ] Devaraj Das commented on HADOOP-611: ------------------------------------
In the current merge code, 'merge-factor' number of keys & values are kept in memory. While implementing this, one thought was that we can prevent all the 'merge-factor' values from being in memory at the same time and fetch them when needed. When the user of the merge code does a next() on the MergeQueue to fetch the key/value, the system loads in memory the value corresponding to the 'minimum' key and defers the loading of the value until then. Implemented this for Compression = NONE & RECORD. However, for BLOCK compression, the code for not proactively loading values is already there and controlled by a boolean "lazyDecompression" and nothing extra needs to be done. The thing is lazyDecompression is controlled via hadoop config (defaulting to true). I was thinking whether it makes good sense to remove this configurable item and have it as true always. Any objection to this? > SequenceFile.Sorter should have a merge method that returns an iterator > ----------------------------------------------------------------------- > > Key: HADOOP-611 > URL: http://issues.apache.org/jira/browse/HADOOP-611 > Project: Hadoop > Issue Type: New Feature > Components: io > Reporter: Owen O'Malley > Assigned To: Devaraj Das > Fix For: 0.8.0 > > > SequenceFile.Sorter should get a new merge method that returns an iterator > over the keys/values. > The current merge method should become a simple method that gets the iterator > and writes the records out to a file. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira