[ http://issues.apache.org/jira/browse/HADOOP-54?page=all ]
Arun C Murthy reassigned HADOOP-54: ----------------------------------- Assignee: Arun C Murthy (was: Michel Tourn) +1 for Owen's proposal. An unrelated issue: the 'append' method in SequenceFile.Writer is passed 2 Writables: key and value. The Writable interface doesn't have a 'getLength' interface. This means one would have to write out the key/value to a temporary buffer to actually figure out it's 'length'. The lengths are particularly relevant here to ensure that the key/value pair can be put into the keyBuffer/valueBuffer without violating the 'configured' maxBufferSize... To get around this issue: how about making the 'configured' bufferSize the 'lower_bound' instead of the 'upper_bound'? This will ensure we can write out the key/value and then check the buffer size, and if need be go ahead and compress etc. This will save the construction of the temporary buffer for getting the key/value lengths. Related gain: it's far simpler with this scheme to deal with outlier/rouge keys/values which are larger than bufferSize itself. Logical next step: make this 'bufferSize' configurable per SequenceFile, this will let applications control it depending on the sizes of their keys/values. I propose to introduce a new constructor with this as an argument for SequenceFile.Writer. This will then be written out as a part of the file-header (along with compression details) and the SequenceFile.Reader can pick this up and read accordingly. (Of course there will be a system-wide default if unspecified per file). Thoughts? thanks, Arun > SequenceFile should compress blocks, not individual entries > ----------------------------------------------------------- > > Key: HADOOP-54 > URL: http://issues.apache.org/jira/browse/HADOOP-54 > Project: Hadoop > Issue Type: Improvement > Components: io > Affects Versions: 0.2.0 > Reporter: Doug Cutting > Assigned To: Arun C Murthy > Fix For: 0.5.0 > > > SequenceFile will optionally compress individual values. But both > compression and performance would be much better if sequences of keys and > values are compressed together. Sync marks should only be placed between > blocks. This will require some changes to MapFile too, so that all file > positions stored there are the positions of blocks, not entries within > blocks. Probably this can be accomplished by adding a > getBlockStartPosition() method to SequenceFile.Writer. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira