[ http://issues.apache.org/jira/browse/HADOOP-54?page=all ]
Arun C Murthy reassigned HADOOP-54:
-----------------------------------
Assignee: Arun C Murthy (was: Michel Tourn)
+1 for Owen's proposal.
An unrelated issue: the 'append' method in SequenceFile.Writer is passed 2
Writables: key and value. The Writable interface doesn't have a 'getLength'
interface. This means one would have to write out the key/value to a temporary
buffer to actually figure out it's 'length'. The lengths are particularly
relevant here to ensure that the key/value pair can be put into the
keyBuffer/valueBuffer without violating the 'configured' maxBufferSize...
To get around this issue: how about making the 'configured' bufferSize the
'lower_bound' instead of the 'upper_bound'? This will ensure we can write out
the key/value and then check the buffer size, and if need be go ahead and
compress etc. This will save the construction of the temporary buffer for
getting the key/value lengths. Related gain: it's far simpler with this scheme
to deal with outlier/rouge keys/values which are larger than bufferSize itself.
Logical next step: make this 'bufferSize' configurable per SequenceFile, this
will let applications control it depending on the sizes of their keys/values. I
propose to introduce a new constructor with this as an argument for
SequenceFile.Writer. This will then be written out as a part of the file-header
(along with compression details) and the SequenceFile.Reader can pick this up
and read accordingly. (Of course there will be a system-wide default if
unspecified per file).
Thoughts?
thanks,
Arun
> SequenceFile should compress blocks, not individual entries
> -----------------------------------------------------------
>
> Key: HADOOP-54
> URL: http://issues.apache.org/jira/browse/HADOOP-54
> Project: Hadoop
> Issue Type: Improvement
> Components: io
> Affects Versions: 0.2.0
> Reporter: Doug Cutting
> Assigned To: Arun C Murthy
> Fix For: 0.5.0
>
>
> SequenceFile will optionally compress individual values. But both
> compression and performance would be much better if sequences of keys and
> values are compressed together. Sync marks should only be placed between
> blocks. This will require some changes to MapFile too, so that all file
> positions stored there are the positions of blocks, not entries within
> blocks. Probably this can be accomplished by adding a
> getBlockStartPosition() method to SequenceFile.Writer.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira