[
http://issues.apache.org/jira/browse/HADOOP-54?page=comments#action_12422072 ]
Arun C Murthy commented on HADOOP-54:
-------------------------------------
Actually we have an 'oops' here... afaics there isn't a way to even construct
<key1-length><key1-bytes><key2-length><key2-bytes><key3-length><key3-bytes>
<value1-length><value1-bytes><value2-length><value2-bytes><value3-length><value3-bytes>
without constructing a temporary object from the Writable key/value, since
there isn't a way to figure out the key/value length at all.
I feel constructing a temporary object will be huge overhead i.e. introduce an
extra copy from Writable to temp-buffer and then from Writable/tempBuffer to
keyBuffer/valueBuffer for compression...
Any way to do this without an extra copy?
-*-*-
One alternative I can think of to avoid the extra copy is slightly convoluted,
though this still won't be able to construct Owen's proposal as-is. (Will look
very similar to my older proposal)
Maintain 2 auxillary arrays which keep track of actual key/value lengths. The
way to maintain lengths is to compute difference in size of keyBuffer/valBuffer
before and after insertion of each key/value and then store that difference.
E.g.
int oldKeyBufferLength = keyBuffer.length();
key.write(keyBuffer);
int newKeyBufferLength = keyBuffer.length();
keySizes.insert(newKeyBufferLength - oldKeyBufferLength); //save the last key
size
// same for 'val'
if((newKeyBufferLength + newValueBufferLength) > minBufferSize) {
//lower_bound instead of 'higher_bound'
// Compress both keyBuffer and valueBuffer
// Write out keySizes array to disk in zero-compressed format
// Write out valueSizes array to disk in zero-compressed format
// Write out compressedKeyBufferSize in zero-compressed format
// Write out compressed keyBuffer
// Write out compressedValueBufferSize in zero-compressed format
// Write out compressed valueBuffer
// Reset keyBuffer, valueBuffer, keySizes and valueSizes
} else {
// Done
return;
}
-*-*-
Appreciate any inputs/alternatives/refinements...
thanks,
Arun
> SequenceFile should compress blocks, not individual entries
> -----------------------------------------------------------
>
> Key: HADOOP-54
> URL: http://issues.apache.org/jira/browse/HADOOP-54
> Project: Hadoop
> Issue Type: Improvement
> Components: io
> Affects Versions: 0.2.0
> Reporter: Doug Cutting
> Assigned To: Arun C Murthy
> Fix For: 0.5.0
>
>
> SequenceFile will optionally compress individual values. But both
> compression and performance would be much better if sequences of keys and
> values are compressed together. Sync marks should only be placed between
> blocks. This will require some changes to MapFile too, so that all file
> positions stored there are the positions of blocks, not entries within
> blocks. Probably this can be accomplished by adding a
> getBlockStartPosition() method to SequenceFile.Writer.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira