[jira] Commented: (HADOOP-54) SequenceFile should compress blocks, not individual entries

Arun C Murthy (JIRA) Wed, 19 Jul 2006 02:36:26 -0700

    [ 
http://issues.apache.org/jira/browse/HADOOP-54?page=comments#action_12422072 ] 
            
Arun C Murthy commented on HADOOP-54:
-------------------------------------


Actually we have an 'oops' here... afaics there isn't a way to even construct
  <key1-length><key1-bytes><key2-length><key2-bytes><key3-length><key3-bytes> 
  
<value1-length><value1-bytes><value2-length><value2-bytes><value3-length><value3-bytes>
 
without constructing a temporary object from the Writable key/value, since 
there isn't a way to figure out the key/value length at all.

I feel constructing a temporary object will be huge overhead i.e. introduce an 
extra copy from Writable to temp-buffer and then from Writable/tempBuffer to 
keyBuffer/valueBuffer for compression... 

Any way to do this without an extra copy?

-*-*-

One alternative I can think of to avoid the extra copy is slightly convoluted, 
though this still won't be able to construct Owen's proposal as-is. (Will look 
very similar to my older proposal)

Maintain 2 auxillary arrays which keep track of actual key/value lengths. The 
way to maintain lengths is to compute difference in size of keyBuffer/valBuffer 
before and after insertion of each key/value and then store that difference.

E.g. 

int oldKeyBufferLength = keyBuffer.length();
key.write(keyBuffer);
int newKeyBufferLength = keyBuffer.length();
keySizes.insert(newKeyBufferLength - oldKeyBufferLength); //save the last key 
size

// same for 'val'

if((newKeyBufferLength + newValueBufferLength) > minBufferSize) {   
//lower_bound instead of 'higher_bound'

  // Compress both keyBuffer and valueBuffer
  // Write out keySizes array to disk in zero-compressed format
  // Write out valueSizes array to disk in zero-compressed format
  // Write out compressedKeyBufferSize in zero-compressed format
  // Write out compressed keyBuffer
  // Write out compressedValueBufferSize in zero-compressed format
  // Write out compressed valueBuffer
  // Reset keyBuffer, valueBuffer, keySizes and valueSizes

} else {
  // Done
  return;
}


-*-*-


Appreciate any inputs/alternatives/refinements...

thanks,
Arun

> SequenceFile should compress blocks, not individual entries
> -----------------------------------------------------------
>
>                 Key: HADOOP-54
>                 URL: http://issues.apache.org/jira/browse/HADOOP-54
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: io
>    Affects Versions: 0.2.0
>            Reporter: Doug Cutting
>         Assigned To: Arun C Murthy
>             Fix For: 0.5.0
>
>
> SequenceFile will optionally compress individual values.  But both 
> compression and performance would be much better if sequences of keys and 
> values are compressed together.  Sync marks should only be placed between 
> blocks.  This will require some changes to MapFile too, so that all file 
> positions stored there are the positions of blocks, not entries within 
> blocks.  Probably this can be accomplished by adding a 
> getBlockStartPosition() method to SequenceFile.Writer.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HADOOP-54) SequenceFile should compress blocks, not individual entries

Reply via email to