[jira] Assigned: (HADOOP-54) SequenceFile should compress blocks, not individual entries

Arun C Murthy (JIRA) Wed, 19 Jul 2006 01:09:26 -0700

     [ http://issues.apache.org/jira/browse/HADOOP-54?page=all ]


Arun C Murthy reassigned HADOOP-54:
-----------------------------------

    Assignee: Arun C Murthy  (was: Michel Tourn)

 +1 for Owen's proposal.

 An unrelated issue: the 'append' method in SequenceFile.Writer is passed 2 
Writables: key and value. The Writable interface doesn't have a 'getLength' 
interface. This means one would have to write out the key/value to a temporary 
buffer to actually figure out it's 'length'. The lengths are particularly 
relevant here to ensure that the key/value pair can be put into the 
keyBuffer/valueBuffer without violating the 'configured' maxBufferSize...

 To get around this issue: how about making the 'configured' bufferSize the 
'lower_bound' instead of the 'upper_bound'? This will ensure we can write out 
the key/value and then check the buffer size, and if need be go ahead and 
compress etc. This will save the construction of the temporary buffer for 
getting the key/value lengths. Related gain: it's far simpler with this scheme 
to deal with outlier/rouge keys/values which are larger than bufferSize itself.

 Logical next step: make this 'bufferSize' configurable per SequenceFile, this 
will let applications control it depending on the sizes of their keys/values. I 
propose to introduce a new constructor with this as an argument for 
SequenceFile.Writer. This will then be written out as a part of the file-header 
(along with compression details) and the SequenceFile.Reader can pick this up 
and read accordingly. (Of course there will be a system-wide default if 
unspecified per file).

 Thoughts?

thanks,
Arun

 

> SequenceFile should compress blocks, not individual entries
> -----------------------------------------------------------
>
>                 Key: HADOOP-54
>                 URL: http://issues.apache.org/jira/browse/HADOOP-54
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: io
>    Affects Versions: 0.2.0
>            Reporter: Doug Cutting
>         Assigned To: Arun C Murthy
>             Fix For: 0.5.0
>
>
> SequenceFile will optionally compress individual values.  But both 
> compression and performance would be much better if sequences of keys and 
> values are compressed together.  Sync marks should only be placed between 
> blocks.  This will require some changes to MapFile too, so that all file 
> positions stored there are the positions of blocks, not entries within 
> blocks.  Probably this can be accomplished by adding a 
> getBlockStartPosition() method to SequenceFile.Writer.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Assigned: (HADOOP-54) SequenceFile should compress blocks, not individual entries

Reply via email to