[jira] [Created] (HADOOP-7909) Implement Splittable Gzip based on a signature in a gzip header field

Tim Broberg (Created) (JIRA) Sat, 10 Dec 2011 11:00:08 -0800

Implement Splittable Gzip based on a signature in a gzip header field
---------------------------------------------------------------------


                 Key: HADOOP-7909
                 URL: https://issues.apache.org/jira/browse/HADOOP-7909
             Project: Hadoop Common
          Issue Type: New Feature
          Components: io
            Reporter: Tim Broberg
            Priority: Minor


I propose to take the suggestion of PIG-42 extend it to
 - add a more robust header such that false matches are vanishingly unlikely
 - repeat initial bytes of the header for very fast split searching
 - break down the stream into modest size chunks (~64k?) for rapid parallel 
encode and decode
 - provide length information on the blocks in advance to make block decode 
possible in hardware

An optional extra header would be added to the gzip header, adding 36 bytes.

<sh> := <version><signature><uncompressedDataLength><compressedRecordLength>
<version> := 1 byte version field allowing us to later adjust the deader 
definition
<signature> := 23 byte signature of the form aaaaaaabcdefghijklmnopr where each 
letter represents a randomly generated byte
<uncompressedDataLength> := 32-bit length of the data compressed into this 
record
<compressedRecordLength> := 32-bit length of this record as compressed, 
including all headers, trailers

If multiple extra headers are present and the split header is not the first 
header, the initial implementation will not recognize the split.

Input streams would be broken down into blocks which are appended, much as 
BlockCompressorStream does. Non-split-aware decoders will ignore this header 
and decode the appended blocks without ever noticing the difference.

The signature has >= 132 bits of entropy which is sufficient for 80+ years of 
Moore's law before collisions become a significant concern.

The first 7 bytes are repeated for speed. When splitting, the signature search 
will look for the 32-bit value aaaa every 4 bytes until a hit is found, then 
the next 4 bytes identify the alignment of the header mod 4 to identify a 
potential header match, then the whole header is validated at that offset. So, 
there is a load, compare, branch, and increment per 4 bytes searched.

The existing gzip implementations do not provide access to the optional header 
fields (nor comment nor filename), so the entire gzip header will have to be 
reimplemented and compression will need to be done using the raw deflate 
options of the native library / built in deflater.

There will be some degradation when using splittable gzip:
 - The gzip headers will each be 36 bytes larger. (4 byte extra header header, 
32 byte extra header)
 - There will be one gzip header per block.
 - History will have to be reset with each block to allow starting from scratch 
at that offset resulting in some uncompressed bytes that would otherwise have 
been strings.

Issues to consider:
 - Is the searching fast enough without the repeating 7 bytes in the signature?
 - Should this be a patch to the existing gzip classes to add a switch, or 
should this be a whole new class?
 - Which level does this belong at? CompressionStream? Compressor?
 - Is it more advantageous to encode the signature into the less dense comment 
field?
 - Optimum block size? Smaller splits faster and may conserve memory, larger 
provides slightly better compression ratio.



--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Created] (HADOOP-7909) Implement Splittable Gzip based on a signature in a gzip header field

Reply via email to