[jira] [Commented] (HADOOP-7909) Implement Splittable Gzip based on a signature in a gzip header field

Tim Broberg (Commented) (JIRA) Sat, 10 Dec 2011 19:27:06 -0800

    [ 
https://issues.apache.org/jira/browse/HADOOP-7909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13167031#comment-13167031
 ]


Tim Broberg commented on HADOOP-7909:
-------------------------------------

> A "normal" gzipped file won't become splittable with this codec. The way I 
> interpret what I've read so far is that this comes very close to defining a 
> new fileformat: a block compressed deflate file that due to the clever way it 
> is stored can also be read by any existing gzip decompression tool.

I'd rather say a new option than a new format, but I would agree with 
everything else here. Splittable streams / files decode normally, normal 
streams / files would not be rendered splittable under this scheme.

> 1.how does the Hadoop job system know that it should try to create splits for 
> the input file?

This is one area I would very much like feedback. I would propose either A - by 
selecting a dedicated to splittable-gzip codec, or B - by selecting a 
splittable option in a modified version of the existing codec much like the 
compression level option is selected today.

> 2.should the files use the .gz file extension? Or perhaps something like .sgz 
> (splittable gz) instead?

I'm torn here as well. Given that we are conforming to the existing gzip spec, 
and given that a aplitting codec can decode non-splittable streams and the 
non-splitting codec can decode splitting streams, is there ant harm in encoding 
splittable gzip into .gz's?

Certainly if we take option B above such that the same codec encodes both 
formats we would make the same codec decode both, and I can see no reason at 
all to have a separate fiel extension.

> 3.what will be the advantages over the existing splittable compression 
> options we have now (LZO/Snappy/Bzip2/...)? Why would I as a Hadoop 
> developer/administrator want to choose this codec?

Snappy
 - is faster to decode and *much* faster to encode
 - does not compress as thoroughly
 - is not splittable in and of itself (but can be used in Sequence Files and 
Avro which are splittable)
LZO
 - inferior to snappy in speed
 - comparable to snappy in compression ratio
 - splittable by means of an indexing scheme which creates a separate file
 - encumbered with some licensing issues
Bzip2
 - inferior to gzip/deflate in speed
 - supperior to gzip/deflate in compression
 - splittable by a bit by bit signature search
   - the signature is only 48 bits in length
   - the value for the signature is not random and may appear in natural data

So, you would choose this codec when
 - you value compression efficiency over speed (compared to LZO and Snappy)
 - you want splittability, but don't want to deal with a Sequence File / Avro
 - you want more speed and/or splitting robustness than Bzip2 can provide

Note also that this scheme closely follows the Bzip2 splittability model such 
that we don't have to implement yet another set of basic splitability classes. 
The LZO scheme is a radically different beast.

Having said all this, gzip explicitly does not require the compression 
algorithm to be Deflate. This scheme is adaptable to allow splitting of any 
compression format by using a different CM value in the gzip header, although 
zlib libraries would not decode this format. Supporting other formats would 
require the gzip decoder to evaluate the CM byte and pass control to the 
appropriate decompressor rather than just running the whole record, header and 
all into zlib as we do now.

Thanks so much for your review, thoughts, and questions!

    - Tim.
                
> Implement Splittable Gzip based on a signature in a gzip header field
> ---------------------------------------------------------------------
>
>                 Key: HADOOP-7909
>                 URL: https://issues.apache.org/jira/browse/HADOOP-7909
>             Project: Hadoop Common
>          Issue Type: New Feature
>          Components: io
>            Reporter: Tim Broberg
>            Priority: Minor
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> I propose to take the suggestion of PIG-42 extend it to
>  - add a more robust header such that false matches are vanishingly unlikely
>  - repeat initial bytes of the header for very fast split searching
>  - break down the stream into modest size chunks (~64k?) for rapid parallel 
> encode and decode
>  - provide length information on the blocks in advance to make block decode 
> possible in hardware
> An optional extra header would be added to the gzip header, adding 36 bytes.
> <sh> := <version><signature><uncompressedDataLength><compressedRecordLength>
> <version> := 1 byte version field allowing us to later adjust the deader 
> definition
> <signature> := 23 byte signature of the form aaaaaaabcdefghijklmnopr where 
> each letter represents a randomly generated byte
> <uncompressedDataLength> := 32-bit length of the data compressed into this 
> record
> <compressedRecordLength> := 32-bit length of this record as compressed, 
> including all headers, trailers
> If multiple extra headers are present and the split header is not the first 
> header, the initial implementation will not recognize the split.
> Input streams would be broken down into blocks which are appended, much as 
> BlockCompressorStream does. Non-split-aware decoders will ignore this header 
> and decode the appended blocks without ever noticing the difference.
> The signature has >= 132 bits of entropy which is sufficient for 80+ years of 
> Moore's law before collisions become a significant concern.
> The first 7 bytes are repeated for speed. When splitting, the signature 
> search will look for the 32-bit value aaaa every 4 bytes until a hit is 
> found, then the next 4 bytes identify the alignment of the header mod 4 to 
> identify a potential header match, then the whole header is validated at that 
> offset. So, there is a load, compare, branch, and increment per 4 bytes 
> searched.
> The existing gzip implementations do not provide access to the optional 
> header fields (nor comment nor filename), so the entire gzip header will have 
> to be reimplemented and compression will need to be done using the raw 
> deflate options of the native library / built in deflater.
> There will be some degradation when using splittable gzip:
>  - The gzip headers will each be 36 bytes larger. (4 byte extra header 
> header, 32 byte extra header)
>  - There will be one gzip header per block.
>  - History will have to be reset with each block to allow starting from 
> scratch at that offset resulting in some uncompressed bytes that would 
> otherwise have been strings.
> Issues to consider:
>  - Is the searching fast enough without the repeating 7 bytes in the 
> signature?
>  - Should this be a patch to the existing gzip classes to add a switch, or 
> should this be a whole new class?
>  - Which level does this belong at? CompressionStream? Compressor?
>  - Is it more advantageous to encode the signature into the less dense 
> comment field?
>  - Optimum block size? Smaller splits faster and may conserve memory, larger 
> provides slightly better compression ratio.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HADOOP-7909) Implement Splittable Gzip based on a signature in a gzip header field

Reply via email to