[jira] [Commented] (HADOOP-7909) Implement Splittable Gzip based on a signature in a gzip header field

Niels Basjes (Commented) (JIRA) Mon, 12 Dec 2011 04:33:03 -0800

    [ 
https://issues.apache.org/jira/browse/HADOOP-7909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13167474#comment-13167474
 ]


Niels Basjes commented on HADOOP-7909:
--------------------------------------

Tim,

I think that both of those questions are hard because of the fact that Hadoop 
MapReduce does roughly these steps in this order:
1. Look at the input filename and filesize
2. Determine codec and if it is splittable
3. Create the split definitions
4. Distribute these split definitions over many tasks / nodes on your cluster
5. Each task opens a file and tries to process the indicated split.

If you find around 5. that you cannot split correctly, then you cannot go back 
to 3. to redefine the splits.
Also I expect that trying the HADOOP-7076 way afterwards is hard: What if the 
first split succeeds and the second one doesn't (which run on completely 
separate nodes)?

Niels
                
> Implement Splittable Gzip based on a signature in a gzip header field
> ---------------------------------------------------------------------
>
>                 Key: HADOOP-7909
>                 URL: https://issues.apache.org/jira/browse/HADOOP-7909
>             Project: Hadoop Common
>          Issue Type: New Feature
>          Components: io
>            Reporter: Tim Broberg
>            Priority: Minor
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> I propose to take the suggestion of PIG-42 extend it to
>  - add a more robust header such that false matches are vanishingly unlikely
>  - repeat initial bytes of the header for very fast split searching
>  - break down the stream into modest size chunks (~64k?) for rapid parallel 
> encode and decode
>  - provide length information on the blocks in advance to make block decode 
> possible in hardware
> An optional extra header would be added to the gzip header, adding 36 bytes.
> <sh> := <version><signature><uncompressedDataLength><compressedRecordLength>
> <version> := 1 byte version field allowing us to later adjust the deader 
> definition
> <signature> := 23 byte signature of the form aaaaaaabcdefghijklmnopr where 
> each letter represents a randomly generated byte
> <uncompressedDataLength> := 32-bit length of the data compressed into this 
> record
> <compressedRecordLength> := 32-bit length of this record as compressed, 
> including all headers, trailers
> If multiple extra headers are present and the split header is not the first 
> header, the initial implementation will not recognize the split.
> Input streams would be broken down into blocks which are appended, much as 
> BlockCompressorStream does. Non-split-aware decoders will ignore this header 
> and decode the appended blocks without ever noticing the difference.
> The signature has >= 132 bits of entropy which is sufficient for 80+ years of 
> Moore's law before collisions become a significant concern.
> The first 7 bytes are repeated for speed. When splitting, the signature 
> search will look for the 32-bit value aaaa every 4 bytes until a hit is 
> found, then the next 4 bytes identify the alignment of the header mod 4 to 
> identify a potential header match, then the whole header is validated at that 
> offset. So, there is a load, compare, branch, and increment per 4 bytes 
> searched.
> The existing gzip implementations do not provide access to the optional 
> header fields (nor comment nor filename), so the entire gzip header will have 
> to be reimplemented and compression will need to be done using the raw 
> deflate options of the native library / built in deflater.
> There will be some degradation when using splittable gzip:
>  - The gzip headers will each be 36 bytes larger. (4 byte extra header 
> header, 32 byte extra header)
>  - There will be one gzip header per block.
>  - History will have to be reset with each block to allow starting from 
> scratch at that offset resulting in some uncompressed bytes that would 
> otherwise have been strings.
> Issues to consider:
>  - Is the searching fast enough without the repeating 7 bytes in the 
> signature?
>  - Should this be a patch to the existing gzip classes to add a switch, or 
> should this be a whole new class?
>  - Which level does this belong at? CompressionStream? Compressor?
>  - Is it more advantageous to encode the signature into the less dense 
> comment field?
>  - Optimum block size? Smaller splits faster and may conserve memory, larger 
> provides slightly better compression ratio.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HADOOP-7909) Implement Splittable Gzip based on a signature in a gzip header field

Reply via email to