[
https://issues.apache.org/jira/browse/HADOOP-7909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13167474#comment-13167474
]
Niels Basjes commented on HADOOP-7909:
--------------------------------------
Tim,
I think that both of those questions are hard because of the fact that Hadoop
MapReduce does roughly these steps in this order:
1. Look at the input filename and filesize
2. Determine codec and if it is splittable
3. Create the split definitions
4. Distribute these split definitions over many tasks / nodes on your cluster
5. Each task opens a file and tries to process the indicated split.
If you find around 5. that you cannot split correctly, then you cannot go back
to 3. to redefine the splits.
Also I expect that trying the HADOOP-7076 way afterwards is hard: What if the
first split succeeds and the second one doesn't (which run on completely
separate nodes)?
Niels
> Implement Splittable Gzip based on a signature in a gzip header field
> ---------------------------------------------------------------------
>
> Key: HADOOP-7909
> URL: https://issues.apache.org/jira/browse/HADOOP-7909
> Project: Hadoop Common
> Issue Type: New Feature
> Components: io
> Reporter: Tim Broberg
> Priority: Minor
> Original Estimate: 672h
> Remaining Estimate: 672h
>
> I propose to take the suggestion of PIG-42 extend it to
> - add a more robust header such that false matches are vanishingly unlikely
> - repeat initial bytes of the header for very fast split searching
> - break down the stream into modest size chunks (~64k?) for rapid parallel
> encode and decode
> - provide length information on the blocks in advance to make block decode
> possible in hardware
> An optional extra header would be added to the gzip header, adding 36 bytes.
> <sh> := <version><signature><uncompressedDataLength><compressedRecordLength>
> <version> := 1 byte version field allowing us to later adjust the deader
> definition
> <signature> := 23 byte signature of the form aaaaaaabcdefghijklmnopr where
> each letter represents a randomly generated byte
> <uncompressedDataLength> := 32-bit length of the data compressed into this
> record
> <compressedRecordLength> := 32-bit length of this record as compressed,
> including all headers, trailers
> If multiple extra headers are present and the split header is not the first
> header, the initial implementation will not recognize the split.
> Input streams would be broken down into blocks which are appended, much as
> BlockCompressorStream does. Non-split-aware decoders will ignore this header
> and decode the appended blocks without ever noticing the difference.
> The signature has >= 132 bits of entropy which is sufficient for 80+ years of
> Moore's law before collisions become a significant concern.
> The first 7 bytes are repeated for speed. When splitting, the signature
> search will look for the 32-bit value aaaa every 4 bytes until a hit is
> found, then the next 4 bytes identify the alignment of the header mod 4 to
> identify a potential header match, then the whole header is validated at that
> offset. So, there is a load, compare, branch, and increment per 4 bytes
> searched.
> The existing gzip implementations do not provide access to the optional
> header fields (nor comment nor filename), so the entire gzip header will have
> to be reimplemented and compression will need to be done using the raw
> deflate options of the native library / built in deflater.
> There will be some degradation when using splittable gzip:
> - The gzip headers will each be 36 bytes larger. (4 byte extra header
> header, 32 byte extra header)
> - There will be one gzip header per block.
> - History will have to be reset with each block to allow starting from
> scratch at that offset resulting in some uncompressed bytes that would
> otherwise have been strings.
> Issues to consider:
> - Is the searching fast enough without the repeating 7 bytes in the
> signature?
> - Should this be a patch to the existing gzip classes to add a switch, or
> should this be a whole new class?
> - Which level does this belong at? CompressionStream? Compressor?
> - Is it more advantageous to encode the signature into the less dense
> comment field?
> - Optimum block size? Smaller splits faster and may conserve memory, larger
> provides slightly better compression ratio.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira