[
https://issues.apache.org/jira/browse/AVRO-134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12762726#action_12762726
]
Scott Carey commented on AVRO-134:
----------------------------------
w.r.t zlib, deflate and gzip:
All use the Deflate algorithm to compress a stream, RFC 1951
http://www.faqs.org/rfcs/rfc1951.html
In Java terms, the above is a DeflateOutputStream 'unwrapped'. I believe in
HTTP that certain browsers interpret Content-Encoding: deflate to mean this.
zlib, is technically capable of multiple compression types but is usually a
Deflate stream wrapped with a header and an Adler32 checksum footer.
http://www.faqs.org/rfcs/rfc1950.html
In Java, this is a "wrapped" DeflateOutputStream. Some browsers require this
for 'deflate', others examine the payload and can deal with both types of
'deflate' in HTTP land.
The header is typically 2 bytes, and the footer 4 bytes.
gzip, is a file format for compressed content, but like zlib only Deflate is
used in practice. It has a relatively large header and footer and stores the
uncompressed size (mod 2^32) and a CRC32 checksum.
The header is large (12 bytes minimum) and the footer is 8 bytes.
http://www.faqs.org/rfcs/rfc1952.html
For Avro's purposes, zlib or raw deflate seem most appropriate to me. If we
want a minimal package, zlib has that. gzip is a bloated wrapper for the
purposes here.
Perhaps we want to store an adler32 or crc32 + length for all block types in
the metadata, thus making it redundant to store it in the block? In that case
a raw deflate stream is appropriate.
Also, in light of future block formats, how do we control the compression
level? Is it encoded in the codec parameter? i.e. zlib:1 , zlib:9 etc? Or is
that not needed?
Deflate at compression level 1 compresses at nearly lzo speeds, and is about 10
to 30 times slower at level 9, so I'm sure we will want to have that
controllable. But that is a decision when the stream is written and does not
impact readers.
> Mismatch between the spec and implementation of metadata blocks in files
> ------------------------------------------------------------------------
>
> Key: AVRO-134
> URL: https://issues.apache.org/jira/browse/AVRO-134
> Project: Avro
> Issue Type: Bug
> Reporter: Thiruvalluvan M. G.
> Attachments: AVRO-134.patch
>
>
> The spec says there are three keys in metadata blocks - schema, count and
> _codec_. But the code in DataFileWriter adds schema, count and _sync_. The
> sync field is used by the DataFileReader. We need to do the following:
> - Add the key sync in the specification.
> - Either drop the key codec in the specification or add code to support
> codec in DataFileReader/DataFileWriter. If we decide to have codec, we need
> to also publish in the spec the list of supported codecs with their names to
> use in the metadata block.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.