[ 
https://issues.apache.org/jira/browse/AVRO-134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12762726#action_12762726
 ] 

Scott Carey commented on AVRO-134:
----------------------------------

w.r.t zlib, deflate and gzip:

All use the Deflate algorithm to compress a stream, RFC 1951
http://www.faqs.org/rfcs/rfc1951.html
In Java terms, the above is a DeflateOutputStream 'unwrapped'.  I believe in 
HTTP that certain browsers interpret Content-Encoding: deflate to mean this.  

zlib, is technically capable of multiple compression types but is usually a 
Deflate stream wrapped with a header and an Adler32 checksum footer. 
http://www.faqs.org/rfcs/rfc1950.html
In Java, this is a "wrapped" DeflateOutputStream.  Some browsers require this 
for 'deflate', others examine the payload and can deal with both types of 
'deflate' in HTTP land.
The header is typically 2 bytes, and the footer 4 bytes.

gzip, is a file format for compressed content, but like zlib only Deflate is 
used in practice.  It has a relatively large header and footer and stores the 
uncompressed size (mod 2^32) and a CRC32 checksum.  
The header is large (12 bytes minimum) and the footer is 8 bytes.  
http://www.faqs.org/rfcs/rfc1952.html

For Avro's purposes, zlib or raw deflate seem most appropriate to me.  If we 
want a minimal package, zlib has that.  gzip is a bloated wrapper for the 
purposes here. 
Perhaps we want to store an adler32 or crc32 + length for all block types in 
the metadata, thus making it redundant to store it in the block?  In that case 
a raw deflate stream is appropriate. 

Also, in light of future block formats, how do we control the compression 
level?  Is it encoded in the codec parameter?  i.e. zlib:1 , zlib:9 etc?  Or is 
that not needed?
Deflate at compression level 1 compresses at nearly lzo speeds, and is about 10 
to 30 times slower at level 9, so I'm sure we will want to have that 
controllable.  But that is a decision when the stream is written and does not 
impact readers.

> Mismatch between the spec and implementation of metadata blocks in files
> ------------------------------------------------------------------------
>
>                 Key: AVRO-134
>                 URL: https://issues.apache.org/jira/browse/AVRO-134
>             Project: Avro
>          Issue Type: Bug
>            Reporter: Thiruvalluvan M. G.
>         Attachments: AVRO-134.patch
>
>
> The spec says there are three keys in metadata blocks - schema, count and 
> _codec_. But the code in DataFileWriter adds schema, count and _sync_. The 
> sync field is used by the DataFileReader. We need to do the following:
>    - Add the key sync in the specification.
>    - Either drop the key codec in the specification or add code to support 
> codec in DataFileReader/DataFileWriter. If we decide to have codec, we need 
> to also publish  in the spec the list of supported codecs with their names to 
> use in the metadata block.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to