On Jan 22, 2010, at 11:39 AM, Doug Cutting wrote:

> Scott Carey wrote:
>> On the specific needs for compression options, I would rather have 
>> avro.codec.options as a general purpose container for codec options than
>> avro.codec.compression_level.   Some codecs have compression levels like 
>> gzip, 0 to 9.  Others have a set of flags or multiple dimensions of options. 
>>  Each codec can do what it will with avro.codec.options.   Deflate can have 
>> "level=[0-9]" for values.
>> Additionally, the Codec API can incorporate a 
>> 
>> public String getOptions();
>> public void SetOptions(String options); 
>> 
>> interface so that file appends can pick up the options that the file was 
>> created with.
> 
> Strictly speaking, we don't need to include options in the file, since 
> they don't affect the format.  They could even be misleading, since one 
> might use different compression levels in different append operations, 
> and I don't see any strong reason to prohibit that.
> 

It could be misleading for codec formats like gzip/deflate where all parameters 
are optional.
For some codecs however, it may not be optional.  LZO for example has several 
formats, and a header indicates which one is used.  This can be in the data 
block, or metadata. I think the Codec API and metadata namespace should not 
restrict that choice up front.

> A given application could always store its options and re-use them when 
> appending, e.g., my.gzip.level=5.  If they're included in the spec then 
> would we then prohibit one to override them?  If not, what would be the 
> purpose of putting them in the spec?

Those semantics are codec dependent.  Its simply a namespace for codecs to 
store parameters.  We do not know in advance what the semantics of these 
parameters are.

> 
> Also, rather than packing all options into a single string that must be 
> parsed, we might instead reserve avro.codec.<codecName>.* for 
> codec-specific options.  So one might specify avro.codec.deflate.level 
> as 5.  The codec name is actually redundant, since only a single codec 
> name is permitted per file.  So this could just instead perhaps be 
> avro.codec.level without much fear of confusion.
> 

Reserving a single name (avro.codec.options), or an entire namespace 
(avro.codec.*) is fine.  The former is just a simpler interface.  The latter 
would mean that the Codec API would have 
String getOption(String optionName) 
instead of 
String getOptions()

Either way, the Codec needs a way to read and store options in a file.  
Gzip/Deflate can live without it since all streams are read-compatible.  LZF is 
the same.  Not all codecs are however.  I've been thinking of trying an LZP 
class algorithm (faster encode, slower decode, smaller compressed size than LZ 
types like LZO), but the size of the hash table and hash algorithm is needed at 
decode time.
Passing options is better than exploding the number of codecs, hard-coding 
parameters needed at decompress time, or (usually) storing the parameters in 
the data portion of each block.
Exposing parameters to the Codec API means the decision on which of the above 
is the right thing to do for a given codec is up to the codec.


None of the above matters that much at this time from the public spec 
perspective since it is all within avro.*.  

But for internal to avro namespace use, I think it is useful to have a general 
rule that if a name is reserved for a feature, its subspaces are also reserved. 
 e.g. avro.codec is used by the Codec API/Feature, and thus avro.codec.* is 
implicitly reserved for future use by that API/Feature.


-Scott

Reply via email to