[ 
https://issues.apache.org/jira/browse/AVRO-3223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tiago Margalho updated AVRO-3223:
---------------------------------
    Priority: Major  (was: Minor)

> Support optional codecs in C# library
> -------------------------------------
>
>                 Key: AVRO-3223
>                 URL: https://issues.apache.org/jira/browse/AVRO-3223
>             Project: Apache Avro
>          Issue Type: New Feature
>          Components: csharp
>            Reporter: Tiago Margalho
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 50m
>  Remaining Estimate: 0h
>
> I'd like to propose a change to the C# library to allow it to dynamically 
> support codecs that aren't supported out of the box. That is, instead of 
> adding more codecs to the library, the library should allow codecs to be 
> resolved by the user.  At the moment, only deflate is supported by the 
> library.
> In fact, the library already supports "custom" codecs to be used when writing 
> to an Avro file. A user can implement a Codec and pass it into the call to 
> OpenWriter: 
> {code:c#|title=DataFileWriter.cs|borderStyle=solid}
> public static IFileWriter<T> OpenWriter(DatumWriter<T> writer, string path, 
> Codec codec);
>  {code}
> However, there's no way a user can call DataFileReader.OpenRead or 
> DataFileReader.OpenAppendWriter and specify how the codec should be resolved.
> So what I'm proposing is that we allow these other methods to resolve custom 
> codecs thus enabling the library to support codecs that are currently 
> supported in other languages but not currently on the C# implementation.
> Here are some reasons why I believe this would be a good approach to extend 
> support to other codecs:
>  - The .NET base class library may not support all codecs.
>  - The library won't have to depend on other OSS projects with potentially 
> conflicting license agreements.
>  - The user is allowed to pick which implementation of zstd, snappy, etc is 
> best for their use case (eg. may prefer a version compatible to an older 
> version of .NET, or one that is more recent and optimised for more recent 
> versions of .NET/C#).
> h2. How this would work
> At present, the library uses a codec resolution method that is only 
> internally used:
> {code:c#|title=DataFileReader.cs|borderStyle=solid}
>         private Codec ResolveCodec()
>         {
>             return 
> Codec.CreateCodecFromString(GetMetaString(DataFileConstants.MetaDataCodec));
>         }
> {code}
>  This can be somewhat easily extended to support other "resolvers" but first 
> we would need to allow "resolvers" to be registered to be discovered by the 
> DataFileReader:
> {code:c#|title=DataFileReader.cs|borderStyle=solid}
>         /// <summary>
>         /// Represents a function capable of resolving a codec identifying 
> string
>         /// into a matching codec implementation a reader can use to 
> decompress data.
>         /// </summary>
>         public delegate Codec CodecResolver(string codecMetaString);
>         private static List<CodecResolver> _codecResolvers = new 
> List<CodecResolver>();
>         /// <summary>
>         /// Registers a function that will be used to resolve a codec 
> identifying string
>         /// into a matching codec implementation when reading compressed Avro 
> data.
>         /// </summary>
>         public static void RegisterCodecResolver(CodecResolver resolver)
>         {
>             _codecResolvers.Add(resolver);
>         }
>  {code}
> Thus, the implementation of `ResolveCodec` could be:
> {code:c#|title=DataFileReader.cs|borderStyle=solid}
>         private Codec ResolveCodec()
>         {
>             var codecString = GetMetaString(DataFileConstants.MetaDataCodec);
>             
>             foreach (var resolver in _codecResolvers)
>             {
>                 var candidateCodec = resolver(codecString);
>                 if (candidateCodec != null)
>                 {
>                     return candidateCodec;
>                 }
>             }
>             return Codec.CreateCodecFromString(codecString);
>         }
> {code}
> h2. One additional change required
> The only change left to make is related to how decompression buffers are 
> passed around. When I tried adding support for zstd using this approach, I 
> found that I couldn't decompress data because the DataFileReader reuses a 
> `DataBlock` buffer and the actual compressed data length doesn't necessarily 
> match the length of the buffer. Although some compression algos are likely to 
> have metadata in frames/headers describing the length of the encoded payload, 
> not all implementations will allow the buffer size to be different so I think 
> it would make sense to make the buffer size explicit.
>  As a consequence I had to make a change that allows us to pass the length of 
> the data in the DataBlock into the Codec.Decompress method call:
> {code:c#|title=DataFileReader.cs|borderStyle=solid}
> if (HasNextBlock())
> {
>     _currentBlock = NextRawBlock(_currentBlock);
>     _currentBlock.Data = _codec.Decompress(_currentBlock.Data, 
> (int)this._blockSize);
>     ...
> {code}
> I will be raising a PR shortly with the changes I've described above. 
> Hopefully people will find this proposal reasonable but please let me know 
> what you think!
> Thanks!



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to