[
https://issues.apache.org/jira/browse/AVRO-3223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated AVRO-3223:
---------------------------------
Labels: pull-request-available (was: )
> Support optional codecs in C# library
> -------------------------------------
>
> Key: AVRO-3223
> URL: https://issues.apache.org/jira/browse/AVRO-3223
> Project: Apache Avro
> Issue Type: New Feature
> Components: csharp
> Reporter: Tiago Margalho
> Priority: Minor
> Labels: pull-request-available
> Time Spent: 10m
> Remaining Estimate: 0h
>
> I'd like to propose a change to the C# library to allow it to dynamically
> support codecs that aren't supported out of the box. That is, instead of
> adding more codecs to the library, the library should allow codecs to be
> resolved by the user. At the moment, only deflate is supported by the
> library.
> In fact, the library already supports "custom" codecs to be used when writing
> to an Avro file. A user can implement a Codec and pass it into the call to
> OpenWriter:
> {code:c#|title=DataFileWriter.cs|borderStyle=solid}
> public static IFileWriter<T> OpenWriter(DatumWriter<T> writer, string path,
> Codec codec);
> {code}
> However, there's no way a user can call DataFileReader.OpenRead or
> DataFileReader.OpenAppendWriter and specify how the codec should be resolved.
> So what I'm proposing is that we allow these other methods to resolve custom
> codecs thus enabling the library to support codecs that are currently
> supported in other languages but not currently on the C# implementation.
> Here are some reasons why I believe this would be a good approach to extend
> support to other codecs:
> - The .NET base class library may not support all codecs.
> - The library won't have to depend on other OSS projects with potentially
> conflicting license agreements.
> - The user is allowed to pick which implementation of zstd, snappy, etc is
> best for their use case (eg. may prefer a version compatible to an older
> version of .NET, or one that is more recent and optimised for more recent
> versions of .NET/C#).
> h2. How this would work
> At present, the library uses a codec resolution method that is only
> internally used:
> {code:c#|title=DataFileReader.cs|borderStyle=solid}
> private Codec ResolveCodec()
> {
> return
> Codec.CreateCodecFromString(GetMetaString(DataFileConstants.MetaDataCodec));
> }
> {code}
> This can be somewhat easily extended to support other "resolvers" but first
> we would need to allow "resolvers" to be registered to be discovered by the
> DataFileReader:
> {code:c#|title=DataFileReader.cs|borderStyle=solid}
> /// <summary>
> /// Represents a function capable of resolving a codec identifying
> string
> /// into a matching codec implementation a reader can use to
> decompress data.
> /// </summary>
> public delegate Codec CodecResolver(string codecMetaString);
> private static List<CodecResolver> _codecResolvers = new
> List<CodecResolver>();
> /// <summary>
> /// Registers a function that will be used to resolve a codec
> identifying string
> /// into a matching codec implementation when reading compressed Avro
> data.
> /// </summary>
> public static void RegisterCodecResolver(CodecResolver resolver)
> {
> _codecResolvers.Add(resolver);
> }
> {code}
> Thus, the implementation of `ResolveCodec` could be:
> {code:c#|title=DataFileReader.cs|borderStyle=solid}
> private Codec ResolveCodec()
> {
> var codecString = GetMetaString(DataFileConstants.MetaDataCodec);
>
> foreach (var resolver in _codecResolvers)
> {
> var candidateCodec = resolver(codecString);
> if (candidateCodec != null)
> {
> return candidateCodec;
> }
> }
> return Codec.CreateCodecFromString(codecString);
> }
> {code}
> h2. One additional change required
> The only change left to make is related to how decompression buffers are
> passed around. When I tried adding support for zstd using this approach, I
> found that I couldn't decompress data because the DataFileReader reuses a
> `DataBlock` buffer and the actual compressed data length doesn't necessarily
> match the length of the buffer. Although some compression algos are likely to
> have metadata in frames/headers describing the length of the encoded payload,
> not all implementations will allow the buffer size to be different so I think
> it would make sense to make the buffer size explicit.
> As a consequence I had to make a change that allows us to pass the length of
> the data in the DataBlock into the Codec.Decompress method call:
> {code:c#|title=DataFileReader.cs|borderStyle=solid}
> if (HasNextBlock())
> {
> _currentBlock = NextRawBlock(_currentBlock);
> _currentBlock.Data = _codec.Decompress(_currentBlock.Data,
> (int)this._blockSize);
> ...
> {code}
> I will be raising a PR shortly with the changes I've described above.
> Hopefully people will find this proposal reasonable but please let me know
> what you think!
> Thanks!
--
This message was sent by Atlassian Jira
(v8.3.4#803005)