Hi everyone, The .NET/C# Apache Arrow library currently only contains managed code, but the addition of the C Data Interface implementation opens up the ability to easily add bindings to the C++ Arrow library to add more capabilities. There is currently a draft PR open to add bindings to the Acero library for example [1], and I'm interested in adding .NET bindings to the dataset library.
The Acero bindings PR uses the Arrow GLib library, but I couldn't find any official guidance on whether this is the recommended approach for adding new native library bindings. As far as I can tell the GLib libraries are currently only used for the Ruby Arrow library, and can be used via GObject introspection by other languages like Lua. So I'd like to start a discussion to see if there's consensus on whether using the GLib libraries should be the standard way to add new native library bindings for .NET. Standardising on one way of wrapping the C++ libraries in .NET would help keep things simpler for both users and developers. For context, I'm a member of the open-source team at G-Research and a maintainer of ParquetSharp, a .NET library that wraps the Arrow C++ Parquet library. In ParquetSharp, we build our own native library with a C ABI that uses the C++ Arrow library from vcpkg internally, and bundle pre-built native libraries inside the ParquetSharp Nuget package for each OS and architecture combination supported. My thoughts on the advantages and disadvantages of using GLib over a custom native wrapper library are: Pros * We can use the existing GLib Arrow libraries rather than having to write custom C wrappers, and any improvements made there to support .NET can also benefit users of other languages, and vice versa (although this would only be Ruby and .NET initially, and anyone using the library directly via GObject introspection) * We can take advantage the tooling built around GLib/GObject to avoid needing to implement a lot of boilerplate binding code manually. For example, we could use the GapiCodegen tool from GtkSharp [2] to help generate binding code * There's no need to distribute a native binary with NuGet packages, and NuGet packages aren't bloated by builds for architectures that aren't used Cons * Users need to separately install the Arrow GLib libraries in order to use some Arrow NuGet packages, and this might complicate build and deployment processes compared to just adding a NuGet package reference to a project * GLib code can be a lot more complicated than plain C binding code that is only going to be consumed by .NET * Automatically generating .NET bindings for GObject libraries is not as well supported as for some other languages/runtimes * As far as I can tell it's expected that most .NET GLib library bindings live inside one of the many forks of GtkSharp so all of the tooling is internal to these repositories rather than being distributed as standalone tools designed to be used by other projects * You can manually write code to use a GLib library, as in the Acero C# PR, but for more complex APIs I think it would make sense to take advantage of the automated tooling available I was worried about whether it's possible to use GObject to implement bindings for some of the more complex parts of the Dataset API, like providing a .NET implementation of a KmsClientFactory, which would be required for reading encrypted Parquet data. I recently added bindings for this to ParquetSharp [3], so thought it would be a good test case to try to implement something similar with GObject. Following the GTK interface docs [4] and GtkSharp interface binding docs [5], and using the GapiCodegen library, I was able to implement something like a KmsClientFactory in C# and use it from GObject code in a C library, so it doesn't look like using GObject would be too limiting. It did take me a while to get this working though and I had a few missteps along the way, like trying to get gapi-parser working before giving up and writing an API XML file manually. I do think that if we use GapiCodegen we might want to avoid publicly exposing classes that inherit from GLib.Object in order to keep the API simple and provide more flexibility to change things in backwards compatible ways as the library evolves. Does anyone have any opinions or thoughts on this? Thanks, Adam [1] https://github.com/apache/arrow/pull/37544 [2] https://github.com/GtkSharp/GtkSharp [3] https://github.com/G-Research/ParquetSharp/pull/426 [4] https://docs.gtk.org/gobject/tutorial.html#how-to-define-and-implement-interfaces [5] https://www.mono-project.com/docs/gui/gtksharp/implementing-ginterfaces/