Hi everyone,

The .NET/C# Apache Arrow library currently only contains managed code,
but the addition of the C Data Interface implementation opens up the
ability to easily add bindings to the C++ Arrow library to add more
capabilities. There is currently a draft PR open to add bindings to
the Acero library for example [1], and I'm interested in adding .NET
bindings to the dataset library.

The Acero bindings PR uses the Arrow GLib library, but I couldn't find
any official guidance on whether this is the recommended approach for
adding new native library bindings. As far as I can tell the GLib
libraries are currently only used for the Ruby Arrow library, and can
be used via GObject introspection by other languages like Lua. So I'd
like to start a discussion to see if there's consensus on whether
using the GLib libraries should be the standard way to add new native
library bindings for .NET. Standardising on one way of wrapping the
C++ libraries in .NET would help keep things simpler for both users
and developers.

For context, I'm a member of the open-source team at G-Research and a
maintainer of ParquetSharp, a .NET library that wraps the Arrow C++
Parquet library. In ParquetSharp, we build our own native library with
a C ABI that uses the C++ Arrow library from vcpkg internally, and
bundle pre-built native libraries inside the ParquetSharp Nuget
package for each OS and architecture combination supported.

My thoughts on the advantages and disadvantages of using GLib over a
custom native wrapper library are:
Pros
* We can use the existing GLib Arrow libraries rather than having to
write custom C wrappers, and any improvements made there to support
.NET can also benefit users of other languages, and vice versa
(although this would only be Ruby and .NET initially, and anyone using
the library directly via GObject introspection)
* We can take advantage the tooling built around GLib/GObject to avoid
needing to implement a lot of boilerplate binding code manually. For
example, we could use the GapiCodegen tool from GtkSharp [2] to help
generate binding code
* There's no need to distribute a native binary with NuGet packages,
and NuGet packages aren't bloated by builds for architectures that
aren't used
Cons
* Users need to separately install the Arrow GLib libraries in order
to use some Arrow NuGet packages, and this might complicate build and
deployment processes compared to just adding a NuGet package reference
to a project
* GLib code can be a lot more complicated than plain C binding code
that is only going to be consumed by .NET
* Automatically generating .NET bindings for GObject libraries is not
as well supported as for some other languages/runtimes
    * As far as I can tell it's expected that most .NET GLib library
bindings live inside one of the many forks of GtkSharp so all of the
tooling is internal to these repositories rather than being
distributed as standalone tools designed to be used by other projects
    * You can manually write code to use a GLib library, as in the
Acero C# PR, but for more complex APIs I think it would make sense to
take advantage of the automated tooling available

I was worried about whether it's possible to use GObject to implement
bindings for some of the more complex parts of the Dataset API, like
providing a .NET implementation of a KmsClientFactory, which would be
required for reading encrypted Parquet data. I recently added bindings
for this to ParquetSharp [3], so thought it would be a good test case
to try to implement something similar with GObject. Following the GTK
interface docs [4] and GtkSharp interface binding docs [5], and using
the GapiCodegen library, I was able to implement something like a
KmsClientFactory in C# and use it from GObject code in a C library, so
it doesn't look like using GObject would be too limiting. It did take
me a while to get this working though and I had a few missteps along
the way, like trying to get gapi-parser working before giving up and
writing an API XML file manually.

I do think that if we use GapiCodegen we might want to avoid publicly
exposing classes that inherit from GLib.Object in order to keep the
API simple and provide more flexibility to change things in backwards
compatible ways as the library evolves.

Does anyone have any opinions or thoughts on this?

Thanks,
Adam

[1] https://github.com/apache/arrow/pull/37544
[2] https://github.com/GtkSharp/GtkSharp
[3] https://github.com/G-Research/ParquetSharp/pull/426
[4] 
https://docs.gtk.org/gobject/tutorial.html#how-to-define-and-implement-interfaces
[5] https://www.mono-project.com/docs/gui/gtksharp/implementing-ginterfaces/

Reply via email to