The only thing I don't like it being a private module in the Go implementation is distribution. For native go code, consumers can just perform `go get` and have it work. But for this interface, it would require both consumers of the module and any consumers of those consumers to have a local built version of this library locally when building their Go code. Easy to static link in for distributing binaries, but not for library builders.
Currently, the Arrow C++ source tree, already has everything set up and configured for being able to distribute the build artifacts for the various platforms, which I assume is also why the C++ code for the JNI dataset library is in the C++ source tree (correct me if I'm wrong please). The Golang build and deploy scripts don't have such a deployment because there typically is no need for such a deployment with Go. So even if it's a separate private module, I'd still prefer for it to at least be in the cpp source tree (perhaps a cpp/src/cgo directory?) in order to benefit from the existing build and CI tooling for deployment and distribution. This way as long as the necessary dependency (i.e. "apt install libarrow_dataset_cgo") exists, then `go get github.com/apache/arrow/go/dataset` would work without issue, rather than requiring additional steps for developers. Unless there's an easy way to grab the c++ code from the Go source tree in this case and add it to the libraries being deployed from the C++ build? I'm not familiar enough with that deployment configuration to know if it's actually easy to hook into for compiling and deploying a library that isn't in the C++ source tree. -----Original Message----- From: Antoine Pitrou <anto...@python.org> Sent: Monday, August 23, 2021 1:24 PM To: dev@arrow.apache.org Subject: Re: [C++][Go] CGO For Dataset API Integration Le 23/08/2021 à 19:16, Matthew Topol a écrit : > Unfortunately, Go currently can only integrate with C++ libraries through a C > interface. There does exist SWIG which is a generator for creating interface > code between Go and C++, but ultimately it's just automating the creation of > a C interface and Go glue code. Personally I'm not a fan of the code that > SWIG generates and haven't had too much luck with it. > > I have a working POC of using the datasets API via CGO through a C interface > (basically just passing around a uintptr_t which is the address of a heap > allocated shared_ptr to a DatasetFactory/Dataset/Scanner and using the C Data > interface for passing the resulting record batches through without copying), > but couldn't decide on the best way to go about integrating the idea and > cleaning it up into a real PR, hence this email thread. I initially was > porting the Dataset API to Go, but ran into the fact that it uses the compute > expression classes to define things and perform the filtering and realized > that it wouldn't be a good idea to try porting the entire compute library. > > So it just becomes a question as to what level I do the implementation and at > what level do I make the calls to a C interface to call into the C++, and > then whether or not the interface is a separate component from the existing > dataset/compute libraries which can get linked into the Go, optionally as a > separate module so that it's not creating a dependency on the C++ libraries > for the current arrow Go implementation, only for using the Dataset API stuff > (and potentially the compute library). I think the dataset C interface can start as a private module in the Go implementation. If it may be useful to other people then we can consider transferring it into the Arrow C++ source tree. Regards Antoine.