hi folks, I wanted to share some concerns that I have about our current trajectory with regards to producing shared libraries from the Arrow build system.
Currently, a comprehensive build produces many shared libraries: * libarrow * libarrow_dataset * libarrow_flight * libarrow_python * libgandiva * libparquet * libplasma There are some others. There are a number of problems with the current approach: * Each DLL needs its own set of "visibility" macros to control the use of __declspec(dllimport/dllexport) on Windows, which is necessary to instruct the import or export of symbols between DLLs on Windows. See e.g. https://github.com/apache/arrow/blob/master/cpp/src/arrow/flight/visibility.h * Templates instantiated in one DLL may cause a violation of the One Definition Rule during linking (we lost at least a day of work time collectively to issues around this in ARROW-6244). It is good to be able to share common template interfaces in general * Statically-linked dependencies in one shared lib may need to be statically linked into another library. For example, libgandiva statically links parts of LLVM, but we will likely have some other code that makes use of LLVM for other purposes (it has been discussed in the context of Avro parsing) Overall, my preferred solution to these issues is to move to a similar approach to what the LLVM project does. To help understand, let me have you first look at the libraries that come from the llvm-7-dev package on Ubuntu Here we have a collection of static "module" libraries that implement different parts of the LLVM platform. Finally, a _single_ shared library libLLVM-7.so is produced. I think we should do the same thing in Apache Arrow. So we only ever will produce a single shared library from the build. We can additionally make the "name" of this shared library configurable to suit different needs. For example, the default name could be simply "libarrow.so" or something. But if someone wants to produce a barebones Parquet shared library they can override the name to create a "libparquet.so" that contains only the "libarrow_core.a" and "libarrow_io.a" symbols needed for reading Parquet files. This would have additional benefits: * Use the same visibility macros for all exported C++ symbols, rather than having to define DLL-specific visibility * Improved modularization of builds and linking for third party users, similar to the way that LLVM's modular linking works, see the way that Gandiva requests specific components from LLVM to use for static linking https://github.com/apache/arrow/blob/master/cpp/cmake_modules/FindLLVM.cmake#L53 * Net simpler linking and deployment. Only one shared library to deal with There are some drawbacks, however: * Our C++ Linux packaging approach would need to be changed to be more LLVM-like (a single .deb/.yum package containing the C++ platform rather than many packages as now) Interested to hear from other C++ developers. Thanks Wes