Hey Uwe, Thanks for sharing this perspective. In Ray we are currently building everything from source and statically linking all the libraries which has been working but incurs long build times for people who want to build everything from source. For the wheels we then ship the binaries and we also include our private version of pyarrow to not clash with anything the user has installed.
-- Philipp. On Fri, Mar 2, 2018 at 6:07 AM, Uwe L. Korn <uw...@xhochy.com> wrote: > Hello all, > > as we're intensive Arrow users in different enviroments and each of them > raises their unique set of bugs, I thought I share some insights into > various things and give a bit background to the issues I have raised and > will raise. In my setting, there is a Python wheel `pyarrow`, a wheel > `turbodbc` that depends on that wheel, various pure Python packages that > depend on them as well as some C++ code that depends on Arrow C++ (incl. > libarrow_python) and Boost. > > At first, we have linked Parquet & Arrow as private static libraries into > the C++ code and used the Python wheels just as they are produced in the > official build. This lead to the issue that parquet-cpp was linked > statically in one case and dynamically in the other case. As parquet-cpp > has global variables that are deconstructed on library unload / program > exit, these destructors were crashing with a double-free in some cases. As > it turns out, destructors for global variables are registered and called > differently depending on if they origin from a static or a shared library. > Thus if you link a library in two ways, they are called twice. One call > succeeds, the other one crashes. > > Due to this, I made the change that Arrow should now be dynamically linked > everywhere and Parquet statically. The reason to link parquet-cpp > statically was that there was not authoritative source for the Parquet > library, i.e. we produce a pyarrow wheel, not a parquet wheel, so parquet > should be a private thing of that library. Several people I asked agreed > that this was the correct way to do such things. This works nicely until > you reach the point where you embed the Python interpreter in a process > that depends natively on parquet-cpp and load `pyarrow` as a Python package > in there. When you end a program, all the destructors of all libraries are > called, just fine. But when you close your embedded Python interpreter > before the end of the whole program, all Python libraries are unloaded. > Thus the in pyarrow statically linked parquet-cpp calls its destructor. > Sadly at the end of the whole program, also all destructors libraries that > were linked into the main program are called. In this case the parquet-cpp > destructor was called a second time and we get a double free again. In > contrast to this behaviour, if parquet-cpp was dynamically linked in both > places, a reference counter should be held by the process and the > destructor should only be called once on program exit: > https://issues.apache.org/jira/browse/ARROW-2245 > > This is just one of the funny things we had to debug and I thought it > might be good to share some insights on one of these packaging issues. In > the end, this also means that we might need to rethink which packages we > actually want to statically link in the conda-forge package. My feeling > here is that especially the larger ones like Thrift and Boost should be > dynamically linked. > > My main learning from this is: > * having control over your build toolchain is key. The pyarrow wheels > will never be as good as the conda-forge packages. Things like > https://issues.apache.org/jira/browse/ARROW-1975 and > https://issues.apache.org/jira/browse/ARROW-2246 > * dynamic linking may not be the ultimate silver bullet for this problem > but > > Uwe >