Arrow C++/Python real-world linking issues

Uwe L. Korn Fri, 02 Mar 2018 06:07:45 -0800

Hello all,

as we're intensive Arrow users in different enviroments and each of them raises 
their unique set of bugs, I thought I share some insights into various things 
and give a bit background to the issues I have raised and will raise. In my 
setting, there is a Python wheel `pyarrow`, a wheel `turbodbc` that depends on 
that wheel, various pure Python packages that depend on them as well as some 
C++ code that depends on Arrow C++ (incl. libarrow_python) and Boost.


At first, we have linked Parquet & Arrow as private static libraries into the 
C++ code and used the Python wheels just as they are produced in the official 
build. This lead to the issue that parquet-cpp was linked statically in one 
case and dynamically in the other case. As parquet-cpp has global variables 
that are deconstructed on library unload / program exit, these destructors were 
crashing with a double-free in some cases. As it turns out, destructors for 
global variables are registered and called differently depending on if they 
origin from a static or a shared library. Thus if you link a library in two 
ways, they are called twice. One call succeeds, the other one crashes.

Due to this, I made the change that Arrow should now be dynamically linked 
everywhere and Parquet statically. The reason to link parquet-cpp statically 
was that there was not authoritative source for the Parquet library, i.e. we 
produce a pyarrow wheel, not a parquet wheel, so parquet should be a private 
thing of that library. Several people I asked agreed that this was the correct 
way to do such things. This works nicely until you reach the point where you 
embed the Python interpreter in a process that depends natively on parquet-cpp 
and load `pyarrow` as a Python package in there. When you end a program, all 
the destructors of all libraries are called, just fine. But when you close your 
embedded Python interpreter before the end of the whole program, all Python 
libraries are unloaded. Thus the in pyarrow statically linked parquet-cpp calls 
its destructor. Sadly at the end of the whole program, also all destructors 
libraries that were linked into the main program are called. In this case the 
parquet-cpp destructor was called a second time and we get a double free again. 
In contrast to this behaviour, if parquet-cpp was dynamically linked in both 
places, a reference counter should be held by the process and the destructor 
should only be called once on program exit: 
https://issues.apache.org/jira/browse/ARROW-2245

This is just one of the funny things we had to debug and I thought it might be 
good to share some insights on one of these packaging issues. In the end, this 
also means that we might need to rethink which packages we actually want to 
statically link in the conda-forge package. My feeling here is that especially 
the larger ones like Thrift and Boost should be dynamically linked.

My main learning from this is:
 * having control over your build toolchain is key. The pyarrow wheels will 
never be as good as the conda-forge packages. Things like 
https://issues.apache.org/jira/browse/ARROW-1975 and 
https://issues.apache.org/jira/browse/ARROW-2246
 * dynamic linking may not be the ultimate silver bullet for this problem but 

Uwe

Arrow C++/Python real-world linking issues

Reply via email to