Re: Arrow C++/Python real-world linking issues

2018-03-02 Thread Philipp Moritz
Hey Uwe,

Thanks for sharing this perspective. In Ray we are currently building
everything from source and statically linking all the libraries which has
been working but incurs long build times for people who want to build
everything from source. For the wheels we then ship the binaries and we
also include our private version of pyarrow to not clash with anything the
user has installed.

-- Philipp.

On Fri, Mar 2, 2018 at 6:07 AM, Uwe L. Korn  wrote:

> Hello all,
>
> as we're intensive Arrow users in different enviroments and each of them
> raises their unique set of bugs, I thought I share some insights into
> various things and give a bit background to the issues I have raised and
> will raise. In my setting, there is a Python wheel `pyarrow`, a wheel
> `turbodbc` that depends on that wheel, various pure Python packages that
> depend on them as well as some C++ code that depends on Arrow C++ (incl.
> libarrow_python) and Boost.
>
> At first, we have linked Parquet & Arrow as private static libraries into
> the C++ code and used the Python wheels just as they are produced in the
> official build. This lead to the issue that parquet-cpp was linked
> statically in one case and dynamically in the other case. As parquet-cpp
> has global variables that are deconstructed on library unload / program
> exit, these destructors were crashing with a double-free in some cases. As
> it turns out, destructors for global variables are registered and called
> differently depending on if they origin from a static or a shared library.
> Thus if you link a library in two ways, they are called twice. One call
> succeeds, the other one crashes.
>
> Due to this, I made the change that Arrow should now be dynamically linked
> everywhere and Parquet statically. The reason to link parquet-cpp
> statically was that there was not authoritative source for the Parquet
> library, i.e. we produce a pyarrow wheel, not a parquet wheel, so parquet
> should be a private thing of that library. Several people I asked agreed
> that this was the correct way to do such things. This works nicely until
> you reach the point where you embed the Python interpreter in a process
> that depends natively on parquet-cpp and load `pyarrow` as a Python package
> in there. When you end a program, all the destructors of all libraries are
> called, just fine. But when you close your embedded Python interpreter
> before the end of the whole program, all Python libraries are unloaded.
> Thus the in pyarrow statically linked parquet-cpp calls its destructor.
> Sadly at the end of the whole program, also all destructors libraries that
> were linked into the main program are called. In this case the parquet-cpp
> destructor was called a second time and we get a double free again. In
> contrast to this behaviour, if parquet-cpp was dynamically linked in both
> places, a reference counter should be held by the process and the
> destructor should only be called once on program exit:
> https://issues.apache.org/jira/browse/ARROW-2245
>
> This is just one of the funny things we had to debug and I thought it
> might be good to share some insights on one of these packaging issues. In
> the end, this also means that we might need to rethink which packages we
> actually want to statically link in the conda-forge package. My feeling
> here is that especially the larger ones like Thrift and Boost should be
> dynamically linked.
>
> My main learning from this is:
>  * having control over your build toolchain is key. The pyarrow wheels
> will never be as good as the conda-forge packages. Things like
> https://issues.apache.org/jira/browse/ARROW-1975 and
> https://issues.apache.org/jira/browse/ARROW-2246
>  * dynamic linking may not be the ultimate silver bullet for this problem
> but
>
> Uwe
>


Arrow C++/Python real-world linking issues

2018-03-02 Thread Uwe L. Korn
Hello all,

as we're intensive Arrow users in different enviroments and each of them raises 
their unique set of bugs, I thought I share some insights into various things 
and give a bit background to the issues I have raised and will raise. In my 
setting, there is a Python wheel `pyarrow`, a wheel `turbodbc` that depends on 
that wheel, various pure Python packages that depend on them as well as some 
C++ code that depends on Arrow C++ (incl. libarrow_python) and Boost.

At first, we have linked Parquet & Arrow as private static libraries into the 
C++ code and used the Python wheels just as they are produced in the official 
build. This lead to the issue that parquet-cpp was linked statically in one 
case and dynamically in the other case. As parquet-cpp has global variables 
that are deconstructed on library unload / program exit, these destructors were 
crashing with a double-free in some cases. As it turns out, destructors for 
global variables are registered and called differently depending on if they 
origin from a static or a shared library. Thus if you link a library in two 
ways, they are called twice. One call succeeds, the other one crashes.

Due to this, I made the change that Arrow should now be dynamically linked 
everywhere and Parquet statically. The reason to link parquet-cpp statically 
was that there was not authoritative source for the Parquet library, i.e. we 
produce a pyarrow wheel, not a parquet wheel, so parquet should be a private 
thing of that library. Several people I asked agreed that this was the correct 
way to do such things. This works nicely until you reach the point where you 
embed the Python interpreter in a process that depends natively on parquet-cpp 
and load `pyarrow` as a Python package in there. When you end a program, all 
the destructors of all libraries are called, just fine. But when you close your 
embedded Python interpreter before the end of the whole program, all Python 
libraries are unloaded. Thus the in pyarrow statically linked parquet-cpp calls 
its destructor. Sadly at the end of the whole program, also all destructors 
libraries that were linked into the main program are called. In this case the 
parquet-cpp destructor was called a second time and we get a double free again. 
In contrast to this behaviour, if parquet-cpp was dynamically linked in both 
places, a reference counter should be held by the process and the destructor 
should only be called once on program exit: 
https://issues.apache.org/jira/browse/ARROW-2245

This is just one of the funny things we had to debug and I thought it might be 
good to share some insights on one of these packaging issues. In the end, this 
also means that we might need to rethink which packages we actually want to 
statically link in the conda-forge package. My feeling here is that especially 
the larger ones like Thrift and Boost should be dynamically linked.

My main learning from this is:
 * having control over your build toolchain is key. The pyarrow wheels will 
never be as good as the conda-forge packages. Things like 
https://issues.apache.org/jira/browse/ARROW-1975 and 
https://issues.apache.org/jira/browse/ARROW-2246
 * dynamic linking may not be the ultimate silver bullet for this problem but 

Uwe