Re: TensorFlow, PyTorch, and manylinux1

Wes McKinney Mon, 17 Dec 2018 07:32:10 -0800

hi Soumith,

On Mon, Dec 17, 2018 at 12:32 AM soumith <[email protected]> wrote:
>
> I'm reposting my original reply below the current reply (below a dotted 
> line). It was filtered out because I wasn't subscribed to the relevant 
> mailing lists.
>
>  tl;dr: manylinux2010 looks pretty promising, because CUDA supports CentOS6 
> (for now).
>
> In the meanwhile, I dug into what pyarrow does, and it looks like it links 
> with `static-libstdc++` along with a linker version script [1].


We aren't passing -static-libstdc++. The static linking of certain
symbols (so that C++11 features work on older systems) is handled
automatically by devtoolset-2; we are modifying the visibility of some
of these linked symbols, though

>
> PyTorch did exactly that until Jan this year [2], except that our linker 
> version script didn't cover the subtleties of statically linking stdc++ as 
> well as Arrow did. Because we weren't covering all of the stdc++ static 
> linking subtleties, we were facing huge issues that amplified wheel 
> incompatibility (import X; import torch crashing under various X). Hence, we 
> moved since then to linking with system-shipped libstdc++, doing no static 
> stdc++ linking.
>

Unless you were using the devtoolset-2 toolchain, you were doing
something different :) My understanding is that passing
-static-libstdc++ with stock gcc or clang is mainly only appropriate
when building dependency-free binary applications

> I'll revisit this in light of manylinux2010, and go down the path of static 
> linkage of stdc++ again, though I'm wary of the subtleties around handling of 
> weak symbols, std::string destruction across library boundaries [3] and 
> std::string's ABI incompatibility issues.
>
> I've opened a tracking issue here: 
> https://github.com/pytorch/pytorch/issues/15294
>
> I'm looking forward to hearing from the TensorFlow devs if manylinux2010 is 
> sufficient for them, or what additional constraints they have.
>
> As a personal thought, I find multiple libraries in the same process 
> statically linking to stdc++ gross, but without a package manager like 
> Anaconda that actually is willing to deal with the C++-side dependencies, 
> there aren't many options on the table.

IIUC the idea of the devtoolset-* toolchains is that all libraries
should use the same toolchain then there are no issues. Having
multiple projects passing -static-libstdc++ when linking would indeed
be problematic. The problem we are having is that if any library is
using devtoolset-2, all libraries need to in order to be compatible.

>
> References:
>
> [1] https://github.com/apache/arrow/blob/master/cpp/src/arrow/symbols.map
> [2] https://github.com/pytorch/pytorch/blob/v0.3.1/tools/pytorch.version
> [3] https://github.com/pytorch/pytorch/issues/5400#issuecomment-369428125
> ............................................................................................................................................................
> Hi Philipp,
>
> Thanks a lot for getting a discussion started. I've sunk ~100+ hours over the 
> last 2 years making PyTorch wheels play well with OpenCV, TensorFlow and 
> other wheels, that I'm glad to see this discussion started.
>
>
> On the PyTorch wheels, we have been shipping with the minimum glibc and 
> libstdc++ versions we can possibly work with, while keeping two hard 
> constraints:
>
> 1. CUDA support
> 2. C++11 support
>
>
> 1. CUDA support
>
> manylinux1 is not an option, considering CUDA doesn't work out of CentOS5. I 
> explored this option [1] to no success.
>
> manylinux2010 is an option at the moment wrt CUDA, but it's unclear when 
> NVIDIA will lift support for CentOS6 under us.
> Additionally, CuDNN 7.0 (if I remember) was compiled against Ubuntu 12.04 
> (meaning the glibc version is newer than CentOS6), and binaries linked 
> against CuDNN refused to run on CentOS6. I requested that this constraint be 
> lifted, and the next dot release fixed it.
>
> The reason PyTorch binaries are not manylinux2010 compatible at the moment is 
> because of the next constraint: C++11.

Do we need to involve NVIDIA in this discussion? Having problematic
GPU-enabled libraries in PyPI isn't too good for them either.

>
> 2. C++11
>
> We picked C++11 as the minimum supported dialect for PyTorch, primarily to 
> serve the default compilers of older machines, i.e. Ubuntu 14.04 and CentOS7. 
> The newer options were C++14 / C++17, but we decided to polyfill what we 
> needed to support older distros better.
>
> A fully fleshed out C++11 implementation landed in gcc in various stages, 
> with gradual ABI changes [2]. Unfortunately, the libstdc++ that ships with 
> centos6 (and hence manylinx2010) isn't sufficient to cover all of C++11. For 
> example, the binaries we built with devtoolset3 (gcc 4.9.2) on CentOS6 didn't 
> run with the default libstdc++ on CentOS6 either due to ABI changes or 
> minimum GLIBCXX version for some of the symbols being unavailable.
>

Do you have a link to the paper trail about this? I had thought a
major raison d'etre of the devtoolset compilers is to support C++11 on
older Linuxes. For example, we are using C++11 in Arrow but we're
limiting ourselves at present to what's available in gcc 4.8.x; our
binaries work fine on CentOS5 and 6.

> We tried our best to support our binaries running on CentOS6 and above with 
> various ranges of static linking hacks until 0.3.1 (January 2018), but at 
> some point hacks over hacks was only getting more fragile. Hence we moved to 
> a CentOS7-based image in April 2018 [3], and relied only on dynamic linking 
> to the system-shipped libstdc++.
>
> As Wes mentions [4], an option is to host a modern C++ standard library via 
> PyPI would put manylinux2010 on the table. There are however subtle 
> consequences with this -- if this package gets installed into a conda 
> environment, it'll clobber anaconda-shipped libstdc++, possibly corrupting 
> environments for thousands of anaconda users (this is actually similar to the 
> issues with `mkl` shipped via PyPI and Conda clobbering each other).
>

More evidence that "pip" as a packaging tool may have already outlived
its usefulness to this community.

Somehow we need to arrange that the same compiler toolchain (with
consistent minimum glibc, libstdc++ version) is used to build all of
the binaries we are discussing here. Short of that some system
configurations will continue to have problems.

- Wes

>
> References:
>
> [1] https://github.com/NVIDIA/nvidia-docker/issues/348
> [2] https://gcc.gnu.org/wiki/Cxx11AbiCompatibility
> [3] 
> https://github.com/pytorch/builder/commit/44d9bfa607a7616c66fe6492fadd8f05f3578b93
> [4] https://github.com/apache/arrow/pull/3177#issuecomment-447515982
> ..............................................................................................................................................................................................
>
> On Sun, Dec 16, 2018 at 2:57 PM Wes McKinney <[email protected]> wrote:
>>
>> Reposting since I wasn't subscribed to [email protected]. I
>> also didn't see Soumith's response since it didn't come through to
>> [email protected]
>>
>> In response to the non-conforming ABI in the TF and PyTorch wheels, we
>> have attempted to hack around the issue with some elaborate
>> workarounds [1] [2] that have ultimately proved to not work
>> universally. The bottom line is that this is burdening other projects
>> in the Python ecosystem and causing confusing application crashes.
>>
>> First, to state what should hopefully obvious to many of you, Python
>> wheels are not a robust way to deploy complex C++ projects, even
>> setting aside the compiler toolchain issue. If a project has
>> non-trivial third party dependencies, you either have to statically
>> link them or bundle shared libraries with the wheel (we do a bit of
>> both in Apache Arrow). Neither solution is foolproof in all cases.
>> There are other downsides to wheels when it comes to numerical
>> computing -- it is difficult to utilize things like the Intel MKL
>> which may be used by multiple projects. If two projects have the same
>> third party C++ dependency (e.g. let's use gRPC or libprotobuf as a
>> straw man example), it's hard to guarantee that versions or ABI will
>> not conflict with each other.
>>
>> In packaging with conda, we pin all dependencies when building
>> projects that depend on them, then package and deploy the dependencies
>> as separate shared libraries instead of bundling. To resolve the need
>> for newer compilers or newer C++ standard library, libstdc++.so and
>> other system shared libraries are packaged and installed as
>> dependencies. In manylinux1, the RedHat devtoolset compiler toolchain
>> is used as it performs selective static linking of symbols to enable
>> C++11 libraries to be deployed on older Linuxes like RHEL5/6. A conda
>> environment functions as sort of portable miniature Linux
>> distribution.
>>
>> Given the current state of things, as using the TensorFlow and PyTorch
>> wheels in the same process as other conforming manylinux1 wheels is
>> unsafe, it's hard to see how one can continue to recommend pip as a
>> preferred installation path until the ABI problems are resolved. For
>> example, "pip" is what is recommended for installing TensorFlow on
>> Linux [3]. It's unclear that non-compliant wheels should be allowed in
>> the package manager at all (I'm aware that this was deemed to not be
>> the responsibility of PyPI to verify policy compliance [4]).
>>
>> A couple possible paths forward (there may be others):
>>
>> * Collaborate with the Python packaging authority to evolve the
>> manylinux ABI to be able to produce compliant wheels that support the
>> build and deployment requirements of these projects
>> * Create a new ABI tag for CUDA/C++11-enabled Python wheels so that
>> projects can ship packages that can be guaranteed to work properly
>> with TF/PyTorch. This might require vendoring libstdc++ in some kind
>> of "toolchain" wheel that projects using this new ABI can depend on
>>
>> Note that these toolchain and deployment issues are absent when
>> building and deploying with conda packages, since build- and run-time
>> dependencies can be pinned and shared across all the projects that
>> depend on them, ensuring ABI cross-compatibility. It's great to have
>> the convenience of "pip install $PROJECT", but I believe that these
>> projects have outgrown the intended use for pip and wheel
>> distributions.
>>
>> Until the ABI incompatibilities are resolved, I would encourage more
>> prominent user documentation about the non-portability and potential
>> for crashes with these Linux wheels.
>>
>> Thanks,
>> Wes
>>
>> [1]: 
>> https://github.com/apache/arrow/commit/537e7f7fd503dd920c0b9f0cef8a2de86bc69e3b
>> [2]: 
>> https://github.com/apache/arrow/commit/e7aaf7bf3d3e326b5fe58d20f8fc45b5cec01cac
>> [3]: https://www.tensorflow.org/install/
>> [4]: https://www.python.org/dev/peps/pep-0513/#id50
>> On Sat, Dec 15, 2018 at 11:25 PM Robert Nishihara
>> <[email protected]> wrote:
>> >
>> > On Sat, Dec 15, 2018 at 8:43 PM Philipp Moritz <[email protected]> wrote:
>> >
>> > > Dear all,
>> > >
>> > > As some of you know, there is a standard in Python called manylinux (
>> > > https://www.python.org/dev/peps/pep-0513/) to package binary executables
>> > > and libraries into a “wheel” in a way that allows the code to be run on a
>> > > wide variety of Linux distributions. This is very convenient for Python
>> > > users, since such libraries can be easily installed via pip.
>> > >
>> > > This standard is also important for a second reason: If many different
>> > > wheels are used together in a single Python process, adhering to 
>> > > manylinux
>> > > ensures that these libraries work together well and don’t trip on each
>> > > other’s toes (this could easily happen if different versions of libstdc++
>> > > are used for example). Therefore *even if support for only a single
>> > > distribution like Ubuntu is desired*, it is important to be manylinux
>> > > compatible to make sure everybody’s wheels work together well.
>> > >
>> > > TensorFlow and PyTorch unfortunately don’t produce manylinux compatible
>> > > wheels. The challenge is due, at least in part, to the need to use
>> > > nvidia-docker to build GPU binaries [10]. This causes various levels of
>> > > pain for the rest of the Python community, see for example [1] [2] [3] 
>> > > [4]
>> > > [5] [6] [7] [8].
>> > >
>> > > The purpose of the e-mail is to get a discussion started on how we can
>> > > make TensorFlow and PyTorch manylinux compliant. There is a new standard 
>> > > in
>> > > the works [9] so hopefully we can discuss what would be necessary to make
>> > > sure TensorFlow and PyTorch can adhere to this standard in the future.
>> > >
>> > > It would make everybody’s lives just a little bit better! Any ideas are
>> > > appreciated.
>> > >
>> > > @soumith: Could you cc the relevant list? I couldn't find a pytorch dev
>> > > mailing list.
>> > >
>> > > Best,
>> > > Philipp.
>> > >
>> > > [1] https://github.com/tensorflow/tensorflow/issues/5033
>> > > [2] https://github.com/tensorflow/tensorflow/issues/8802
>> > > [3] https://github.com/primitiv/primitiv-python/issues/28
>> > > [4] https://github.com/zarr-developers/numcodecs/issues/70
>> > > [5] https://github.com/apache/arrow/pull/3177
>> > > [6] https://github.com/tensorflow/tensorflow/issues/13615
>> > > [7] https://github.com/pytorch/pytorch/issues/8358
>> > > [8] https://github.com/ray-project/ray/issues/2159
>> > > [9] https://www.python.org/dev/peps/pep-0571/
>> > > [10]
>> > > https://github.com/tensorflow/tensorflow/issues/8802#issuecomment-291935940
>> > >
>> > > --
>> > > You received this message because you are subscribed to the Google Groups
>> > > "ray-dev" group.
>> > > To unsubscribe from this group and stop receiving emails from it, send an
>> > > email to [email protected].
>> > > To post to this group, send email to [email protected].
>> > > To view this discussion on the web visit
>> > > https://groups.google.com/d/msgid/ray-dev/CAFs1FxUBAag6AThj34twiAB6KY3t5sJSJF3g70K3SvF-%2BzGGgw%40mail.gmail.com
>> > > <https://groups.google.com/d/msgid/ray-dev/CAFs1FxUBAag6AThj34twiAB6KY3t5sJSJF3g70K3SvF-%2BzGGgw%40mail.gmail.com?utm_medium=email&utm_source=footer>
>> > > .
>> > > For more options, visit https://groups.google.com/d/optout.
>> > >

Re: TensorFlow, PyTorch, and manylinux1

Reply via email to