hi Soumith, On Mon, Dec 17, 2018 at 12:32 AM soumith <soum...@gmail.com> wrote: > > I'm reposting my original reply below the current reply (below a dotted > line). It was filtered out because I wasn't subscribed to the relevant > mailing lists. > > tl;dr: manylinux2010 looks pretty promising, because CUDA supports CentOS6 > (for now). > > In the meanwhile, I dug into what pyarrow does, and it looks like it links > with `static-libstdc++` along with a linker version script [1].
We aren't passing -static-libstdc++. The static linking of certain symbols (so that C++11 features work on older systems) is handled automatically by devtoolset-2; we are modifying the visibility of some of these linked symbols, though > > PyTorch did exactly that until Jan this year [2], except that our linker > version script didn't cover the subtleties of statically linking stdc++ as > well as Arrow did. Because we weren't covering all of the stdc++ static > linking subtleties, we were facing huge issues that amplified wheel > incompatibility (import X; import torch crashing under various X). Hence, we > moved since then to linking with system-shipped libstdc++, doing no static > stdc++ linking. > Unless you were using the devtoolset-2 toolchain, you were doing something different :) My understanding is that passing -static-libstdc++ with stock gcc or clang is mainly only appropriate when building dependency-free binary applications > I'll revisit this in light of manylinux2010, and go down the path of static > linkage of stdc++ again, though I'm wary of the subtleties around handling of > weak symbols, std::string destruction across library boundaries [3] and > std::string's ABI incompatibility issues. > > I've opened a tracking issue here: > https://github.com/pytorch/pytorch/issues/15294 > > I'm looking forward to hearing from the TensorFlow devs if manylinux2010 is > sufficient for them, or what additional constraints they have. > > As a personal thought, I find multiple libraries in the same process > statically linking to stdc++ gross, but without a package manager like > Anaconda that actually is willing to deal with the C++-side dependencies, > there aren't many options on the table. IIUC the idea of the devtoolset-* toolchains is that all libraries should use the same toolchain then there are no issues. Having multiple projects passing -static-libstdc++ when linking would indeed be problematic. The problem we are having is that if any library is using devtoolset-2, all libraries need to in order to be compatible. > > References: > > [1] https://github.com/apache/arrow/blob/master/cpp/src/arrow/symbols.map > [2] https://github.com/pytorch/pytorch/blob/v0.3.1/tools/pytorch.version > [3] https://github.com/pytorch/pytorch/issues/5400#issuecomment-369428125 > ............................................................................................................................................................ > Hi Philipp, > > Thanks a lot for getting a discussion started. I've sunk ~100+ hours over the > last 2 years making PyTorch wheels play well with OpenCV, TensorFlow and > other wheels, that I'm glad to see this discussion started. > > > On the PyTorch wheels, we have been shipping with the minimum glibc and > libstdc++ versions we can possibly work with, while keeping two hard > constraints: > > 1. CUDA support > 2. C++11 support > > > 1. CUDA support > > manylinux1 is not an option, considering CUDA doesn't work out of CentOS5. I > explored this option [1] to no success. > > manylinux2010 is an option at the moment wrt CUDA, but it's unclear when > NVIDIA will lift support for CentOS6 under us. > Additionally, CuDNN 7.0 (if I remember) was compiled against Ubuntu 12.04 > (meaning the glibc version is newer than CentOS6), and binaries linked > against CuDNN refused to run on CentOS6. I requested that this constraint be > lifted, and the next dot release fixed it. > > The reason PyTorch binaries are not manylinux2010 compatible at the moment is > because of the next constraint: C++11. Do we need to involve NVIDIA in this discussion? Having problematic GPU-enabled libraries in PyPI isn't too good for them either. > > 2. C++11 > > We picked C++11 as the minimum supported dialect for PyTorch, primarily to > serve the default compilers of older machines, i.e. Ubuntu 14.04 and CentOS7. > The newer options were C++14 / C++17, but we decided to polyfill what we > needed to support older distros better. > > A fully fleshed out C++11 implementation landed in gcc in various stages, > with gradual ABI changes [2]. Unfortunately, the libstdc++ that ships with > centos6 (and hence manylinx2010) isn't sufficient to cover all of C++11. For > example, the binaries we built with devtoolset3 (gcc 4.9.2) on CentOS6 didn't > run with the default libstdc++ on CentOS6 either due to ABI changes or > minimum GLIBCXX version for some of the symbols being unavailable. > Do you have a link to the paper trail about this? I had thought a major raison d'etre of the devtoolset compilers is to support C++11 on older Linuxes. For example, we are using C++11 in Arrow but we're limiting ourselves at present to what's available in gcc 4.8.x; our binaries work fine on CentOS5 and 6. > We tried our best to support our binaries running on CentOS6 and above with > various ranges of static linking hacks until 0.3.1 (January 2018), but at > some point hacks over hacks was only getting more fragile. Hence we moved to > a CentOS7-based image in April 2018 [3], and relied only on dynamic linking > to the system-shipped libstdc++. > > As Wes mentions [4], an option is to host a modern C++ standard library via > PyPI would put manylinux2010 on the table. There are however subtle > consequences with this -- if this package gets installed into a conda > environment, it'll clobber anaconda-shipped libstdc++, possibly corrupting > environments for thousands of anaconda users (this is actually similar to the > issues with `mkl` shipped via PyPI and Conda clobbering each other). > More evidence that "pip" as a packaging tool may have already outlived its usefulness to this community. Somehow we need to arrange that the same compiler toolchain (with consistent minimum glibc, libstdc++ version) is used to build all of the binaries we are discussing here. Short of that some system configurations will continue to have problems. - Wes > > References: > > [1] https://github.com/NVIDIA/nvidia-docker/issues/348 > [2] https://gcc.gnu.org/wiki/Cxx11AbiCompatibility > [3] > https://github.com/pytorch/builder/commit/44d9bfa607a7616c66fe6492fadd8f05f3578b93 > [4] https://github.com/apache/arrow/pull/3177#issuecomment-447515982 > .............................................................................................................................................................................................. > > On Sun, Dec 16, 2018 at 2:57 PM Wes McKinney <wesmck...@gmail.com> wrote: >> >> Reposting since I wasn't subscribed to develop...@tensorflow.org. I >> also didn't see Soumith's response since it didn't come through to >> dev@arrow.apache.org >> >> In response to the non-conforming ABI in the TF and PyTorch wheels, we >> have attempted to hack around the issue with some elaborate >> workarounds [1] [2] that have ultimately proved to not work >> universally. The bottom line is that this is burdening other projects >> in the Python ecosystem and causing confusing application crashes. >> >> First, to state what should hopefully obvious to many of you, Python >> wheels are not a robust way to deploy complex C++ projects, even >> setting aside the compiler toolchain issue. If a project has >> non-trivial third party dependencies, you either have to statically >> link them or bundle shared libraries with the wheel (we do a bit of >> both in Apache Arrow). Neither solution is foolproof in all cases. >> There are other downsides to wheels when it comes to numerical >> computing -- it is difficult to utilize things like the Intel MKL >> which may be used by multiple projects. If two projects have the same >> third party C++ dependency (e.g. let's use gRPC or libprotobuf as a >> straw man example), it's hard to guarantee that versions or ABI will >> not conflict with each other. >> >> In packaging with conda, we pin all dependencies when building >> projects that depend on them, then package and deploy the dependencies >> as separate shared libraries instead of bundling. To resolve the need >> for newer compilers or newer C++ standard library, libstdc++.so and >> other system shared libraries are packaged and installed as >> dependencies. In manylinux1, the RedHat devtoolset compiler toolchain >> is used as it performs selective static linking of symbols to enable >> C++11 libraries to be deployed on older Linuxes like RHEL5/6. A conda >> environment functions as sort of portable miniature Linux >> distribution. >> >> Given the current state of things, as using the TensorFlow and PyTorch >> wheels in the same process as other conforming manylinux1 wheels is >> unsafe, it's hard to see how one can continue to recommend pip as a >> preferred installation path until the ABI problems are resolved. For >> example, "pip" is what is recommended for installing TensorFlow on >> Linux [3]. It's unclear that non-compliant wheels should be allowed in >> the package manager at all (I'm aware that this was deemed to not be >> the responsibility of PyPI to verify policy compliance [4]). >> >> A couple possible paths forward (there may be others): >> >> * Collaborate with the Python packaging authority to evolve the >> manylinux ABI to be able to produce compliant wheels that support the >> build and deployment requirements of these projects >> * Create a new ABI tag for CUDA/C++11-enabled Python wheels so that >> projects can ship packages that can be guaranteed to work properly >> with TF/PyTorch. This might require vendoring libstdc++ in some kind >> of "toolchain" wheel that projects using this new ABI can depend on >> >> Note that these toolchain and deployment issues are absent when >> building and deploying with conda packages, since build- and run-time >> dependencies can be pinned and shared across all the projects that >> depend on them, ensuring ABI cross-compatibility. It's great to have >> the convenience of "pip install $PROJECT", but I believe that these >> projects have outgrown the intended use for pip and wheel >> distributions. >> >> Until the ABI incompatibilities are resolved, I would encourage more >> prominent user documentation about the non-portability and potential >> for crashes with these Linux wheels. >> >> Thanks, >> Wes >> >> [1]: >> https://github.com/apache/arrow/commit/537e7f7fd503dd920c0b9f0cef8a2de86bc69e3b >> [2]: >> https://github.com/apache/arrow/commit/e7aaf7bf3d3e326b5fe58d20f8fc45b5cec01cac >> [3]: https://www.tensorflow.org/install/ >> [4]: https://www.python.org/dev/peps/pep-0513/#id50 >> On Sat, Dec 15, 2018 at 11:25 PM Robert Nishihara >> <robertnishih...@gmail.com> wrote: >> > >> > On Sat, Dec 15, 2018 at 8:43 PM Philipp Moritz <pcmor...@gmail.com> wrote: >> > >> > > Dear all, >> > > >> > > As some of you know, there is a standard in Python called manylinux ( >> > > https://www.python.org/dev/peps/pep-0513/) to package binary executables >> > > and libraries into a “wheel” in a way that allows the code to be run on a >> > > wide variety of Linux distributions. This is very convenient for Python >> > > users, since such libraries can be easily installed via pip. >> > > >> > > This standard is also important for a second reason: If many different >> > > wheels are used together in a single Python process, adhering to >> > > manylinux >> > > ensures that these libraries work together well and don’t trip on each >> > > other’s toes (this could easily happen if different versions of libstdc++ >> > > are used for example). Therefore *even if support for only a single >> > > distribution like Ubuntu is desired*, it is important to be manylinux >> > > compatible to make sure everybody’s wheels work together well. >> > > >> > > TensorFlow and PyTorch unfortunately don’t produce manylinux compatible >> > > wheels. The challenge is due, at least in part, to the need to use >> > > nvidia-docker to build GPU binaries [10]. This causes various levels of >> > > pain for the rest of the Python community, see for example [1] [2] [3] >> > > [4] >> > > [5] [6] [7] [8]. >> > > >> > > The purpose of the e-mail is to get a discussion started on how we can >> > > make TensorFlow and PyTorch manylinux compliant. There is a new standard >> > > in >> > > the works [9] so hopefully we can discuss what would be necessary to make >> > > sure TensorFlow and PyTorch can adhere to this standard in the future. >> > > >> > > It would make everybody’s lives just a little bit better! Any ideas are >> > > appreciated. >> > > >> > > @soumith: Could you cc the relevant list? I couldn't find a pytorch dev >> > > mailing list. >> > > >> > > Best, >> > > Philipp. >> > > >> > > [1] https://github.com/tensorflow/tensorflow/issues/5033 >> > > [2] https://github.com/tensorflow/tensorflow/issues/8802 >> > > [3] https://github.com/primitiv/primitiv-python/issues/28 >> > > [4] https://github.com/zarr-developers/numcodecs/issues/70 >> > > [5] https://github.com/apache/arrow/pull/3177 >> > > [6] https://github.com/tensorflow/tensorflow/issues/13615 >> > > [7] https://github.com/pytorch/pytorch/issues/8358 >> > > [8] https://github.com/ray-project/ray/issues/2159 >> > > [9] https://www.python.org/dev/peps/pep-0571/ >> > > [10] >> > > https://github.com/tensorflow/tensorflow/issues/8802#issuecomment-291935940 >> > > >> > > -- >> > > You received this message because you are subscribed to the Google Groups >> > > "ray-dev" group. >> > > To unsubscribe from this group and stop receiving emails from it, send an >> > > email to ray-dev+unsubscr...@googlegroups.com. >> > > To post to this group, send email to ray-...@googlegroups.com. >> > > To view this discussion on the web visit >> > > https://groups.google.com/d/msgid/ray-dev/CAFs1FxUBAag6AThj34twiAB6KY3t5sJSJF3g70K3SvF-%2BzGGgw%40mail.gmail.com >> > > <https://groups.google.com/d/msgid/ray-dev/CAFs1FxUBAag6AThj34twiAB6KY3t5sJSJF3g70K3SvF-%2BzGGgw%40mail.gmail.com?utm_medium=email&utm_source=footer> >> > > . >> > > For more options, visit https://groups.google.com/d/optout. >> > >