Re: TensorFlow, PyTorch, and manylinux1

Wes McKinney Tue, 29 Jan 2019 08:04:17 -0800

hi Manuel,

Adding a couple more folks from Apache Arrow to the thread to make
sure they see this discussion.


On Tue, Jan 22, 2019 at 3:48 AM Manuel Klimek <[email protected]> wrote:
>
> Sorry if I'm missing something fundamental, but it seems like a new manylinux 
> standard would come with the same problem of basically being static and 
> growing outdated.
>
> I'd be interested in helping to provide a toolchain wheel, as mentioned in 
> the initial post, at least for libc++ (potentially libstdc++) - it seems like 
> that could be updated on an ongoing basis, use standard dependency management 
> and if necessary be bootstrapped with a statically linked compiler.
>
> What would the requirements for such a toolchain wheel be for it to have a 
> chance to be widely used? (note that I come from a C++ background and don't 
> have a lot of experience with Python outside of modules using C++ under the 
> hood :)

In principle I would think that the requirement would be that we
demonstrate that wheels built with the newer compiler toolchain and
libstdc++ dependency can coexist with manylinux1 / manylinux2010
packages. This is supposed to be the promise of devtoolset-produced
libraries anyhow. A potential problem might be projects that need to
pass std::* objects between shared libraries in their C++ API. For
example, the "turbodbc" package uses the "pyarrow" package's C++ API.
This would just mean that any wheel that needs to depend on a wheel in
the "TF/PyTorch-compatible toolchain" ecosystem would necessarily need
to use the alternative build toolchain instead of manylinux*

If I'm reading the room right, it seems that manylinux2010 is
effectively DOA for TensorFlow and PyTorch, is that right? If that's
the case then we shouldn't spend another year or more wringing our
hands in hopes that the PyPA solves the problem in that way that we
need. We've got to get busy shipping software and move on with our
lives

- Wes

>
> Similarly, what would the downsides of such a toolchain wheel be?
>
> On Wednesday, December 19, 2018 at 5:07:49 AM UTC+1, Jay Furmanek wrote:
>>
>> Hi Martin,
>>
>> If the goal here is to propose a new manylinux standard, I'd love to be 
>> involved as well. Currently the existing standards excludes alternative 
>> (non-Intel) CPU architectures and specify certain levels that predate the 
>> existence of ppc64le and arm64 as Linux architectures. I could lend some 
>> insight to make the proposal a little more acceptable to those arches.
>>
>>
>> Jason M. Furmanek
>> Power Systems and Open Power Innovation and Solutions
>> IBM Systems & Technology Group
>> Mobile: 1-512-638-9692
>> E-mail: [email protected]
>>
>>
>>
>> ----- Original message -----
>> From: "'Martin Wicke' via TensorFlow Developers" <[email protected]>
>> To: soumith <[email protected]>
>> Cc: Jean-Marc Ludwig <[email protected]>, [email protected], Wes McKinney 
>> <[email protected]>, [email protected], Philipp Moritz 
>> <[email protected]>, TensorFlow Developers <[email protected]>, 
>> [email protected], [email protected], Edd Wilder-James 
>> <[email protected]>
>> Subject: Re: TensorFlow, PyTorch, and manylinux1
>> Date: Mon, Dec 17, 2018 5:49 PM
>>
>> I have created a fork of tensorflow/community and added a file:
>> https://github.com/martinwicke/community/blob/master/sigs/build/manylinux-proposal.md
>>
>> It's presently empty.
>>
>> I've invited Soumith, Wes, and Philipp to collaborate on the repo, let's 
>> work on this there? If anybody else wants to join, just let me know.
>>
>> On Mon, Dec 17, 2018 at 1:55 PM soumith <[email protected]> wrote:
>>
>> > The group on this thread is a good start, maybe we can get together and 
>> > make a proposal that meets the need of the scientific computing community? 
>> > I think that would probably involve updating the minimum requirements 
>> > (possibly to CentOS 7, I heard there was talk of a manylinux2014), carving 
>> > out NVIDIA libraries, and creating a smoother path for updating these 
>> > requirements (maybe a manylinux-rolling, which automatically updates 
>> > maximum versions based on age or support status without requiring new 
>> > PEPs).
>>
>> Martin, this sounds great. I'm really looking forward to the day where 
>> pytorch package binary sizes aren't heavily bloated because we have to ship 
>> with all of the CUDA / CuDNN / NCCL bits.
>>
>> Is there a github issue or a private google doc that we can collaborate on, 
>> to clear our thoughts and requirements into a proposal? We can propose a 
>> manylinux2014 (or realize that manylinux2010 is somehow sufficient), as well 
>> as push NVIDIA to address the distribution situation of the CUDA stack.
>>
>> --
>> S
>>
>> On Mon, Dec 17, 2018 at 12:31 PM Martin Wicke <[email protected]> wrote:
>>
>> Thank you Philipp for getting this started. We've been trying to get in 
>> touch and have tried via Nick Coghlan and Nathaniel Smith, but we never got 
>> far.
>>
>> I'm a little late to the party, but basically, what Soumith said. We have 
>> the exact same constraints (C++-11, CUDA/cuDNN). These would be extremely 
>> common for any computation-heavy packages, and properly solving this issue 
>> would be a huge boon for the Python community.
>>
>> Actual compliance with manylinux1 is out since it cannot fulfill those 
>> constraints. I'll also add that there is no way to build compliant wheels 
>> without using software beyond end-of-life (even beyond security updates).
>>
>> manylinux2010 is indeed promising, and I saw that Nick merged support for it 
>> recently, though I don't think there has been a pip release including the 
>> support yet (maybe that has now changed?).
>>
>> However, manylinux2010 still has (possible fatal) problems:
>>
>> - CUDA10's minimum versions are higher than manylinux2010's maximum 
>> versions: specifically, GCC 4.4.7 > 4.3.0.
>>
>> - NVIDIA's license terms for CUDA/cuDNN are not standard and redistribution 
>> can be problematic, and may depend on agreements you may have with NVIDIA. 
>> The libraries are also large, and including them would make distribution via 
>> pypi problematic. It would be much preferable if there was an approved way 
>> to distribute Python packages depending on external CUDA/cuDNN. I don't 
>> think this should be a problem, it is similar in spirit to the exception 
>> made for libGL.
>>
>> I've added JM Ludwig to this thread, I think as was mentioned by someone 
>> else, having NVIDIA in the conversation is critical.
>>
>> The group on this thread is a good start, maybe we can get together and make 
>> a proposal that meets the need of the scientific computing community? I 
>> think that would probably involve updating the minimum requirements 
>> (possibly to CentOS 7, I heard there was talk of a manylinux2014), carving 
>> out NVIDIA libraries, and creating a smoother path for updating these 
>> requirements (maybe a manylinux-rolling, which automatically updates maximum 
>> versions based on age or support status without requiring new PEPs).
>>
>> I'm very interested in solving this problem, I feel bad for abusing the 
>> manylinux1 tag.
>>
>> Martin
>>
>> On Sun, Dec 16, 2018 at 10:32 PM soumith <[email protected]> wrote:
>>
>> I'm reposting my original reply below the current reply (below a dotted 
>> line). It was filtered out because I wasn't subscribed to the relevant 
>> mailing lists.
>>
>>  tl;dr: manylinux2010 looks pretty promising, because CUDA supports CentOS6 
>> (for now).
>>
>> In the meanwhile, I dug into what pyarrow does, and it looks like it links 
>> with `static-libstdc++` along with a linker version script [1].
>>
>> PyTorch did exactly that until Jan this year [2], except that our linker 
>> version script didn't cover the subtleties of statically linking stdc++ as 
>> well as Arrow did. Because we weren't covering all of the stdc++ static 
>> linking subtleties, we were facing huge issues that amplified wheel 
>> incompatibility (import X; import torch crashing under various X). Hence, we 
>> moved since then to linking with system-shipped libstdc++, doing no static 
>> stdc++ linking.
>>
>> I'll revisit this in light of manylinux2010, and go down the path of static 
>> linkage of stdc++ again, though I'm wary of the subtleties around handling 
>> of weak symbols, std::string destruction across library boundaries [3] and 
>> std::string's ABI incompatibility issues.
>>
>> I've opened a tracking issue here: 
>> https://github.com/pytorch/pytorch/issues/15294
>>
>> I'm looking forward to hearing from the TensorFlow devs if manylinux2010 is 
>> sufficient for them, or what additional constraints they have.
>>
>> As a personal thought, I find multiple libraries in the same process 
>> statically linking to stdc++ gross, but without a package manager like 
>> Anaconda that actually is willing to deal with the C++-side dependencies, 
>> there aren't many options on the table.
>>
>> References:
>>
>> [1] https://github.com/apache/arrow/blob/master/cpp/src/arrow/symbols.map
>> [2] https://github.com/pytorch/pytorch/blob/v0.3.1/tools/pytorch.version
>> [3] https://github.com/pytorch/pytorch/issues/5400#issuecomment-369428125
>> ............................................................................................................................................................
>> Hi Philipp,
>>
>> Thanks a lot for getting a discussion started. I've sunk ~100+ hours over 
>> the last 2 years making PyTorch wheels play well with OpenCV, TensorFlow and 
>> other wheels, that I'm glad to see this discussion started.
>>
>>
>> On the PyTorch wheels, we have been shipping with the minimum glibc and 
>> libstdc++ versions we can possibly work with, while keeping two hard 
>> constraints:
>>
>> 1. CUDA support
>> 2. C++11 support
>>
>>
>> 1. CUDA support
>>
>> manylinux1 is not an option, considering CUDA doesn't work out of CentOS5. I 
>> explored this option [1] to no success.
>>
>> manylinux2010 is an option at the moment wrt CUDA, but it's unclear when 
>> NVIDIA will lift support for CentOS6 under us.
>> Additionally, CuDNN 7.0 (if I remember) was compiled against Ubuntu 12.04 
>> (meaning the glibc version is newer than CentOS6), and binaries linked 
>> against CuDNN refused to run on CentOS6. I requested that this constraint be 
>> lifted, and the next dot release fixed it.
>>
>> The reason PyTorch binaries are not manylinux2010 compatible at the moment 
>> is because of the next constraint: C++11.
>>
>> 2. C++11
>>
>> We picked C++11 as the minimum supported dialect for PyTorch, primarily to 
>> serve the default compilers of older machines, i.e. Ubuntu 14.04 and 
>> CentOS7. The newer options were C++14 / C++17, but we decided to polyfill 
>> what we needed to support older distros better.
>>
>> A fully fleshed out C++11 implementation landed in gcc in various stages, 
>> with gradual ABI changes [2]. Unfortunately, the libstdc++ that ships with 
>> centos6 (and hence manylinx2010) isn't sufficient to cover all of C++11. For 
>> example, the binaries we built with devtoolset3 (gcc 4.9.2) on CentOS6 
>> didn't run with the default libstdc++ on CentOS6 either due to ABI changes 
>> or minimum GLIBCXX version for some of the symbols being unavailable.
>>
>> We tried our best to support our binaries running on CentOS6 and above with 
>> various ranges of static linking hacks until 0.3.1 (January 2018), but at 
>> some point hacks over hacks was only getting more fragile. Hence we moved to 
>> a CentOS7-based image in April 2018 [3], and relied only on dynamic linking 
>> to the system-shipped libstdc++.
>>
>> As Wes mentions [4], an option is to host a modern C++ standard library via 
>> PyPI would put manylinux2010 on the table. There are however subtle 
>> consequences with this -- if this package gets installed into a conda 
>> environment, it'll clobber anaconda-shipped libstdc++, possibly corrupting 
>> environments for thousands of anaconda users (this is actually similar to 
>> the issues with `mkl` shipped via PyPI and Conda clobbering each other).
>>
>>
>> References:
>>
>> [1] https://github.com/NVIDIA/nvidia-docker/issues/348
>> [2] https://gcc.gnu.org/wiki/Cxx11AbiCompatibility
>> [3] 
>> https://github.com/pytorch/builder/commit/44d9bfa607a7616c66fe6492fadd8f05f3578b93
>> [4] https://github.com/apache/arrow/pull/3177#issuecomment-447515982
>> ..............................................................................................................................................................................................
>>
>> On Sun, Dec 16, 2018 at 2:57 PM Wes McKinney <[email protected]> wrote:
>>
>> Reposting since I wasn't subscribed to [email protected]. I
>> also didn't see Soumith's response since it didn't come through to
>> [email protected]
>>
>> In response to the non-conforming ABI in the TF and PyTorch wheels, we
>> have attempted to hack around the issue with some elaborate
>> workarounds [1] [2] that have ultimately proved to not work
>> universally. The bottom line is that this is burdening other projects
>> in the Python ecosystem and causing confusing application crashes.
>>
>> First, to state what should hopefully obvious to many of you, Python
>> wheels are not a robust way to deploy complex C++ projects, even
>> setting aside the compiler toolchain issue. If a project has
>> non-trivial third party dependencies, you either have to statically
>> link them or bundle shared libraries with the wheel (we do a bit of
>> both in Apache Arrow). Neither solution is foolproof in all cases.
>> There are other downsides to wheels when it comes to numerical
>> computing -- it is difficult to utilize things like the Intel MKL
>> which may be used by multiple projects. If two projects have the same
>> third party C++ dependency (e.g. let's use gRPC or libprotobuf as a
>> straw man example), it's hard to guarantee that versions or ABI will
>> not conflict with each other.
>>
>> In packaging with conda, we pin all dependencies when building
>> projects that depend on them, then package and deploy the dependencies
>> as separate shared libraries instead of bundling. To resolve the need
>> for newer compilers or newer C++ standard library, libstdc++.so and
>> other system shared libraries are packaged and installed as
>> dependencies. In manylinux1, the RedHat devtoolset compiler toolchain
>> is used as it performs selective static linking of symbols to enable
>> C++11 libraries to be deployed on older Linuxes like RHEL5/6. A conda
>> environment functions as sort of portable miniature Linux
>> distribution.
>>
>> Given the current state of things, as using the TensorFlow and PyTorch
>> wheels in the same process as other conforming manylinux1 wheels is
>> unsafe, it's hard to see how one can continue to recommend pip as a
>> preferred installation path until the ABI problems are resolved. For
>> example, "pip" is what is recommended for installing TensorFlow on
>> Linux [3]. It's unclear that non-compliant wheels should be allowed in
>> the package manager at all (I'm aware that this was deemed to not be
>> the responsibility of PyPI to verify policy compliance [4]).
>>
>> A couple possible paths forward (there may be others):
>>
>> * Collaborate with the Python packaging authority to evolve the
>> manylinux ABI to be able to produce compliant wheels that support the
>> build and deployment requirements of these projects
>> * Create a new ABI tag for CUDA/C++11-enabled Python wheels so that
>> projects can ship packages that can be guaranteed to work properly
>> with TF/PyTorch. This might require vendoring libstdc++ in some kind
>> of "toolchain" wheel that projects using this new ABI can depend on
>>
>> Note that these toolchain and deployment issues are absent when
>> building and deploying with conda packages, since build- and run-time
>> dependencies can be pinned and shared across all the projects that
>> depend on them, ensuring ABI cross-compatibility. It's great to have
>> the convenience of "pip install $PROJECT", but I believe that these
>> projects have outgrown the intended use for pip and wheel
>> distributions.
>>
>> Until the ABI incompatibilities are resolved, I would encourage more
>> prominent user documentation about the non-portability and potential
>> for crashes with these Linux wheels.
>>
>> Thanks,
>> Wes
>>
>> [1]: 
>> https://github.com/apache/arrow/commit/537e7f7fd503dd920c0b9f0cef8a2de86bc69e3b
>> [2]: 
>> https://github.com/apache/arrow/commit/e7aaf7bf3d3e326b5fe58d20f8fc45b5cec01cac
>> [3]: https://www.tensorflow.org/install/
>> [4]: https://www.python.org/dev/peps/pep-0513/#id50
>> On Sat, Dec 15, 2018 at 11:25 PM Robert Nishihara
>> <[email protected]> wrote:
>> >
>> > On Sat, Dec 15, 2018 at 8:43 PM Philipp Moritz <[email protected]> wrote:
>> >
>> > > Dear all,
>> > >
>> > > As some of you know, there is a standard in Python called manylinux (
>> > > https://www.python.org/dev/peps/pep-0513/) to package binary executables
>> > > and libraries into a “wheel” in a way that allows the code to be run on a
>> > > wide variety of Linux distributions. This is very convenient for Python
>> > > users, since such libraries can be easily installed via pip.
>> > >
>> > > This standard is also important for a second reason: If many different
>> > > wheels are used together in a single Python process, adhering to 
>> > > manylinux
>> > > ensures that these libraries work together well and don’t trip on each
>> > > other’s toes (this could easily happen if different versions of libstdc++
>> > > are used for example). Therefore *even if support for only a single
>> > > distribution like Ubuntu is desired*, it is important to be manylinux
>> > > compatible to make sure everybody’s wheels work together well.
>> > >
>> > > TensorFlow and PyTorch unfortunately don’t produce manylinux compatible
>> > > wheels. The challenge is due, at least in part, to the need to use
>> > > nvidia-docker to build GPU binaries [10]. This causes various levels of
>> > > pain for the rest of the Python community, see for example [1] [2] [3] 
>> > > [4]
>> > > [5] [6] [7] [8].
>> > >
>> > > The purpose of the e-mail is to get a discussion started on how we can
>> > > make TensorFlow and PyTorch manylinux compliant. There is a new standard 
>> > > in
>> > > the works [9] so hopefully we can discuss what would be necessary to make
>> > > sure TensorFlow and PyTorch can adhere to this standard in the future.
>> > >
>> > > It would make everybody’s lives just a little bit better! Any ideas are
>> > > appreciated.
>> > >
>> > > @soumith: Could you cc the relevant list? I couldn't find a pytorch dev
>> > > mailing list.
>> > >
>> > > Best,
>> > > Philipp.
>> > >
>> > > [1] https://github.com/tensorflow/tensorflow/issues/5033
>> > > [2] https://github.com/tensorflow/tensorflow/issues/8802
>> > > [3] https://github.com/primitiv/primitiv-python/issues/28
>> > > [4] https://github.com/zarr-developers/numcodecs/issues/70
>> > > [5] https://github.com/apache/arrow/pull/3177
>> > > [6] https://github.com/tensorflow/tensorflow/issues/13615
>> > > [7] https://github.com/pytorch/pytorch/issues/8358
>> > > [8] https://github.com/ray-project/ray/issues/2159
>> > > [9] https://www.python.org/dev/peps/pep-0571/
>> > > [10]
>> > > https://github.com/tensorflow/tensorflow/issues/8802#issuecomment-291935940
>> > >
>> > > --
>> > > You received this message because you are subscribed to the Google Groups
>> > > "ray-dev" group.
>> > > To unsubscribe from this group and stop receiving emails from it, send an
>> > > email to [email protected].
>> > > To post to this group, send email to [email protected].
>> > > To view this discussion on the web visit
>> > > https://groups.google.com/d/msgid/ray-dev/CAFs1FxUBAag6AThj34twiAB6KY3t5sJSJF3g70K3SvF-%2BzGGgw%40mail.gmail.com
>> > > <https://groups.google.com/d/msgid/ray-dev/CAFs1FxUBAag6AThj34twiAB6KY3t5sJSJF3g70K3SvF-%2BzGGgw%40mail.gmail.com?utm_medium=email&utm_source=footer>
>> > > .
>> > > For more options, visit https://groups.google.com/d/optout.
>> > >
>>
>>
>>
>> --
>> You received this message because you are subscribed to the Google Groups 
>> "TensorFlow Developers" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected].
>> Visit this group at 
>> https://groups.google.com/a/tensorflow.org/group/developers/.
>> To view this discussion on the web visit 
>> https://groups.google.com/a/tensorflow.org/d/msgid/developers/CAGZdauXqQ9Gze6eAB0R3%3D2j6X2yWfh7QPbrGj1%3D5xuvQUninpQ%40mail.gmail.com.
>>
>>
>>
>> --
>> You received this message because you are subscribed to the Google Groups 
>> "TensorFlow Developers" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected].
>> Visit this group at 
>> https://groups.google.com/a/tensorflow.org/group/developers/.
>> To view this discussion on the web visit 
>> https://groups.google.com/a/tensorflow.org/d/msgid/developers/CADtzJKMzpDj2SfFRygaxKTgJD3eoKi7kKBUgZExN9cceMN2CyQ%40mail.gmail.com.
>>
>>
>>

Re: TensorFlow, PyTorch, and manylinux1

Reply via email to