Re: TensorFlow, PyTorch, and manylinux1

soumith Mon, 17 Dec 2018 13:51:43 -0800

Hey Travis,

PyTorch and anaconda are actually smooth. There are no issues with
Anaconda, and we officially maintain conda packages (it's also our
recommended and default package manager).


Conda-forge recipes are currently not possible because conda-forge hasn't
finalized their CUDA packaging mechanisms.

This thread is mostly focusing on unscrewing the PyPI situation.
--
S

On Mon, Dec 17, 2018 at 9:54 AM Travis Oliphant <[email protected]>
wrote:

> Can PyTorch provide and maintain a conda-forge recipe?
>
> This would allow the large and growing conda forge ecosystem to easily
> install PyTorch in a community-supported way.
>
> Are there problems with using conda or another general package manager?
>
> I agree that the machine learning packages are trying to make a language
> specific package manager do more than it was intended and other open source
> solutions already exist.
>
> Thanks,
>
> Travis
>
>
> On Mon, Dec 17, 2018, 12:32 AM soumith <[email protected] wrote:
>
> > I'm reposting my original reply below the current reply (below a dotted
> > line). It was filtered out because I wasn't subscribed to the relevant
> > mailing lists.
> >
> >  tl;dr: manylinux2010 looks pretty promising, because CUDA supports
> CentOS6
> > (for now).
> >
> > In the meanwhile, I dug into what pyarrow does, and it looks like it
> links
> > with `static-libstdc++` along with a linker version script [1].
> >
> > PyTorch did exactly that until Jan this year [2], except that our linker
> > version script didn't cover the subtleties of statically linking stdc++
> as
> > well as Arrow did. Because we weren't covering all of the stdc++ static
> > linking subtleties, we were facing huge issues that amplified wheel
> > incompatibility (import X; import torch crashing under various X). Hence,
> > we moved since then to linking with system-shipped libstdc++, doing no
> > static stdc++ linking.
> >
> > I'll revisit this in light of manylinux2010, and go down the path of
> static
> > linkage of stdc++ again, though I'm wary of the subtleties around
> handling
> > of weak symbols, std::string destruction across library boundaries [3]
> and
> > std::string's ABI incompatibility issues.
> >
> > I've opened a tracking issue here:
> > https://github.com/pytorch/pytorch/issues/15294
> >
> > I'm looking forward to hearing from the TensorFlow devs if manylinux2010
> is
> > sufficient for them, or what additional constraints they have.
> >
> > As a personal thought, I find multiple libraries in the same process
> > statically linking to stdc++ gross, but without a package manager like
> > Anaconda that actually is willing to deal with the C++-side dependencies,
> > there aren't many options on the table.
> >
> > References:
> >
> > [1]
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/symbols.map
> > [2] https://github.com/pytorch/pytorch/blob/v0.3.1/tools/pytorch.version
> > [3]
> https://github.com/pytorch/pytorch/issues/5400#issuecomment-369428125
> >
> >
> ............................................................................................................................................................
> > Hi Philipp,
> >
> > Thanks a lot for getting a discussion started. I've sunk ~100+ hours over
> > the last 2 years making PyTorch wheels play well with OpenCV, TensorFlow
> > and other wheels, that I'm glad to see this discussion started.
> >
> >
> > On the PyTorch wheels, we have been shipping with the minimum glibc and
> > libstdc++ versions we can possibly work with, while keeping two hard
> > constraints:
> >
> > 1. CUDA support
> > 2. C++11 support
> >
> >
> > 1. CUDA support
> >
> > manylinux1 is not an option, considering CUDA doesn't work out of
> CentOS5.
> > I explored this option [1] to no success.
> >
> > manylinux2010 is an option at the moment wrt CUDA, but it's unclear when
> > NVIDIA will lift support for CentOS6 under us.
> > Additionally, CuDNN 7.0 (if I remember) was compiled against Ubuntu 12.04
> > (meaning the glibc version is newer than CentOS6), and binaries linked
> > against CuDNN refused to run on CentOS6. I requested that this constraint
> > be lifted, and the next dot release fixed it.
> >
> > The reason PyTorch binaries are not manylinux2010 compatible at the
> moment
> > is because of the next constraint: C++11.
> >
> > 2. C++11
> >
> > We picked C++11 as the minimum supported dialect for PyTorch, primarily
> to
> > serve the default compilers of older machines, i.e. Ubuntu 14.04 and
> > CentOS7. The newer options were C++14 / C++17, but we decided to polyfill
> > what we needed to support older distros better.
> >
> > A fully fleshed out C++11 implementation landed in gcc in various stages,
> > with gradual ABI changes [2]. Unfortunately, the libstdc++ that ships
> with
> > centos6 (and hence manylinx2010) isn't sufficient to cover all of C++11.
> > For example, the binaries we built with devtoolset3 (gcc 4.9.2) on
> CentOS6
> > didn't run with the default libstdc++ on CentOS6 either due to ABI
> changes
> > or minimum GLIBCXX version for some of the symbols being unavailable.
> >
> > We tried our best to support our binaries running on CentOS6 and above
> with
> > various ranges of static linking hacks until 0.3.1 (January 2018), but at
> > some point hacks over hacks was only getting more fragile. Hence we moved
> > to a CentOS7-based image in April 2018 [3], and relied only on dynamic
> > linking to the system-shipped libstdc++.
> >
> > As Wes mentions [4], an option is to host a modern C++ standard library
> via
> > PyPI would put manylinux2010 on the table. There are however subtle
> > consequences with this -- if this package gets installed into a conda
> > environment, it'll clobber anaconda-shipped libstdc++, possibly
> corrupting
> > environments for thousands of anaconda users (this is actually similar to
> > the issues with `mkl` shipped via PyPI and Conda clobbering each other).
> >
> >
> > References:
> >
> > [1] https://github.com/NVIDIA/nvidia-docker/issues/348
> > [2] https://gcc.gnu.org/wiki/Cxx11AbiCompatibility
> > [3]
> >
> >
> https://github.com/pytorch/builder/commit/44d9bfa607a7616c66fe6492fadd8f05f3578b93
> > [4] https://github.com/apache/arrow/pull/3177#issuecomment-447515982
> >
> >
> ..............................................................................................................................................................................................
> >
> > On Sun, Dec 16, 2018 at 2:57 PM Wes McKinney <[email protected]>
> wrote:
> >
> > > Reposting since I wasn't subscribed to [email protected]. I
> > > also didn't see Soumith's response since it didn't come through to
> > > [email protected]
> > >
> > > In response to the non-conforming ABI in the TF and PyTorch wheels, we
> > > have attempted to hack around the issue with some elaborate
> > > workarounds [1] [2] that have ultimately proved to not work
> > > universally. The bottom line is that this is burdening other projects
> > > in the Python ecosystem and causing confusing application crashes.
> > >
> > > First, to state what should hopefully obvious to many of you, Python
> > > wheels are not a robust way to deploy complex C++ projects, even
> > > setting aside the compiler toolchain issue. If a project has
> > > non-trivial third party dependencies, you either have to statically
> > > link them or bundle shared libraries with the wheel (we do a bit of
> > > both in Apache Arrow). Neither solution is foolproof in all cases.
> > > There are other downsides to wheels when it comes to numerical
> > > computing -- it is difficult to utilize things like the Intel MKL
> > > which may be used by multiple projects. If two projects have the same
> > > third party C++ dependency (e.g. let's use gRPC or libprotobuf as a
> > > straw man example), it's hard to guarantee that versions or ABI will
> > > not conflict with each other.
> > >
> > > In packaging with conda, we pin all dependencies when building
> > > projects that depend on them, then package and deploy the dependencies
> > > as separate shared libraries instead of bundling. To resolve the need
> > > for newer compilers or newer C++ standard library, libstdc++.so and
> > > other system shared libraries are packaged and installed as
> > > dependencies. In manylinux1, the RedHat devtoolset compiler toolchain
> > > is used as it performs selective static linking of symbols to enable
> > > C++11 libraries to be deployed on older Linuxes like RHEL5/6. A conda
> > > environment functions as sort of portable miniature Linux
> > > distribution.
> > >
> > > Given the current state of things, as using the TensorFlow and PyTorch
> > > wheels in the same process as other conforming manylinux1 wheels is
> > > unsafe, it's hard to see how one can continue to recommend pip as a
> > > preferred installation path until the ABI problems are resolved. For
> > > example, "pip" is what is recommended for installing TensorFlow on
> > > Linux [3]. It's unclear that non-compliant wheels should be allowed in
> > > the package manager at all (I'm aware that this was deemed to not be
> > > the responsibility of PyPI to verify policy compliance [4]).
> > >
> > > A couple possible paths forward (there may be others):
> > >
> > > * Collaborate with the Python packaging authority to evolve the
> > > manylinux ABI to be able to produce compliant wheels that support the
> > > build and deployment requirements of these projects
> > > * Create a new ABI tag for CUDA/C++11-enabled Python wheels so that
> > > projects can ship packages that can be guaranteed to work properly
> > > with TF/PyTorch. This might require vendoring libstdc++ in some kind
> > > of "toolchain" wheel that projects using this new ABI can depend on
> > >
> > > Note that these toolchain and deployment issues are absent when
> > > building and deploying with conda packages, since build- and run-time
> > > dependencies can be pinned and shared across all the projects that
> > > depend on them, ensuring ABI cross-compatibility. It's great to have
> > > the convenience of "pip install $PROJECT", but I believe that these
> > > projects have outgrown the intended use for pip and wheel
> > > distributions.
> > >
> > > Until the ABI incompatibilities are resolved, I would encourage more
> > > prominent user documentation about the non-portability and potential
> > > for crashes with these Linux wheels.
> > >
> > > Thanks,
> > > Wes
> > >
> > > [1]:
> > >
> >
> https://github.com/apache/arrow/commit/537e7f7fd503dd920c0b9f0cef8a2de86bc69e3b
> > > [2]:
> > >
> >
> https://github.com/apache/arrow/commit/e7aaf7bf3d3e326b5fe58d20f8fc45b5cec01cac
> > > [3]: https://www.tensorflow.org/install/
> > > [4]: https://www.python.org/dev/peps/pep-0513/#id50
> > > On Sat, Dec 15, 2018 at 11:25 PM Robert Nishihara
> > > <[email protected]> wrote:
> > > >
> > > > On Sat, Dec 15, 2018 at 8:43 PM Philipp Moritz <[email protected]>
> > > wrote:
> > > >
> > > > > Dear all,
> > > > >
> > > > > As some of you know, there is a standard in Python called
> manylinux (
> > > > > https://www.python.org/dev/peps/pep-0513/) to package binary
> > > executables
> > > > > and libraries into a “wheel” in a way that allows the code to be
> run
> > > on a
> > > > > wide variety of Linux distributions. This is very convenient for
> > Python
> > > > > users, since such libraries can be easily installed via pip.
> > > > >
> > > > > This standard is also important for a second reason: If many
> > different
> > > > > wheels are used together in a single Python process, adhering to
> > > manylinux
> > > > > ensures that these libraries work together well and don’t trip on
> > each
> > > > > other’s toes (this could easily happen if different versions of
> > > libstdc++
> > > > > are used for example). Therefore *even if support for only a single
> > > > > distribution like Ubuntu is desired*, it is important to be
> manylinux
> > > > > compatible to make sure everybody’s wheels work together well.
> > > > >
> > > > > TensorFlow and PyTorch unfortunately don’t produce manylinux
> > compatible
> > > > > wheels. The challenge is due, at least in part, to the need to use
> > > > > nvidia-docker to build GPU binaries [10]. This causes various
> levels
> > of
> > > > > pain for the rest of the Python community, see for example [1] [2]
> > [3]
> > > [4]
> > > > > [5] [6] [7] [8].
> > > > >
> > > > > The purpose of the e-mail is to get a discussion started on how we
> > can
> > > > > make TensorFlow and PyTorch manylinux compliant. There is a new
> > > standard in
> > > > > the works [9] so hopefully we can discuss what would be necessary
> to
> > > make
> > > > > sure TensorFlow and PyTorch can adhere to this standard in the
> > future.
> > > > >
> > > > > It would make everybody’s lives just a little bit better! Any ideas
> > are
> > > > > appreciated.
> > > > >
> > > > > @soumith: Could you cc the relevant list? I couldn't find a pytorch
> > dev
> > > > > mailing list.
> > > > >
> > > > > Best,
> > > > > Philipp.
> > > > >
> > > > > [1] https://github.com/tensorflow/tensorflow/issues/5033
> > > > > [2] https://github.com/tensorflow/tensorflow/issues/8802
> > > > > [3] https://github.com/primitiv/primitiv-python/issues/28
> > > > > [4] https://github.com/zarr-developers/numcodecs/issues/70
> > > > > [5] https://github.com/apache/arrow/pull/3177
> > > > > [6] https://github.com/tensorflow/tensorflow/issues/13615
> > > > > [7] https://github.com/pytorch/pytorch/issues/8358
> > > > > [8] https://github.com/ray-project/ray/issues/2159
> > > > > [9] https://www.python.org/dev/peps/pep-0571/
> > > > > [10]
> > > > >
> > >
> >
> https://github.com/tensorflow/tensorflow/issues/8802#issuecomment-291935940
> > > > >
> > > > > --
> > > > > You received this message because you are subscribed to the Google
> > > Groups
> > > > > "ray-dev" group.
> > > > > To unsubscribe from this group and stop receiving emails from it,
> > send
> > > an
> > > > > email to [email protected].
> > > > > To post to this group, send email to [email protected].
> > > > > To view this discussion on the web visit
> > > > >
> > >
> >
> https://groups.google.com/d/msgid/ray-dev/CAFs1FxUBAag6AThj34twiAB6KY3t5sJSJF3g70K3SvF-%2BzGGgw%40mail.gmail.com
> > > > > <
> > >
> >
> https://groups.google.com/d/msgid/ray-dev/CAFs1FxUBAag6AThj34twiAB6KY3t5sJSJF3g70K3SvF-%2BzGGgw%40mail.gmail.com?utm_medium=email&utm_source=footer
> > > >
> > > > > .
> > > > > For more options, visit https://groups.google.com/d/optout.
> > > > >
> > >
> >
>

Re: TensorFlow, PyTorch, and manylinux1

Reply via email to