Thank you for your reply, Robert and Valentyn. Concerns about slow pipeline submission is very helpful which I couldn't take into account. It seems that using .tar.gz as sdist format would be fine but introducing PEP 517 to Beam should have bad impact to users therefore more consideration must be needed.
I will investigate more and post an update if ideal solutions are discovered. Thanks, yoshiki On Wed, Oct 13, 2021 at 10:59 AM Valentyn Tymofieiev <[email protected]> wrote: > > Hi Yoshiki, > > it should be fine to use tar.gz. There may be some Beam release-related > scripts we need to update that expect a .zip. Also I noticed that numpy uses > .zip for its sdist[1], and they also use pyproject.toml[2], so I don't know > if the requirement to use .tar.gz sdist in PEP-517 is a hard requirement, but > we can ask try to confirm. Perhaps it matters when some particular hooks of > PEP-517 are defined. > > I would like to discuss further your suggestion to use PEP-517 in Beam and > potential implications. For the benefit of others on this thread, for a > simple explanation of PEP-517, see [3]. > > I noticed that when several other projects switched to using PEP-517 (numpy, > pyarrow, it impacted the pipeline submission experience for Beam users. To > make sure that the pipeline execution environment has necessary pipeline > dependencies, Beam downloads distributions for packages (and their transitive > dependencies) that users supply in requirements.txt [4], and stages them to > the execution environment[5]. Specifically, Beam downloads and stages source > distributions (aka sdists) of the packages. We download sdists, since the > target execution platform on the runner is not known to the SDK, so staging > binary distributions (aka bdists, wheels) was never supported (BEAM-4032 > [6]). It turns out that downloading a source distribution of a library from > PyPI via `python -m pip download --dest /tmp pyarrow==5.0.0 --no-deps > --no-binary :all:` makes pip also do a metadata check to verify that the > contents of the sdist that was downloaded is indeed the package that was > requested [7]. For libraries packaged with setuptools / prior to PEP-517, > this involved a call `python setup.py egg_info` [8] , which is fairly fast. > However, for libraries packaged using pyproject.toml/PEP-517, getting the > package metadata involves installing build dependencies, and potentially > building a wheel and extracting the metadata from a wheel. If a library has c > extensions, this step involves compilation, can be quite slow and require > dev header packages to be installed on the OS. I am not sure whether the > slowdown happens on every project (with extensions) that adopts PEP-517, or > only if the project is somehow misconfigured, asked on [9]. So far I observed > such a slowdown with numpy and pyarrow[10]. > > Users have brought this up to the pip maintainers. For various reasons, this > has proved not easy to address, and long term solutions are still in the > works [11, 12, 13]. > > What does it mean for Beam: > > 1) With more Python packages adopting PEP-517, users may be getting affected > by slow pipeline submission time if they require certain packages in > requirements.txt that take a long time to download+build. > 2) There is a possibility that adopting PEP-517 in Beam will increase > pipeline submission time due to the slowness of the pip download command, > because we download Beam SDK and stage it to the runner during job > submission[14]. Given we don't know the target platform for sure, we stage > both an sdist and a bdist [14]. The platform selected for bdist matches the > platform of Beam's default containers, but with custom containers we can't > guarantee the target platform will always match a predefined default, so we > also pass an sdist. > 3) Users sometimes include apache-beam into requirements.txt of users' > pipelines. Although not necessary, this contributes to slowdowns because > numpy and pyarrow are Beam's dependencies, and they end up being downloaded. > > I am not against using PEP-517, it seems to have been written for good > reason, but we should prevent slowdown in pipeline submissions. > > I am curious what the community thinks about ways to adopt it. Possible > avenues: > > - Provide some information to the SDK about the runner's platform at pipeline > submission, and stage only binary packages to the runner whenever possible > [6]. Pip is only slow to download sdists, downloading bdists is fast. Also > installing bdists on the runner would be much faster. > - Rethink dependency staging. Avoid staging the SDK, and/or dependency > package sdists via container prebuilding workflow[15] by default for all > runners that use containers. During prebuilding, install packages directly on > the containers with `pip install`, and avoid `pip download` step. Filter out > apache_beam from requirements.txt file when users add it there. > - Wait until solutions for [11] become available or get involved to help move > that forward. > - Switch to download sdists over a HTTP instead of `pip download` [16], fall > back to pip download if not successful. > - There may be some way to configure PEP-517 properly, that avoids the > slowdown, asked on [9]. > > Thanks, > Valentyn > > [1] > https://pypi.org/project/numpy/#copy-hash-modal-89500de9-70e8-4e7e-87f5-12adf0808905 > [2] https://github.com/numpy/numpy/blob/main/pyproject.toml > [3] https://snarky.ca/clarifying-pep-518 > [4] > https://github.com/apache/beam/blob/1ce290bab031192c22f643cac92bd6470788798d/sdks/python/apache_beam/runners/portability/stager.py#L640 > [5] > https://github.com/apache/beam/blob/1ce290bab031192c22f643cac92bd6470788798d/sdks/python/apache_beam/runners/portability/stager.py#L238 > [6] https://issues.apache.org/jira/browse/BEAM-4032 > [7] https://github.com/pypa/pip/issues/1884#issuecomment-670364453 > [8] https://github.com/pypa/pip/issues/8387#issuecomment-638118900 > [9] https://github.com/pypa/pip/issues/10195#issuecomment-941811201 > [10] https://github.com/numpy/numpy/pull/14053#issuecomment-637709988 > [11] https://github.com/pypa/pip/issues/1884 > [12] > https://discuss.python.org/t/pep-625-file-name-of-a-source-distribution/4686 > [13] https://github.com/pypa/pip/issues/10195 > [14] > https://github.com/apache/beam/blob/1ce290bab031192c22f643cac92bd6470788798d/sdks/python/apache_beam/runners/portability/stager.py#L715 > [15] > https://github.com/apache/beam/blob/1ce290bab031192c22f643cac92bd6470788798d/sdks/python/apache_beam/options/pipeline_options.py#L1096 > [16] https://github.com/pypa/pip/issues/1884#issuecomment-800483766 > > > > > On Mon, Oct 11, 2021 at 8:51 AM Robert Bradshaw <[email protected]> wrote: >> >> That's fine by me. The only advantage I can think of for .zip is that >> it's (generally) better supported on Windows, but as far as I know >> .tar.gz works on Windows just fine for python package distribution. >> >> On Mon, Oct 11, 2021 at 5:09 AM Yoshiki Obata <[email protected]> >> wrote: >> > >> > Hello everyone, >> > >> > I'm working on BEAM-8954[1] which introduces tox isolated_build for >> > python tests. >> > Concerning this issue, I want opinions about using .tar.gz as sdist format. >> > >> > Introducing tox isolated_build leads replacement of >> > build-requirements.txt to pyproject.toml[2] and we should use >> > pyproject.toml when creating sdist because we install dependencies >> > with build-requirements.txt before calling "python setup.py sdist" >> > PEP 517 based build tools like pypa/build will help to do so, but it >> > does not allow .zip as sdist format[3]. >> > Therefore I think it would be better to switch sdist format to .tar.gz >> > when starting to use pyproject.toml. >> > >> > Are there any obstacles to use .tar.gz? >> > Please let me know details about adopting .zip as Beam sdist format(I >> > could not find discussions about this) >> > >> > Regards, >> > yoshiki >> > >> > [1] https://issues.apache.org/jira/browse/BEAM-8954 >> > [2] >> > https://tox.wiki/en/latest/config.html?highlight=isolated#conf-isolated_build >> > [3] https://www.python.org/dev/peps/pep-0517/#source-distributions
