On Mon, Aug 5, 2019 at 1:43 AM Valentyn Tymofieiev <valen...@google.com>
wrote:

> - The purpose of install_requires in setup.py is to define the maximally
> permissive set of requirements for a package[1]. We don't pin a version in
> setup.py without a strong reason, instead we typically pick up a lower
> bound we have tested, and set an upper bound to be next major version.
> - The purpose of requirements.txt is to force pip to properly resolve
> dependencies, and create a reproducible execution environment, since pip
> doesn’t have true dependency resolution [2]
>
> We currently regularly upgrade setup.py dependencies, but do not update
> base_image_requirements.txt. Also, our requirements file is not exhaustive.
> We need to introduce a process to fix this.
>

base_image_requirements.txt currently does not drive any released Beam
artifact. It would make sense to address this as part of the SDKHarness
container image release process. Similar mechanism might make sense for
figuring out what goes into containers for other SDKs. (Perhaps add your
proposal to
https://cwiki.apache.org/confluence/display/BEAM/%5BWIP%5D+SDKHarness+Container+Image+Release+Process
)


> I would recommend that in new process:
> - All dependencies of Beam Python SDK, including transitive
> dependencies, are listed in base_image_requirements.txt (or another
> requirements file). "Explicit is better than implicit."
> - Requirements file is regenerated whenever setup.py changes..
>
- When we build a container image, we check that the final image has
> exactly the same versions of dependencies that were spelled out in
> requirements file (no versions added, or changed)
>

The above could be done in a two step process:
- a human generated requirements file like base_image_requirements today
which has a set of curated requirements.
- Changes to the first file would result in a generated file with all
transitive dependencies. Second file could be used as the source of truth
for all dependencies at a particular commit. Generated file could be used
for the container builds.


> - We also check that there are no dependency conflicts (they typically
> look like: Package X requires version A of dependency Y, but you will have
> B, which is incompatible).
>

+1. Container build could verify this.


> - We update the versions of pinned dependencies periodically. We may want
> to put all dependencies of SDK harness containers on the radar of Beam
> dependency checker.
>

+1 but a separate process.

This process does not address how to keep setup.py in sync with
requirements file. We could use git hooks to ensure that files are changed
at the same time.


> Valentyn
>
> [1]
> https://packaging.python.org/discussions/install-requires-vs-requirements/
> [2] https://pip.pypa.io/en/stable/user_guide/#requirements-files
>
> On Sat, Aug 3, 2019 at 2:47 AM Ahmet Altay <al...@google.com> wrote:
>
>>
>>
>> On Fri, Aug 2, 2019 at 4:34 PM Brian Hulette <bhule...@google.com> wrote:
>>
>>> Thanks for the reply, I added some responses inline.
>>>
>>> On Fri, Aug 2, 2019 at 2:42 PM Ahmet Altay <al...@google.com> wrote:
>>> >
>>> > There is a value in explicitly pinning the dependencies to be used in
>>> the containers:
>>> > - It reproducibly produces the same container. This will be important
>>> once we start release Beam container. By looking at a Beam release branch,
>>> one could exactly figure out the set of dependencies available in a
>>> released container.
>>> >
>>> > - Package manager (pip) in this is notorious about resolving versions
>>> for sub-dependencies, and sometimes picks incompatible dependencies.
>>> > - SImilarly this repdocubility is helpful with tests, that can work on
>>> the same container version unaffected by numerous dependency version
>>> changes happening in sub-dependecies.
>>>
>>> I thought this may be the reason, it definitely makes sense to pin
>>> dependencies for reproducible builds. If that is the case though I
>>> think we should at least change the comment in
>>> base_image_requirements.txt to say so (I'm happy to write a patch for
>>> that).
>>>
>>
>> Sounds good.
>>
>>
>>>
>>> > In addition, I will argue that we will want to keep a form of
>>> base_image_requirements.txt, in order to be able to install additional
>>> dependencies on the container and we would not be able to get rid of this
>>> mechanism.
>>>
>>> My PR doesn't remove base_image_requirements.txt. It keeps the file
>>> there but removes all the pinned beam dependencies, and instead just
>>> lists the beam tar.gz as a dependency directly, along with all of the
>>> additional dependencies.
>>>
>>
>> Got it, thanks for the clarification.
>>
>>
>>>
>>> >
>>> >
>>> > I will suggest, If possible ensure that both files are modified
>>> synchronously. This might be possible with precommit hooks although I am
>>> not familiar.
>>>
>>> Yeah setting up something like that would help, but surely there's a
>>> better way to lock dependencies than to count on people manually
>>> updating this file (even with a precommit reminding them)? I'm not
>>> that familiar with the complexities of python package management, but
>>> it seems like we should be able to set up something with pip freeze or
>>> pipenv's Pipfile.lock. The thing that confounds me when I try to think
>>> through that setup though is how to reference a development build of
>>> beam. If I run pip freeze in the container the requirement for
>>> apache-beam is "2.16.0.dev0" which of course doesn't resolve when you
>>> turn around and try to install it. Maybe an automated solution would
>>> just replace the apache-beam line with
>>> /opt/apache/beam/tars/apache-beam.tar.gz[gcp]?
>>>
>>
>> I agree, this probably could be automated in a better way.
>>
>> I believe just doing "pip
>> install /opt/apache/beam/tars/apache-beam.tar.gz[gcp]" at the beginning
>> will prevent a later error for "pip install apache-beam==2.16.0.dev0". So
>> we could have the former line in the docker file before installing the
>> requirements.
>>
>>
>>>
>>> >
>>> >
>>> > Ahmet
>>> >
>>> >
>>> > On Fri, Aug 2, 2019 at 2:20 PM Brian Hulette <bhule...@google.com>
>>> wrote:
>>> >>
>>> >> I recently ran into a portable python precommit failure that led me
>>> to discover that python dependencies for the container are defined in two
>>> different places, in slightly different ways: in setup.py with version
>>> ranges [1], and in a base_image_requirements.txt file in the container
>>> directory with pinned versions [2]. The latter includes some dependencies
>>> that are commonly used in user code, in addition to the Beam dependencies,
>>> and states that "Specifying the versions manually helps to resolve
>>> dependency conflicts with other packages installed in the container."
>>> >>
>>> >> But isn't the purpose of a package manager to resolve those
>>> conflicts? It seems that we should be able to just list the beam tar.gz as
>>> a dependency and let pip resolve it. I wrote up a patch [3] to test this
>>> approach and it seems to work. The only issue I ran into is that cython
>>> _does_ needs to be installed separately before beam so that it can be used
>>> when installing beam. Would this be a reasonable approach to remove the
>>> duplicate dependency specifications?
>>> >>
>>> >> Apologies if I'm rehashing an old discussion, if there's already an
>>> ML thread or PR discussing this issue I'd be happy to take a look at it.
>>> >>
>>> >> Thanks,
>>> >> Brian
>>> >>
>>> >> [1] https://github.com/apache/beam/blob/master/sdks/python/setup.py
>>> >> [2]
>>> https://github.com/apache/beam/blob/master/sdks/python/container/base_image_requirements.txt
>>> >> [3] https://github.com/apache/beam/pull/9236
>>> >>
>>> >>
>>> >>
>>> >> PS - What caused my precommit failure:
>>> >> I tested out the portable runner locally to debug the issue. It
>>> worked fine with the loopback worker, but failed when using the docker
>>> worker. I eventually determined the actual cause: I unintentionally relied
>>> on a protobuf 3.8.0 feature in the code I added. The setup.py requirement,
>>> used locally in the loopback worker, specified a range requirement which
>>> resolved to 3.9.0, but the dependency for the container was pinned to
>>> 3.6.1. I've proposed to resolve the issue by just bumping the lower bound
>>> of the range in setup.py to 3.8.0, to reflect the new requirement.
>>>
>>

Reply via email to