- The purpose of install_requires in setup.py is to define the maximally
permissive set of requirements for a package[1]. We don't pin a version in
setup.py without a strong reason, instead we typically pick up a lower
bound we have tested, and set an upper bound to be next major version.
- The purpose of requirements.txt is to force pip to properly resolve
dependencies, and create a reproducible execution environment, since pip
doesn’t have true dependency resolution [2]

We currently regularly upgrade setup.py dependencies, but do not update
base_image_requirements.txt. Also, our requirements file is not exhaustive.
We need to introduce a process to fix this. I would recommend that in new
process:
- All dependencies of Beam Python SDK, including transitive
dependencies, are listed in base_image_requirements.txt (or another
requirements file). "Explicit is better than implicit."
- Requirements file is regenerated whenever setup.py changes..
- When we build a container image, we check that the final image has
exactly the same versions of dependencies that were spelled out in
requirements file (no versions added, or changed)
- We also check that there are no dependency conflicts (they typically look
like: Package X requires version A of dependency Y, but you will have B,
which is incompatible).
- We update the versions of pinned dependencies periodically. We may want
to put all dependencies of SDK harness containers on the radar of Beam
dependency checker.

Valentyn

[1]
https://packaging.python.org/discussions/install-requires-vs-requirements/
[2] https://pip.pypa.io/en/stable/user_guide/#requirements-files

On Sat, Aug 3, 2019 at 2:47 AM Ahmet Altay <[email protected]> wrote:

>
>
> On Fri, Aug 2, 2019 at 4:34 PM Brian Hulette <[email protected]> wrote:
>
>> Thanks for the reply, I added some responses inline.
>>
>> On Fri, Aug 2, 2019 at 2:42 PM Ahmet Altay <[email protected]> wrote:
>> >
>> > There is a value in explicitly pinning the dependencies to be used in
>> the containers:
>> > - It reproducibly produces the same container. This will be important
>> once we start release Beam container. By looking at a Beam release branch,
>> one could exactly figure out the set of dependencies available in a
>> released container.
>> >
>> > - Package manager (pip) in this is notorious about resolving versions
>> for sub-dependencies, and sometimes picks incompatible dependencies.
>> > - SImilarly this repdocubility is helpful with tests, that can work on
>> the same container version unaffected by numerous dependency version
>> changes happening in sub-dependecies.
>>
>> I thought this may be the reason, it definitely makes sense to pin
>> dependencies for reproducible builds. If that is the case though I
>> think we should at least change the comment in
>> base_image_requirements.txt to say so (I'm happy to write a patch for
>> that).
>>
>
> Sounds good.
>
>
>>
>> > In addition, I will argue that we will want to keep a form of
>> base_image_requirements.txt, in order to be able to install additional
>> dependencies on the container and we would not be able to get rid of this
>> mechanism.
>>
>> My PR doesn't remove base_image_requirements.txt. It keeps the file
>> there but removes all the pinned beam dependencies, and instead just
>> lists the beam tar.gz as a dependency directly, along with all of the
>> additional dependencies.
>>
>
> Got it, thanks for the clarification.
>
>
>>
>> >
>> >
>> > I will suggest, If possible ensure that both files are modified
>> synchronously. This might be possible with precommit hooks although I am
>> not familiar.
>>
>> Yeah setting up something like that would help, but surely there's a
>> better way to lock dependencies than to count on people manually
>> updating this file (even with a precommit reminding them)? I'm not
>> that familiar with the complexities of python package management, but
>> it seems like we should be able to set up something with pip freeze or
>> pipenv's Pipfile.lock. The thing that confounds me when I try to think
>> through that setup though is how to reference a development build of
>> beam. If I run pip freeze in the container the requirement for
>> apache-beam is "2.16.0.dev0" which of course doesn't resolve when you
>> turn around and try to install it. Maybe an automated solution would
>> just replace the apache-beam line with
>> /opt/apache/beam/tars/apache-beam.tar.gz[gcp]?
>>
>
> I agree, this probably could be automated in a better way.
>
> I believe just doing "pip
> install /opt/apache/beam/tars/apache-beam.tar.gz[gcp]" at the beginning
> will prevent a later error for "pip install apache-beam==2.16.0.dev0". So
> we could have the former line in the docker file before installing the
> requirements.
>
>
>>
>> >
>> >
>> > Ahmet
>> >
>> >
>> > On Fri, Aug 2, 2019 at 2:20 PM Brian Hulette <[email protected]>
>> wrote:
>> >>
>> >> I recently ran into a portable python precommit failure that led me to
>> discover that python dependencies for the container are defined in two
>> different places, in slightly different ways: in setup.py with version
>> ranges [1], and in a base_image_requirements.txt file in the container
>> directory with pinned versions [2]. The latter includes some dependencies
>> that are commonly used in user code, in addition to the Beam dependencies,
>> and states that "Specifying the versions manually helps to resolve
>> dependency conflicts with other packages installed in the container."
>> >>
>> >> But isn't the purpose of a package manager to resolve those conflicts?
>> It seems that we should be able to just list the beam tar.gz as a
>> dependency and let pip resolve it. I wrote up a patch [3] to test this
>> approach and it seems to work. The only issue I ran into is that cython
>> _does_ needs to be installed separately before beam so that it can be used
>> when installing beam. Would this be a reasonable approach to remove the
>> duplicate dependency specifications?
>> >>
>> >> Apologies if I'm rehashing an old discussion, if there's already an ML
>> thread or PR discussing this issue I'd be happy to take a look at it.
>> >>
>> >> Thanks,
>> >> Brian
>> >>
>> >> [1] https://github.com/apache/beam/blob/master/sdks/python/setup.py
>> >> [2]
>> https://github.com/apache/beam/blob/master/sdks/python/container/base_image_requirements.txt
>> >> [3] https://github.com/apache/beam/pull/9236
>> >>
>> >>
>> >>
>> >> PS - What caused my precommit failure:
>> >> I tested out the portable runner locally to debug the issue. It worked
>> fine with the loopback worker, but failed when using the docker worker. I
>> eventually determined the actual cause: I unintentionally relied on a
>> protobuf 3.8.0 feature in the code I added. The setup.py requirement, used
>> locally in the loopback worker, specified a range requirement which
>> resolved to 3.9.0, but the dependency for the container was pinned to
>> 3.6.1. I've proposed to resolve the issue by just bumping the lower bound
>> of the range in setup.py to 3.8.0, to reflect the new requirement.
>>
>

Reply via email to