- The purpose of install_requires in setup.py is to define the maximally permissive set of requirements for a package[1]. We don't pin a version in setup.py without a strong reason, instead we typically pick up a lower bound we have tested, and set an upper bound to be next major version. - The purpose of requirements.txt is to force pip to properly resolve dependencies, and create a reproducible execution environment, since pip doesn’t have true dependency resolution [2]
We currently regularly upgrade setup.py dependencies, but do not update base_image_requirements.txt. Also, our requirements file is not exhaustive. We need to introduce a process to fix this. I would recommend that in new process: - All dependencies of Beam Python SDK, including transitive dependencies, are listed in base_image_requirements.txt (or another requirements file). "Explicit is better than implicit." - Requirements file is regenerated whenever setup.py changes.. - When we build a container image, we check that the final image has exactly the same versions of dependencies that were spelled out in requirements file (no versions added, or changed) - We also check that there are no dependency conflicts (they typically look like: Package X requires version A of dependency Y, but you will have B, which is incompatible). - We update the versions of pinned dependencies periodically. We may want to put all dependencies of SDK harness containers on the radar of Beam dependency checker. Valentyn [1] https://packaging.python.org/discussions/install-requires-vs-requirements/ [2] https://pip.pypa.io/en/stable/user_guide/#requirements-files On Sat, Aug 3, 2019 at 2:47 AM Ahmet Altay <[email protected]> wrote: > > > On Fri, Aug 2, 2019 at 4:34 PM Brian Hulette <[email protected]> wrote: > >> Thanks for the reply, I added some responses inline. >> >> On Fri, Aug 2, 2019 at 2:42 PM Ahmet Altay <[email protected]> wrote: >> > >> > There is a value in explicitly pinning the dependencies to be used in >> the containers: >> > - It reproducibly produces the same container. This will be important >> once we start release Beam container. By looking at a Beam release branch, >> one could exactly figure out the set of dependencies available in a >> released container. >> > >> > - Package manager (pip) in this is notorious about resolving versions >> for sub-dependencies, and sometimes picks incompatible dependencies. >> > - SImilarly this repdocubility is helpful with tests, that can work on >> the same container version unaffected by numerous dependency version >> changes happening in sub-dependecies. >> >> I thought this may be the reason, it definitely makes sense to pin >> dependencies for reproducible builds. If that is the case though I >> think we should at least change the comment in >> base_image_requirements.txt to say so (I'm happy to write a patch for >> that). >> > > Sounds good. > > >> >> > In addition, I will argue that we will want to keep a form of >> base_image_requirements.txt, in order to be able to install additional >> dependencies on the container and we would not be able to get rid of this >> mechanism. >> >> My PR doesn't remove base_image_requirements.txt. It keeps the file >> there but removes all the pinned beam dependencies, and instead just >> lists the beam tar.gz as a dependency directly, along with all of the >> additional dependencies. >> > > Got it, thanks for the clarification. > > >> >> > >> > >> > I will suggest, If possible ensure that both files are modified >> synchronously. This might be possible with precommit hooks although I am >> not familiar. >> >> Yeah setting up something like that would help, but surely there's a >> better way to lock dependencies than to count on people manually >> updating this file (even with a precommit reminding them)? I'm not >> that familiar with the complexities of python package management, but >> it seems like we should be able to set up something with pip freeze or >> pipenv's Pipfile.lock. The thing that confounds me when I try to think >> through that setup though is how to reference a development build of >> beam. If I run pip freeze in the container the requirement for >> apache-beam is "2.16.0.dev0" which of course doesn't resolve when you >> turn around and try to install it. Maybe an automated solution would >> just replace the apache-beam line with >> /opt/apache/beam/tars/apache-beam.tar.gz[gcp]? >> > > I agree, this probably could be automated in a better way. > > I believe just doing "pip > install /opt/apache/beam/tars/apache-beam.tar.gz[gcp]" at the beginning > will prevent a later error for "pip install apache-beam==2.16.0.dev0". So > we could have the former line in the docker file before installing the > requirements. > > >> >> > >> > >> > Ahmet >> > >> > >> > On Fri, Aug 2, 2019 at 2:20 PM Brian Hulette <[email protected]> >> wrote: >> >> >> >> I recently ran into a portable python precommit failure that led me to >> discover that python dependencies for the container are defined in two >> different places, in slightly different ways: in setup.py with version >> ranges [1], and in a base_image_requirements.txt file in the container >> directory with pinned versions [2]. The latter includes some dependencies >> that are commonly used in user code, in addition to the Beam dependencies, >> and states that "Specifying the versions manually helps to resolve >> dependency conflicts with other packages installed in the container." >> >> >> >> But isn't the purpose of a package manager to resolve those conflicts? >> It seems that we should be able to just list the beam tar.gz as a >> dependency and let pip resolve it. I wrote up a patch [3] to test this >> approach and it seems to work. The only issue I ran into is that cython >> _does_ needs to be installed separately before beam so that it can be used >> when installing beam. Would this be a reasonable approach to remove the >> duplicate dependency specifications? >> >> >> >> Apologies if I'm rehashing an old discussion, if there's already an ML >> thread or PR discussing this issue I'd be happy to take a look at it. >> >> >> >> Thanks, >> >> Brian >> >> >> >> [1] https://github.com/apache/beam/blob/master/sdks/python/setup.py >> >> [2] >> https://github.com/apache/beam/blob/master/sdks/python/container/base_image_requirements.txt >> >> [3] https://github.com/apache/beam/pull/9236 >> >> >> >> >> >> >> >> PS - What caused my precommit failure: >> >> I tested out the portable runner locally to debug the issue. It worked >> fine with the loopback worker, but failed when using the docker worker. I >> eventually determined the actual cause: I unintentionally relied on a >> protobuf 3.8.0 feature in the code I added. The setup.py requirement, used >> locally in the loopback worker, specified a range requirement which >> resolved to 3.9.0, but the dependency for the container was pinned to >> 3.6.1. I've proposed to resolve the issue by just bumping the lower bound >> of the range in setup.py to 3.8.0, to reflect the new requirement. >> >
