On Mon, Aug 5, 2019 at 9:49 PM Valentyn Tymofieiev <[email protected]> wrote:
> On Tue, Aug 6, 2019 at 2:29 AM Ahmet Altay <[email protected]> wrote: > >> >> >> On Mon, Aug 5, 2019 at 1:43 AM Valentyn Tymofieiev <[email protected]> >> wrote: >> >>> - The purpose of install_requires in setup.py is to define the maximally >>> permissive set of requirements for a package[1]. We don't pin a version in >>> setup.py without a strong reason, instead we typically pick up a lower >>> bound we have tested, and set an upper bound to be next major version. >>> - The purpose of requirements.txt is to force pip to properly resolve >>> dependencies, and create a reproducible execution environment, since pip >>> doesn’t have true dependency resolution [2] >>> >>> We currently regularly upgrade setup.py dependencies, but do not update >>> base_image_requirements.txt. Also, our requirements file is not exhaustive. >>> We need to introduce a process to fix this. >>> >> >> base_image_requirements.txt currently does not drive any released Beam >> artifact. It would make sense to address this as part of the SDKHarness >> container image release process. Similar mechanism might make sense for >> figuring out what goes into containers for other SDKs. (Perhaps add your >> proposal to >> https://cwiki.apache.org/confluence/display/BEAM/%5BWIP%5D+SDKHarness+Container+Image+Release+Process >> ) >> > > +1. I will add notes to the wiki, thanks. > > >> >> > >> >>> I would recommend that in new process: >>> - All dependencies of Beam Python SDK, including transitive >>> dependencies, are listed in base_image_requirements.txt (or another >>> requirements file). "Explicit is better than implicit." >>> - Requirements file is regenerated whenever setup.py changes.. >>> >> - When we build a container image, we check that the final image has >>> exactly the same versions of dependencies that were spelled out in >>> requirements file (no versions added, or changed) >>> >> >> The above could be done in a two step process: >> - a human generated requirements file like base_image_requirements today >> which has a set of curated requirements. >> - Changes to the first file would result in a generated file with all >> transitive dependencies. Second file could be used as the source of truth >> for all dependencies at a particular commit. Generated file could be used >> for the container builds. >> >> >>> - We also check that there are no dependency conflicts (they typically >>> look like: Package X requires version A of dependency Y, but you will have >>> B, which is incompatible). >>> >> >> +1. Container build could verify this. >> >> >>> - We update the versions of pinned dependencies periodically. We may >>> want to put all dependencies of SDK harness containers on the radar of Beam >>> dependency checker. >>> >> >> +1 but a separate process. >> >> This process does not address how to keep setup.py in sync with >> requirements file. We could use git hooks to ensure that files are changed >> at the same time. >> > > In addition to git hooks, we could try to check that once we install > Apache Beam (after installing requirements file dependencies), we don't > pull anything from PyPi. We could reply on pip output or cut internet > access to enforce this. > Good idea. This also represents an interesting use case where pipelines run on a private network with no pypi access. > > >> >>> Valentyn >>> >>> [1] >>> https://packaging.python.org/discussions/install-requires-vs-requirements/ >>> [2] https://pip.pypa.io/en/stable/user_guide/#requirements-files >>> >>> On Sat, Aug 3, 2019 at 2:47 AM Ahmet Altay <[email protected]> wrote: >>> >>>> >>>> >>>> On Fri, Aug 2, 2019 at 4:34 PM Brian Hulette <[email protected]> >>>> wrote: >>>> >>>>> Thanks for the reply, I added some responses inline. >>>>> >>>>> On Fri, Aug 2, 2019 at 2:42 PM Ahmet Altay <[email protected]> wrote: >>>>> > >>>>> > There is a value in explicitly pinning the dependencies to be used >>>>> in the containers: >>>>> > - It reproducibly produces the same container. This will be >>>>> important once we start release Beam container. By looking at a Beam >>>>> release branch, one could exactly figure out the set of dependencies >>>>> available in a released container. >>>>> > >>>>> > - Package manager (pip) in this is notorious about resolving >>>>> versions for sub-dependencies, and sometimes picks incompatible >>>>> dependencies. >>>>> > - SImilarly this repdocubility is helpful with tests, that can work >>>>> on the same container version unaffected by numerous dependency version >>>>> changes happening in sub-dependecies. >>>>> >>>>> I thought this may be the reason, it definitely makes sense to pin >>>>> dependencies for reproducible builds. If that is the case though I >>>>> think we should at least change the comment in >>>>> base_image_requirements.txt to say so (I'm happy to write a patch for >>>>> that). >>>>> >>>> >>>> Sounds good. >>>> >>>> >>>>> >>>>> > In addition, I will argue that we will want to keep a form of >>>>> base_image_requirements.txt, in order to be able to install additional >>>>> dependencies on the container and we would not be able to get rid of this >>>>> mechanism. >>>>> >>>>> My PR doesn't remove base_image_requirements.txt. It keeps the file >>>>> there but removes all the pinned beam dependencies, and instead just >>>>> lists the beam tar.gz as a dependency directly, along with all of the >>>>> additional dependencies. >>>>> >>>> >>>> Got it, thanks for the clarification. >>>> >>>> >>>>> >>>>> > >>>>> > >>>>> > I will suggest, If possible ensure that both files are modified >>>>> synchronously. This might be possible with precommit hooks although I am >>>>> not familiar. >>>>> >>>>> Yeah setting up something like that would help, but surely there's a >>>>> better way to lock dependencies than to count on people manually >>>>> updating this file (even with a precommit reminding them)? I'm not >>>>> that familiar with the complexities of python package management, but >>>>> it seems like we should be able to set up something with pip freeze or >>>>> pipenv's Pipfile.lock. The thing that confounds me when I try to think >>>>> through that setup though is how to reference a development build of >>>>> beam. If I run pip freeze in the container the requirement for >>>>> apache-beam is "2.16.0.dev0" which of course doesn't resolve when you >>>>> turn around and try to install it. Maybe an automated solution would >>>>> just replace the apache-beam line with >>>>> /opt/apache/beam/tars/apache-beam.tar.gz[gcp]? >>>>> >>>> >>>> I agree, this probably could be automated in a better way. >>>> >>>> I believe just doing "pip >>>> install /opt/apache/beam/tars/apache-beam.tar.gz[gcp]" at the beginning >>>> will prevent a later error for "pip install apache-beam==2.16.0.dev0". So >>>> we could have the former line in the docker file before installing the >>>> requirements. >>>> >>>> >>>>> >>>>> > >>>>> > >>>>> > Ahmet >>>>> > >>>>> > >>>>> > On Fri, Aug 2, 2019 at 2:20 PM Brian Hulette <[email protected]> >>>>> wrote: >>>>> >> >>>>> >> I recently ran into a portable python precommit failure that led me >>>>> to discover that python dependencies for the container are defined in two >>>>> different places, in slightly different ways: in setup.py with version >>>>> ranges [1], and in a base_image_requirements.txt file in the container >>>>> directory with pinned versions [2]. The latter includes some dependencies >>>>> that are commonly used in user code, in addition to the Beam dependencies, >>>>> and states that "Specifying the versions manually helps to resolve >>>>> dependency conflicts with other packages installed in the container." >>>>> >> >>>>> >> But isn't the purpose of a package manager to resolve those >>>>> conflicts? It seems that we should be able to just list the beam tar.gz as >>>>> a dependency and let pip resolve it. I wrote up a patch [3] to test this >>>>> approach and it seems to work. The only issue I ran into is that cython >>>>> _does_ needs to be installed separately before beam so that it can be used >>>>> when installing beam. Would this be a reasonable approach to remove the >>>>> duplicate dependency specifications? >>>>> >> >>>>> >> Apologies if I'm rehashing an old discussion, if there's already an >>>>> ML thread or PR discussing this issue I'd be happy to take a look at it. >>>>> >> >>>>> >> Thanks, >>>>> >> Brian >>>>> >> >>>>> >> [1] https://github.com/apache/beam/blob/master/sdks/python/setup.py >>>>> >> [2] >>>>> https://github.com/apache/beam/blob/master/sdks/python/container/base_image_requirements.txt >>>>> >> [3] https://github.com/apache/beam/pull/9236 >>>>> >> >>>>> >> >>>>> >> >>>>> >> PS - What caused my precommit failure: >>>>> >> I tested out the portable runner locally to debug the issue. It >>>>> worked fine with the loopback worker, but failed when using the docker >>>>> worker. I eventually determined the actual cause: I unintentionally relied >>>>> on a protobuf 3.8.0 feature in the code I added. The setup.py requirement, >>>>> used locally in the loopback worker, specified a range requirement which >>>>> resolved to 3.9.0, but the dependency for the container was pinned to >>>>> 3.6.1. I've proposed to resolve the issue by just bumping the lower bound >>>>> of the range in setup.py to 3.8.0, to reflect the new requirement. >>>>> >>>>
