Added some ideas discussed here to the SDK container release process working doc <https://docs.google.com/document/d/1IKE_aEkrAzkzUE4pD_r_zVuL5amHGetJ1efnbTfmunM/edit#> that Hannah started.
On Tue, Aug 6, 2019 at 10:33 PM Sam Bourne <[email protected]> wrote: > > On Wed, Aug 7, 2019 at 1:53 AM Ahmet Altay <[email protected]> wrote: > >> >> >> On Mon, Aug 5, 2019 at 9:49 PM Valentyn Tymofieiev <[email protected]> >> wrote: >> >>> On Tue, Aug 6, 2019 at 2:29 AM Ahmet Altay <[email protected]> wrote: >>> >>>> >>>> >>>> On Mon, Aug 5, 2019 at 1:43 AM Valentyn Tymofieiev <[email protected]> >>>> wrote: >>>> >>>>> - The purpose of install_requires in setup.py is to define the >>>>> maximally permissive set of requirements for a package[1]. We don't pin a >>>>> version in setup.py without a strong reason, instead we typically pick up >>>>> a >>>>> lower bound we have tested, and set an upper bound to be next major >>>>> version. >>>>> - The purpose of requirements.txt is to force pip to properly resolve >>>>> dependencies, and create a reproducible execution environment, since pip >>>>> doesn’t have true dependency resolution [2] >>>>> >>>>> We currently regularly upgrade setup.py dependencies, but do not >>>>> update base_image_requirements.txt. Also, our requirements file is not >>>>> exhaustive. We need to introduce a process to fix this. >>>>> >>>> >>>> base_image_requirements.txt currently does not drive any released Beam >>>> artifact. It would make sense to address this as part of the SDKHarness >>>> container image release process. Similar mechanism might make sense for >>>> figuring out what goes into containers for other SDKs. (Perhaps add your >>>> proposal to >>>> https://cwiki.apache.org/confluence/display/BEAM/%5BWIP%5D+SDKHarness+Container+Image+Release+Process >>>> ) >>>> >>> >>> +1. I will add notes to the wiki, thanks. >>> >>> >>>> >>>> >>> >>>> >>>>> I would recommend that in new process: >>>>> - All dependencies of Beam Python SDK, including transitive >>>>> dependencies, are listed in base_image_requirements.txt (or another >>>>> requirements file). "Explicit is better than implicit." >>>>> - Requirements file is regenerated whenever setup.py changes.. >>>>> >>>> - When we build a container image, we check that the final image has >>>>> exactly the same versions of dependencies that were spelled out in >>>>> requirements file (no versions added, or changed) >>>>> >>>> >>>> The above could be done in a two step process: >>>> - a human generated requirements file like base_image_requirements >>>> today which has a set of curated requirements. >>>> - Changes to the first file would result in a generated file with all >>>> transitive dependencies. Second file could be used as the source of truth >>>> for all dependencies at a particular commit. Generated file could be used >>>> for the container builds. >>>> >>>> >>>>> - We also check that there are no dependency conflicts (they typically >>>>> look like: Package X requires version A of dependency Y, but you will have >>>>> B, which is incompatible). >>>>> >>>> >>>> +1. Container build could verify this. >>>> >>>> >>>>> - We update the versions of pinned dependencies periodically. We may >>>>> want to put all dependencies of SDK harness containers on the radar of >>>>> Beam >>>>> dependency checker. >>>>> >>>> >>>> +1 but a separate process. >>>> >>>> This process does not address how to keep setup.py in sync with >>>> requirements file. We could use git hooks to ensure that files are changed >>>> at the same time. >>>> >>> >>> In addition to git hooks, we could try to check that once we install >>> Apache Beam (after installing requirements file dependencies), we don't >>> pull anything from PyPi. We could reply on pip output or cut internet >>> access to enforce this. >>> >> >> Good idea. This also represents an interesting use case where pipelines >> run on a private network with no pypi access. >> > > +1 For what it's worth we intend to run beam pipelines (via flink) on a > private network. > > >> >>> >>> >>>> >>>>> Valentyn >>>>> >>>>> [1] >>>>> https://packaging.python.org/discussions/install-requires-vs-requirements/ >>>>> [2] https://pip.pypa.io/en/stable/user_guide/#requirements-files >>>>> >>>>> On Sat, Aug 3, 2019 at 2:47 AM Ahmet Altay <[email protected]> wrote: >>>>> >>>>>> >>>>>> >>>>>> On Fri, Aug 2, 2019 at 4:34 PM Brian Hulette <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> Thanks for the reply, I added some responses inline. >>>>>>> >>>>>>> On Fri, Aug 2, 2019 at 2:42 PM Ahmet Altay <[email protected]> wrote: >>>>>>> > >>>>>>> > There is a value in explicitly pinning the dependencies to be used >>>>>>> in the containers: >>>>>>> > - It reproducibly produces the same container. This will be >>>>>>> important once we start release Beam container. By looking at a Beam >>>>>>> release branch, one could exactly figure out the set of dependencies >>>>>>> available in a released container. >>>>>>> > >>>>>>> > - Package manager (pip) in this is notorious about resolving >>>>>>> versions for sub-dependencies, and sometimes picks incompatible >>>>>>> dependencies. >>>>>>> > - SImilarly this repdocubility is helpful with tests, that can >>>>>>> work on the same container version unaffected by numerous dependency >>>>>>> version changes happening in sub-dependecies. >>>>>>> >>>>>>> I thought this may be the reason, it definitely makes sense to pin >>>>>>> dependencies for reproducible builds. If that is the case though I >>>>>>> think we should at least change the comment in >>>>>>> base_image_requirements.txt to say so (I'm happy to write a patch for >>>>>>> that). >>>>>>> >>>>>> >>>>>> Sounds good. >>>>>> >>>>>> >>>>>>> >>>>>>> > In addition, I will argue that we will want to keep a form of >>>>>>> base_image_requirements.txt, in order to be able to install additional >>>>>>> dependencies on the container and we would not be able to get rid of >>>>>>> this >>>>>>> mechanism. >>>>>>> >>>>>>> My PR doesn't remove base_image_requirements.txt. It keeps the file >>>>>>> there but removes all the pinned beam dependencies, and instead just >>>>>>> lists the beam tar.gz as a dependency directly, along with all of the >>>>>>> additional dependencies. >>>>>>> >>>>>> >>>>>> Got it, thanks for the clarification. >>>>>> >>>>>> >>>>>>> >>>>>>> > >>>>>>> > >>>>>>> > I will suggest, If possible ensure that both files are modified >>>>>>> synchronously. This might be possible with precommit hooks although I am >>>>>>> not familiar. >>>>>>> >>>>>>> Yeah setting up something like that would help, but surely there's a >>>>>>> better way to lock dependencies than to count on people manually >>>>>>> updating this file (even with a precommit reminding them)? I'm not >>>>>>> that familiar with the complexities of python package management, but >>>>>>> it seems like we should be able to set up something with pip freeze >>>>>>> or >>>>>>> pipenv's Pipfile.lock. The thing that confounds me when I try to >>>>>>> think >>>>>>> through that setup though is how to reference a development build of >>>>>>> beam. If I run pip freeze in the container the requirement for >>>>>>> apache-beam is "2.16.0.dev0" which of course doesn't resolve when you >>>>>>> turn around and try to install it. Maybe an automated solution would >>>>>>> just replace the apache-beam line with >>>>>>> /opt/apache/beam/tars/apache-beam.tar.gz[gcp]? >>>>>>> >>>>>> >>>>>> I agree, this probably could be automated in a better way. >>>>>> >>>>>> I believe just doing "pip >>>>>> install /opt/apache/beam/tars/apache-beam.tar.gz[gcp]" at the beginning >>>>>> will prevent a later error for "pip install apache-beam==2.16.0.dev0". So >>>>>> we could have the former line in the docker file before installing the >>>>>> requirements. >>>>>> >>>>>> >>>>>>> >>>>>>> > >>>>>>> > >>>>>>> > Ahmet >>>>>>> > >>>>>>> > >>>>>>> > On Fri, Aug 2, 2019 at 2:20 PM Brian Hulette <[email protected]> >>>>>>> wrote: >>>>>>> >> >>>>>>> >> I recently ran into a portable python precommit failure that led >>>>>>> me to discover that python dependencies for the container are defined in >>>>>>> two different places, in slightly different ways: in setup.py with >>>>>>> version >>>>>>> ranges [1], and in a base_image_requirements.txt file in the container >>>>>>> directory with pinned versions [2]. The latter includes some >>>>>>> dependencies >>>>>>> that are commonly used in user code, in addition to the Beam >>>>>>> dependencies, >>>>>>> and states that "Specifying the versions manually helps to resolve >>>>>>> dependency conflicts with other packages installed in the container." >>>>>>> >> >>>>>>> >> But isn't the purpose of a package manager to resolve those >>>>>>> conflicts? It seems that we should be able to just list the beam tar.gz >>>>>>> as >>>>>>> a dependency and let pip resolve it. I wrote up a patch [3] to test this >>>>>>> approach and it seems to work. The only issue I ran into is that cython >>>>>>> _does_ needs to be installed separately before beam so that it can be >>>>>>> used >>>>>>> when installing beam. Would this be a reasonable approach to remove the >>>>>>> duplicate dependency specifications? >>>>>>> >> >>>>>>> >> Apologies if I'm rehashing an old discussion, if there's already >>>>>>> an ML thread or PR discussing this issue I'd be happy to take a look at >>>>>>> it. >>>>>>> >> >>>>>>> >> Thanks, >>>>>>> >> Brian >>>>>>> >> >>>>>>> >> [1] >>>>>>> https://github.com/apache/beam/blob/master/sdks/python/setup.py >>>>>>> >> [2] >>>>>>> https://github.com/apache/beam/blob/master/sdks/python/container/base_image_requirements.txt >>>>>>> >> [3] https://github.com/apache/beam/pull/9236 >>>>>>> >> >>>>>>> >> >>>>>>> >> >>>>>>> >> PS - What caused my precommit failure: >>>>>>> >> I tested out the portable runner locally to debug the issue. It >>>>>>> worked fine with the loopback worker, but failed when using the docker >>>>>>> worker. I eventually determined the actual cause: I unintentionally >>>>>>> relied >>>>>>> on a protobuf 3.8.0 feature in the code I added. The setup.py >>>>>>> requirement, >>>>>>> used locally in the loopback worker, specified a range requirement which >>>>>>> resolved to 3.9.0, but the dependency for the container was pinned to >>>>>>> 3.6.1. I've proposed to resolve the issue by just bumping the lower >>>>>>> bound >>>>>>> of the range in setup.py to 3.8.0, to reflect the new requirement. >>>>>>> >>>>>>
