On Wed, Aug 7, 2019 at 1:53 AM Ahmet Altay <[email protected]> wrote: > > > On Mon, Aug 5, 2019 at 9:49 PM Valentyn Tymofieiev <[email protected]> > wrote: > >> On Tue, Aug 6, 2019 at 2:29 AM Ahmet Altay <[email protected]> wrote: >> >>> >>> >>> On Mon, Aug 5, 2019 at 1:43 AM Valentyn Tymofieiev <[email protected]> >>> wrote: >>> >>>> - The purpose of install_requires in setup.py is to define the >>>> maximally permissive set of requirements for a package[1]. We don't pin a >>>> version in setup.py without a strong reason, instead we typically pick up a >>>> lower bound we have tested, and set an upper bound to be next major >>>> version. >>>> - The purpose of requirements.txt is to force pip to properly resolve >>>> dependencies, and create a reproducible execution environment, since pip >>>> doesn’t have true dependency resolution [2] >>>> >>>> We currently regularly upgrade setup.py dependencies, but do not update >>>> base_image_requirements.txt. Also, our requirements file is not exhaustive. >>>> We need to introduce a process to fix this. >>>> >>> >>> base_image_requirements.txt currently does not drive any released Beam >>> artifact. It would make sense to address this as part of the SDKHarness >>> container image release process. Similar mechanism might make sense for >>> figuring out what goes into containers for other SDKs. (Perhaps add your >>> proposal to >>> https://cwiki.apache.org/confluence/display/BEAM/%5BWIP%5D+SDKHarness+Container+Image+Release+Process >>> ) >>> >> >> +1. I will add notes to the wiki, thanks. >> >> >>> >>> >> >>> >>>> I would recommend that in new process: >>>> - All dependencies of Beam Python SDK, including transitive >>>> dependencies, are listed in base_image_requirements.txt (or another >>>> requirements file). "Explicit is better than implicit." >>>> - Requirements file is regenerated whenever setup.py changes.. >>>> >>> - When we build a container image, we check that the final image has >>>> exactly the same versions of dependencies that were spelled out in >>>> requirements file (no versions added, or changed) >>>> >>> >>> The above could be done in a two step process: >>> - a human generated requirements file like base_image_requirements today >>> which has a set of curated requirements. >>> - Changes to the first file would result in a generated file with all >>> transitive dependencies. Second file could be used as the source of truth >>> for all dependencies at a particular commit. Generated file could be used >>> for the container builds. >>> >>> >>>> - We also check that there are no dependency conflicts (they typically >>>> look like: Package X requires version A of dependency Y, but you will have >>>> B, which is incompatible). >>>> >>> >>> +1. Container build could verify this. >>> >>> >>>> - We update the versions of pinned dependencies periodically. We may >>>> want to put all dependencies of SDK harness containers on the radar of Beam >>>> dependency checker. >>>> >>> >>> +1 but a separate process. >>> >>> This process does not address how to keep setup.py in sync with >>> requirements file. We could use git hooks to ensure that files are changed >>> at the same time. >>> >> >> In addition to git hooks, we could try to check that once we install >> Apache Beam (after installing requirements file dependencies), we don't >> pull anything from PyPi. We could reply on pip output or cut internet >> access to enforce this. >> > > Good idea. This also represents an interesting use case where pipelines > run on a private network with no pypi access. >
+1 For what it's worth we intend to run beam pipelines (via flink) on a private network. > >> >> >>> >>>> Valentyn >>>> >>>> [1] >>>> https://packaging.python.org/discussions/install-requires-vs-requirements/ >>>> [2] https://pip.pypa.io/en/stable/user_guide/#requirements-files >>>> >>>> On Sat, Aug 3, 2019 at 2:47 AM Ahmet Altay <[email protected]> wrote: >>>> >>>>> >>>>> >>>>> On Fri, Aug 2, 2019 at 4:34 PM Brian Hulette <[email protected]> >>>>> wrote: >>>>> >>>>>> Thanks for the reply, I added some responses inline. >>>>>> >>>>>> On Fri, Aug 2, 2019 at 2:42 PM Ahmet Altay <[email protected]> wrote: >>>>>> > >>>>>> > There is a value in explicitly pinning the dependencies to be used >>>>>> in the containers: >>>>>> > - It reproducibly produces the same container. This will be >>>>>> important once we start release Beam container. By looking at a Beam >>>>>> release branch, one could exactly figure out the set of dependencies >>>>>> available in a released container. >>>>>> > >>>>>> > - Package manager (pip) in this is notorious about resolving >>>>>> versions for sub-dependencies, and sometimes picks incompatible >>>>>> dependencies. >>>>>> > - SImilarly this repdocubility is helpful with tests, that can work >>>>>> on the same container version unaffected by numerous dependency version >>>>>> changes happening in sub-dependecies. >>>>>> >>>>>> I thought this may be the reason, it definitely makes sense to pin >>>>>> dependencies for reproducible builds. If that is the case though I >>>>>> think we should at least change the comment in >>>>>> base_image_requirements.txt to say so (I'm happy to write a patch for >>>>>> that). >>>>>> >>>>> >>>>> Sounds good. >>>>> >>>>> >>>>>> >>>>>> > In addition, I will argue that we will want to keep a form of >>>>>> base_image_requirements.txt, in order to be able to install additional >>>>>> dependencies on the container and we would not be able to get rid of this >>>>>> mechanism. >>>>>> >>>>>> My PR doesn't remove base_image_requirements.txt. It keeps the file >>>>>> there but removes all the pinned beam dependencies, and instead just >>>>>> lists the beam tar.gz as a dependency directly, along with all of the >>>>>> additional dependencies. >>>>>> >>>>> >>>>> Got it, thanks for the clarification. >>>>> >>>>> >>>>>> >>>>>> > >>>>>> > >>>>>> > I will suggest, If possible ensure that both files are modified >>>>>> synchronously. This might be possible with precommit hooks although I am >>>>>> not familiar. >>>>>> >>>>>> Yeah setting up something like that would help, but surely there's a >>>>>> better way to lock dependencies than to count on people manually >>>>>> updating this file (even with a precommit reminding them)? I'm not >>>>>> that familiar with the complexities of python package management, but >>>>>> it seems like we should be able to set up something with pip freeze or >>>>>> pipenv's Pipfile.lock. The thing that confounds me when I try to think >>>>>> through that setup though is how to reference a development build of >>>>>> beam. If I run pip freeze in the container the requirement for >>>>>> apache-beam is "2.16.0.dev0" which of course doesn't resolve when you >>>>>> turn around and try to install it. Maybe an automated solution would >>>>>> just replace the apache-beam line with >>>>>> /opt/apache/beam/tars/apache-beam.tar.gz[gcp]? >>>>>> >>>>> >>>>> I agree, this probably could be automated in a better way. >>>>> >>>>> I believe just doing "pip >>>>> install /opt/apache/beam/tars/apache-beam.tar.gz[gcp]" at the beginning >>>>> will prevent a later error for "pip install apache-beam==2.16.0.dev0". So >>>>> we could have the former line in the docker file before installing the >>>>> requirements. >>>>> >>>>> >>>>>> >>>>>> > >>>>>> > >>>>>> > Ahmet >>>>>> > >>>>>> > >>>>>> > On Fri, Aug 2, 2019 at 2:20 PM Brian Hulette <[email protected]> >>>>>> wrote: >>>>>> >> >>>>>> >> I recently ran into a portable python precommit failure that led >>>>>> me to discover that python dependencies for the container are defined in >>>>>> two different places, in slightly different ways: in setup.py with >>>>>> version >>>>>> ranges [1], and in a base_image_requirements.txt file in the container >>>>>> directory with pinned versions [2]. The latter includes some dependencies >>>>>> that are commonly used in user code, in addition to the Beam >>>>>> dependencies, >>>>>> and states that "Specifying the versions manually helps to resolve >>>>>> dependency conflicts with other packages installed in the container." >>>>>> >> >>>>>> >> But isn't the purpose of a package manager to resolve those >>>>>> conflicts? It seems that we should be able to just list the beam tar.gz >>>>>> as >>>>>> a dependency and let pip resolve it. I wrote up a patch [3] to test this >>>>>> approach and it seems to work. The only issue I ran into is that cython >>>>>> _does_ needs to be installed separately before beam so that it can be >>>>>> used >>>>>> when installing beam. Would this be a reasonable approach to remove the >>>>>> duplicate dependency specifications? >>>>>> >> >>>>>> >> Apologies if I'm rehashing an old discussion, if there's already >>>>>> an ML thread or PR discussing this issue I'd be happy to take a look at >>>>>> it. >>>>>> >> >>>>>> >> Thanks, >>>>>> >> Brian >>>>>> >> >>>>>> >> [1] >>>>>> https://github.com/apache/beam/blob/master/sdks/python/setup.py >>>>>> >> [2] >>>>>> https://github.com/apache/beam/blob/master/sdks/python/container/base_image_requirements.txt >>>>>> >> [3] https://github.com/apache/beam/pull/9236 >>>>>> >> >>>>>> >> >>>>>> >> >>>>>> >> PS - What caused my precommit failure: >>>>>> >> I tested out the portable runner locally to debug the issue. It >>>>>> worked fine with the loopback worker, but failed when using the docker >>>>>> worker. I eventually determined the actual cause: I unintentionally >>>>>> relied >>>>>> on a protobuf 3.8.0 feature in the code I added. The setup.py >>>>>> requirement, >>>>>> used locally in the loopback worker, specified a range requirement which >>>>>> resolved to 3.9.0, but the dependency for the container was pinned to >>>>>> 3.6.1. I've proposed to resolve the issue by just bumping the lower bound >>>>>> of the range in setup.py to 3.8.0, to reflect the new requirement. >>>>>> >>>>>
