Added some ideas discussed here to the SDK container release process
working doc
<https://docs.google.com/document/d/1IKE_aEkrAzkzUE4pD_r_zVuL5amHGetJ1efnbTfmunM/edit#>
that
Hannah started.



On Tue, Aug 6, 2019 at 10:33 PM Sam Bourne <[email protected]> wrote:

>
> On Wed, Aug 7, 2019 at 1:53 AM Ahmet Altay <[email protected]> wrote:
>
>>
>>
>> On Mon, Aug 5, 2019 at 9:49 PM Valentyn Tymofieiev <[email protected]>
>> wrote:
>>
>>> On Tue, Aug 6, 2019 at 2:29 AM Ahmet Altay <[email protected]> wrote:
>>>
>>>>
>>>>
>>>> On Mon, Aug 5, 2019 at 1:43 AM Valentyn Tymofieiev <[email protected]>
>>>> wrote:
>>>>
>>>>> - The purpose of install_requires in setup.py is to define the
>>>>> maximally permissive set of requirements for a package[1]. We don't pin a
>>>>> version in setup.py without a strong reason, instead we typically pick up 
>>>>> a
>>>>> lower bound we have tested, and set an upper bound to be next major
>>>>> version.
>>>>> - The purpose of requirements.txt is to force pip to properly resolve
>>>>> dependencies, and create a reproducible execution environment, since pip
>>>>> doesn’t have true dependency resolution [2]
>>>>>
>>>>> We currently regularly upgrade setup.py dependencies, but do not
>>>>> update base_image_requirements.txt. Also, our requirements file is not
>>>>> exhaustive. We need to introduce a process to fix this.
>>>>>
>>>>
>>>> base_image_requirements.txt currently does not drive any released Beam
>>>> artifact. It would make sense to address this as part of the SDKHarness
>>>> container image release process. Similar mechanism might make sense for
>>>> figuring out what goes into containers for other SDKs. (Perhaps add your
>>>> proposal to
>>>> https://cwiki.apache.org/confluence/display/BEAM/%5BWIP%5D+SDKHarness+Container+Image+Release+Process
>>>> )
>>>>
>>>
>>> +1. I will add notes to the wiki, thanks.
>>>
>>>
>>>>
>>>>
>>>
>>>>
>>>>> I would recommend that in new process:
>>>>> - All dependencies of Beam Python SDK, including transitive
>>>>> dependencies, are listed in base_image_requirements.txt (or another
>>>>> requirements file). "Explicit is better than implicit."
>>>>> - Requirements file is regenerated whenever setup.py changes..
>>>>>
>>>> - When we build a container image, we check that the final image has
>>>>> exactly the same versions of dependencies that were spelled out in
>>>>> requirements file (no versions added, or changed)
>>>>>
>>>>
>>>> The above could be done in a two step process:
>>>> - a human generated requirements file like base_image_requirements
>>>> today which has a set of curated requirements.
>>>> - Changes to the first file would result in a generated file with all
>>>> transitive dependencies. Second file could be used as the source of truth
>>>> for all dependencies at a particular commit. Generated file could be used
>>>> for the container builds.
>>>>
>>>>
>>>>> - We also check that there are no dependency conflicts (they typically
>>>>> look like: Package X requires version A of dependency Y, but you will have
>>>>> B, which is incompatible).
>>>>>
>>>>
>>>> +1. Container build could verify this.
>>>>
>>>>
>>>>> - We update the versions of pinned dependencies periodically. We may
>>>>> want to put all dependencies of SDK harness containers on the radar of 
>>>>> Beam
>>>>> dependency checker.
>>>>>
>>>>
>>>> +1 but a separate process.
>>>>
>>>> This process does not address how to keep setup.py in sync with
>>>> requirements file. We could use git hooks to ensure that files are changed
>>>> at the same time.
>>>>
>>>
>>> In addition to git hooks, we could try to check that once we install
>>> Apache Beam (after installing requirements file dependencies), we don't
>>> pull anything from PyPi. We could reply on pip output or cut internet
>>> access to enforce this.
>>>
>>
>> Good idea. This also represents an interesting use case where pipelines
>> run on a private network with no pypi access.
>>
>
> +1 For what it's worth we intend to run beam pipelines (via flink) on a
> private network.
>
>
>>
>>>
>>>
>>>>
>>>>> Valentyn
>>>>>
>>>>> [1]
>>>>> https://packaging.python.org/discussions/install-requires-vs-requirements/
>>>>> [2] https://pip.pypa.io/en/stable/user_guide/#requirements-files
>>>>>
>>>>> On Sat, Aug 3, 2019 at 2:47 AM Ahmet Altay <[email protected]> wrote:
>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, Aug 2, 2019 at 4:34 PM Brian Hulette <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Thanks for the reply, I added some responses inline.
>>>>>>>
>>>>>>> On Fri, Aug 2, 2019 at 2:42 PM Ahmet Altay <[email protected]> wrote:
>>>>>>> >
>>>>>>> > There is a value in explicitly pinning the dependencies to be used
>>>>>>> in the containers:
>>>>>>> > - It reproducibly produces the same container. This will be
>>>>>>> important once we start release Beam container. By looking at a Beam
>>>>>>> release branch, one could exactly figure out the set of dependencies
>>>>>>> available in a released container.
>>>>>>> >
>>>>>>> > - Package manager (pip) in this is notorious about resolving
>>>>>>> versions for sub-dependencies, and sometimes picks incompatible
>>>>>>> dependencies.
>>>>>>> > - SImilarly this repdocubility is helpful with tests, that can
>>>>>>> work on the same container version unaffected by numerous dependency
>>>>>>> version changes happening in sub-dependecies.
>>>>>>>
>>>>>>> I thought this may be the reason, it definitely makes sense to pin
>>>>>>> dependencies for reproducible builds. If that is the case though I
>>>>>>> think we should at least change the comment in
>>>>>>> base_image_requirements.txt to say so (I'm happy to write a patch for
>>>>>>> that).
>>>>>>>
>>>>>>
>>>>>> Sounds good.
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> > In addition, I will argue that we will want to keep a form of
>>>>>>> base_image_requirements.txt, in order to be able to install additional
>>>>>>> dependencies on the container and we would not be able to get rid of 
>>>>>>> this
>>>>>>> mechanism.
>>>>>>>
>>>>>>> My PR doesn't remove base_image_requirements.txt. It keeps the file
>>>>>>> there but removes all the pinned beam dependencies, and instead just
>>>>>>> lists the beam tar.gz as a dependency directly, along with all of the
>>>>>>> additional dependencies.
>>>>>>>
>>>>>>
>>>>>> Got it, thanks for the clarification.
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> >
>>>>>>> >
>>>>>>> > I will suggest, If possible ensure that both files are modified
>>>>>>> synchronously. This might be possible with precommit hooks although I am
>>>>>>> not familiar.
>>>>>>>
>>>>>>> Yeah setting up something like that would help, but surely there's a
>>>>>>> better way to lock dependencies than to count on people manually
>>>>>>> updating this file (even with a precommit reminding them)? I'm not
>>>>>>> that familiar with the complexities of python package management, but
>>>>>>> it seems like we should be able to set up something with pip freeze
>>>>>>> or
>>>>>>> pipenv's Pipfile.lock. The thing that confounds me when I try to
>>>>>>> think
>>>>>>> through that setup though is how to reference a development build of
>>>>>>> beam. If I run pip freeze in the container the requirement for
>>>>>>> apache-beam is "2.16.0.dev0" which of course doesn't resolve when you
>>>>>>> turn around and try to install it. Maybe an automated solution would
>>>>>>> just replace the apache-beam line with
>>>>>>> /opt/apache/beam/tars/apache-beam.tar.gz[gcp]?
>>>>>>>
>>>>>>
>>>>>> I agree, this probably could be automated in a better way.
>>>>>>
>>>>>> I believe just doing "pip
>>>>>> install /opt/apache/beam/tars/apache-beam.tar.gz[gcp]" at the beginning
>>>>>> will prevent a later error for "pip install apache-beam==2.16.0.dev0". So
>>>>>> we could have the former line in the docker file before installing the
>>>>>> requirements.
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> >
>>>>>>> >
>>>>>>> > Ahmet
>>>>>>> >
>>>>>>> >
>>>>>>> > On Fri, Aug 2, 2019 at 2:20 PM Brian Hulette <[email protected]>
>>>>>>> wrote:
>>>>>>> >>
>>>>>>> >> I recently ran into a portable python precommit failure that led
>>>>>>> me to discover that python dependencies for the container are defined in
>>>>>>> two different places, in slightly different ways: in setup.py with 
>>>>>>> version
>>>>>>> ranges [1], and in a base_image_requirements.txt file in the container
>>>>>>> directory with pinned versions [2]. The latter includes some 
>>>>>>> dependencies
>>>>>>> that are commonly used in user code, in addition to the Beam 
>>>>>>> dependencies,
>>>>>>> and states that "Specifying the versions manually helps to resolve
>>>>>>> dependency conflicts with other packages installed in the container."
>>>>>>> >>
>>>>>>> >> But isn't the purpose of a package manager to resolve those
>>>>>>> conflicts? It seems that we should be able to just list the beam tar.gz 
>>>>>>> as
>>>>>>> a dependency and let pip resolve it. I wrote up a patch [3] to test this
>>>>>>> approach and it seems to work. The only issue I ran into is that cython
>>>>>>> _does_ needs to be installed separately before beam so that it can be 
>>>>>>> used
>>>>>>> when installing beam. Would this be a reasonable approach to remove the
>>>>>>> duplicate dependency specifications?
>>>>>>> >>
>>>>>>> >> Apologies if I'm rehashing an old discussion, if there's already
>>>>>>> an ML thread or PR discussing this issue I'd be happy to take a look at 
>>>>>>> it.
>>>>>>> >>
>>>>>>> >> Thanks,
>>>>>>> >> Brian
>>>>>>> >>
>>>>>>> >> [1]
>>>>>>> https://github.com/apache/beam/blob/master/sdks/python/setup.py
>>>>>>> >> [2]
>>>>>>> https://github.com/apache/beam/blob/master/sdks/python/container/base_image_requirements.txt
>>>>>>> >> [3] https://github.com/apache/beam/pull/9236
>>>>>>> >>
>>>>>>> >>
>>>>>>> >>
>>>>>>> >> PS - What caused my precommit failure:
>>>>>>> >> I tested out the portable runner locally to debug the issue. It
>>>>>>> worked fine with the loopback worker, but failed when using the docker
>>>>>>> worker. I eventually determined the actual cause: I unintentionally 
>>>>>>> relied
>>>>>>> on a protobuf 3.8.0 feature in the code I added. The setup.py 
>>>>>>> requirement,
>>>>>>> used locally in the loopback worker, specified a range requirement which
>>>>>>> resolved to 3.9.0, but the dependency for the container was pinned to
>>>>>>> 3.6.1. I've proposed to resolve the issue by just bumping the lower 
>>>>>>> bound
>>>>>>> of the range in setup.py to 3.8.0, to reflect the new requirement.
>>>>>>>
>>>>>>

Reply via email to