On Mon, Aug 5, 2019 at 9:49 PM Valentyn Tymofieiev <[email protected]>
wrote:

> On Tue, Aug 6, 2019 at 2:29 AM Ahmet Altay <[email protected]> wrote:
>
>>
>>
>> On Mon, Aug 5, 2019 at 1:43 AM Valentyn Tymofieiev <[email protected]>
>> wrote:
>>
>>> - The purpose of install_requires in setup.py is to define the maximally
>>> permissive set of requirements for a package[1]. We don't pin a version in
>>> setup.py without a strong reason, instead we typically pick up a lower
>>> bound we have tested, and set an upper bound to be next major version.
>>> - The purpose of requirements.txt is to force pip to properly resolve
>>> dependencies, and create a reproducible execution environment, since pip
>>> doesn’t have true dependency resolution [2]
>>>
>>> We currently regularly upgrade setup.py dependencies, but do not update
>>> base_image_requirements.txt. Also, our requirements file is not exhaustive.
>>> We need to introduce a process to fix this.
>>>
>>
>> base_image_requirements.txt currently does not drive any released Beam
>> artifact. It would make sense to address this as part of the SDKHarness
>> container image release process. Similar mechanism might make sense for
>> figuring out what goes into containers for other SDKs. (Perhaps add your
>> proposal to
>> https://cwiki.apache.org/confluence/display/BEAM/%5BWIP%5D+SDKHarness+Container+Image+Release+Process
>> )
>>
>
> +1. I will add notes to the wiki, thanks.
>
>
>>
>>
>
>>
>>> I would recommend that in new process:
>>> - All dependencies of Beam Python SDK, including transitive
>>> dependencies, are listed in base_image_requirements.txt (or another
>>> requirements file). "Explicit is better than implicit."
>>> - Requirements file is regenerated whenever setup.py changes..
>>>
>> - When we build a container image, we check that the final image has
>>> exactly the same versions of dependencies that were spelled out in
>>> requirements file (no versions added, or changed)
>>>
>>
>> The above could be done in a two step process:
>> - a human generated requirements file like base_image_requirements today
>> which has a set of curated requirements.
>> - Changes to the first file would result in a generated file with all
>> transitive dependencies. Second file could be used as the source of truth
>> for all dependencies at a particular commit. Generated file could be used
>> for the container builds.
>>
>>
>>> - We also check that there are no dependency conflicts (they typically
>>> look like: Package X requires version A of dependency Y, but you will have
>>> B, which is incompatible).
>>>
>>
>> +1. Container build could verify this.
>>
>>
>>> - We update the versions of pinned dependencies periodically. We may
>>> want to put all dependencies of SDK harness containers on the radar of Beam
>>> dependency checker.
>>>
>>
>> +1 but a separate process.
>>
>> This process does not address how to keep setup.py in sync with
>> requirements file. We could use git hooks to ensure that files are changed
>> at the same time.
>>
>
> In addition to git hooks, we could try to check that once we install
> Apache Beam (after installing requirements file dependencies), we don't
> pull anything from PyPi. We could reply on pip output or cut internet
> access to enforce this.
>

Good idea. This also represents an interesting use case where pipelines run
on a private network with no pypi access.


>
>
>>
>>> Valentyn
>>>
>>> [1]
>>> https://packaging.python.org/discussions/install-requires-vs-requirements/
>>> [2] https://pip.pypa.io/en/stable/user_guide/#requirements-files
>>>
>>> On Sat, Aug 3, 2019 at 2:47 AM Ahmet Altay <[email protected]> wrote:
>>>
>>>>
>>>>
>>>> On Fri, Aug 2, 2019 at 4:34 PM Brian Hulette <[email protected]>
>>>> wrote:
>>>>
>>>>> Thanks for the reply, I added some responses inline.
>>>>>
>>>>> On Fri, Aug 2, 2019 at 2:42 PM Ahmet Altay <[email protected]> wrote:
>>>>> >
>>>>> > There is a value in explicitly pinning the dependencies to be used
>>>>> in the containers:
>>>>> > - It reproducibly produces the same container. This will be
>>>>> important once we start release Beam container. By looking at a Beam
>>>>> release branch, one could exactly figure out the set of dependencies
>>>>> available in a released container.
>>>>> >
>>>>> > - Package manager (pip) in this is notorious about resolving
>>>>> versions for sub-dependencies, and sometimes picks incompatible
>>>>> dependencies.
>>>>> > - SImilarly this repdocubility is helpful with tests, that can work
>>>>> on the same container version unaffected by numerous dependency version
>>>>> changes happening in sub-dependecies.
>>>>>
>>>>> I thought this may be the reason, it definitely makes sense to pin
>>>>> dependencies for reproducible builds. If that is the case though I
>>>>> think we should at least change the comment in
>>>>> base_image_requirements.txt to say so (I'm happy to write a patch for
>>>>> that).
>>>>>
>>>>
>>>> Sounds good.
>>>>
>>>>
>>>>>
>>>>> > In addition, I will argue that we will want to keep a form of
>>>>> base_image_requirements.txt, in order to be able to install additional
>>>>> dependencies on the container and we would not be able to get rid of this
>>>>> mechanism.
>>>>>
>>>>> My PR doesn't remove base_image_requirements.txt. It keeps the file
>>>>> there but removes all the pinned beam dependencies, and instead just
>>>>> lists the beam tar.gz as a dependency directly, along with all of the
>>>>> additional dependencies.
>>>>>
>>>>
>>>> Got it, thanks for the clarification.
>>>>
>>>>
>>>>>
>>>>> >
>>>>> >
>>>>> > I will suggest, If possible ensure that both files are modified
>>>>> synchronously. This might be possible with precommit hooks although I am
>>>>> not familiar.
>>>>>
>>>>> Yeah setting up something like that would help, but surely there's a
>>>>> better way to lock dependencies than to count on people manually
>>>>> updating this file (even with a precommit reminding them)? I'm not
>>>>> that familiar with the complexities of python package management, but
>>>>> it seems like we should be able to set up something with pip freeze or
>>>>> pipenv's Pipfile.lock. The thing that confounds me when I try to think
>>>>> through that setup though is how to reference a development build of
>>>>> beam. If I run pip freeze in the container the requirement for
>>>>> apache-beam is "2.16.0.dev0" which of course doesn't resolve when you
>>>>> turn around and try to install it. Maybe an automated solution would
>>>>> just replace the apache-beam line with
>>>>> /opt/apache/beam/tars/apache-beam.tar.gz[gcp]?
>>>>>
>>>>
>>>> I agree, this probably could be automated in a better way.
>>>>
>>>> I believe just doing "pip
>>>> install /opt/apache/beam/tars/apache-beam.tar.gz[gcp]" at the beginning
>>>> will prevent a later error for "pip install apache-beam==2.16.0.dev0". So
>>>> we could have the former line in the docker file before installing the
>>>> requirements.
>>>>
>>>>
>>>>>
>>>>> >
>>>>> >
>>>>> > Ahmet
>>>>> >
>>>>> >
>>>>> > On Fri, Aug 2, 2019 at 2:20 PM Brian Hulette <[email protected]>
>>>>> wrote:
>>>>> >>
>>>>> >> I recently ran into a portable python precommit failure that led me
>>>>> to discover that python dependencies for the container are defined in two
>>>>> different places, in slightly different ways: in setup.py with version
>>>>> ranges [1], and in a base_image_requirements.txt file in the container
>>>>> directory with pinned versions [2]. The latter includes some dependencies
>>>>> that are commonly used in user code, in addition to the Beam dependencies,
>>>>> and states that "Specifying the versions manually helps to resolve
>>>>> dependency conflicts with other packages installed in the container."
>>>>> >>
>>>>> >> But isn't the purpose of a package manager to resolve those
>>>>> conflicts? It seems that we should be able to just list the beam tar.gz as
>>>>> a dependency and let pip resolve it. I wrote up a patch [3] to test this
>>>>> approach and it seems to work. The only issue I ran into is that cython
>>>>> _does_ needs to be installed separately before beam so that it can be used
>>>>> when installing beam. Would this be a reasonable approach to remove the
>>>>> duplicate dependency specifications?
>>>>> >>
>>>>> >> Apologies if I'm rehashing an old discussion, if there's already an
>>>>> ML thread or PR discussing this issue I'd be happy to take a look at it.
>>>>> >>
>>>>> >> Thanks,
>>>>> >> Brian
>>>>> >>
>>>>> >> [1] https://github.com/apache/beam/blob/master/sdks/python/setup.py
>>>>> >> [2]
>>>>> https://github.com/apache/beam/blob/master/sdks/python/container/base_image_requirements.txt
>>>>> >> [3] https://github.com/apache/beam/pull/9236
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >> PS - What caused my precommit failure:
>>>>> >> I tested out the portable runner locally to debug the issue. It
>>>>> worked fine with the loopback worker, but failed when using the docker
>>>>> worker. I eventually determined the actual cause: I unintentionally relied
>>>>> on a protobuf 3.8.0 feature in the code I added. The setup.py requirement,
>>>>> used locally in the loopback worker, specified a range requirement which
>>>>> resolved to 3.9.0, but the dependency for the container was pinned to
>>>>> 3.6.1. I've proposed to resolve the issue by just bumping the lower bound
>>>>> of the range in setup.py to 3.8.0, to reflect the new requirement.
>>>>>
>>>>

Reply via email to