On Wed, Aug 7, 2019 at 1:53 AM Ahmet Altay <[email protected]> wrote:

>
>
> On Mon, Aug 5, 2019 at 9:49 PM Valentyn Tymofieiev <[email protected]>
> wrote:
>
>> On Tue, Aug 6, 2019 at 2:29 AM Ahmet Altay <[email protected]> wrote:
>>
>>>
>>>
>>> On Mon, Aug 5, 2019 at 1:43 AM Valentyn Tymofieiev <[email protected]>
>>> wrote:
>>>
>>>> - The purpose of install_requires in setup.py is to define the
>>>> maximally permissive set of requirements for a package[1]. We don't pin a
>>>> version in setup.py without a strong reason, instead we typically pick up a
>>>> lower bound we have tested, and set an upper bound to be next major
>>>> version.
>>>> - The purpose of requirements.txt is to force pip to properly resolve
>>>> dependencies, and create a reproducible execution environment, since pip
>>>> doesn’t have true dependency resolution [2]
>>>>
>>>> We currently regularly upgrade setup.py dependencies, but do not update
>>>> base_image_requirements.txt. Also, our requirements file is not exhaustive.
>>>> We need to introduce a process to fix this.
>>>>
>>>
>>> base_image_requirements.txt currently does not drive any released Beam
>>> artifact. It would make sense to address this as part of the SDKHarness
>>> container image release process. Similar mechanism might make sense for
>>> figuring out what goes into containers for other SDKs. (Perhaps add your
>>> proposal to
>>> https://cwiki.apache.org/confluence/display/BEAM/%5BWIP%5D+SDKHarness+Container+Image+Release+Process
>>> )
>>>
>>
>> +1. I will add notes to the wiki, thanks.
>>
>>
>>>
>>>
>>
>>>
>>>> I would recommend that in new process:
>>>> - All dependencies of Beam Python SDK, including transitive
>>>> dependencies, are listed in base_image_requirements.txt (or another
>>>> requirements file). "Explicit is better than implicit."
>>>> - Requirements file is regenerated whenever setup.py changes..
>>>>
>>> - When we build a container image, we check that the final image has
>>>> exactly the same versions of dependencies that were spelled out in
>>>> requirements file (no versions added, or changed)
>>>>
>>>
>>> The above could be done in a two step process:
>>> - a human generated requirements file like base_image_requirements today
>>> which has a set of curated requirements.
>>> - Changes to the first file would result in a generated file with all
>>> transitive dependencies. Second file could be used as the source of truth
>>> for all dependencies at a particular commit. Generated file could be used
>>> for the container builds.
>>>
>>>
>>>> - We also check that there are no dependency conflicts (they typically
>>>> look like: Package X requires version A of dependency Y, but you will have
>>>> B, which is incompatible).
>>>>
>>>
>>> +1. Container build could verify this.
>>>
>>>
>>>> - We update the versions of pinned dependencies periodically. We may
>>>> want to put all dependencies of SDK harness containers on the radar of Beam
>>>> dependency checker.
>>>>
>>>
>>> +1 but a separate process.
>>>
>>> This process does not address how to keep setup.py in sync with
>>> requirements file. We could use git hooks to ensure that files are changed
>>> at the same time.
>>>
>>
>> In addition to git hooks, we could try to check that once we install
>> Apache Beam (after installing requirements file dependencies), we don't
>> pull anything from PyPi. We could reply on pip output or cut internet
>> access to enforce this.
>>
>
> Good idea. This also represents an interesting use case where pipelines
> run on a private network with no pypi access.
>

+1 For what it's worth we intend to run beam pipelines (via flink) on a
private network.


>
>>
>>
>>>
>>>> Valentyn
>>>>
>>>> [1]
>>>> https://packaging.python.org/discussions/install-requires-vs-requirements/
>>>> [2] https://pip.pypa.io/en/stable/user_guide/#requirements-files
>>>>
>>>> On Sat, Aug 3, 2019 at 2:47 AM Ahmet Altay <[email protected]> wrote:
>>>>
>>>>>
>>>>>
>>>>> On Fri, Aug 2, 2019 at 4:34 PM Brian Hulette <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Thanks for the reply, I added some responses inline.
>>>>>>
>>>>>> On Fri, Aug 2, 2019 at 2:42 PM Ahmet Altay <[email protected]> wrote:
>>>>>> >
>>>>>> > There is a value in explicitly pinning the dependencies to be used
>>>>>> in the containers:
>>>>>> > - It reproducibly produces the same container. This will be
>>>>>> important once we start release Beam container. By looking at a Beam
>>>>>> release branch, one could exactly figure out the set of dependencies
>>>>>> available in a released container.
>>>>>> >
>>>>>> > - Package manager (pip) in this is notorious about resolving
>>>>>> versions for sub-dependencies, and sometimes picks incompatible
>>>>>> dependencies.
>>>>>> > - SImilarly this repdocubility is helpful with tests, that can work
>>>>>> on the same container version unaffected by numerous dependency version
>>>>>> changes happening in sub-dependecies.
>>>>>>
>>>>>> I thought this may be the reason, it definitely makes sense to pin
>>>>>> dependencies for reproducible builds. If that is the case though I
>>>>>> think we should at least change the comment in
>>>>>> base_image_requirements.txt to say so (I'm happy to write a patch for
>>>>>> that).
>>>>>>
>>>>>
>>>>> Sounds good.
>>>>>
>>>>>
>>>>>>
>>>>>> > In addition, I will argue that we will want to keep a form of
>>>>>> base_image_requirements.txt, in order to be able to install additional
>>>>>> dependencies on the container and we would not be able to get rid of this
>>>>>> mechanism.
>>>>>>
>>>>>> My PR doesn't remove base_image_requirements.txt. It keeps the file
>>>>>> there but removes all the pinned beam dependencies, and instead just
>>>>>> lists the beam tar.gz as a dependency directly, along with all of the
>>>>>> additional dependencies.
>>>>>>
>>>>>
>>>>> Got it, thanks for the clarification.
>>>>>
>>>>>
>>>>>>
>>>>>> >
>>>>>> >
>>>>>> > I will suggest, If possible ensure that both files are modified
>>>>>> synchronously. This might be possible with precommit hooks although I am
>>>>>> not familiar.
>>>>>>
>>>>>> Yeah setting up something like that would help, but surely there's a
>>>>>> better way to lock dependencies than to count on people manually
>>>>>> updating this file (even with a precommit reminding them)? I'm not
>>>>>> that familiar with the complexities of python package management, but
>>>>>> it seems like we should be able to set up something with pip freeze or
>>>>>> pipenv's Pipfile.lock. The thing that confounds me when I try to think
>>>>>> through that setup though is how to reference a development build of
>>>>>> beam. If I run pip freeze in the container the requirement for
>>>>>> apache-beam is "2.16.0.dev0" which of course doesn't resolve when you
>>>>>> turn around and try to install it. Maybe an automated solution would
>>>>>> just replace the apache-beam line with
>>>>>> /opt/apache/beam/tars/apache-beam.tar.gz[gcp]?
>>>>>>
>>>>>
>>>>> I agree, this probably could be automated in a better way.
>>>>>
>>>>> I believe just doing "pip
>>>>> install /opt/apache/beam/tars/apache-beam.tar.gz[gcp]" at the beginning
>>>>> will prevent a later error for "pip install apache-beam==2.16.0.dev0". So
>>>>> we could have the former line in the docker file before installing the
>>>>> requirements.
>>>>>
>>>>>
>>>>>>
>>>>>> >
>>>>>> >
>>>>>> > Ahmet
>>>>>> >
>>>>>> >
>>>>>> > On Fri, Aug 2, 2019 at 2:20 PM Brian Hulette <[email protected]>
>>>>>> wrote:
>>>>>> >>
>>>>>> >> I recently ran into a portable python precommit failure that led
>>>>>> me to discover that python dependencies for the container are defined in
>>>>>> two different places, in slightly different ways: in setup.py with 
>>>>>> version
>>>>>> ranges [1], and in a base_image_requirements.txt file in the container
>>>>>> directory with pinned versions [2]. The latter includes some dependencies
>>>>>> that are commonly used in user code, in addition to the Beam 
>>>>>> dependencies,
>>>>>> and states that "Specifying the versions manually helps to resolve
>>>>>> dependency conflicts with other packages installed in the container."
>>>>>> >>
>>>>>> >> But isn't the purpose of a package manager to resolve those
>>>>>> conflicts? It seems that we should be able to just list the beam tar.gz 
>>>>>> as
>>>>>> a dependency and let pip resolve it. I wrote up a patch [3] to test this
>>>>>> approach and it seems to work. The only issue I ran into is that cython
>>>>>> _does_ needs to be installed separately before beam so that it can be 
>>>>>> used
>>>>>> when installing beam. Would this be a reasonable approach to remove the
>>>>>> duplicate dependency specifications?
>>>>>> >>
>>>>>> >> Apologies if I'm rehashing an old discussion, if there's already
>>>>>> an ML thread or PR discussing this issue I'd be happy to take a look at 
>>>>>> it.
>>>>>> >>
>>>>>> >> Thanks,
>>>>>> >> Brian
>>>>>> >>
>>>>>> >> [1]
>>>>>> https://github.com/apache/beam/blob/master/sdks/python/setup.py
>>>>>> >> [2]
>>>>>> https://github.com/apache/beam/blob/master/sdks/python/container/base_image_requirements.txt
>>>>>> >> [3] https://github.com/apache/beam/pull/9236
>>>>>> >>
>>>>>> >>
>>>>>> >>
>>>>>> >> PS - What caused my precommit failure:
>>>>>> >> I tested out the portable runner locally to debug the issue. It
>>>>>> worked fine with the loopback worker, but failed when using the docker
>>>>>> worker. I eventually determined the actual cause: I unintentionally 
>>>>>> relied
>>>>>> on a protobuf 3.8.0 feature in the code I added. The setup.py 
>>>>>> requirement,
>>>>>> used locally in the loopback worker, specified a range requirement which
>>>>>> resolved to 3.9.0, but the dependency for the container was pinned to
>>>>>> 3.6.1. I've proposed to resolve the issue by just bumping the lower bound
>>>>>> of the range in setup.py to 3.8.0, to reflect the new requirement.
>>>>>>
>>>>>

Reply via email to