Re: Python staging file weirdness
Filed https://issues.apache.org/jira/browse/BEAM-8900 to address the inefficiency discussed here. Thanks everyone. On Thu, Dec 5, 2019 at 2:53 PM Valentyn Tymofieiev wrote: > Note that so far we have not been staging wheels, since SDK does not have > a knowledge of a target platform, but there is > https://issues.apache.org/jira/browse/BEAM-4032 to add this support. > > On Thu, Dec 5, 2019 at 2:35 PM Chad Dombrova wrote: > >> On Thu, Dec 5, 2019 at 12:36 PM Valentyn Tymofieiev >> wrote: >> >> Ah nice, so then the workflow would be: download [missing] deps from pypi >>> into a long-lived cache directory, then download copy the same deps >>> into a short-lived temporary directory, using long-lived cache directory >>> as SoT, then stage files from a short-lived temporary directory and clean >>> it up. Is that what you are suggesting, Chad? >>> >> Yes, I just did a quick test to confirm: >> >> # download or build wheels of anything that's missing from the cache >> # note: we're including gcp extras: >> pip wheel apache_beam[gcp]==2.16 --wheel-dir /tmp/wheel-cache >> # copy some of those wheels somewhere else >> # note: we're excluding gcp extras >> pip download apache_beam==2.16 --no-binary --find-links=/tmp/wheel-cache >> --dest /tmp/wheel-dest/ >> # rerun to confirm that cached wheels are being re-used instead of >> downloaded from pypi >> pip wheel apache_beam[gcp]==2.16 --wheel-dir /tmp/wheel-cache >> >> /tmp/wheel-dest/ will now have a subset of the deps from >> /tmp/wheel-cache, excluding the gcp extras. >> >> Note that for some reason the equal sign after —find-links is required, >> at least for me on pip 19.1.1. Using a space resulted in an error. >> >> -chad >> >> >>
Re: Python staging file weirdness
Note that so far we have not been staging wheels, since SDK does not have a knowledge of a target platform, but there is https://issues.apache.org/jira/browse/BEAM-4032 to add this support. On Thu, Dec 5, 2019 at 2:35 PM Chad Dombrova wrote: > On Thu, Dec 5, 2019 at 12:36 PM Valentyn Tymofieiev > wrote: > > Ah nice, so then the workflow would be: download [missing] deps from pypi >> into a long-lived cache directory, then download copy the same deps into >> a short-lived temporary directory, using long-lived cache directory as >> SoT, then stage files from a short-lived temporary directory and clean it >> up. Is that what you are suggesting, Chad? >> > Yes, I just did a quick test to confirm: > > # download or build wheels of anything that's missing from the cache > # note: we're including gcp extras: > pip wheel apache_beam[gcp]==2.16 --wheel-dir /tmp/wheel-cache > # copy some of those wheels somewhere else > # note: we're excluding gcp extras > pip download apache_beam==2.16 --no-binary --find-links=/tmp/wheel-cache > --dest /tmp/wheel-dest/ > # rerun to confirm that cached wheels are being re-used instead of downloaded > from pypi > pip wheel apache_beam[gcp]==2.16 --wheel-dir /tmp/wheel-cache > > /tmp/wheel-dest/ will now have a subset of the deps from /tmp/wheel-cache, > excluding the gcp extras. > > Note that for some reason the equal sign after —find-links is required, at > least for me on pip 19.1.1. Using a space resulted in an error. > > -chad > > >
Re: Python staging file weirdness
On Thu, Dec 5, 2019 at 12:36 PM Valentyn Tymofieiev wrote: Ah nice, so then the workflow would be: download [missing] deps from pypi > into a long-lived cache directory, then download copy the same deps into > a short-lived temporary directory, using long-lived cache directory as > SoT, then stage files from a short-lived temporary directory and clean it > up. Is that what you are suggesting, Chad? > Yes, I just did a quick test to confirm: # download or build wheels of anything that's missing from the cache # note: we're including gcp extras: pip wheel apache_beam[gcp]==2.16 --wheel-dir /tmp/wheel-cache # copy some of those wheels somewhere else # note: we're excluding gcp extras pip download apache_beam==2.16 --no-binary --find-links=/tmp/wheel-cache --dest /tmp/wheel-dest/ # rerun to confirm that cached wheels are being re-used instead of downloaded from pypi pip wheel apache_beam[gcp]==2.16 --wheel-dir /tmp/wheel-cache /tmp/wheel-dest/ will now have a subset of the deps from /tmp/wheel-cache, excluding the gcp extras. Note that for some reason the equal sign after —find-links is required, at least for me on pip 19.1.1. Using a space resulted in an error. -chad
Re: Python staging file weirdness
Ah nice, so then the workflow would be: download [missing] deps from pypi into a long-lived cache directory, then download copy the same deps into a short-lived temporary directory, using long-lived cache directory as SoT, then stage files from a short-lived temporary directory and clean it up. Is that what you are suggesting, Chad?
Re: Python staging file weirdness
Another way to copy only the deps you care about is to use `pip download` to do the copy. I believe you can provide the cache dir to `pip download --find-links` and it will read from that before reading from pypi (you may also need to set --wheel-dir to the cache dir as well), and thus it acts as a simple copy. -chad On Thu, Dec 5, 2019 at 12:07 PM Valentyn Tymofieiev wrote: > Looked for a bit at pip download command. The alternative seems to parse > the output of > > python -m pip download --dest . -r requirements.txt --exists-action i > --no-binary :all: > > and see which files were downloaded and/or skipped since they were already > present, and then stage only the files that appear in the log output. Seems > doable but may break if pip output changes between pip implementations, so > we'd have to add a test as well. > > On Thu, Dec 5, 2019 at 11:10 AM Luke Cwik wrote: > >> I think reusing the same cache directory makes sense during downloading >> but why do we upload everything that is there? >> >> On Thu, Dec 5, 2019 at 9:24 AM Udi Meiri wrote: >> >>> Looking at the source, it seems that it should be using a >>> os.path.join(tempfile.gettempdir(), 'dataflow-requirements-cache') >>> to create a different tmp directory on each run. >>> >>> Also, sampling worker no. 2: >>> >>> *jenkins@apache-beam-jenkins-2*:*~*$ ls -l /tmp/dataflow-requirements-cache/ >>> total 7172 >>> -rw-rw-r-- 1 jenkins jenkins 27947 Sep 6 22:46 *funcsigs-1.0.2.tar.gz* >>> -rw-rw-r-- 1 jenkins jenkins 28126 Sep 6 21:38 *mock-3.0.5.tar.gz* >>> -rw-rw-r-- 1 jenkins jenkins 376623 Sep 6 21:38 *PyHamcrest-1.9.0.tar.gz* >>> -rw-rw-r-- 1 jenkins jenkins 851251 Sep 6 21:38 *setuptools-41.2.0.zip* >>> -rw-rw-r-- 1 jenkins jenkins 855608 Oct 7 06:03 *setuptools-41.4.0.zip* >>> -rw-rw-r-- 1 jenkins jenkins 851068 Oct 28 06:10 *setuptools-41.5.0.zip* >>> -rw-rw-r-- 1 jenkins jenkins 851097 Oct 28 19:46 *setuptools-41.5.1.zip* >>> -rw-rw-r-- 1 jenkins jenkins 852541 Oct 29 14:06 *setuptools-41.6.0.zip* >>> -rw-rw-r-- 1 jenkins jenkins 852125 Nov 24 08:10 *setuptools-42.0.0.zip* >>> -rw-rw-r-- 1 jenkins jenkins 852264 Nov 25 20:55 *setuptools-42.0.1.zip* >>> -rw-rw-r-- 1 jenkins jenkins 858444 Dec 1 18:12 *setuptools-42.0.2.zip* >>> -rw-rw-r-- 1 jenkins jenkins 32725 Sep 6 21:38 *six-1.12.0.tar.gz* >>> -rw-rw-r-- 1 jenkins jenkins 33726 Nov 5 19:18 *six-1.13.0.tar.gz* >>> >>> >>> On Wed, Dec 4, 2019 at 8:00 PM Luke Cwik wrote: >>> Can we filter the cache directory only for the artifacts that we want and not everything that is there? On Wed, Dec 4, 2019 at 6:56 PM Valentyn Tymofieiev wrote: > Luke, I am not sure I understand the question. The caching that > happens here is implemented in the SDK for requirements packages: > https://github.com/apache/beam/blob/438055c95116f4e6e419e5faa9c42f7d329c421c/sdks/python/apache_beam/runners/portability/stager.py#L161 > > > On Wed, Dec 4, 2019 at 6:19 PM Luke Cwik wrote: > >> Is there a way to use a cache on disk that is separate from the set >> of packages we use as requirements? >> >> On Wed, Dec 4, 2019 at 5:58 PM Udi Meiri wrote: >> >>> Thanks! >>> Another reason to periodically referesh workers. >>> >>> On Wed, Nov 27, 2019 at 10:37 PM Valentyn Tymofieiev < >>> valen...@google.com> wrote: >>> Tests job specify[1] a requirements.txt file that contains two entries: pyhamcrest, mock. We download[2] sources of packages specified in requirements file, and packages they depend on. While doing so, it appears that we use a cache directory on jenkins to store the sources of the packages [3], perhaps to save a trip to pypi and reduce pypi flakiness? Then, we stage the entire cache directory[4], which includes all packages ever cached. Overtime the versions that our requirements packages need change, but I guess we don't clean the cache on Jenkins workers. [1] https://github.com/apache/beam/blob/438055c95116f4e6e419e5faa9c42f7d329c421c/sdks/python/scripts/run_integration_test.sh#L197 [2] https://github.com/apache/beam/blob/438055c95116f4e6e419e5faa9c42f7d329c421c/sdks/python/apache_beam/runners/portability/stager.py#L469 [3] https://github.com/apache/beam/blob/438055c95116f4e6e419e5faa9c42f7d329c421c/sdks/python/apache_beam/runners/portability/stager.py#L161 [4] https://github.com/apache/beam/blob/438055c95116f4e6e419e5faa9c42f7d329c421c/sdks/python/apache_beam/runners/portability/stager.py#L172 On Wed, Nov 27, 2019 at 11:55 AM Udi Meiri wrote: > I was investigating a Dataflow postcommit test failure > (endpoints_pb2 missing), and saw this in the staging directory: > > $ gsutil ls >
Re: Python staging file weirdness
Looked for a bit at pip download command. The alternative seems to parse the output of python -m pip download --dest . -r requirements.txt --exists-action i --no-binary :all: and see which files were downloaded and/or skipped since they were already present, and then stage only the files that appear in the log output. Seems doable but may break if pip output changes between pip implementations, so we'd have to add a test as well. On Thu, Dec 5, 2019 at 11:10 AM Luke Cwik wrote: > I think reusing the same cache directory makes sense during downloading > but why do we upload everything that is there? > > On Thu, Dec 5, 2019 at 9:24 AM Udi Meiri wrote: > >> Looking at the source, it seems that it should be using a >> os.path.join(tempfile.gettempdir(), 'dataflow-requirements-cache') >> to create a different tmp directory on each run. >> >> Also, sampling worker no. 2: >> >> *jenkins@apache-beam-jenkins-2*:*~*$ ls -l /tmp/dataflow-requirements-cache/ >> total 7172 >> -rw-rw-r-- 1 jenkins jenkins 27947 Sep 6 22:46 *funcsigs-1.0.2.tar.gz* >> -rw-rw-r-- 1 jenkins jenkins 28126 Sep 6 21:38 *mock-3.0.5.tar.gz* >> -rw-rw-r-- 1 jenkins jenkins 376623 Sep 6 21:38 *PyHamcrest-1.9.0.tar.gz* >> -rw-rw-r-- 1 jenkins jenkins 851251 Sep 6 21:38 *setuptools-41.2.0.zip* >> -rw-rw-r-- 1 jenkins jenkins 855608 Oct 7 06:03 *setuptools-41.4.0.zip* >> -rw-rw-r-- 1 jenkins jenkins 851068 Oct 28 06:10 *setuptools-41.5.0.zip* >> -rw-rw-r-- 1 jenkins jenkins 851097 Oct 28 19:46 *setuptools-41.5.1.zip* >> -rw-rw-r-- 1 jenkins jenkins 852541 Oct 29 14:06 *setuptools-41.6.0.zip* >> -rw-rw-r-- 1 jenkins jenkins 852125 Nov 24 08:10 *setuptools-42.0.0.zip* >> -rw-rw-r-- 1 jenkins jenkins 852264 Nov 25 20:55 *setuptools-42.0.1.zip* >> -rw-rw-r-- 1 jenkins jenkins 858444 Dec 1 18:12 *setuptools-42.0.2.zip* >> -rw-rw-r-- 1 jenkins jenkins 32725 Sep 6 21:38 *six-1.12.0.tar.gz* >> -rw-rw-r-- 1 jenkins jenkins 33726 Nov 5 19:18 *six-1.13.0.tar.gz* >> >> >> On Wed, Dec 4, 2019 at 8:00 PM Luke Cwik wrote: >> >>> Can we filter the cache directory only for the artifacts that we want >>> and not everything that is there? >>> >>> On Wed, Dec 4, 2019 at 6:56 PM Valentyn Tymofieiev >>> wrote: >>> Luke, I am not sure I understand the question. The caching that happens here is implemented in the SDK for requirements packages: https://github.com/apache/beam/blob/438055c95116f4e6e419e5faa9c42f7d329c421c/sdks/python/apache_beam/runners/portability/stager.py#L161 On Wed, Dec 4, 2019 at 6:19 PM Luke Cwik wrote: > Is there a way to use a cache on disk that is separate from the set of > packages we use as requirements? > > On Wed, Dec 4, 2019 at 5:58 PM Udi Meiri wrote: > >> Thanks! >> Another reason to periodically referesh workers. >> >> On Wed, Nov 27, 2019 at 10:37 PM Valentyn Tymofieiev < >> valen...@google.com> wrote: >> >>> Tests job specify[1] a requirements.txt file that contains two >>> entries: pyhamcrest, mock. >>> >>> We download[2] sources of packages specified in requirements file, >>> and packages they depend on. While doing so, it appears that we use a >>> cache >>> directory on jenkins to store the sources of the packages [3], perhaps >>> to >>> save a trip to pypi and reduce pypi flakiness? Then, we stage the entire >>> cache directory[4], which includes all packages ever cached. Overtime >>> the >>> versions that our requirements packages need change, but I guess we >>> don't >>> clean the cache on Jenkins workers. >>> >>> [1] >>> https://github.com/apache/beam/blob/438055c95116f4e6e419e5faa9c42f7d329c421c/sdks/python/scripts/run_integration_test.sh#L197 >>> [2] >>> https://github.com/apache/beam/blob/438055c95116f4e6e419e5faa9c42f7d329c421c/sdks/python/apache_beam/runners/portability/stager.py#L469 >>> [3] >>> https://github.com/apache/beam/blob/438055c95116f4e6e419e5faa9c42f7d329c421c/sdks/python/apache_beam/runners/portability/stager.py#L161 >>> >>> [4] >>> https://github.com/apache/beam/blob/438055c95116f4e6e419e5faa9c42f7d329c421c/sdks/python/apache_beam/runners/portability/stager.py#L172 >>> >>> On Wed, Nov 27, 2019 at 11:55 AM Udi Meiri wrote: >>> I was investigating a Dataflow postcommit test failure (endpoints_pb2 missing), and saw this in the staging directory: $ gsutil ls gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882 gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/PyHamcrest-1.9.0.tar.gz gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/dataflow-worker.jar gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/dataflow_python_sdk.tar
Re: Python staging file weirdness
I think reusing the same cache directory makes sense during downloading but why do we upload everything that is there? On Thu, Dec 5, 2019 at 9:24 AM Udi Meiri wrote: > Looking at the source, it seems that it should be using a > os.path.join(tempfile.gettempdir(), 'dataflow-requirements-cache') > to create a different tmp directory on each run. > > Also, sampling worker no. 2: > > *jenkins@apache-beam-jenkins-2*:*~*$ ls -l /tmp/dataflow-requirements-cache/ > total 7172 > -rw-rw-r-- 1 jenkins jenkins 27947 Sep 6 22:46 *funcsigs-1.0.2.tar.gz* > -rw-rw-r-- 1 jenkins jenkins 28126 Sep 6 21:38 *mock-3.0.5.tar.gz* > -rw-rw-r-- 1 jenkins jenkins 376623 Sep 6 21:38 *PyHamcrest-1.9.0.tar.gz* > -rw-rw-r-- 1 jenkins jenkins 851251 Sep 6 21:38 *setuptools-41.2.0.zip* > -rw-rw-r-- 1 jenkins jenkins 855608 Oct 7 06:03 *setuptools-41.4.0.zip* > -rw-rw-r-- 1 jenkins jenkins 851068 Oct 28 06:10 *setuptools-41.5.0.zip* > -rw-rw-r-- 1 jenkins jenkins 851097 Oct 28 19:46 *setuptools-41.5.1.zip* > -rw-rw-r-- 1 jenkins jenkins 852541 Oct 29 14:06 *setuptools-41.6.0.zip* > -rw-rw-r-- 1 jenkins jenkins 852125 Nov 24 08:10 *setuptools-42.0.0.zip* > -rw-rw-r-- 1 jenkins jenkins 852264 Nov 25 20:55 *setuptools-42.0.1.zip* > -rw-rw-r-- 1 jenkins jenkins 858444 Dec 1 18:12 *setuptools-42.0.2.zip* > -rw-rw-r-- 1 jenkins jenkins 32725 Sep 6 21:38 *six-1.12.0.tar.gz* > -rw-rw-r-- 1 jenkins jenkins 33726 Nov 5 19:18 *six-1.13.0.tar.gz* > > > On Wed, Dec 4, 2019 at 8:00 PM Luke Cwik wrote: > >> Can we filter the cache directory only for the artifacts that we want and >> not everything that is there? >> >> On Wed, Dec 4, 2019 at 6:56 PM Valentyn Tymofieiev >> wrote: >> >>> Luke, I am not sure I understand the question. The caching that happens >>> here is implemented in the SDK for requirements packages: >>> https://github.com/apache/beam/blob/438055c95116f4e6e419e5faa9c42f7d329c421c/sdks/python/apache_beam/runners/portability/stager.py#L161 >>> >>> >>> On Wed, Dec 4, 2019 at 6:19 PM Luke Cwik wrote: >>> Is there a way to use a cache on disk that is separate from the set of packages we use as requirements? On Wed, Dec 4, 2019 at 5:58 PM Udi Meiri wrote: > Thanks! > Another reason to periodically referesh workers. > > On Wed, Nov 27, 2019 at 10:37 PM Valentyn Tymofieiev < > valen...@google.com> wrote: > >> Tests job specify[1] a requirements.txt file that contains two >> entries: pyhamcrest, mock. >> >> We download[2] sources of packages specified in requirements file, >> and packages they depend on. While doing so, it appears that we use a >> cache >> directory on jenkins to store the sources of the packages [3], perhaps to >> save a trip to pypi and reduce pypi flakiness? Then, we stage the entire >> cache directory[4], which includes all packages ever cached. Overtime the >> versions that our requirements packages need change, but I guess we don't >> clean the cache on Jenkins workers. >> >> [1] >> https://github.com/apache/beam/blob/438055c95116f4e6e419e5faa9c42f7d329c421c/sdks/python/scripts/run_integration_test.sh#L197 >> [2] >> https://github.com/apache/beam/blob/438055c95116f4e6e419e5faa9c42f7d329c421c/sdks/python/apache_beam/runners/portability/stager.py#L469 >> [3] >> https://github.com/apache/beam/blob/438055c95116f4e6e419e5faa9c42f7d329c421c/sdks/python/apache_beam/runners/portability/stager.py#L161 >> >> [4] >> https://github.com/apache/beam/blob/438055c95116f4e6e419e5faa9c42f7d329c421c/sdks/python/apache_beam/runners/portability/stager.py#L172 >> >> On Wed, Nov 27, 2019 at 11:55 AM Udi Meiri wrote: >> >>> I was investigating a Dataflow postcommit test failure >>> (endpoints_pb2 missing), and saw this in the staging directory: >>> >>> $ gsutil ls >>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882 >>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/PyHamcrest-1.9.0.tar.gz >>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/dataflow-worker.jar >>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/dataflow_python_sdk.tar >>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/funcsigs-1.0.2.tar.gz >>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/mock-3.0.5.tar.gz >>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/pipeline.pb >>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/requirements.txt >>>
Re: Python staging file weirdness
Can we filter the cache directory only for the artifacts that we want and not everything that is there? On Wed, Dec 4, 2019 at 6:56 PM Valentyn Tymofieiev wrote: > Luke, I am not sure I understand the question. The caching that happens > here is implemented in the SDK for requirements packages: > https://github.com/apache/beam/blob/438055c95116f4e6e419e5faa9c42f7d329c421c/sdks/python/apache_beam/runners/portability/stager.py#L161 > > > On Wed, Dec 4, 2019 at 6:19 PM Luke Cwik wrote: > >> Is there a way to use a cache on disk that is separate from the set of >> packages we use as requirements? >> >> On Wed, Dec 4, 2019 at 5:58 PM Udi Meiri wrote: >> >>> Thanks! >>> Another reason to periodically referesh workers. >>> >>> On Wed, Nov 27, 2019 at 10:37 PM Valentyn Tymofieiev < >>> valen...@google.com> wrote: >>> Tests job specify[1] a requirements.txt file that contains two entries: pyhamcrest, mock. We download[2] sources of packages specified in requirements file, and packages they depend on. While doing so, it appears that we use a cache directory on jenkins to store the sources of the packages [3], perhaps to save a trip to pypi and reduce pypi flakiness? Then, we stage the entire cache directory[4], which includes all packages ever cached. Overtime the versions that our requirements packages need change, but I guess we don't clean the cache on Jenkins workers. [1] https://github.com/apache/beam/blob/438055c95116f4e6e419e5faa9c42f7d329c421c/sdks/python/scripts/run_integration_test.sh#L197 [2] https://github.com/apache/beam/blob/438055c95116f4e6e419e5faa9c42f7d329c421c/sdks/python/apache_beam/runners/portability/stager.py#L469 [3] https://github.com/apache/beam/blob/438055c95116f4e6e419e5faa9c42f7d329c421c/sdks/python/apache_beam/runners/portability/stager.py#L161 [4] https://github.com/apache/beam/blob/438055c95116f4e6e419e5faa9c42f7d329c421c/sdks/python/apache_beam/runners/portability/stager.py#L172 On Wed, Nov 27, 2019 at 11:55 AM Udi Meiri wrote: > I was investigating a Dataflow postcommit test failure (endpoints_pb2 > missing), and saw this in the staging directory: > > $ gsutil ls > gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882 > gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/PyHamcrest-1.9.0.tar.gz > gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/dataflow-worker.jar > gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/dataflow_python_sdk.tar > gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/funcsigs-1.0.2.tar.gz > gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/mock-3.0.5.tar.gz > gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/pipeline.pb > gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/requirements.txt > gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/setuptools-41.2.0.zip > gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/setuptools-41.4.0.zip > gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/setuptools-41.5.0.zip > gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/setuptools-41.5.1.zip > gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/setuptools-41.6.0.zip > gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/setuptools-42.0.0.zip > gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/setuptools-42.0.1.zip > gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/six-1.12.0.tar.gz > gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/six-1.13.0.tar.gz > > > Does anyone know why so many versions of setuptools need to be staged? > Shouldn't 1 be enough? >
Re: Python staging file weirdness
Luke, I am not sure I understand the question. The caching that happens here is implemented in the SDK for requirements packages: https://github.com/apache/beam/blob/438055c95116f4e6e419e5faa9c42f7d329c421c/sdks/python/apache_beam/runners/portability/stager.py#L161 On Wed, Dec 4, 2019 at 6:19 PM Luke Cwik wrote: > Is there a way to use a cache on disk that is separate from the set of > packages we use as requirements? > > On Wed, Dec 4, 2019 at 5:58 PM Udi Meiri wrote: > >> Thanks! >> Another reason to periodically referesh workers. >> >> On Wed, Nov 27, 2019 at 10:37 PM Valentyn Tymofieiev >> wrote: >> >>> Tests job specify[1] a requirements.txt file that contains two entries: >>> pyhamcrest, mock. >>> >>> We download[2] sources of packages specified in requirements file, >>> and packages they depend on. While doing so, it appears that we use a cache >>> directory on jenkins to store the sources of the packages [3], perhaps to >>> save a trip to pypi and reduce pypi flakiness? Then, we stage the entire >>> cache directory[4], which includes all packages ever cached. Overtime the >>> versions that our requirements packages need change, but I guess we don't >>> clean the cache on Jenkins workers. >>> >>> [1] >>> https://github.com/apache/beam/blob/438055c95116f4e6e419e5faa9c42f7d329c421c/sdks/python/scripts/run_integration_test.sh#L197 >>> [2] >>> https://github.com/apache/beam/blob/438055c95116f4e6e419e5faa9c42f7d329c421c/sdks/python/apache_beam/runners/portability/stager.py#L469 >>> [3] >>> https://github.com/apache/beam/blob/438055c95116f4e6e419e5faa9c42f7d329c421c/sdks/python/apache_beam/runners/portability/stager.py#L161 >>> >>> [4] >>> https://github.com/apache/beam/blob/438055c95116f4e6e419e5faa9c42f7d329c421c/sdks/python/apache_beam/runners/portability/stager.py#L172 >>> >>> On Wed, Nov 27, 2019 at 11:55 AM Udi Meiri wrote: >>> I was investigating a Dataflow postcommit test failure (endpoints_pb2 missing), and saw this in the staging directory: $ gsutil ls gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882 gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/PyHamcrest-1.9.0.tar.gz gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/dataflow-worker.jar gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/dataflow_python_sdk.tar gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/funcsigs-1.0.2.tar.gz gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/mock-3.0.5.tar.gz gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/pipeline.pb gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/requirements.txt gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/setuptools-41.2.0.zip gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/setuptools-41.4.0.zip gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/setuptools-41.5.0.zip gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/setuptools-41.5.1.zip gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/setuptools-41.6.0.zip gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/setuptools-42.0.0.zip gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/setuptools-42.0.1.zip gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/six-1.12.0.tar.gz gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/six-1.13.0.tar.gz Does anyone know why so many versions of setuptools need to be staged? Shouldn't 1 be enough? >>>
Re: Python staging file weirdness
Is there a way to use a cache on disk that is separate from the set of packages we use as requirements? On Wed, Dec 4, 2019 at 5:58 PM Udi Meiri wrote: > Thanks! > Another reason to periodically referesh workers. > > On Wed, Nov 27, 2019 at 10:37 PM Valentyn Tymofieiev > wrote: > >> Tests job specify[1] a requirements.txt file that contains two entries: >> pyhamcrest, mock. >> >> We download[2] sources of packages specified in requirements file, >> and packages they depend on. While doing so, it appears that we use a cache >> directory on jenkins to store the sources of the packages [3], perhaps to >> save a trip to pypi and reduce pypi flakiness? Then, we stage the entire >> cache directory[4], which includes all packages ever cached. Overtime the >> versions that our requirements packages need change, but I guess we don't >> clean the cache on Jenkins workers. >> >> [1] >> https://github.com/apache/beam/blob/438055c95116f4e6e419e5faa9c42f7d329c421c/sdks/python/scripts/run_integration_test.sh#L197 >> [2] >> https://github.com/apache/beam/blob/438055c95116f4e6e419e5faa9c42f7d329c421c/sdks/python/apache_beam/runners/portability/stager.py#L469 >> [3] >> https://github.com/apache/beam/blob/438055c95116f4e6e419e5faa9c42f7d329c421c/sdks/python/apache_beam/runners/portability/stager.py#L161 >> >> [4] >> https://github.com/apache/beam/blob/438055c95116f4e6e419e5faa9c42f7d329c421c/sdks/python/apache_beam/runners/portability/stager.py#L172 >> >> On Wed, Nov 27, 2019 at 11:55 AM Udi Meiri wrote: >> >>> I was investigating a Dataflow postcommit test failure (endpoints_pb2 >>> missing), and saw this in the staging directory: >>> >>> $ gsutil ls >>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882 >>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/PyHamcrest-1.9.0.tar.gz >>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/dataflow-worker.jar >>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/dataflow_python_sdk.tar >>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/funcsigs-1.0.2.tar.gz >>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/mock-3.0.5.tar.gz >>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/pipeline.pb >>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/requirements.txt >>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/setuptools-41.2.0.zip >>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/setuptools-41.4.0.zip >>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/setuptools-41.5.0.zip >>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/setuptools-41.5.1.zip >>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/setuptools-41.6.0.zip >>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/setuptools-42.0.0.zip >>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/setuptools-42.0.1.zip >>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/six-1.12.0.tar.gz >>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/six-1.13.0.tar.gz >>> >>> >>> Does anyone know why so many versions of setuptools need to be staged? >>> Shouldn't 1 be enough? >>> >>
Re: Python staging file weirdness
Thanks! Another reason to periodically referesh workers. On Wed, Nov 27, 2019 at 10:37 PM Valentyn Tymofieiev wrote: > Tests job specify[1] a requirements.txt file that contains two entries: > pyhamcrest, mock. > > We download[2] sources of packages specified in requirements file, > and packages they depend on. While doing so, it appears that we use a cache > directory on jenkins to store the sources of the packages [3], perhaps to > save a trip to pypi and reduce pypi flakiness? Then, we stage the entire > cache directory[4], which includes all packages ever cached. Overtime the > versions that our requirements packages need change, but I guess we don't > clean the cache on Jenkins workers. > > [1] > https://github.com/apache/beam/blob/438055c95116f4e6e419e5faa9c42f7d329c421c/sdks/python/scripts/run_integration_test.sh#L197 > [2] > https://github.com/apache/beam/blob/438055c95116f4e6e419e5faa9c42f7d329c421c/sdks/python/apache_beam/runners/portability/stager.py#L469 > [3] > https://github.com/apache/beam/blob/438055c95116f4e6e419e5faa9c42f7d329c421c/sdks/python/apache_beam/runners/portability/stager.py#L161 > > [4] > https://github.com/apache/beam/blob/438055c95116f4e6e419e5faa9c42f7d329c421c/sdks/python/apache_beam/runners/portability/stager.py#L172 > > On Wed, Nov 27, 2019 at 11:55 AM Udi Meiri wrote: > >> I was investigating a Dataflow postcommit test failure (endpoints_pb2 >> missing), and saw this in the staging directory: >> >> $ gsutil ls >> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882 >> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/PyHamcrest-1.9.0.tar.gz >> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/dataflow-worker.jar >> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/dataflow_python_sdk.tar >> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/funcsigs-1.0.2.tar.gz >> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/mock-3.0.5.tar.gz >> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/pipeline.pb >> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/requirements.txt >> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/setuptools-41.2.0.zip >> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/setuptools-41.4.0.zip >> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/setuptools-41.5.0.zip >> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/setuptools-41.5.1.zip >> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/setuptools-41.6.0.zip >> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/setuptools-42.0.0.zip >> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/setuptools-42.0.1.zip >> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/six-1.12.0.tar.gz >> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/six-1.13.0.tar.gz >> >> >> Does anyone know why so many versions of setuptools need to be staged? >> Shouldn't 1 be enough? >> > smime.p7s Description: S/MIME Cryptographic Signature
Re: Python staging file weirdness
Tests job specify[1] a requirements.txt file that contains two entries: pyhamcrest, mock. We download[2] sources of packages specified in requirements file, and packages they depend on. While doing so, it appears that we use a cache directory on jenkins to store the sources of the packages [3], perhaps to save a trip to pypi and reduce pypi flakiness? Then, we stage the entire cache directory[4], which includes all packages ever cached. Overtime the versions that our requirements packages need change, but I guess we don't clean the cache on Jenkins workers. [1] https://github.com/apache/beam/blob/438055c95116f4e6e419e5faa9c42f7d329c421c/sdks/python/scripts/run_integration_test.sh#L197 [2] https://github.com/apache/beam/blob/438055c95116f4e6e419e5faa9c42f7d329c421c/sdks/python/apache_beam/runners/portability/stager.py#L469 [3] https://github.com/apache/beam/blob/438055c95116f4e6e419e5faa9c42f7d329c421c/sdks/python/apache_beam/runners/portability/stager.py#L161 [4] https://github.com/apache/beam/blob/438055c95116f4e6e419e5faa9c42f7d329c421c/sdks/python/apache_beam/runners/portability/stager.py#L172 On Wed, Nov 27, 2019 at 11:55 AM Udi Meiri wrote: > I was investigating a Dataflow postcommit test failure (endpoints_pb2 > missing), and saw this in the staging directory: > > $ gsutil ls > gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882 > gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/PyHamcrest-1.9.0.tar.gz > gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/dataflow-worker.jar > gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/dataflow_python_sdk.tar > gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/funcsigs-1.0.2.tar.gz > gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/mock-3.0.5.tar.gz > gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/pipeline.pb > gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/requirements.txt > gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/setuptools-41.2.0.zip > gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/setuptools-41.4.0.zip > gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/setuptools-41.5.0.zip > gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/setuptools-41.5.1.zip > gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/setuptools-41.6.0.zip > gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/setuptools-42.0.0.zip > gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/setuptools-42.0.1.zip > gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/six-1.12.0.tar.gz > gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/six-1.13.0.tar.gz > > > Does anyone know why so many versions of setuptools need to be staged? > Shouldn't 1 be enough? >
Python staging file weirdness
I was investigating a Dataflow postcommit test failure (endpoints_pb2 missing), and saw this in the staging directory: $ gsutil ls gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882 gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/PyHamcrest-1.9.0.tar.gz gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/dataflow-worker.jar gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/dataflow_python_sdk.tar gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/funcsigs-1.0.2.tar.gz gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/mock-3.0.5.tar.gz gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/pipeline.pb gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/requirements.txt gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/setuptools-41.2.0.zip gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/setuptools-41.4.0.zip gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/setuptools-41.5.0.zip gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/setuptools-41.5.1.zip gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/setuptools-41.6.0.zip gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/setuptools-42.0.0.zip gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/setuptools-42.0.1.zip gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/six-1.12.0.tar.gz gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/six-1.13.0.tar.gz Does anyone know why so many versions of setuptools need to be staged? Shouldn't 1 be enough? smime.p7s Description: S/MIME Cryptographic Signature