Looked for a bit at pip download command. The alternative seems to parse
the output of

python -m pip download  --dest . -r requirements.txt  --exists-action i
--no-binary :all:

and see which files were downloaded and/or skipped since they were already
present, and then stage only the files that appear in the log output. Seems
doable but may break if pip output changes between pip implementations, so
we'd have to add a test as well.

On Thu, Dec 5, 2019 at 11:10 AM Luke Cwik <lc...@google.com> wrote:

> I think reusing the same cache directory makes sense during downloading
> but why do we upload everything that is there?
>
> On Thu, Dec 5, 2019 at 9:24 AM Udi Meiri <eh...@google.com> wrote:
>
>> Looking at the source, it seems that it should be using a
>> os.path.join(tempfile.gettempdir(), 'dataflow-requirements-cache')
>> to create a different tmp directory on each run.
>>
>> Also, sampling worker no. 2:
>>
>> *jenkins@apache-beam-jenkins-2*:*~*$ ls -l /tmp/dataflow-requirements-cache/
>> total 7172
>> -rw-rw-r-- 1 jenkins jenkins  27947 Sep  6 22:46 *funcsigs-1.0.2.tar.gz*
>> -rw-rw-r-- 1 jenkins jenkins  28126 Sep  6 21:38 *mock-3.0.5.tar.gz*
>> -rw-rw-r-- 1 jenkins jenkins 376623 Sep  6 21:38 *PyHamcrest-1.9.0.tar.gz*
>> -rw-rw-r-- 1 jenkins jenkins 851251 Sep  6 21:38 *setuptools-41.2.0.zip*
>> -rw-rw-r-- 1 jenkins jenkins 855608 Oct  7 06:03 *setuptools-41.4.0.zip*
>> -rw-rw-r-- 1 jenkins jenkins 851068 Oct 28 06:10 *setuptools-41.5.0.zip*
>> -rw-rw-r-- 1 jenkins jenkins 851097 Oct 28 19:46 *setuptools-41.5.1.zip*
>> -rw-rw-r-- 1 jenkins jenkins 852541 Oct 29 14:06 *setuptools-41.6.0.zip*
>> -rw-rw-r-- 1 jenkins jenkins 852125 Nov 24 08:10 *setuptools-42.0.0.zip*
>> -rw-rw-r-- 1 jenkins jenkins 852264 Nov 25 20:55 *setuptools-42.0.1.zip*
>> -rw-rw-r-- 1 jenkins jenkins 858444 Dec  1 18:12 *setuptools-42.0.2.zip*
>> -rw-rw-r-- 1 jenkins jenkins  32725 Sep  6 21:38 *six-1.12.0.tar.gz*
>> -rw-rw-r-- 1 jenkins jenkins  33726 Nov  5 19:18 *six-1.13.0.tar.gz*
>>
>>
>> On Wed, Dec 4, 2019 at 8:00 PM Luke Cwik <lc...@google.com> wrote:
>>
>>> Can we filter the cache directory only for the artifacts that we want
>>> and not everything that is there?
>>>
>>> On Wed, Dec 4, 2019 at 6:56 PM Valentyn Tymofieiev <valen...@google.com>
>>> wrote:
>>>
>>>> Luke, I am not sure I understand the question. The caching that happens
>>>> here is implemented in the SDK for requirements packages:
>>>> https://github.com/apache/beam/blob/438055c95116f4e6e419e5faa9c42f7d329c421c/sdks/python/apache_beam/runners/portability/stager.py#L161
>>>>
>>>>
>>>> On Wed, Dec 4, 2019 at 6:19 PM Luke Cwik <lc...@google.com> wrote:
>>>>
>>>>> Is there a way to use a cache on disk that is separate from the set of
>>>>> packages we use as requirements?
>>>>>
>>>>> On Wed, Dec 4, 2019 at 5:58 PM Udi Meiri <eh...@google.com> wrote:
>>>>>
>>>>>> Thanks!
>>>>>> Another reason to periodically referesh workers.
>>>>>>
>>>>>> On Wed, Nov 27, 2019 at 10:37 PM Valentyn Tymofieiev <
>>>>>> valen...@google.com> wrote:
>>>>>>
>>>>>>> Tests job specify[1] a requirements.txt file that contains two
>>>>>>> entries: pyhamcrest, mock.
>>>>>>>
>>>>>>> We download[2]  sources of packages specified in requirements file,
>>>>>>> and packages they depend on. While doing so, it appears that we use a 
>>>>>>> cache
>>>>>>> directory on jenkins to store the sources of the packages [3], perhaps 
>>>>>>> to
>>>>>>> save a trip to pypi and reduce pypi flakiness? Then, we stage the entire
>>>>>>> cache directory[4], which includes all packages ever cached. Overtime 
>>>>>>> the
>>>>>>> versions that our requirements packages need change, but I guess we 
>>>>>>> don't
>>>>>>> clean the cache on Jenkins workers.
>>>>>>>
>>>>>>> [1]
>>>>>>> https://github.com/apache/beam/blob/438055c95116f4e6e419e5faa9c42f7d329c421c/sdks/python/scripts/run_integration_test.sh#L197
>>>>>>> [2]
>>>>>>> https://github.com/apache/beam/blob/438055c95116f4e6e419e5faa9c42f7d329c421c/sdks/python/apache_beam/runners/portability/stager.py#L469
>>>>>>> [3]
>>>>>>> https://github.com/apache/beam/blob/438055c95116f4e6e419e5faa9c42f7d329c421c/sdks/python/apache_beam/runners/portability/stager.py#L161
>>>>>>>
>>>>>>> [4]
>>>>>>> https://github.com/apache/beam/blob/438055c95116f4e6e419e5faa9c42f7d329c421c/sdks/python/apache_beam/runners/portability/stager.py#L172
>>>>>>>
>>>>>>> On Wed, Nov 27, 2019 at 11:55 AM Udi Meiri <eh...@google.com> wrote:
>>>>>>>
>>>>>>>> I was investigating a Dataflow postcommit test failure
>>>>>>>> (endpoints_pb2 missing), and saw this in the staging directory:
>>>>>>>>
>>>>>>>> $ gsutil ls 
>>>>>>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882
>>>>>>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/PyHamcrest-1.9.0.tar.gz
>>>>>>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/dataflow-worker.jar
>>>>>>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/dataflow_python_sdk.tar
>>>>>>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/funcsigs-1.0.2.tar.gz
>>>>>>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/mock-3.0.5.tar.gz
>>>>>>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/pipeline.pb
>>>>>>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/requirements.txt
>>>>>>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/setuptools-41.2.0.zip
>>>>>>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/setuptools-41.4.0.zip
>>>>>>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/setuptools-41.5.0.zip
>>>>>>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/setuptools-41.5.1.zip
>>>>>>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/setuptools-41.6.0.zip
>>>>>>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/setuptools-42.0.0.zip
>>>>>>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/setuptools-42.0.1.zip
>>>>>>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/six-1.12.0.tar.gz
>>>>>>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/six-1.13.0.tar.gz
>>>>>>>>
>>>>>>>>
>>>>>>>> Does anyone know why so many versions of setuptools need to be
>>>>>>>> staged? Shouldn't 1 be enough?
>>>>>>>>
>>>>>>>

Reply via email to