[ 
https://issues.apache.org/jira/browse/BEAM-680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15524090#comment-15524090
 ] 

Scott Wegner commented on BEAM-680:
-----------------------------------

/cc [~robertwb]

This came up as an issue with 
[dependency_test.test_with_requirements_file()|https://github.com/apache/incubator-beam/blob/python-sdk/sdks/python/apache_beam/utils/dependency_test.py#L108]
 in [PR 1005|https://github.com/apache/incubator-beam/pull/1005].

We use pip to download all required dependencies, but generate the full listing 
by scanning the cache directory. Perhaps there is a way to ask pip for the 
transitive dependency list as well.

> Python Dataflow stages stale requirements-cache dependencies
> ------------------------------------------------------------
>
>                 Key: BEAM-680
>                 URL: https://issues.apache.org/jira/browse/BEAM-680
>             Project: Beam
>          Issue Type: Bug
>          Components: sdk-py
>            Reporter: Scott Wegner
>            Priority: Minor
>
> When executing a python pipeline using a requirements.txt file, the Dataflow 
> runner will stage all dependencies downloaded to its requirements cache 
> directory, including those specified in the requirements.txt, and any 
> previously cached dependencies. This results in bloated staging directory if 
> previous pipeline runs from the same machine included different dependencies.
> Repro:
> # Initialize a virtualenv and pip install apache_beam
> # Create an empty requirements.txt file
> # Create a simple pipeline using DataflowPipelineRunner and a 
> requirements.txt file, for example: 
> [my_pipeline.py|https://gist.github.com/swegner/6df00df1423b48206c4ab5a7e917218a]
> # {{touch /tmp/dataflow-requirements-cache/extra-file.txt}}
> # Run the pipeline with a specified staging directory
> # Check the staged files for the job
> 'extra-file.txt' will be uploaded with the job, along with any other cached 
> dependencies under /tmp/dataflow-requirements-cache.
> We should only be staging the dependencies necessary for a pipeline, not all 
> previously-cached dependencies found on the machine.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to