[ https://issues.apache.org/jira/browse/BEAM-680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15524090#comment-15524090 ]
Scott Wegner commented on BEAM-680: ----------------------------------- /cc [~robertwb] This came up as an issue with [dependency_test.test_with_requirements_file()|https://github.com/apache/incubator-beam/blob/python-sdk/sdks/python/apache_beam/utils/dependency_test.py#L108] in [PR 1005|https://github.com/apache/incubator-beam/pull/1005]. We use pip to download all required dependencies, but generate the full listing by scanning the cache directory. Perhaps there is a way to ask pip for the transitive dependency list as well. > Python Dataflow stages stale requirements-cache dependencies > ------------------------------------------------------------ > > Key: BEAM-680 > URL: https://issues.apache.org/jira/browse/BEAM-680 > Project: Beam > Issue Type: Bug > Components: sdk-py > Reporter: Scott Wegner > Priority: Minor > > When executing a python pipeline using a requirements.txt file, the Dataflow > runner will stage all dependencies downloaded to its requirements cache > directory, including those specified in the requirements.txt, and any > previously cached dependencies. This results in bloated staging directory if > previous pipeline runs from the same machine included different dependencies. > Repro: > # Initialize a virtualenv and pip install apache_beam > # Create an empty requirements.txt file > # Create a simple pipeline using DataflowPipelineRunner and a > requirements.txt file, for example: > [my_pipeline.py|https://gist.github.com/swegner/6df00df1423b48206c4ab5a7e917218a] > # {{touch /tmp/dataflow-requirements-cache/extra-file.txt}} > # Run the pipeline with a specified staging directory > # Check the staged files for the job > 'extra-file.txt' will be uploaded with the job, along with any other cached > dependencies under /tmp/dataflow-requirements-cache. > We should only be staging the dependencies necessary for a pipeline, not all > previously-cached dependencies found on the machine. -- This message was sent by Atlassian JIRA (v6.3.4#6332)