tvalentyn commented on code in PR #36249:
URL: https://github.com/apache/beam/pull/36249#discussion_r2414792765
##########
sdks/python/apache_beam/runners/portability/stager.py:
##########
@@ -780,7 +785,12 @@ def _populate_requirements_cache(
platform_tag
])
_LOGGER.info('Executing command: %s', cmd_args)
- processes.check_output(cmd_args, stderr=processes.STDOUT)
+ output = processes.check_output(cmd_args, stderr=subprocess.STDOUT)
+ downloaded_packages = []
+ for line in output.decode('utf-8').split('\n'):
Review Comment:
When a user supplies a `--requirements_file` option, Beam stages packages to
allow a runner execute a pipeline even if the runner environment doesn't have
access to PyPI to download the packages on the fly.
To stage packages, we download the packages into the local
requirements_cache folder, and then stage the entire folder. The disadvantage
is that overtime the requirements_cache folder might have some other packages
no longer in `requirements.txt`. That can cause additional uploads of files
that are not necessary. Possible solutions:
* Clean the requirements cache folder periodically: `rm -rf
/tmp/dataflow-requirements-cache`
* Use a custom container image (`--sdk_container_image`) instead of the
`--requirements_file`, and install the packages in your image.
* Don't stage requirements cache with `--requirements_cache=skip` (pipeline
will depend on PyPI at runtime).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]