gemini-code-assist[bot] commented on code in PR #38348:
URL: https://github.com/apache/beam/pull/38348#discussion_r3191057142
##########
sdks/python/apache_beam/runners/portability/stager.py:
##########
@@ -701,14 +701,29 @@ def _remove_dependency_from_requirements(
return tmp_requirements_filename
@staticmethod
- def _extract_local_packages(requirements_file):
+ def _extract_local_packages(requirements_file, temp_dir):
local_deps = []
pypi_deps = []
with open(requirements_file, 'r') as fin:
+ staging_temp_dir = tempfile.mkdtemp(dir=temp_dir)
for line in fin:
dep = line.strip()
if os.path.exists(dep):
local_deps.append(dep)
+ elif urlparse(dep).scheme:
+ last_component = dep.rsplit('/', 1)[-1]
Review Comment:

The extraction of the filename from the URL using `rsplit('/', 1)[-1]` is
fragile. It can result in invalid filenames if the URL contains query
parameters or fragments (e.g., `?v=1`). Additionally, URLs with the same
filename but different paths (e.g., `.../v1/pkg.tar.gz` and
`.../v2/pkg.tar.gz`) will overwrite each other in the temporary directory.
Consider using `urlparse` to extract the path and `os.path.basename` for the
filename to improve robustness.
```python
parsed_url = urlparse(dep)
if os.path.exists(dep):
local_deps.append(dep)
elif parsed_url.scheme:
last_component = os.path.basename(parsed_url.path)
```
##########
sdks/python/apache_beam/runners/portability/stager.py:
##########
@@ -701,14 +701,29 @@ def _remove_dependency_from_requirements(
return tmp_requirements_filename
@staticmethod
- def _extract_local_packages(requirements_file):
+ def _extract_local_packages(requirements_file, temp_dir):
local_deps = []
pypi_deps = []
with open(requirements_file, 'r') as fin:
+ staging_temp_dir = tempfile.mkdtemp(dir=temp_dir)
for line in fin:
dep = line.strip()
if os.path.exists(dep):
local_deps.append(dep)
+ elif urlparse(dep).scheme:
+ last_component = dep.rsplit('/', 1)[-1]
+ if not last_component:
+ _LOGGER.warning(
+ 'Extra package %s has an unexpected format hence will not be
downloaded locally.'
+ % dep)
+ continue
+ # Download remote package.
+ _LOGGER.info(
+ 'Downloading remote extra package: %s locally before staging',
+ dep)
+ local_file_path = FileSystems.join(staging_temp_dir, last_component)
+ Stager._download_file(dep, local_file_path)
+ local_deps.append(local_file_path)
Review Comment:

There is a bug in the requirements thinning logic. The
`Stager._remove_dependency_from_requirements` method (called after this loop)
uses the items in `local_deps` to identify which lines to remove from the
original requirements file.
By appending the local temporary path (`local_file_path`) to `local_deps`
instead of the original URL string (`dep`), the thinning logic will fail to
match the URL line in the file. Consequently, the URL will remain in the
thinned requirements file, causing the worker's pip to attempt to download it
again, which defeats the purpose of staging the package locally.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]