asf-tooling commented on issue #1177:
URL:
https://github.com/apache/tooling-trusted-releases/issues/1177#issuecomment-4409758928
<!-- gofannon-issue-triage-bot v2 -->
**Automated triage** — analyzed at `main@2da7807a`
**Type:** `bug_fix` • **Classification:** `actionable` • **Confidence:**
`medium`
**Application domain(s):** `artifact_upload`
### Summary
The issue requests expanding archive type detection in the
quarantine/extraction code beyond the current `.tar.gz`, `.tgz`, `.zip`. @sbp
cited exarch's `detect.rs` showing it supports `.tar`, `.bz2`/`.tbz`/`.tbz2`
(TarBz2), `.xz`/`.txz` (TarXz), `.zst`/`.tzst` (TarZst), `.zip`, and `.7z`.
@dave2wave additionally noted `.whl`/`.war`/`.jar`/`.nar`, but @sbp confirmed
exarch won't understand those. Two code locations need updating: the
`QUARANTINE_ARCHIVE_SUFFIXES` constant in `atr/detection.py` (which I can't
read but is referenced from `atr/tasks/quarantine.py`), and the hardcoded tuple
in `atr/archives.py:list_archive`.
### Where this lives in the code today
#### `atr/archives.py` — `list_archive` (lines 79-92)
_needs modification_
Hardcodes only .tar.gz, .tgz, .zip — needs to include all exarch-supported
formats.
```python
async def list_archive(file_path: safe.StatePath) -> list[str] | None:
"""Attempt to list contents of supported archive files."""
if not await aiofiles.os.path.isfile(file_path):
return None
if not file_path.name.endswith((".tar.gz", ".tgz", ".zip")):
return None
def _read_archive() -> list[str] | None:
with contextlib.suppress(Exception):
manifest = exarch.list_archive(str(file_path),
extraction_config())
return sorted(entry.path for entry in manifest.entries)
return None
return await asyncio.to_thread(_read_archive)
```
#### `atr/tasks/quarantine.py` — `_is_archive_suffix` (lines 265-267)
_currently does this_
Uses detection.QUARANTINE_ARCHIVE_SUFFIXES to decide if a file should be
extracted during backfill — the constant needs expanding.
```python
def _is_archive_suffix(filename: str) -> bool:
lower_name = filename.lower()
return any(lower_name.endswith(suffix) for suffix in
detection.QUARANTINE_ARCHIVE_SUFFIXES)
```
#### `atr/tasks/quarantine.py` — `validate` (lines 96-107)
_currently does this_
The quarantine validation entry point — archive entries are passed in via
task_args.archives, detection of what IS an archive happens earlier.
```python
@checks.with_model(args.QuarantineValidate)
async def validate(task_args: args.QuarantineValidate) -> results.Results |
None:
async with db.session() as data:
quarantined = await data.quarantined(id=task_args.quarantined_id,
_release=True).get()
if quarantined is None:
log.error(f"Quarantined row {task_args.quarantined_id} not found")
return None
if quarantined.status != sql.QuarantineStatus.PENDING:
log.error(f"Quarantined row {task_args.quarantined_id} is not
PENDING")
return None
```
### Where new code would go
- `atr/detection.py` — QUARANTINE_ARCHIVE_SUFFIXES constant
This file contains the QUARANTINE_ARCHIVE_SUFFIXES constant referenced by
quarantine.py — needs expansion to cover all exarch-supported formats.
### Proposed approach
There are two changes needed:
1. **`atr/detection.py`**: Expand `QUARANTINE_ARCHIVE_SUFFIXES` to include
all archive formats supported by exarch: `.tar`, `.tar.gz`, `.tgz`, `.tar.bz2`,
`.tbz`, `.tbz2`, `.tar.xz`, `.txz`, `.tar.zst`, `.tzst`, `.zip`, `.7z`. Note
that `.whl`/`.war`/`.jar`/`.nar` (zip-based Java/Python packages) are NOT
supported by exarch per @sbp's analysis and should not be added until upstream
support is confirmed.
2. **`atr/archives.py`**: Update `list_archive()` to use the same expanded
suffix set rather than its own hardcoded tuple. Ideally import
`QUARANTINE_ARCHIVE_SUFFIXES` from `atr.detection` (or define a shared
constant) to keep the two in sync.
The `.bz2` and `.xz` suffixes alone (without `.tar.` prefix) are treated by
exarch as TarBz2/TarXz respectively, which may be incorrect for standalone
compressed files, but exarch handles the detection internally once we pass the
file through. Non-archive compressed files will simply fail extraction
gracefully (the `contextlib.suppress(Exception)` in `list_archive` and the
error handling in quarantine will catch this).
### Suggested patches
#### `atr/archives.py`
Expand list_archive to support all exarch-supported formats instead of
hardcoded .tar.gz/.tgz/.zip
````diff
--- a/atr/archives.py
+++ b/atr/archives.py
@@ -17,6 +17,7 @@
import asyncio
import contextlib
import os
+from collections.abc import Sequence
from typing import Final
import aiofiles.os
@@ -27,6 +28,24 @@
import atr.models.safe as safe
MAX_ARCHIVE_MEMBERS: Final[int] = 100_000
+
+# All archive suffixes supported by exarch (cf. exarch-core detect.rs).
+# Ordered longest-first so multi-part extensions match before single-part
ones.
+SUPPORTED_ARCHIVE_SUFFIXES: Final[tuple[str, ...]] = (
+ ".tar.gz",
+ ".tar.bz2",
+ ".tar.xz",
+ ".tar.zst",
+ ".tgz",
+ ".tbz",
+ ".tbz2",
+ ".txz",
+ ".tzst",
+ ".tar",
+ ".zip",
+ ".7z",
+)
+
class ExtractionError(Exception):
@@ -83,7 +102,7 @@
async def list_archive(file_path: safe.StatePath) -> list[str] | None:
"""Attempt to list contents of supported archive files."""
if not await aiofiles.os.path.isfile(file_path):
return None
- if not file_path.name.endswith((".tar.gz", ".tgz", ".zip")):
+ if not file_path.name.lower().endswith(SUPPORTED_ARCHIVE_SUFFIXES):
return None
def _read_archive() -> list[str] | None:
````
#### `atr/detection.py`
Expand QUARANTINE_ARCHIVE_SUFFIXES to cover all exarch-supported archive
formats (cannot see full file, proposing change to the constant)
````diff
--- a/atr/detection.py
+++ b/atr/detection.py
@@ -1,3 +1,5 @@
+# TODO: locate exact line of QUARANTINE_ARCHIVE_SUFFIXES in this file and
replace.
+# The constant should be updated to:
+QUARANTINE_ARCHIVE_SUFFIXES: tuple[str, ...] = (
+ ".tar.gz",
+ ".tar.bz2",
+ ".tar.xz",
+ ".tar.zst",
+ ".tgz",
+ ".tbz",
+ ".tbz2",
+ ".txz",
+ ".tzst",
+ ".tar",
+ ".zip",
+ ".7z",
+)
````
### Open questions
- I cannot read `atr/detection.py` to see the current value of
`QUARANTINE_ARCHIVE_SUFFIXES` or its exact location — the diff for that file is
illustrative only.
- Should `QUARANTINE_ARCHIVE_SUFFIXES` in detection.py and
`SUPPORTED_ARCHIVE_SUFFIXES` in archives.py be consolidated into a single
source of truth? Currently they'd be duplicated.
- Should `.whl`/`.war`/`.jar`/`.nar` (zip-format archives) be supported via
rename-to-.zip or by contributing upstream to exarch? @sbp confirmed exarch
rejects these currently.
- Issue #553 was referenced by @dave2wave as having an inventory of file
suffixes — that may inform whether any additional suffixes are needed beyond
what exarch supports.
- Should standalone `.bz2`/`.xz`/`.zst` files (not tarballs) be included?
exarch treats them as TarBz2/TarXz/TarZst which may fail for non-tar content —
but extraction failures are handled gracefully.
### Files examined
- `atr/archives.py`
- `atr/tasks/quarantine.py`
- `atr/post/upload.py`
- `atr/get/file.py`
- `atr/get/download.py`
---
*Draft from a triage agent. A human reviewer should validate before merging
any change. The agent did not run tests or verify diffs apply.*
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]