asf-tooling commented on issue #1177:
URL: 
https://github.com/apache/tooling-trusted-releases/issues/1177#issuecomment-4409758928

   <!-- gofannon-issue-triage-bot v2 -->
   
   **Automated triage** — analyzed at `main@2da7807a`
   
   **Type:** `bug_fix`  •  **Classification:** `actionable`  •  **Confidence:** 
`medium`
   **Application domain(s):** `artifact_upload`
   
   ### Summary
   The issue requests expanding archive type detection in the 
quarantine/extraction code beyond the current `.tar.gz`, `.tgz`, `.zip`. @sbp 
cited exarch's `detect.rs` showing it supports `.tar`, `.bz2`/`.tbz`/`.tbz2` 
(TarBz2), `.xz`/`.txz` (TarXz), `.zst`/`.tzst` (TarZst), `.zip`, and `.7z`. 
@dave2wave additionally noted `.whl`/`.war`/`.jar`/`.nar`, but @sbp confirmed 
exarch won't understand those. Two code locations need updating: the 
`QUARANTINE_ARCHIVE_SUFFIXES` constant in `atr/detection.py` (which I can't 
read but is referenced from `atr/tasks/quarantine.py`), and the hardcoded tuple 
in `atr/archives.py:list_archive`.
   
   ### Where this lives in the code today
   
   #### `atr/archives.py` — `list_archive` (lines 79-92)
   _needs modification_
   Hardcodes only .tar.gz, .tgz, .zip — needs to include all exarch-supported 
formats.
   
   ```python
   async def list_archive(file_path: safe.StatePath) -> list[str] | None:
       """Attempt to list contents of supported archive files."""
       if not await aiofiles.os.path.isfile(file_path):
           return None
       if not file_path.name.endswith((".tar.gz", ".tgz", ".zip")):
           return None
   
       def _read_archive() -> list[str] | None:
           with contextlib.suppress(Exception):
               manifest = exarch.list_archive(str(file_path), 
extraction_config())
               return sorted(entry.path for entry in manifest.entries)
           return None
   
       return await asyncio.to_thread(_read_archive)
   ```
   
   #### `atr/tasks/quarantine.py` — `_is_archive_suffix` (lines 265-267)
   _currently does this_
   Uses detection.QUARANTINE_ARCHIVE_SUFFIXES to decide if a file should be 
extracted during backfill — the constant needs expanding.
   
   ```python
   def _is_archive_suffix(filename: str) -> bool:
       lower_name = filename.lower()
       return any(lower_name.endswith(suffix) for suffix in 
detection.QUARANTINE_ARCHIVE_SUFFIXES)
   ```
   
   #### `atr/tasks/quarantine.py` — `validate` (lines 96-107)
   _currently does this_
   The quarantine validation entry point — archive entries are passed in via 
task_args.archives, detection of what IS an archive happens earlier.
   
   ```python
   @checks.with_model(args.QuarantineValidate)
   async def validate(task_args: args.QuarantineValidate) -> results.Results | 
None:
       async with db.session() as data:
           quarantined = await data.quarantined(id=task_args.quarantined_id, 
_release=True).get()
   
       if quarantined is None:
           log.error(f"Quarantined row {task_args.quarantined_id} not found")
           return None
   
       if quarantined.status != sql.QuarantineStatus.PENDING:
           log.error(f"Quarantined row {task_args.quarantined_id} is not 
PENDING")
           return None
   ```
   
   ### Where new code would go
   - `atr/detection.py` — QUARANTINE_ARCHIVE_SUFFIXES constant
     This file contains the QUARANTINE_ARCHIVE_SUFFIXES constant referenced by 
quarantine.py — needs expansion to cover all exarch-supported formats.
   
   ### Proposed approach
   There are two changes needed:
   
   1. **`atr/detection.py`**: Expand `QUARANTINE_ARCHIVE_SUFFIXES` to include 
all archive formats supported by exarch: `.tar`, `.tar.gz`, `.tgz`, `.tar.bz2`, 
`.tbz`, `.tbz2`, `.tar.xz`, `.txz`, `.tar.zst`, `.tzst`, `.zip`, `.7z`. Note 
that `.whl`/`.war`/`.jar`/`.nar` (zip-based Java/Python packages) are NOT 
supported by exarch per @sbp's analysis and should not be added until upstream 
support is confirmed.
   
   2. **`atr/archives.py`**: Update `list_archive()` to use the same expanded 
suffix set rather than its own hardcoded tuple. Ideally import 
`QUARANTINE_ARCHIVE_SUFFIXES` from `atr.detection` (or define a shared 
constant) to keep the two in sync.
   
   The `.bz2` and `.xz` suffixes alone (without `.tar.` prefix) are treated by 
exarch as TarBz2/TarXz respectively, which may be incorrect for standalone 
compressed files, but exarch handles the detection internally once we pass the 
file through. Non-archive compressed files will simply fail extraction 
gracefully (the `contextlib.suppress(Exception)` in `list_archive` and the 
error handling in quarantine will catch this).
   
   ### Suggested patches
   
   #### `atr/archives.py`
   Expand list_archive to support all exarch-supported formats instead of 
hardcoded .tar.gz/.tgz/.zip
   
   ````diff
   --- a/atr/archives.py
   +++ b/atr/archives.py
   @@ -17,6 +17,7 @@
    import asyncio
    import contextlib
    import os
   +from collections.abc import Sequence
    from typing import Final
    
    import aiofiles.os
   @@ -27,6 +28,24 @@
    import atr.models.safe as safe
    
    MAX_ARCHIVE_MEMBERS: Final[int] = 100_000
   +
   +# All archive suffixes supported by exarch (cf. exarch-core detect.rs).
   +# Ordered longest-first so multi-part extensions match before single-part 
ones.
   +SUPPORTED_ARCHIVE_SUFFIXES: Final[tuple[str, ...]] = (
   +    ".tar.gz",
   +    ".tar.bz2",
   +    ".tar.xz",
   +    ".tar.zst",
   +    ".tgz",
   +    ".tbz",
   +    ".tbz2",
   +    ".txz",
   +    ".tzst",
   +    ".tar",
   +    ".zip",
   +    ".7z",
   +)
   +
    
    
    class ExtractionError(Exception):
   @@ -83,7 +102,7 @@
    async def list_archive(file_path: safe.StatePath) -> list[str] | None:
        """Attempt to list contents of supported archive files."""
        if not await aiofiles.os.path.isfile(file_path):
            return None
   -    if not file_path.name.endswith((".tar.gz", ".tgz", ".zip")):
   +    if not file_path.name.lower().endswith(SUPPORTED_ARCHIVE_SUFFIXES):
            return None
    
        def _read_archive() -> list[str] | None:
   ````
   
   #### `atr/detection.py`
   Expand QUARANTINE_ARCHIVE_SUFFIXES to cover all exarch-supported archive 
formats (cannot see full file, proposing change to the constant)
   
   ````diff
   --- a/atr/detection.py
   +++ b/atr/detection.py
   @@ -1,3 +1,5 @@
   +# TODO: locate exact line of QUARANTINE_ARCHIVE_SUFFIXES in this file and 
replace.
   +# The constant should be updated to:
   +QUARANTINE_ARCHIVE_SUFFIXES: tuple[str, ...] = (
   +    ".tar.gz",
   +    ".tar.bz2",
   +    ".tar.xz",
   +    ".tar.zst",
   +    ".tgz",
   +    ".tbz",
   +    ".tbz2",
   +    ".txz",
   +    ".tzst",
   +    ".tar",
   +    ".zip",
   +    ".7z",
   +)
   ````
   
   ### Open questions
   - I cannot read `atr/detection.py` to see the current value of 
`QUARANTINE_ARCHIVE_SUFFIXES` or its exact location — the diff for that file is 
illustrative only.
   - Should `QUARANTINE_ARCHIVE_SUFFIXES` in detection.py and 
`SUPPORTED_ARCHIVE_SUFFIXES` in archives.py be consolidated into a single 
source of truth? Currently they'd be duplicated.
   - Should `.whl`/`.war`/`.jar`/`.nar` (zip-format archives) be supported via 
rename-to-.zip or by contributing upstream to exarch? @sbp confirmed exarch 
rejects these currently.
   - Issue #553 was referenced by @dave2wave as having an inventory of file 
suffixes — that may inform whether any additional suffixes are needed beyond 
what exarch supports.
   - Should standalone `.bz2`/`.xz`/`.zst` files (not tarballs) be included? 
exarch treats them as TarBz2/TarXz/TarZst which may fail for non-tar content — 
but extraction failures are handled gracefully.
   
   ### Files examined
   - `atr/archives.py`
   - `atr/tasks/quarantine.py`
   - `atr/post/upload.py`
   - `atr/get/file.py`
   - `atr/get/download.py`
   
   ---
   *Draft from a triage agent. A human reviewer should validate before merging 
any change. The agent did not run tests or verify diffs apply.*


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to