asf-tooling commented on issue #1154:
URL: 
https://github.com/apache/tooling-trusted-releases/issues/1154#issuecomment-4404147270

   <!-- gofannon-issue-triage-bot v2 -->
   
   **Automated triage** — analyzed at `main@837830e8`
   
   **Type:** `new_feature`  •  **Classification:** `no_action`  •  
**Confidence:** `high`
   **Application domain(s):** `compliance_verification`
   
   ### Summary
   This issue requests adding SWHID (Software Heritage Identifiers) as a new 
verification mechanism for release archives. The team has already decided on 
the implementation approach: wrapping the Rust reference implementation 
(`swhid-rs`) in a Python package (`asfswhid`). @andrewmusselman built the 
Python wrapper at github.com/andrewmusselman/swhid-py and contributed gitoxide 
(MIT-licensed) support upstream to avoid GPL2 libgit2 dependency. Work is 
currently blocked waiting for the upstream `swhid-rs` maintainer to merge the 
gitoxide PR. No internal ATR changes can proceed until the external package is 
available.
   
   ### Where this lives in the code today
   
   #### `atr/tasks/checks/compare.py` — `source_trees` (lines 83-94)
   _currently does this_
   Existing source tree comparison check that validates archive content matches 
GitHub checkout; SWHID would provide a complementary cross-format comparison 
without needing rsync-based diffing.
   
   ```python
   async def source_trees(args: checks.FunctionArguments) -> results.Results | 
None:  # noqa: C901
       recorder = await args.recorder(CHECK_VERSION)
       is_source = await recorder.primary_path_is_source()
       if not is_source:
           log.info(
               "Skipping compare.source_trees because the input is not a source 
artifact",
               project=args.project_key,
               version=args.version_key,
               revision=args.revision_number,
               path=args.primary_rel_path,
           )
           return None
   ```
   
   #### `atr/archives.py` — `extract` (lines 36-66)
   _extension point_
   Archive extraction infrastructure that SWHID computation would leverage — 
after extraction, the directory tree can be walked to compute the SWHID dir 
identifier.
   
   ```python
   def extract(
       archive_path: safe.StatePath,
       extract_dir: str,
       max_size: int,
       chunk_size: int = 4096,
       track_files: bool | set[str] = False,
   ) -> tuple[int, list[str]]:
       # chunk_size retained for signature stability; exarch manages buffering 
internally.
       del chunk_size
       log.info(f"Extracting {archive_path} to {extract_dir}")
   
       cfg = _build_extraction_config(max_size)
       extracted_paths: list[str] = []
   
       if isinstance(track_files, set) and track_files:
           try:
               manifest = exarch.list_archive(str(archive_path), cfg)
           except Exception as exc:
               raise ExtractionError(f"Failed to list archive: {exc}") from exc
           for entry in manifest.entries:
               if os.path.basename(entry.path) in track_files:
                   extracted_paths.append(entry.path)
   
       try:
           report = exarch.extract_archive(str(archive_path), extract_dir, cfg)
       except exarch.QuotaExceededError as exc:
           raise ExtractionError(f"Extraction exceeded size limit of {max_size} 
bytes: {exc}") from exc
       except Exception as exc:
           raise ExtractionError(f"Failed to extract archive: {exc}") from exc
   
       return report.bytes_written, extracted_paths
   ```
   
   #### `atr/tasks/checks/__init__.py` — `FunctionArguments` (lines 47-55)
   _extension point_
   The standard check function arguments dataclass — a new SWHID check would 
receive these same arguments and use the recorder pattern.
   
   ```python
   @dataclasses.dataclass
   class FunctionArguments:
       recorder: Callable[[str | None], Awaitable[Recorder]]
       asf_uid: str
       project_key: safe.ProjectKey
       version_key: safe.VersionKey
       revision_number: safe.RevisionNumber
       primary_rel_path: safe.RelPath | None
       extra_args: dict[str, Any]
   ```
   
   #### `atr/tasks/checks/__init__.py` — `resolve_archive_dir` (lines 341-361)
   _extension point_
   Utility to resolve extracted archive directories — a SWHID check would use 
this to get the extracted tree for computing the directory identifier.
   
   ```python
   async def resolve_archive_dir(args: FunctionArguments) -> safe.StatePath | 
None:
       """Resolve the extracted archive directory for the primary archive."""
       if args.primary_rel_path is None:
           return None
       release_key = sql.release_key(str(args.project_key), 
str(args.version_key))
       revision_seq = int(str(args.revision_number))
       async with db.session() as data:
           content_hash = await data.release_file_hash_at(release_key, 
str(args.primary_rel_path), revision_seq)
       if content_hash is None:
           abs_path = file_paths.revision_path_for_file(
               args.project_key, args.version_key, args.revision_number, 
str(args.primary_rel_path)
           )
           if await aiofiles.os.path.isfile(abs_path):
               content_hash = await hashes.compute_file_hash(abs_path)
       if content_hash is None:
           return None
       archive_key = hashes.filesystem_archives_key(content_hash)
       archive_dir = file_paths.get_archives_dir() / str(args.project_key) / 
str(args.version_key) / archive_key
       if await aiofiles.os.path.isdir(archive_dir):
           return archive_dir
       return None
   ```
   
   #### `atr/hashes.py` — `compute_file_hash` (lines 36-42)
   _currently does this_
   Current file hashing approach — SWHID computation would be a new hash 
function (Merkle tree over directory structure) that might live alongside these 
or in a dedicated module.
   
   ```python
   async def compute_file_hash(path: str | os.PathLike) -> str:
       path = pathlib.Path(path)
       hasher = blake3.blake3()
       async with aiofiles.open(path, "rb") as f:
           while chunk := await f.read(_HASH_CHUNK_SIZE):
               hasher.update(chunk)
       return f"blake3:{hasher.hexdigest()}"
   ```
   
   ### Where new code would go
   - `atr/tasks/checks/swhid.py` — new file
     A new check module following the pattern of 
targz.py/zipformat.py/compare.py that computes SWHID dir identifiers for 
extracted archives and compares them across formats within the same release.
   - `atr/swhid.py` — new file
     A thin wrapper module that imports from the external `asfswhid` package 
and provides ATR-specific helpers for computing directory SWHIDs from extracted 
archive paths.
   
   ### Proposed approach
   The integration cannot proceed until the external dependency chain is 
resolved: (1) the upstream `swhid-rs` crate merges the gitoxide PR 
(https://github.com/swhid/swhid-rs/pull/44), (2) the Python wrapper 
(`swhid-py`) is moved to the Apache org as `tooling-swhid-py` and published as 
`asfswhid`, and (3) the LEGAL-728 review confirms licensing is clear.
   
   Once the package is available, integration into ATR would involve: adding 
`asfswhid` as a dependency, creating a new check module (e.g., 
`atr/tasks/checks/swhid.py`) that computes SWHID `dir` identifiers on extracted 
archive trees using `resolve_archive_dir`, and then comparing identifiers 
across archives within the same release revision to verify cross-format 
equivalence. The check would follow the established `Recorder` pattern used by 
all other checks. The SWHID value itself could also be stored in the attestable 
data for downstream consumption.
   
   ### Open questions
   - When will the upstream swhid-rs maintainer merge the gitoxide PR 
(https://github.com/swhid/swhid-rs/pull/44)?
   - What is the status of LEGAL-728 regarding licensing clearance for the Rust 
crate dependency chain?
   - Has the `tooling-swhid-py` repo been created under the Apache GitHub org 
yet?
   - Should SWHID identifiers be stored in the attestable data model for 
downstream consumers, or only used for internal cross-archive comparison checks?
   - How should the check handle archives where contents legitimately differ 
(e.g., line ending differences due to .gitattributes)?
   
   _The agent reviewed this issue and is not proposing patches in this run. 
Review the existing-code citations and open questions above before deciding 
next steps._
   
   ### Files examined
   - `atr/hashes.py`
   - `atr/tasks/checks/compare.py`
   - `atr/archives.py`
   - `atr/tasks/checks/__init__.py`
   - `atr/tasks/checks/targz.py`
   - `atr/tasks/checks/zipformat.py`
   - `atr/tasks/checks/hashing.py`
   
   ---
   *Draft from a triage agent. A human reviewer should validate before merging 
any change. The agent did not run tests or verify diffs apply.*


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to