asf-tooling commented on issue #1154:
URL: 
https://github.com/apache/tooling-trusted-releases/issues/1154#issuecomment-4403672932

   <!-- gofannon-issue-triage-bot v2 -->
   
   **Automated triage** — analyzed at `main@837830e8`
   
   **Type:** `new_feature`  •  **Classification:** `actionable`  •  
**Confidence:** `high`
   **Application domain(s):** `compliance_verification`
   
   ### Summary
   This issue proposes adding SWHID (Software Heritage IDentifier) computation 
to ATR for cross-format archive comparison and git-to-archive verification. The 
team has already decided to wrap the Rust reference implementation (`swhid-rs`) 
in Python, with @andrewmusselman building the wrapper at 
github.com/andrewmusselman/swhid-py. @dave2wave confirmed it will be hosted at 
`tooling-swhid-py` and released as `asfswhid`. The integration is currently 
BLOCKED on an upstream merge (swhid/swhid-rs#44 adding MIT-licensed gitoxide 
support). As of April 28, @andrewmusselman noted the upstream maintainer was 
re-prompted. No ATR-internal code changes have been made yet.
   
   ### Where this lives in the code today
   
   #### `atr/tasks/checks/compare.py` — `source_trees` (lines 83-95)
   _extension point_
   Existing check that compares source archive against GitHub checkout using 
rsync; SWHID dir identifiers could supplement or replace this comparison with a 
format-agnostic Merkle hash.
   
   ```python
   async def source_trees(args: checks.FunctionArguments) -> results.Results | 
None:  # noqa: C901
       recorder = await args.recorder(CHECK_VERSION)
       is_source = await recorder.primary_path_is_source()
       if not is_source:
           log.info(
               "Skipping compare.source_trees because the input is not a source 
artifact",
               ...
           )
           return None
   
       payload = await attestable.github_tp_payload_read(args.project_key, 
args.version_key, args.revision_number)
       ...
               comparison = await _compare_trees(github_dir, 
archive_content_dir)
   ```
   
   #### `atr/tasks/checks/targz.py` — `structure` (lines 76-90)
   _extension point_
   After extracting and validating tar.gz structure, SWHID computation over the 
root directory tree could be added as a follow-up step or companion check.
   
   ```python
   async def structure(args: checks.FunctionArguments) -> results.Results | 
None:  # noqa: C901
       """Check the structure of a .tar.gz file using the extracted tree."""
       recorder = await args.recorder(CHECK_VERSION_STRUCTURE)
       if not (artifact_abs_path := await recorder.abs_path()):
           return None
       if not await recorder.primary_path_is_source():
           return None
   
       archive_dir = await checks.resolve_archive_dir(args)
       if archive_dir is None:
           await recorder.failure(
               "Extracted archive tree is not available",
               {"rel_path": args.primary_rel_path},
           )
           return None
   ```
   
   #### `atr/tasks/checks/zipformat.py` — `structure` (lines 35-44)
   _extension point_
   Zip structure check that extracts and validates archive root; SWHID 
computation over the extracted directory could enable cross-format comparison 
with tar.gz of the same release.
   
   ```python
   async def structure(args: checks.FunctionArguments) -> results.Results | 
None:
       """Check that the zip archive has a single root directory matching the 
artifact name."""
       ...
       archive_dir = await checks.resolve_archive_dir(args)
       if archive_dir is None:
           await recorder.failure(
               "Extracted archive tree is not available",
               {"rel_path": args.primary_rel_path},
           )
           return None
   ```
   
   #### `atr/hashes.py` — `compute_file_hash` (lines 36-42)
   _currently does this_
   Pattern for hash computation in ATR. A new SWHID directory hash function 
would follow a similar pattern but operate on directory trees rather than 
individual files.
   
   ```python
   async def compute_file_hash(path: str | os.PathLike) -> str:
       path = pathlib.Path(path)
       hasher = blake3.blake3()
       async with aiofiles.open(path, "rb") as f:
           while chunk := await f.read(_HASH_CHUNK_SIZE):
               hasher.update(chunk)
       return f"blake3:{hasher.hexdigest()}"
   ```
   
   #### `atr/tasks/checks/__init__.py` — `resolve_archive_dir` (lines 341-352)
   _extension point_
   Utility to get the extracted archive directory; a SWHID check would use this 
to locate the extracted tree for Merkle hash computation.
   
   ```python
   async def resolve_archive_dir(args: FunctionArguments) -> safe.StatePath | 
None:
       """Resolve the extracted archive directory for the primary archive."""
       if args.primary_rel_path is None:
           return None
       release_key = sql.release_key(str(args.project_key), 
str(args.version_key))
       revision_seq = int(str(args.revision_number))
       async with db.session() as data:
           content_hash = await data.release_file_hash_at(release_key, 
str(args.primary_rel_path), revision_seq)
       ...
       if await aiofiles.os.path.isdir(archive_dir):
           return archive_dir
       return None
   ```
   
   ### Where new code would go
   - `atr/tasks/checks/swhid.py` — new file
     New check module that computes SWHID dir identifiers over extracted 
archive trees, following the pattern of targz.py and zipformat.py. Would depend 
on the external `asfswhid` package once it's published.
   
   ### Proposed approach
   Once the external `asfswhid` package (wrapping swhid-rs via PyO3) is 
published, integration into ATR would involve: (1) Adding `asfswhid` as a 
dependency. (2) Creating a new check module `atr/tasks/checks/swhid.py` that 
computes the SWHID `dir` identifier over extracted archive trees (using 
`checks.resolve_archive_dir()` to get the extracted path). (3) Storing the 
computed SWHID in the check result data. (4) For cross-format comparison, the 
check could query other archives in the same revision and compare their SWHID 
dir identifiers. (5) For git-to-archive comparison, the existing 
`compare.source_trees` infrastructure (which already clones and checks out the 
repo) could compute the SWHID of the checkout tree and compare with the 
archive's SWHID.
   
   However, NO internal ATR changes should be made yet. The blocking dependency 
chain is: upstream swhid-rs#44 merge → swhid-py wrapper completion → 
publication as `asfswhid` on PyPI → ATR integration. @andrewmusselman is 
driving the external work and last prompted the upstream maintainer on April 28.
   
   ### Open questions
   - When will the upstream swhid-rs PR #44 (gitoxide support) be merged?
   - Has the LEGAL-728 JIRA been resolved regarding the licensing of the 
wrapped crate?
   - Should the SWHID check be a standalone check module or integrated into the 
existing compare.source_trees check?
   - Should SWHID identifiers be stored in attestable data alongside existing 
BLAKE3/SHA-512 hashes, and if so, what schema changes are needed?
   - Should cross-format SWHID comparison (tar.gz vs zip) be automatic when 
multiple source archives share a version, or opt-in via release policy?
   
   ### Files examined
   - `atr/hashes.py`
   - `atr/tasks/checks/compare.py`
   - `atr/archives.py`
   - `atr/tasks/checks/__init__.py`
   - `atr/tasks/checks/targz.py`
   - `atr/tasks/checks/zipformat.py`
   - `atr/tasks/checks/hashing.py`
   
   ---
   *Draft from a triage agent. A human reviewer should validate before merging 
any change. The agent did not run tests or verify diffs apply.*


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to