asf-tooling commented on issue #1154:
URL:
https://github.com/apache/tooling-trusted-releases/issues/1154#issuecomment-4403672932
<!-- gofannon-issue-triage-bot v2 -->
**Automated triage** — analyzed at `main@837830e8`
**Type:** `new_feature` • **Classification:** `actionable` •
**Confidence:** `high`
**Application domain(s):** `compliance_verification`
### Summary
This issue proposes adding SWHID (Software Heritage IDentifier) computation
to ATR for cross-format archive comparison and git-to-archive verification. The
team has already decided to wrap the Rust reference implementation (`swhid-rs`)
in Python, with @andrewmusselman building the wrapper at
github.com/andrewmusselman/swhid-py. @dave2wave confirmed it will be hosted at
`tooling-swhid-py` and released as `asfswhid`. The integration is currently
BLOCKED on an upstream merge (swhid/swhid-rs#44 adding MIT-licensed gitoxide
support). As of April 28, @andrewmusselman noted the upstream maintainer was
re-prompted. No ATR-internal code changes have been made yet.
### Where this lives in the code today
#### `atr/tasks/checks/compare.py` — `source_trees` (lines 83-95)
_extension point_
Existing check that compares source archive against GitHub checkout using
rsync; SWHID dir identifiers could supplement or replace this comparison with a
format-agnostic Merkle hash.
```python
async def source_trees(args: checks.FunctionArguments) -> results.Results |
None: # noqa: C901
recorder = await args.recorder(CHECK_VERSION)
is_source = await recorder.primary_path_is_source()
if not is_source:
log.info(
"Skipping compare.source_trees because the input is not a source
artifact",
...
)
return None
payload = await attestable.github_tp_payload_read(args.project_key,
args.version_key, args.revision_number)
...
comparison = await _compare_trees(github_dir,
archive_content_dir)
```
#### `atr/tasks/checks/targz.py` — `structure` (lines 76-90)
_extension point_
After extracting and validating tar.gz structure, SWHID computation over the
root directory tree could be added as a follow-up step or companion check.
```python
async def structure(args: checks.FunctionArguments) -> results.Results |
None: # noqa: C901
"""Check the structure of a .tar.gz file using the extracted tree."""
recorder = await args.recorder(CHECK_VERSION_STRUCTURE)
if not (artifact_abs_path := await recorder.abs_path()):
return None
if not await recorder.primary_path_is_source():
return None
archive_dir = await checks.resolve_archive_dir(args)
if archive_dir is None:
await recorder.failure(
"Extracted archive tree is not available",
{"rel_path": args.primary_rel_path},
)
return None
```
#### `atr/tasks/checks/zipformat.py` — `structure` (lines 35-44)
_extension point_
Zip structure check that extracts and validates archive root; SWHID
computation over the extracted directory could enable cross-format comparison
with tar.gz of the same release.
```python
async def structure(args: checks.FunctionArguments) -> results.Results |
None:
"""Check that the zip archive has a single root directory matching the
artifact name."""
...
archive_dir = await checks.resolve_archive_dir(args)
if archive_dir is None:
await recorder.failure(
"Extracted archive tree is not available",
{"rel_path": args.primary_rel_path},
)
return None
```
#### `atr/hashes.py` — `compute_file_hash` (lines 36-42)
_currently does this_
Pattern for hash computation in ATR. A new SWHID directory hash function
would follow a similar pattern but operate on directory trees rather than
individual files.
```python
async def compute_file_hash(path: str | os.PathLike) -> str:
path = pathlib.Path(path)
hasher = blake3.blake3()
async with aiofiles.open(path, "rb") as f:
while chunk := await f.read(_HASH_CHUNK_SIZE):
hasher.update(chunk)
return f"blake3:{hasher.hexdigest()}"
```
#### `atr/tasks/checks/__init__.py` — `resolve_archive_dir` (lines 341-352)
_extension point_
Utility to get the extracted archive directory; a SWHID check would use this
to locate the extracted tree for Merkle hash computation.
```python
async def resolve_archive_dir(args: FunctionArguments) -> safe.StatePath |
None:
"""Resolve the extracted archive directory for the primary archive."""
if args.primary_rel_path is None:
return None
release_key = sql.release_key(str(args.project_key),
str(args.version_key))
revision_seq = int(str(args.revision_number))
async with db.session() as data:
content_hash = await data.release_file_hash_at(release_key,
str(args.primary_rel_path), revision_seq)
...
if await aiofiles.os.path.isdir(archive_dir):
return archive_dir
return None
```
### Where new code would go
- `atr/tasks/checks/swhid.py` — new file
New check module that computes SWHID dir identifiers over extracted
archive trees, following the pattern of targz.py and zipformat.py. Would depend
on the external `asfswhid` package once it's published.
### Proposed approach
Once the external `asfswhid` package (wrapping swhid-rs via PyO3) is
published, integration into ATR would involve: (1) Adding `asfswhid` as a
dependency. (2) Creating a new check module `atr/tasks/checks/swhid.py` that
computes the SWHID `dir` identifier over extracted archive trees (using
`checks.resolve_archive_dir()` to get the extracted path). (3) Storing the
computed SWHID in the check result data. (4) For cross-format comparison, the
check could query other archives in the same revision and compare their SWHID
dir identifiers. (5) For git-to-archive comparison, the existing
`compare.source_trees` infrastructure (which already clones and checks out the
repo) could compute the SWHID of the checkout tree and compare with the
archive's SWHID.
However, NO internal ATR changes should be made yet. The blocking dependency
chain is: upstream swhid-rs#44 merge → swhid-py wrapper completion →
publication as `asfswhid` on PyPI → ATR integration. @andrewmusselman is
driving the external work and last prompted the upstream maintainer on April 28.
### Open questions
- When will the upstream swhid-rs PR #44 (gitoxide support) be merged?
- Has the LEGAL-728 JIRA been resolved regarding the licensing of the
wrapped crate?
- Should the SWHID check be a standalone check module or integrated into the
existing compare.source_trees check?
- Should SWHID identifiers be stored in attestable data alongside existing
BLAKE3/SHA-512 hashes, and if so, what schema changes are needed?
- Should cross-format SWHID comparison (tar.gz vs zip) be automatic when
multiple source archives share a version, or opt-in via release policy?
### Files examined
- `atr/hashes.py`
- `atr/tasks/checks/compare.py`
- `atr/archives.py`
- `atr/tasks/checks/__init__.py`
- `atr/tasks/checks/targz.py`
- `atr/tasks/checks/zipformat.py`
- `atr/tasks/checks/hashing.py`
---
*Draft from a triage agent. A human reviewer should validate before merging
any change. The agent did not run tests or verify diffs apply.*
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]