asf-tooling commented on issue #1154:
URL:
https://github.com/apache/tooling-trusted-releases/issues/1154#issuecomment-4404147270
<!-- gofannon-issue-triage-bot v2 -->
**Automated triage** — analyzed at `main@837830e8`
**Type:** `new_feature` • **Classification:** `no_action` •
**Confidence:** `high`
**Application domain(s):** `compliance_verification`
### Summary
This issue requests adding SWHID (Software Heritage Identifiers) as a new
verification mechanism for release archives. The team has already decided on
the implementation approach: wrapping the Rust reference implementation
(`swhid-rs`) in a Python package (`asfswhid`). @andrewmusselman built the
Python wrapper at github.com/andrewmusselman/swhid-py and contributed gitoxide
(MIT-licensed) support upstream to avoid GPL2 libgit2 dependency. Work is
currently blocked waiting for the upstream `swhid-rs` maintainer to merge the
gitoxide PR. No internal ATR changes can proceed until the external package is
available.
### Where this lives in the code today
#### `atr/tasks/checks/compare.py` — `source_trees` (lines 83-94)
_currently does this_
Existing source tree comparison check that validates archive content matches
GitHub checkout; SWHID would provide a complementary cross-format comparison
without needing rsync-based diffing.
```python
async def source_trees(args: checks.FunctionArguments) -> results.Results |
None: # noqa: C901
recorder = await args.recorder(CHECK_VERSION)
is_source = await recorder.primary_path_is_source()
if not is_source:
log.info(
"Skipping compare.source_trees because the input is not a source
artifact",
project=args.project_key,
version=args.version_key,
revision=args.revision_number,
path=args.primary_rel_path,
)
return None
```
#### `atr/archives.py` — `extract` (lines 36-66)
_extension point_
Archive extraction infrastructure that SWHID computation would leverage —
after extraction, the directory tree can be walked to compute the SWHID dir
identifier.
```python
def extract(
archive_path: safe.StatePath,
extract_dir: str,
max_size: int,
chunk_size: int = 4096,
track_files: bool | set[str] = False,
) -> tuple[int, list[str]]:
# chunk_size retained for signature stability; exarch manages buffering
internally.
del chunk_size
log.info(f"Extracting {archive_path} to {extract_dir}")
cfg = _build_extraction_config(max_size)
extracted_paths: list[str] = []
if isinstance(track_files, set) and track_files:
try:
manifest = exarch.list_archive(str(archive_path), cfg)
except Exception as exc:
raise ExtractionError(f"Failed to list archive: {exc}") from exc
for entry in manifest.entries:
if os.path.basename(entry.path) in track_files:
extracted_paths.append(entry.path)
try:
report = exarch.extract_archive(str(archive_path), extract_dir, cfg)
except exarch.QuotaExceededError as exc:
raise ExtractionError(f"Extraction exceeded size limit of {max_size}
bytes: {exc}") from exc
except Exception as exc:
raise ExtractionError(f"Failed to extract archive: {exc}") from exc
return report.bytes_written, extracted_paths
```
#### `atr/tasks/checks/__init__.py` — `FunctionArguments` (lines 47-55)
_extension point_
The standard check function arguments dataclass — a new SWHID check would
receive these same arguments and use the recorder pattern.
```python
@dataclasses.dataclass
class FunctionArguments:
recorder: Callable[[str | None], Awaitable[Recorder]]
asf_uid: str
project_key: safe.ProjectKey
version_key: safe.VersionKey
revision_number: safe.RevisionNumber
primary_rel_path: safe.RelPath | None
extra_args: dict[str, Any]
```
#### `atr/tasks/checks/__init__.py` — `resolve_archive_dir` (lines 341-361)
_extension point_
Utility to resolve extracted archive directories — a SWHID check would use
this to get the extracted tree for computing the directory identifier.
```python
async def resolve_archive_dir(args: FunctionArguments) -> safe.StatePath |
None:
"""Resolve the extracted archive directory for the primary archive."""
if args.primary_rel_path is None:
return None
release_key = sql.release_key(str(args.project_key),
str(args.version_key))
revision_seq = int(str(args.revision_number))
async with db.session() as data:
content_hash = await data.release_file_hash_at(release_key,
str(args.primary_rel_path), revision_seq)
if content_hash is None:
abs_path = file_paths.revision_path_for_file(
args.project_key, args.version_key, args.revision_number,
str(args.primary_rel_path)
)
if await aiofiles.os.path.isfile(abs_path):
content_hash = await hashes.compute_file_hash(abs_path)
if content_hash is None:
return None
archive_key = hashes.filesystem_archives_key(content_hash)
archive_dir = file_paths.get_archives_dir() / str(args.project_key) /
str(args.version_key) / archive_key
if await aiofiles.os.path.isdir(archive_dir):
return archive_dir
return None
```
#### `atr/hashes.py` — `compute_file_hash` (lines 36-42)
_currently does this_
Current file hashing approach — SWHID computation would be a new hash
function (Merkle tree over directory structure) that might live alongside these
or in a dedicated module.
```python
async def compute_file_hash(path: str | os.PathLike) -> str:
path = pathlib.Path(path)
hasher = blake3.blake3()
async with aiofiles.open(path, "rb") as f:
while chunk := await f.read(_HASH_CHUNK_SIZE):
hasher.update(chunk)
return f"blake3:{hasher.hexdigest()}"
```
### Where new code would go
- `atr/tasks/checks/swhid.py` — new file
A new check module following the pattern of
targz.py/zipformat.py/compare.py that computes SWHID dir identifiers for
extracted archives and compares them across formats within the same release.
- `atr/swhid.py` — new file
A thin wrapper module that imports from the external `asfswhid` package
and provides ATR-specific helpers for computing directory SWHIDs from extracted
archive paths.
### Proposed approach
The integration cannot proceed until the external dependency chain is
resolved: (1) the upstream `swhid-rs` crate merges the gitoxide PR
(https://github.com/swhid/swhid-rs/pull/44), (2) the Python wrapper
(`swhid-py`) is moved to the Apache org as `tooling-swhid-py` and published as
`asfswhid`, and (3) the LEGAL-728 review confirms licensing is clear.
Once the package is available, integration into ATR would involve: adding
`asfswhid` as a dependency, creating a new check module (e.g.,
`atr/tasks/checks/swhid.py`) that computes SWHID `dir` identifiers on extracted
archive trees using `resolve_archive_dir`, and then comparing identifiers
across archives within the same release revision to verify cross-format
equivalence. The check would follow the established `Recorder` pattern used by
all other checks. The SWHID value itself could also be stored in the attestable
data for downstream consumption.
### Open questions
- When will the upstream swhid-rs maintainer merge the gitoxide PR
(https://github.com/swhid/swhid-rs/pull/44)?
- What is the status of LEGAL-728 regarding licensing clearance for the Rust
crate dependency chain?
- Has the `tooling-swhid-py` repo been created under the Apache GitHub org
yet?
- Should SWHID identifiers be stored in the attestable data model for
downstream consumers, or only used for internal cross-archive comparison checks?
- How should the check handle archives where contents legitimately differ
(e.g., line ending differences due to .gitattributes)?
_The agent reviewed this issue and is not proposing patches in this run.
Review the existing-code citations and open questions above before deciding
next steps._
### Files examined
- `atr/hashes.py`
- `atr/tasks/checks/compare.py`
- `atr/archives.py`
- `atr/tasks/checks/__init__.py`
- `atr/tasks/checks/targz.py`
- `atr/tasks/checks/zipformat.py`
- `atr/tasks/checks/hashing.py`
---
*Draft from a triage agent. A human reviewer should validate before merging
any change. The agent did not run tests or verify diffs apply.*
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]