asf-tooling commented on issue #1154: URL: https://github.com/apache/tooling-trusted-releases/issues/1154#issuecomment-4404059690
<!-- gofannon-issue-triage-bot v2 --> **Automated triage** — analyzed at `main@837830e8` **Type:** `new_feature` • **Classification:** `actionable` • **Confidence:** `medium` **Application domain(s):** `compliance_verification` ### Summary This issue requests adding SWHID (Software Heritage IDentifier) computation for release archives to enable cross-format comparison (tar.gz vs zip) and git-to-archive verification. The team has already decided to wrap the Rust reference implementation (`swhid-rs`) rather than use GPL-licensed `swh.model` or the third-party `miniswhid`. @andrewmusselman is developing the Python wrapper at https://github.com/andrewmusselman/swhid-py (to be published as `asfswhid` under `tooling-swhid-py`). Work is currently blocked waiting on the upstream `swhid-rs` maintainer to merge the `gitoxide` (MIT-licensed) addition (PR https://github.com/swhid/swhid-rs/pull/44). No integration work can proceed in this repo until the external package is published. ### Where this lives in the code today #### `atr/tasks/checks/compare.py` — `source_trees` (lines 83-94) _currently does this_ Existing check that compares source archive content against GitHub checkout — SWHID could complement or extend this comparison with a format-agnostic Merkle hash. ```python async def source_trees(args: checks.FunctionArguments) -> results.Results | None: # noqa: C901 recorder = await args.recorder(CHECK_VERSION) is_source = await recorder.primary_path_is_source() if not is_source: log.info( "Skipping compare.source_trees because the input is not a source artifact", project=args.project_key, version=args.version_key, revision=args.revision_number, path=args.primary_rel_path, ) return None ``` #### `atr/tasks/checks/__init__.py` — `FunctionArguments` (lines 47-55) _extension point_ Standard argument dataclass that new SWHID check functions would receive, following the same pattern as other checks. ```python @dataclasses.dataclass class FunctionArguments: recorder: Callable[[str | None], Awaitable[Recorder]] asf_uid: str project_key: safe.ProjectKey version_key: safe.VersionKey revision_number: safe.RevisionNumber primary_rel_path: safe.RelPath | None extra_args: dict[str, Any] ``` #### `atr/tasks/checks/__init__.py` — `resolve_archive_dir` (lines 341-361) _extension point_ Utility to locate extracted archive directories — a SWHID check would use this to access the extracted tree for computing directory identifiers. ```python async def resolve_archive_dir(args: FunctionArguments) -> safe.StatePath | None: """Resolve the extracted archive directory for the primary archive.""" if args.primary_rel_path is None: return None release_key = sql.release_key(str(args.project_key), str(args.version_key)) revision_seq = int(str(args.revision_number)) async with db.session() as data: content_hash = await data.release_file_hash_at(release_key, str(args.primary_rel_path), revision_seq) if content_hash is None: abs_path = file_paths.revision_path_for_file( args.project_key, args.version_key, args.revision_number, str(args.primary_rel_path) ) if await aiofiles.os.path.isfile(abs_path): content_hash = await hashes.compute_file_hash(abs_path) if content_hash is None: return None archive_key = hashes.filesystem_archives_key(content_hash) archive_dir = file_paths.get_archives_dir() / str(args.project_key) / str(args.version_key) / archive_key if await aiofiles.os.path.isdir(archive_dir): return archive_dir return None ``` #### `atr/hashes.py` — `compute_file_hash` (lines 36-42) _currently does this_ Existing hash infrastructure pattern — SWHID computation would follow a similar async streaming approach but delegate to the external `asfswhid` package for the Merkle tree algorithm. ```python async def compute_file_hash(path: str | os.PathLike) -> str: path = pathlib.Path(path) hasher = blake3.blake3() async with aiofiles.open(path, "rb") as f: while chunk := await f.read(_HASH_CHUNK_SIZE): hasher.update(chunk) return f"blake3:{hasher.hexdigest()}" ``` #### `atr/tasks/checks/targz.py` — `root_directory` (lines 42-73) _extension point_ Identifies the root directory inside an extracted archive — SWHID computation would need this same root-finding logic to compute the dir identifier over the correct subtree. ```python def root_directory(archive_dir: safe.StatePath) -> tuple[str, bytes | None]: """Find root directory and read package/package.json from the extracted tree.""" # The ._ prefix is a metadata convention entries = sorted(e for e in os.listdir(archive_dir) if not e.startswith("._")) if not entries: raise RootDirectoryError("No root directory found in archive") if len(entries) > 1: raise RootDirectoryError(f"Multiple root directories found: {entries[0]}, {entries[1]}") root = entries[0] root_path = archive_dir / root try: root_stat = root_path.path.lstat() except OSError as e: raise RootDirectoryError(f"Unable to inspect root entry '{root}': {e}") from e if not stat.S_ISDIR(root_stat.st_mode): raise RootDirectoryError(f"Root entry is not a directory: {root}") package_json: bytes | None = None if root == "package": package_json_path = archive_dir / "package" / "package.json" with contextlib.suppress(FileNotFoundError, OSError): package_json_stat = package_json_path.path.lstat() # We do this to avoid allowing package.json to be a symlink if stat.S_ISREG(package_json_stat.st_mode): size = package_json_stat.st_size if (size > 0) and (size <= util.NPM_PACKAGE_JSON_MAX_SIZE): package_json = package_json_path.path.read_bytes() return root, package_json ``` ### Where new code would go - `atr/tasks/checks/swhid.py` — new file New check module that computes SWHID dir identifiers for extracted archive trees, following the same pattern as targz.py and zipformat.py. Would depend on the `asfswhid` package once published. ### Proposed approach The integration into ATR will happen in two phases. First, the external `asfswhid` package (wrapping `swhid-rs` with `gitoxide` for MIT-clean licensing) must be published — this is blocked on upstream merge of https://github.com/swhid/swhid-rs/pull/44. Once available, a new check module (`atr/tasks/checks/swhid.py`) should be created that: (1) resolves the extracted archive directory via `checks.resolve_archive_dir`, (2) finds the root directory inside it (similar to `targz.root_directory`), (3) computes the SWHID `dir` identifier over that tree using the `asfswhid` library, and (4) records the result. For cross-format comparison, the check could look up other archives in the same release/revision and compare their SWHID `dir` identifiers. The result would be stored as check data, making it visible to voters. Since the external dependency doesn't exist yet as a published package, no concrete diff should be proposed at this time. The team should track the upstream merge and `tooling-swhid-py` publication, then integrate once available. ### Open questions - When will the upstream swhid-rs maintainer merge PR #44 (gitoxide support)? - Has LEGAL-728 been resolved confirming the licensing is acceptable for ASF use? - Should the SWHID dir identifier be stored as a CheckResult data field, or should it be persisted separately in the attestable data model for long-term reference? - Should cross-format comparison (tar.gz vs zip SWHID match) be a separate check function or part of a single SWHID check that examines all archives in a revision? - Will the asfswhid package support computing directory identifiers from an already-extracted filesystem tree, or only from archive files directly? ### Files examined - `atr/hashes.py` - `atr/tasks/checks/compare.py` - `atr/archives.py` - `atr/tasks/checks/__init__.py` - `atr/tasks/checks/targz.py` - `atr/tasks/checks/zipformat.py` - `atr/tasks/checks/hashing.py` --- *Draft from a triage agent. A human reviewer should validate before merging any change. The agent did not run tests or verify diffs apply.* -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
