asf-tooling commented on issue #1154:
URL: 
https://github.com/apache/tooling-trusted-releases/issues/1154#issuecomment-4404059690

   <!-- gofannon-issue-triage-bot v2 -->
   
   **Automated triage** — analyzed at `main@837830e8`
   
   **Type:** `new_feature`  •  **Classification:** `actionable`  •  
**Confidence:** `medium`
   **Application domain(s):** `compliance_verification`
   
   ### Summary
   This issue requests adding SWHID (Software Heritage IDentifier) computation 
for release archives to enable cross-format comparison (tar.gz vs zip) and 
git-to-archive verification. The team has already decided to wrap the Rust 
reference implementation (`swhid-rs`) rather than use GPL-licensed `swh.model` 
or the third-party `miniswhid`. @andrewmusselman is developing the Python 
wrapper at https://github.com/andrewmusselman/swhid-py (to be published as 
`asfswhid` under `tooling-swhid-py`). Work is currently blocked waiting on the 
upstream `swhid-rs` maintainer to merge the `gitoxide` (MIT-licensed) addition 
(PR https://github.com/swhid/swhid-rs/pull/44). No integration work can proceed 
in this repo until the external package is published.
   
   ### Where this lives in the code today
   
   #### `atr/tasks/checks/compare.py` — `source_trees` (lines 83-94)
   _currently does this_
   Existing check that compares source archive content against GitHub checkout 
— SWHID could complement or extend this comparison with a format-agnostic 
Merkle hash.
   
   ```python
   async def source_trees(args: checks.FunctionArguments) -> results.Results | 
None:  # noqa: C901
       recorder = await args.recorder(CHECK_VERSION)
       is_source = await recorder.primary_path_is_source()
       if not is_source:
           log.info(
               "Skipping compare.source_trees because the input is not a source 
artifact",
               project=args.project_key,
               version=args.version_key,
               revision=args.revision_number,
               path=args.primary_rel_path,
           )
           return None
   ```
   
   #### `atr/tasks/checks/__init__.py` — `FunctionArguments` (lines 47-55)
   _extension point_
   Standard argument dataclass that new SWHID check functions would receive, 
following the same pattern as other checks.
   
   ```python
   @dataclasses.dataclass
   class FunctionArguments:
       recorder: Callable[[str | None], Awaitable[Recorder]]
       asf_uid: str
       project_key: safe.ProjectKey
       version_key: safe.VersionKey
       revision_number: safe.RevisionNumber
       primary_rel_path: safe.RelPath | None
       extra_args: dict[str, Any]
   ```
   
   #### `atr/tasks/checks/__init__.py` — `resolve_archive_dir` (lines 341-361)
   _extension point_
   Utility to locate extracted archive directories — a SWHID check would use 
this to access the extracted tree for computing directory identifiers.
   
   ```python
   async def resolve_archive_dir(args: FunctionArguments) -> safe.StatePath | 
None:
       """Resolve the extracted archive directory for the primary archive."""
       if args.primary_rel_path is None:
           return None
       release_key = sql.release_key(str(args.project_key), 
str(args.version_key))
       revision_seq = int(str(args.revision_number))
       async with db.session() as data:
           content_hash = await data.release_file_hash_at(release_key, 
str(args.primary_rel_path), revision_seq)
       if content_hash is None:
           abs_path = file_paths.revision_path_for_file(
               args.project_key, args.version_key, args.revision_number, 
str(args.primary_rel_path)
           )
           if await aiofiles.os.path.isfile(abs_path):
               content_hash = await hashes.compute_file_hash(abs_path)
       if content_hash is None:
           return None
       archive_key = hashes.filesystem_archives_key(content_hash)
       archive_dir = file_paths.get_archives_dir() / str(args.project_key) / 
str(args.version_key) / archive_key
       if await aiofiles.os.path.isdir(archive_dir):
           return archive_dir
       return None
   ```
   
   #### `atr/hashes.py` — `compute_file_hash` (lines 36-42)
   _currently does this_
   Existing hash infrastructure pattern — SWHID computation would follow a 
similar async streaming approach but delegate to the external `asfswhid` 
package for the Merkle tree algorithm.
   
   ```python
   async def compute_file_hash(path: str | os.PathLike) -> str:
       path = pathlib.Path(path)
       hasher = blake3.blake3()
       async with aiofiles.open(path, "rb") as f:
           while chunk := await f.read(_HASH_CHUNK_SIZE):
               hasher.update(chunk)
       return f"blake3:{hasher.hexdigest()}"
   ```
   
   #### `atr/tasks/checks/targz.py` — `root_directory` (lines 42-73)
   _extension point_
   Identifies the root directory inside an extracted archive — SWHID 
computation would need this same root-finding logic to compute the dir 
identifier over the correct subtree.
   
   ```python
   def root_directory(archive_dir: safe.StatePath) -> tuple[str, bytes | None]:
       """Find root directory and read package/package.json from the extracted 
tree."""
       # The ._ prefix is a metadata convention
       entries = sorted(e for e in os.listdir(archive_dir) if not 
e.startswith("._"))
   
       if not entries:
           raise RootDirectoryError("No root directory found in archive")
       if len(entries) > 1:
           raise RootDirectoryError(f"Multiple root directories found: 
{entries[0]}, {entries[1]}")
   
       root = entries[0]
       root_path = archive_dir / root
       try:
           root_stat = root_path.path.lstat()
       except OSError as e:
           raise RootDirectoryError(f"Unable to inspect root entry '{root}': 
{e}") from e
       if not stat.S_ISDIR(root_stat.st_mode):
           raise RootDirectoryError(f"Root entry is not a directory: {root}")
   
       package_json: bytes | None = None
   
       if root == "package":
           package_json_path = archive_dir / "package" / "package.json"
           with contextlib.suppress(FileNotFoundError, OSError):
               package_json_stat = package_json_path.path.lstat()
               # We do this to avoid allowing package.json to be a symlink
               if stat.S_ISREG(package_json_stat.st_mode):
                   size = package_json_stat.st_size
                   if (size > 0) and (size <= util.NPM_PACKAGE_JSON_MAX_SIZE):
                       package_json = package_json_path.path.read_bytes()
   
       return root, package_json
   ```
   
   ### Where new code would go
   - `atr/tasks/checks/swhid.py` — new file
     New check module that computes SWHID dir identifiers for extracted archive 
trees, following the same pattern as targz.py and zipformat.py. Would depend on 
the `asfswhid` package once published.
   
   ### Proposed approach
   The integration into ATR will happen in two phases. First, the external 
`asfswhid` package (wrapping `swhid-rs` with `gitoxide` for MIT-clean 
licensing) must be published — this is blocked on upstream merge of 
https://github.com/swhid/swhid-rs/pull/44. Once available, a new check module 
(`atr/tasks/checks/swhid.py`) should be created that: (1) resolves the 
extracted archive directory via `checks.resolve_archive_dir`, (2) finds the 
root directory inside it (similar to `targz.root_directory`), (3) computes the 
SWHID `dir` identifier over that tree using the `asfswhid` library, and (4) 
records the result. For cross-format comparison, the check could look up other 
archives in the same release/revision and compare their SWHID `dir` 
identifiers. The result would be stored as check data, making it visible to 
voters.
   
   Since the external dependency doesn't exist yet as a published package, no 
concrete diff should be proposed at this time. The team should track the 
upstream merge and `tooling-swhid-py` publication, then integrate once 
available.
   
   ### Open questions
   - When will the upstream swhid-rs maintainer merge PR #44 (gitoxide support)?
   - Has LEGAL-728 been resolved confirming the licensing is acceptable for ASF 
use?
   - Should the SWHID dir identifier be stored as a CheckResult data field, or 
should it be persisted separately in the attestable data model for long-term 
reference?
   - Should cross-format comparison (tar.gz vs zip SWHID match) be a separate 
check function or part of a single SWHID check that examines all archives in a 
revision?
   - Will the asfswhid package support computing directory identifiers from an 
already-extracted filesystem tree, or only from archive files directly?
   
   ### Files examined
   - `atr/hashes.py`
   - `atr/tasks/checks/compare.py`
   - `atr/archives.py`
   - `atr/tasks/checks/__init__.py`
   - `atr/tasks/checks/targz.py`
   - `atr/tasks/checks/zipformat.py`
   - `atr/tasks/checks/hashing.py`
   
   ---
   *Draft from a triage agent. A human reviewer should validate before merging 
any change. The agent did not run tests or verify diffs apply.*


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to