asf-tooling commented on issue #1154:
URL: 
https://github.com/apache/tooling-trusted-releases/issues/1154#issuecomment-4403364137

   <!-- gofannon-issue-triage-bot v2 -->
   
   **Automated triage** — analyzed at `main@837830e8`
   
   **Type:** `new_feature`  •  **Classification:** `actionable`  •  
**Confidence:** `medium`
   **Application domain(s):** `artifact_validation`, `attestable_tracking`
   
   ### Summary
   The issue requests computing SWHID (Software Heritage Identifier) directory 
identifiers for release archives during the validation process. This would 
enable two use cases: (1) cross-format archive comparison (tar.gz vs zip 
producing same SWHID dir hash confirms identical content), and (2) comparison 
between a Git commit tree and a source archive. The codebase already has 
infrastructure for archive extraction, tree comparison, and attestable data 
recording that could be extended. The SWHID computation would likely be 
implemented as a new check in `atr/tasks/checks/` and the identifier stored in 
attestable data or check results.
   
   ### Where this lives in the code today
   
   #### `atr/tasks/checks/compare.py` — `source_trees` (lines 65-149)
   _extension point_
   This existing check already compares source archives against GitHub 
checkouts using rsync; a SWHID check would complement this by providing a 
cryptographic Merkle hash comparison.
   
   ```python
   async def source_trees(args: checks.FunctionArguments) -> results.Results | 
None:  # noqa: C901
       recorder = await args.recorder(CHECK_VERSION)
       is_source = await recorder.primary_path_is_source()
       if not is_source:
           log.info(
               "Skipping compare.source_trees because the input is not a source 
artifact",
               ...
           )
           return None
   
       payload = await attestable.github_tp_payload_read(args.project_key, 
args.version_key, args.revision_number)
       checkout_dir: str | None = None
       archive_dir: str | None = None
       if payload is not None:
   ```
   
   #### `atr/archives.py` — `extract` (lines 35-65)
   _extension point_
   Archive extraction is already handled; SWHID computation would operate on 
the extracted directory tree.
   
   ```python
   def extract(
       archive_path: safe.StatePath,
       extract_dir: str,
       max_size: int,
       chunk_size: int = 4096,
       track_files: bool | set[str] = False,
   ) -> tuple[int, list[str]]:
       # chunk_size retained for signature stability; exarch manages buffering 
internally.
       del chunk_size
       log.info(f"Extracting {archive_path} to {extract_dir}")
   
       cfg = _build_extraction_config(max_size)
       extracted_paths: list[str] = []
       ...
       try:
           report = exarch.extract_archive(str(archive_path), extract_dir, cfg)
       except exarch.QuotaExceededError as exc:
           raise ExtractionError(f"Extraction exceeded size limit of {max_size} 
bytes: {exc}") from exc
   ```
   
   #### `atr/hashes.py` — `compute_file_hash` (lines 39-46)
   _extension point_
   Existing hashing infrastructure; SWHID computation uses SHA-1 in Git's 
object format, so a new function would be needed alongside these.
   
   ```python
   async def compute_file_hash(path: str | os.PathLike) -> str:
       path = pathlib.Path(path)
       hasher = blake3.blake3()
       async with aiofiles.open(path, "rb") as f:
           while chunk := await f.read(_HASH_CHUNK_SIZE):
               hasher.update(chunk)
       return f"blake3:{hasher.hexdigest()}"
   ```
   
   #### `atr/models/attestable.py` — `PathEntryV2` (lines 72-80)
   _extension point_
   The attestable path entry model could be extended (in a V3 or via check 
results) to include a SWHID directory identifier for source archives.
   
   ```python
   class PathEntryV2(schema.Strict):
       content_hash: str
       classification: str
       provenance: ProvenanceV2 | None = None
   ```
   
   #### `atr/tasks/checks/__init__.py` — `Recorder` (lines 48-283)
   _extension point_
   The check recording infrastructure would be used by a new SWHID check to 
record computed identifiers as check results.
   
   ```python
   class Recorder:
       checker: str
       checker_version: str | None
       ...
       async def success(
           self,
           message: str,
           data: Any,
           primary_rel_path: safe.RelPath | None = None,
           member_rel_path: str | None = None,
       ) -> sql.CheckResult:
   ```
   
   #### `atr/attestable.py` — `write_checks_data` (lines 246-265)
   _extension point_
   SWHID identifiers could be stored as check data associated with each source 
archive path.
   
   ```python
   async def write_checks_data(
       project_key: safe.ProjectKey,
       version_key: safe.VersionKey,
       revision_number: safe.RevisionNumber,
       rel_path: str,
       checks: dict[str, str],
   ) -> None:
       log.info(f"Writing checks for 
{project_key}/{version_key}/{revision_number}/{rel_path}: {checks}")
   
       def modify(content: str) -> str:
           try:
               current = 
models.AttestableChecksV2.model_validate_json(content).checks
           except pydantic.ValidationError:
               current = {}
           if rel_path not in current:
               current[rel_path] = checks
           else:
               current[rel_path].update(checks)
           result = models.AttestableChecksV2(checks=current)
           return result.model_dump_json(indent=2)
   
       await _atomic_modify_readonly(attestable_checks_path(project_key, 
version_key, revision_number).path, modify)
   ```
   
   ### Where new code would go
   - `atr/swhid.py` — new file
     A new module implementing SWHID directory identifier computation following 
the Git tree object hashing algorithm (SHA-1 Merkle tree over sorted directory 
entries with mode, name, and blob/tree hashes).
   - `atr/tasks/checks/swhid.py` — new file
     A new check module that computes SWHID dir identifiers for source archives 
and optionally compares them across formats or against a Git commit tree.
   
   ### Proposed approach
   The implementation would involve two main components:
   
   1. **SWHID computation module** (`atr/swhid.py`): Implements the SWHID 
directory identifier algorithm, which mirrors Git's tree object hashing. For 
each file (blob), compute `sha1("blob <size>\0" + content)`. For each directory 
(tree), sort entries by name (with directories having a trailing `/` for sort 
purposes per Git convention), format as `<mode> <name>\0<20-byte-sha1>`, then 
compute `sha1("tree <size>\0" + entries)`. The top-level directory hash becomes 
the SWHID `swh:1:dir:<hex-sha1>`.
   
   2. **SWHID check** (`atr/tasks/checks/swhid.py`): A new automated check that 
runs on source archives (similar to how `compare.source_trees` runs). It 
extracts the archive (using the existing archive cache from 
`resolve_archive_dir`), computes the SWHID dir identifier over the root 
directory, and records it as a check result. When multiple source archives 
exist for the same revision (e.g., .tar.gz and .zip), it can compare their 
SWHID identifiers. The check would use the existing `Recorder` infrastructure 
to store results. A dependency on an external library like `swh.model` could be 
considered, but a minimal pure-Python implementation (just SHA-1 over directory 
trees) would be straightforward and avoid heavyweight dependencies.
   
   ### Suggested patches
   
   #### `atr/swhid.py`
   New module implementing SWHID directory identifier computation using Git's 
tree object hashing algorithm.
   
   ````diff
   --- /dev/null
   +++ b/atr/swhid.py
   @@ -0,0 +1,80 @@
   +# Licensed to the Apache Software Foundation (ASF) under one
   +# or more contributor license agreements.  See the NOTICE file
   +# distributed with this work for additional information
   +# regarding copyright ownership.  The ASF licenses this file
   +# to you under the Apache License, Version 2.0 (the
   +# "License"); you may not use this file except in compliance
   +# with the License.  You may obtain a copy of the License at
   +#
   +#   http://www.apache.org/licenses/LICENSE-2.0
   +#
   +# Unless required by applicable law or agreed to in writing,
   +# software distributed under the License is distributed on an
   +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
   +# KIND, either express or implied.  See the License for the
   +# specific language governing permissions and limitations
   +# under the License.
   +
   +"""Compute SWHID (Software Heritage Identifier) directory identifiers.
   +
   +SWHIDs use the same Merkle hash algorithm as Git tree objects:
   +- blob: sha1("blob <size>\0" + content)
   +- tree: sha1("tree <size>\0" + sorted_entries)
   +
   +See https://www.swhid.org/ and ECMA-428.
   +"""
   +
   +from __future__ import annotations
   +
   +import hashlib
   +import os
   +import pathlib
   +from typing import Final
   +
   +_BLOB_PREFIX: Final[bytes] = b"blob "
   +_TREE_PREFIX: Final[bytes] = b"tree "
   +_NULL: Final[bytes] = b"\x00"
   +# Git file modes
   +_MODE_FILE: Final[bytes] = b"100644"
   +_MODE_EXECUTABLE: Final[bytes] = b"100755"
   +_MODE_SYMLINK: Final[bytes] = b"120000"
   +_MODE_DIR: Final[bytes] = b"40000"
   +
   +
   +def compute_blob_id(content: bytes) -> bytes:
   +    """Compute the Git blob object ID (raw 20-byte SHA-1)."""
   +    header = _BLOB_PREFIX + str(len(content)).encode("ascii") + _NULL
   +    return hashlib.sha1(header + content).digest()  # noqa: S324
   +
   +
   +def compute_directory_id(directory: pathlib.Path) -> str:
   +    """Compute the SWHID directory identifier for a filesystem tree.
   +
   +    Returns the identifier in qualified form: swh:1:dir:<hex-sha1>
   +    """
   +    raw_hash = _compute_tree_hash(directory)
   +    return f"swh:1:dir:{raw_hash.hex()}"
   +
   +
   +def _compute_tree_hash(directory: pathlib.Path) -> bytes:
   +    """Recursively compute the Git tree object hash for a directory."""
   +    entries: list[tuple[bytes, bytes, bytes]] = []  # (name, mode, hash)
   +    for entry in sorted(os.scandir(directory), key=lambda e: e.name):
   +        name = entry.name.encode("utf-8")
   +        entry_path = pathlib.Path(entry.path)
   +        if entry.is_symlink():
   +            target = os.readlink(entry.path)
   +            blob_hash = compute_blob_id(target.encode("utf-8") if 
isinstance(target, str) else target)
   +            entries.append((name, _MODE_SYMLINK, blob_hash))
   +        elif entry.is_dir(follow_symlinks=False):
   +            tree_hash = _compute_tree_hash(entry_path)
   +            entries.append((name, _MODE_DIR, tree_hash))
   +        elif entry.is_file(follow_symlinks=False):
   +            content = entry_path.read_bytes()
   +            blob_hash = compute_blob_id(content)
   +            mode = _MODE_EXECUTABLE if os.access(entry.path, os.X_OK) else 
_MODE_FILE
   +            entries.append((name, mode, blob_hash))
   +    # Build the tree object content
   +    tree_content = b""
   +    for name, mode, obj_hash in entries:
   +        tree_content += mode + b" " + name + _NULL + obj_hash
   +    header = _TREE_PREFIX + str(len(tree_content)).encode("ascii") + _NULL
   +    return hashlib.sha1(header + tree_content).digest()  # noqa: S324
   ````
   
   #### `atr/tasks/checks/swhid.py`
   New check module that computes SWHID identifiers for source archives and 
compares across formats.
   
   ````diff
   --- /dev/null
   +++ b/atr/tasks/checks/swhid.py
   @@ -0,0 +1,72 @@
   +# Licensed to the Apache Software Foundation (ASF) under one
   +# or more contributor license agreements.  See the NOTICE file
   +# distributed with this work for additional information
   +# regarding copyright ownership.  The ASF licenses this file
   +# to you under the Apache License, Version 2.0 (the
   +# "License"); you may not use this file except in compliance
   +# with the License.  You may obtain a copy of the License at
   +#
   +#   http://www.apache.org/licenses/LICENSE-2.0
   +#
   +# Unless required by applicable law or agreed to in writing,
   +# software distributed under the License is distributed on an
   +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
   +# KIND, either express or implied.  See the License for the
   +# specific language governing permissions and limitations
   +# under the License.
   +
   +"""Check that computes SWHID directory identifiers for source archives."""
   +
   +from __future__ import annotations
   +
   +import asyncio
   +import pathlib
   +from typing import Final
   +
   +import aiofiles.os
   +
   +import atr.log as log
   +import atr.models.results as results
   +import atr.models.safe as safe
   +import atr.swhid as swhid
   +import atr.tasks.checks as checks
   +
   +INPUT_POLICY_KEYS: Final[list[str]] = []
   +INPUT_EXTRA_ARGS: Final[list[str]] = []
   +CHECK_VERSION: Final[str] = "1"
   +
   +
   +async def directory_id(args: checks.FunctionArguments) -> results.Results | 
None:
   +    """Compute the SWHID directory identifier for a source archive."""
   +    recorder = await args.recorder(CHECK_VERSION)
   +    is_source = await recorder.primary_path_is_source()
   +    if not is_source:
   +        log.info(
   +            "Skipping swhid.directory_id because the input is not a source 
artifact",
   +            project=args.project_key,
   +            version=args.version_key,
   +            revision=args.revision_number,
   +            path=args.primary_rel_path,
   +        )
   +        return None
   +
   +    extracted_dir = await checks.resolve_archive_dir(args)
   +    if extracted_dir is None:
   +        await recorder.failure(
   +            "Extracted archive tree is not available for SWHID computation",
   +            {"rel_path": str(args.primary_rel_path)},
   +        )
   +        return None
   +
   +    # Find the single root directory inside the extracted archive
   +    entries = await aiofiles.os.listdir(extracted_dir)
   +    directories = [
   +        e for e in entries if not e.startswith("._") and (extracted_dir / 
e).path.is_dir()
   +    ]
   +    if len(directories) != 1:
   +        await recorder.warning(
   +            "Cannot compute SWHID: expected single root directory in 
archive",
   +            {"directories_found": len(directories)},
   +        )
   +        return None
   +
   +    root_dir = (extracted_dir / directories[0]).path
   +    swhid_id = await asyncio.to_thread(swhid.compute_directory_id, root_dir)
   +
   +    await recorder.success(
   +        "Computed SWHID directory identifier for source archive",
   +        {"swhid": swhid_id, "root_directory": directories[0]},
   +    )
   +    return None
   ````
   
   ### Open questions
   - Should the SWHID identifier be stored in the attestable data model 
(requiring a V3 or optional field on PathEntryV2) or only in check results?
   - Should cross-format comparison (matching SWHID between .tar.gz and .zip of 
the same release) be a separate check or part of the same check?
   - The Git tree sort order has subtleties (directories sort as if they have a 
trailing '/' appended to their name). The proposed implementation uses simple 
`sorted()` by name — need to verify this matches Git/SWH exactly.
   - Should the project use the `swh.model` Python package (from Software 
Heritage) for authoritative SWHID computation, or implement it minimally to 
avoid the dependency?
   - How should executable bit detection work across different archive formats, 
given that zip files may not preserve Unix permissions?
   - The issue mentions `.gitattributes` handling (export-ignore, text 
eol=crlf) — should this be handled in the comparison logic or just documented 
as known divergence cases?
   
   ### Files examined
   - `atr/attestable.py`
   - `atr/hashes.py`
   - `atr/models/attestable.py`
   - `atr/archives.py`
   - `atr/tasks/checks/compare.py`
   - `tests/unit/test_attestable.py`
   - `atr/tasks/checks/__init__.py`
   - `tests/unit/test_checks_compare.py`
   
   ---
   *Draft from a triage agent. A human reviewer should validate before merging 
any change. The agent did not run tests or verify diffs apply.*


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to