asf-tooling commented on issue #479:
URL:
https://github.com/apache/tooling-trusted-releases/issues/479#issuecomment-4410211129
<!-- gofannon-issue-triage-bot v2 -->
**Automated triage** — analyzed at `main@2da7807a`
**Type:** `new_feature` • **Classification:** `actionable` •
**Confidence:** `medium`
**Application domain(s):** `announcement_publishing`,
`project_committee_management`, `release_lifecycle`
### Summary
Issue requests watching SVN dist commits (via svnpubsub) to automatically
catalog new sub-projects, new releases, and archived releases. The existing
`atr/svn/commits.py` already subscribes to PubSub notifications for
`/svn/dist/dev` and `/svn/dist/release` and performs `svn update` on changed
paths, but does NOT analyze the changes to detect new
projects/releases/archival. @dave2wave suggested using svnpubsub, which aligns
with the existing code pattern. The missing piece is path-analysis logic in
`_process_payload` that identifies structural changes and catalogs them in the
database.
### Where this lives in the code today
#### `atr/svn/commits.py` — `handle` (lines 34-46)
_needs modification_
Entry point for PubSub commit notifications; currently only updates the
working copy but does not analyze changes.
```python
# TODO: Check that these prefixes are correct
_WATCHED_PREFIXES: Final[tuple[str, ...]] = (
"/svn/dist/dev",
"/svn/dist/release",
)
async def handle(payload: dict, working_copy_root: pathlib.Path) -> None:
pubsub_path = str(payload.get("pubsub_path", ""))
# Ignore commits outside dist/dev or dist/release
if pubsub_path.startswith(_WATCHED_PREFIXES):
log.debug(f"PubSub payload: {payload}")
await _process_payload(payload, working_copy_root)
```
#### `atr/svn/commits.py` — `_process_payload` (lines 42-64)
_needs modification_
Core processing of PubSub payloads; needs to be extended to analyze paths
and catalog new sub-projects, releases, and archival events.
```python
async def _process_payload(payload: dict, working_copy_root: pathlib.Path)
-> None:
"""
Update each changed file in the local working copy.
Payload format that we listen to:
{
"commit": {
"changed": ["/path/inside/repo/foo.txt", ...]
},
...
}
"""
changed: Sequence[str] = payload.get("commit", {}).get("changed", [])
for repo_path in changed:
prefix = next((p for p in _WATCHED_PREFIXES if
repo_path.startswith(p)), "")
if not prefix:
continue
local_path = working_copy_root / repo_path[len(prefix) :].lstrip("/")
try:
await svn.update(local_path)
log.info(f"svn updated {local_path}")
except Exception as exc:
log.warning(f"failed svn update {local_path}: {exc}")
```
#### `atr/storage/writers/project.py` — `CommitteeMember.create` (lines
141-153)
_extension point_
Existing project creation logic that could be invoked when new sub-projects
are detected from SVN commits.
```python
async def create(self, committee_key: safe.CommitteeKey, display_name:
str, label: str) -> None:
super_project = None
# TODO: Do we need to do any additional validation on the string
value?
# Get the base project to derive from
# We're allowing derivation from a retired project here
# TODO: Should we disallow this instead?
committee_projects = await self.__data.project(
committee_key=str(committee_key), _committee=True,
_release_policy=True
).all()
for committee_project in committee_projects:
if label.startswith(str(committee_project.key) + "-"):
if (super_project is None) or (len(str(super_project.key)) <
len(str(committee_project.key))):
super_project = committee_project
```
#### `atr/datasources/apache.py` — `_update_projects` (lines 453-463)
_extension point_
Shows how projects are created/updated from upstream data sources; the SVN
watcher would need a similar pattern for cataloging discovered projects.
```python
async def _update_projects(data: db.Session, projects: ProjectsData) ->
tuple[int, int]:
added_count = 0
updated_count = 0
# Add projects and associate them with the right PMC
for project_key, project_status in projects.items():
# FIXME: this is a quick workaround for inconsistent data wrt
webservices PMC / projects
# the PMC seems to be identified by the key ws, but the
associated projects use webservices
if project_key.startswith("webservices-"):
project_key = project_key.replace("webservices-", "ws-")
project_status.pmc = "ws"
```
#### `atr/svn/__init__.py` — `update` (lines 165-167)
_currently does this_
SVN update command used by commits.py to synchronize local working copy.
```python
async def update(path: pathlib.Path) -> str:
log.debug(f"running svn update for '{path}'")
return await _run_svn_command("update", str(path), "--parents")
```
### Where new code would go
- `atr/svn/commits.py` — after symbol _process_payload
Add analysis functions that parse changed paths to identify new
sub-projects, new releases, and archived releases from the dist directory
structure.
- `atr/svn/catalog.py` — new file
A dedicated module for the cataloging logic—parsing SVN dist paths into
structured release/project events and persisting them to the database.
### Proposed approach
The implementation should extend `atr/svn/commits.py` (or introduce a new
module `atr/svn/catalog.py`) that analyzes the changed paths in each PubSub
payload. The SVN dist directory structure follows the pattern
`dist/release/<tlp>/<optional-subproject>/<version>/...` for releases and
similar for dev. By parsing path components after the watched prefix, the code
can detect: (1) new sub-projects when a new directory appears at the second
level under a TLP, (2) new releases when version-level directories appear, and
(3) archived releases when paths are removed from `dist/release/` (SVN commits
include both additions and deletions).
The catalog logic would use `atr/db` to check whether projects/releases
already exist and create records as needed. Since the PubSub payload includes
changed paths but not the type of change (add vs delete), the code may need to
check the SVN diff or log (using `atr/svn.get_diff` or `atr/svn.get_log`) to
distinguish additions from deletions. For new sub-projects, it should create a
project record associated with the appropriate committee. For new releases, it
should create a release record in an appropriate phase. For archived releases,
it should update the release status. The feature aligns with @dave2wave's
suggestion to use svnpubsub, which the code already subscribes to.
### Suggested patches
#### `atr/svn/commits.py`
Extend _process_payload to analyze commit paths and call a new cataloging
function after updating the working copy.
````diff
--- a/atr/svn/commits.py
+++ b/atr/svn/commits.py
@@ -1,10 +1,12 @@
import pathlib
-from typing import TYPE_CHECKING, Final
+from typing import TYPE_CHECKING, Final, NamedTuple
import atr.log as log
import atr.svn as svn
if TYPE_CHECKING:
from collections.abc import Sequence
+import atr.svn.catalog as catalog
+
# TODO: Check that these prefixes are correct
_WATCHED_PREFIXES: Final[tuple[str, ...]] = (
"/svn/dist/dev",
@@ -32,6 +34,8 @@
async def handle(payload: dict, working_copy_root: pathlib.Path) -> None:
pubsub_path = str(payload.get("pubsub_path", ""))
# Ignore commits outside dist/dev or dist/release
if pubsub_path.startswith(_WATCHED_PREFIXES):
log.debug(f"PubSub payload: {payload}")
await _process_payload(payload, working_copy_root)
+ # Catalog structural changes (new sub-projects, releases, archival)
+ await catalog.analyze_commit(payload)
async def _process_payload(payload: dict, working_copy_root: pathlib.Path)
-> None:
````
#### `atr/svn/catalog.py`
New module to analyze SVN dist commits and catalog new sub-projects,
releases, and archived releases.
````diff
--- /dev/null
+++ b/atr/svn/catalog.py
@@ -0,0 +1,120 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements. See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership. The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License. You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied. See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+"""Catalog structural changes from SVN dist commits.
+
+Watches commit payloads from svnpubsub to detect:
+1. New sub-projects (new directory at TLP/<subproject> level)
+2. New releases (new version directory under a project)
+3. Archived releases (removals from dist/release)
+"""
+
+from __future__ import annotations
+
+import re
+from dataclasses import dataclass, field
+from typing import Final
+
+import atr.db as db
+import atr.log as log
+import atr.models.sql as sql
+
+_RELEASE_PREFIX: Final[str] = "/svn/dist/release/"
+_DEV_PREFIX: Final[str] = "/svn/dist/dev/"
+
+# Pattern: <tlp>/<subproject>/<version>/...
+# e.g. "hadoop/hadoop-common/3.4.0/hadoop-common-3.4.0-src.tar.gz"
+_PATH_PATTERN: Final[re.Pattern[str]] = re.compile(
+ r"^(?P<tlp>[^/]+)/(?P<subproject>[^/]+)/(?P<version>[^/]+)"
+)
+
+
+@dataclass
+class DistChange:
+ """Represents a structural change detected in SVN dist."""
+
+ tlp: str
+ subproject: str | None = None
+ version: str | None = None
+ is_removal: bool = False
+
+
+@dataclass
+class CommitAnalysis:
+ """Results of analyzing a commit for structural changes."""
+
+ new_subprojects: list[DistChange] = field(default_factory=list)
+ new_releases: list[DistChange] = field(default_factory=list)
+ archived_releases: list[DistChange] = field(default_factory=list)
+
+
+async def analyze_commit(payload: dict) -> CommitAnalysis:
+ """Analyze an SVN commit payload for structural dist changes."""
+ # TODO: The payload format for added vs deleted paths needs to be
confirmed.
+ # svnpubsub may provide "changed" as a dict with A/U/D keys, or as a
flat list.
+ commit_info = payload.get("commit", {})
+ changed = commit_info.get("changed", {})
+
+ analysis = CommitAnalysis()
+
+ # Handle both dict format ({path: action}) and list format
+ if isinstance(changed, dict):
+ paths_with_actions = list(changed.items())
+ elif isinstance(changed, list):
+ # If flat list, we cannot distinguish additions from deletions
+ # TODO: May need to use svn log/diff to determine action type
+ paths_with_actions = [(p, "U") for p in changed]
+ else:
+ return analysis
+
+ for repo_path, action in paths_with_actions:
+ _classify_path_change(repo_path, action, analysis)
+
+ if analysis.new_subprojects or analysis.new_releases or
analysis.archived_releases:
+ await _persist_changes(analysis)
+
+ return analysis
+
+
+def _classify_path_change(repo_path: str, action: str, analysis:
CommitAnalysis) -> None:
+ """Classify a single path change into the appropriate category."""
+ is_release = repo_path.startswith(_RELEASE_PREFIX)
+ is_dev = repo_path.startswith(_DEV_PREFIX)
+ if not (is_release or is_dev):
+ return
+
+ prefix = _RELEASE_PREFIX if is_release else _DEV_PREFIX
+ relative = repo_path[len(prefix):]
+ match = _PATH_PATTERN.match(relative)
+ if not match:
+ return
+
+ tlp = match.group("tlp")
+ subproject = match.group("subproject")
+ version = match.group("version")
+ is_removal = (action == "D")
+
+ change = DistChange(tlp=tlp, subproject=subproject, version=version,
is_removal=is_removal)
+
+ if is_removal and is_release:
+ analysis.archived_releases.append(change)
+ elif not is_removal and version:
+ analysis.new_releases.append(change)
+ # TODO: Check if subproject is previously unknown to catalog it as
new
+
+
+async def _persist_changes(analysis: CommitAnalysis) -> None:
+ """Persist detected changes to the database."""
+ # TODO: Implement database persistence
+ # For new sub-projects: create project records associated with the TLP
committee
+ # For new releases: create release records (or update existing ones)
+ # For archived releases: update release phase/status
+ for change in analysis.new_subprojects:
+ log.info(f"Detected new sub-project:
{change.tlp}/{change.subproject}")
+
+ for change in analysis.new_releases:
+ log.info(f"Detected new release: {change.tlp}/{change.subproject}
v{change.version}")
+
+ for change in analysis.archived_releases:
+ log.info(f"Detected archived release:
{change.tlp}/{change.subproject} v{change.version}")
````
### Open questions
- What is the exact format of the svnpubsub payload for changed paths — does
it distinguish additions (A) from deletions (D) vs modifications (U), or is it
a flat list requiring additional svn log/diff calls?
- Should detected releases be created in a specific phase (e.g., RELEASE)
since they're already published in SVN dist, or should they be cataloged
separately as 'legacy releases'?
- How should the system determine if a path component is a 'sub-project' vs
a 'version' — is there a naming convention or should it use heuristics (e.g.,
version-like patterns)?
- What database model should be used to track 'archived releases' — should
it set sql.ProjectStatus.RETIRED on the project, or mark individual releases
with a specific status?
- Issue references #478 — what methods from that issue should be adapted?
The referenced issue is not visible here.
### Files examined
- `atr/tasks/svn.py`
- `atr/svn/commits.py`
- `atr/svn/__init__.py`
- `atr/storage/writers/project.py`
- `atr/storage/writers/release.py`
- `atr/datasources/apache.py`
- `atr/storage/writers/revision.py`
- `atr/attestable.py`
### Related issues
This issue appears related to: #478.
_Both address cataloging and watching SVN dist releases for projects_
---
*Draft from a triage agent. A human reviewer should validate before merging
any change. The agent did not run tests or verify diffs apply.*
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]