asf-tooling commented on issue #364:
URL:
https://github.com/apache/tooling-trusted-releases/issues/364#issuecomment-4410363272
<!-- gofannon-issue-triage-bot v2 -->
**Automated triage** — analyzed at `main@2da7807a`
**Type:** `new_feature` • **Classification:** `actionable` •
**Confidence:** `medium`
**Application domain(s):** `project_committee_management`,
`shared_infrastructure`
### Summary
The issue requests recording what happens when committees are automatically
synced from remote data sources (Whimsy, projects.apache.org). @sbp noted this
can be recorded in the audit log. @developer-ravi-03 volunteered to implement
it via PR but no response was given. The `update_metadata()` function in
`atr/datasources/apache.py` already returns aggregate counts but doesn't record
specific changes (which committees were added/modified, what fields changed).
The enhancement would add detailed change tracking to the metadata sync process.
### Where this lives in the code today
#### `atr/datasources/apache.py` — `update_metadata` (lines 298-334)
_needs modification_
This is the main entry point for metadata syncing. It returns only aggregate
counts but doesn't record details about what specifically changed.
```python
async def update_metadata() -> tuple[int, int]:
"""Update metadata from remote data sources."""
ldap_projects = await get_ldap_projects_data()
projects = await get_projects_data()
podlings_data = await get_current_podlings_data()
committees = await get_active_committee_data()
ldap_projects_by_name: Mapping[str, LDAPProject] = {p.name: p for p in
ldap_projects.projects}
committees_by_name: Mapping[str, Committee] = {c.name: c for c in
committees.committees}
added_count = 0
updated_count = 0
async with db.session() as data:
async with data.begin():
added, updated = await _update_committees(data, ldap_projects,
committees_by_name)
added_count += added
updated_count += updated
added, updated = await _update_podlings(data, podlings_data,
ldap_projects_by_name)
added_count += added
updated_count += updated
added, updated = await _update_projects(data, projects)
added_count += added
updated_count += updated
added, updated = await _update_tooling(data)
added_count += added
updated_count += updated
added, updated = await _process_undiscovered(data)
added_count += added
updated_count += updated
return added_count, updated_count
```
#### `atr/datasources/apache.py` — `_update_committees` (lines 374-406)
_needs modification_
Updates committees from LDAP data. Currently only tracks counts, not details
of which committees changed or what fields were modified.
```python
async def _update_committees(
data: db.Session, ldap_projects: LDAPProjectsData, committees_by_name:
Mapping[str, Committee]
) -> tuple[int, int]:
added_count = 0
updated_count = 0
# First create PMC committees
for project in ldap_projects.projects:
name = project.name
# Skip non-PMC committees
if project.pmc is not True:
continue
# Get or create PMC
committee = await data.committee(key=name).get()
if not committee:
committee = sql.Committee(key=name)
data.add(committee)
added_count += 1
else:
updated_count += 1
committee.committee_members = project.owners
committee.committers = project.members
# We create PMCs for now
committee.is_podling = False
committee_info = committees_by_name.get(name)
if committee_info:
committee.name = committee_info.display_name
updated_count += 1
return added_count, updated_count
```
#### `atr/tasks/__init__.py` — `metadata_update` (lines 243-271)
_currently does this_
Queues the metadata update as a background task. The task result should
ultimately contain the audit details.
```python
async def metadata_update(
asf_uid: str,
caller_data: db.Session | None = None,
schedule: datetime.datetime | None = None,
schedule_next: bool = False,
) -> sql.Task:
"""Queue a metadata update task."""
task_args = args.Update(asf_uid=asf_uid, next_schedule_seconds=0)
if schedule_next:
task_args.next_schedule_seconds = _DAILY
async with db.ensure_session(caller_data) as data:
task = sql.Task(
status=sql.TaskStatus.QUEUED,
task_type=sql.TaskType.METADATA_UPDATE,
task_args=task_args.model_dump(),
asf_uid=asf_uid,
revision_number=None,
primary_rel_path=None,
project_key=None,
version_key=None,
)
if schedule:
task.scheduled = schedule
await data.begin_immediate()
await _clear_existing_scheduled(data,
sql.TaskType.METADATA_UPDATE, asf_uid)
data.add(task)
await data.commit()
await data.flush()
return task
```
#### `atr/tasks/maintenance.py` — `run` (lines 40-65)
_extension point_
Shows the pattern for background tasks returning structured results. The
metadata update task should follow the same pattern but include audit details.
```python
@checks.with_model(args.MaintenanceArgs)
async def run(task_args: args.MaintenanceArgs) -> results.Results | None:
"""Run maintenance."""
log.info(f"Starting maintenance (user: {task_args.asf_uid})")
try:
# We schedule first to keep the start time approximately consistent
# (and knowing this task will finish before it's re-scheduled)
await tasks.schedule_next(task_args.asf_uid,
task_args.next_schedule_seconds, tasks.run_maintenance)
await _expired_pats_maintenance()
await _session_data_maintenance()
await _storage_maintenance()
await _workflow_ssh_keys_maintenance()
log.info(
"Storage maintenance completed successfully",
)
return results.Maintenance(
kind="maintenance",
)
except Exception as e:
error_msg = f"Unexpected error during maintenance: {e!s}"
log.exception("Maintenance failed with unexpected error")
raise MaintenanceError(error_msg) from e
```
### Where new code would go
- `atr/datasources/apache.py` — after symbol update_metadata
Add a dataclass or typed structure to hold detailed change records (e.g.,
MetadataUpdateEvent with committee_key, change_type, details, timestamp)
- `atr/models/results.py` — new result type
Add a MetadataUpdate result type that includes the list of changes made
during the sync, following the existing Results pattern
### Proposed approach
The implementation should capture what specifically changed during each
metadata sync and record it in the task result (which is already persisted in
the database). The approach has two parts:
1. **Collect change details in `update_metadata()`**: Modify
`_update_committees()`, `_update_podlings()`, and `_update_projects()` to
return structured change records (e.g., 'committee X added', 'committee Y
members changed from [a,b] to [a,b,c]'). Change the return type of
`update_metadata()` from `tuple[int, int]` to include these details.
2. **Record changes in the task result**: The `metadata.update` task handler
(in `atr/tasks/metadata.py`) should store these change details in the task's
`result` field using a new `results.MetadataUpdate` result type. This makes the
audit trail queryable through the existing task infrastructure. Additionally,
log each change at info level so it appears in the application logs.
### Suggested patches
#### `atr/datasources/apache.py`
Add change tracking to _update_committees and surface it from update_metadata
````diff
--- a/atr/datasources/apache.py
+++ b/atr/datasources/apache.py
@@ -20,6 +20,7 @@
from __future__ import annotations
+import dataclasses
import datetime
from typing import TYPE_CHECKING, Annotated, Any, Final
@@ -38,6 +39,18 @@
import atr.models.sql as sql
import atr.util as util
+
[email protected]
+class MetadataChangeRecord:
+ """A single change made during metadata sync."""
+
+ timestamp: datetime.datetime
+ entity_type: str # "committee", "podling", "project", "tooling"
+ entity_key: str
+ change_type: str # "added", "updated"
+ details: str # Human-readable description of what changed
+
+
_WHIMSY_COMMITTEE_INFO_URL: Final[str] =
"https://whimsy.apache.org/public/committee-info.json"
_WHIMSY_COMMITTEE_RETIRED_URL: Final[str] =
"https://whimsy.apache.org/public/committee-retired.json"
_WHIMSY_PROJECTS_URL: Final[str] =
"https://whimsy.apache.org/public/public_ldap_projects.json"
@@ -283,7 +296,7 @@
-async def update_metadata() -> tuple[int, int]:
+async def update_metadata() -> tuple[int, int, list[MetadataChangeRecord]]:
"""Update metadata from remote data sources."""
ldap_projects = await get_ldap_projects_data()
@@ -295,27 +308,31 @@
added_count = 0
updated_count = 0
+ changes: list[MetadataChangeRecord] = []
async with db.session() as data:
async with data.begin():
- added, updated = await _update_committees(data, ldap_projects,
committees_by_name)
+ added, updated, committee_changes = await
_update_committees(data, ldap_projects, committees_by_name)
added_count += added
updated_count += updated
+ changes.extend(committee_changes)
- added, updated = await _update_podlings(data, podlings_data,
ldap_projects_by_name)
+ added, updated, podling_changes = await _update_podlings(data,
podlings_data, ldap_projects_by_name)
added_count += added
updated_count += updated
+ changes.extend(podling_changes)
- added, updated = await _update_projects(data, projects)
+ added, updated, project_changes = await _update_projects(data,
projects)
added_count += added
updated_count += updated
+ changes.extend(project_changes)
added, updated = await _update_tooling(data)
added_count += added
updated_count += updated
added, updated = await _process_undiscovered(data)
added_count += added
updated_count += updated
- return added_count, updated_count
+ return added_count, updated_count, changes
````
#### `atr/datasources/apache.py`
Update _update_committees to track and return detailed change records
````diff
--- a/atr/datasources/apache.py
+++ b/atr/datasources/apache.py
@@ -355,9 +355,10 @@
async def _update_committees(
data: db.Session, ldap_projects: LDAPProjectsData, committees_by_name:
Mapping[str, Committee]
-) -> tuple[int, int]:
+) -> tuple[int, int, list[MetadataChangeRecord]]:
added_count = 0
updated_count = 0
+ changes: list[MetadataChangeRecord] = []
+ now = datetime.datetime.now(datetime.UTC)
# First create PMC committees
for project in ldap_projects.projects:
@@ -369,8 +370,21 @@
committee = await data.committee(key=name).get()
if not committee:
committee = sql.Committee(key=name)
data.add(committee)
added_count += 1
+ changes.append(MetadataChangeRecord(
+ timestamp=now,
+ entity_type="committee",
+ entity_key=name,
+ change_type="added",
+ details=f"New committee '{name}' discovered from LDAP",
+ ))
else:
+ # Track membership changes
+ old_members = set(committee.committee_members)
+ new_members = set(project.owners)
+ if old_members != new_members:
+ added_members = new_members - old_members
+ removed_members = old_members - new_members
+ detail_parts = []
+ if added_members:
+ detail_parts.append(f"members added: {',
'.join(sorted(added_members))}")
+ if removed_members:
+ detail_parts.append(f"members removed: {',
'.join(sorted(removed_members))}")
+ changes.append(MetadataChangeRecord(
+ timestamp=now,
+ entity_type="committee",
+ entity_key=name,
+ change_type="updated",
+ details=f"Committee members changed: {';
'.join(detail_parts)}",
+ ))
updated_count += 1
committee.committee_members = project.owners
@@ -383,7 +397,7 @@
updated_count += 1
- return added_count, updated_count
+ return added_count, updated_count, changes
````
### Open questions
- What does `atr/tasks/metadata.py` look like? It's the task handler that
calls `update_metadata()` and would need to be updated to handle the new return
value and store change records in the task result.
- Does a `results.MetadataUpdate` result type already exist in
`atr/models/results.py`, or does it need to be created to persist the change
records?
- Should `_update_podlings()` and `_update_projects()` also be updated to
return change records (similar to the `_update_committees` pattern shown)?
- Is there a dedicated audit log system beyond task results, or should the
task result field be the primary audit mechanism?
- PR #449 by @developer-ravi-03 was mentioned — is that PR related to this
issue or was it a separate contribution?
### Files examined
- `atr/datasources/apache.py`
- `atr/db/__init__.py`
- `atr/models/sql.py`
- `atr/tasks/__init__.py`
- `atr/tasks/maintenance.py`
- `tests/unit/datasources/test_apache.py`
- `tests/unit/datasources/testdata/committees.json`
- `atr/db/interaction.py`
---
*Draft from a triage agent. A human reviewer should validate before merging
any change. The agent did not run tests or verify diffs apply.*
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]