asf-tooling commented on issue #364:
URL: 
https://github.com/apache/tooling-trusted-releases/issues/364#issuecomment-4410363272

   <!-- gofannon-issue-triage-bot v2 -->
   
   **Automated triage** — analyzed at `main@2da7807a`
   
   **Type:** `new_feature`  •  **Classification:** `actionable`  •  
**Confidence:** `medium`
   **Application domain(s):** `project_committee_management`, 
`shared_infrastructure`
   
   ### Summary
   The issue requests recording what happens when committees are automatically 
synced from remote data sources (Whimsy, projects.apache.org). @sbp noted this 
can be recorded in the audit log. @developer-ravi-03 volunteered to implement 
it via PR but no response was given. The `update_metadata()` function in 
`atr/datasources/apache.py` already returns aggregate counts but doesn't record 
specific changes (which committees were added/modified, what fields changed). 
The enhancement would add detailed change tracking to the metadata sync process.
   
   ### Where this lives in the code today
   
   #### `atr/datasources/apache.py` — `update_metadata` (lines 298-334)
   _needs modification_
   This is the main entry point for metadata syncing. It returns only aggregate 
counts but doesn't record details about what specifically changed.
   
   ```python
   async def update_metadata() -> tuple[int, int]:
       """Update metadata from remote data sources."""
   
       ldap_projects = await get_ldap_projects_data()
       projects = await get_projects_data()
       podlings_data = await get_current_podlings_data()
       committees = await get_active_committee_data()
   
       ldap_projects_by_name: Mapping[str, LDAPProject] = {p.name: p for p in 
ldap_projects.projects}
       committees_by_name: Mapping[str, Committee] = {c.name: c for c in 
committees.committees}
   
       added_count = 0
       updated_count = 0
   
       async with db.session() as data:
           async with data.begin():
               added, updated = await _update_committees(data, ldap_projects, 
committees_by_name)
               added_count += added
               updated_count += updated
   
               added, updated = await _update_podlings(data, podlings_data, 
ldap_projects_by_name)
               added_count += added
               updated_count += updated
   
               added, updated = await _update_projects(data, projects)
               added_count += added
               updated_count += updated
   
               added, updated = await _update_tooling(data)
               added_count += added
               updated_count += updated
   
               added, updated = await _process_undiscovered(data)
               added_count += added
               updated_count += updated
   
       return added_count, updated_count
   ```
   
   #### `atr/datasources/apache.py` — `_update_committees` (lines 374-406)
   _needs modification_
   Updates committees from LDAP data. Currently only tracks counts, not details 
of which committees changed or what fields were modified.
   
   ```python
   async def _update_committees(
       data: db.Session, ldap_projects: LDAPProjectsData, committees_by_name: 
Mapping[str, Committee]
   ) -> tuple[int, int]:
       added_count = 0
       updated_count = 0
   
       # First create PMC committees
       for project in ldap_projects.projects:
           name = project.name
           # Skip non-PMC committees
           if project.pmc is not True:
               continue
   
           # Get or create PMC
           committee = await data.committee(key=name).get()
           if not committee:
               committee = sql.Committee(key=name)
               data.add(committee)
               added_count += 1
           else:
               updated_count += 1
   
           committee.committee_members = project.owners
           committee.committers = project.members
           # We create PMCs for now
           committee.is_podling = False
           committee_info = committees_by_name.get(name)
           if committee_info:
               committee.name = committee_info.display_name
   
           updated_count += 1
   
       return added_count, updated_count
   ```
   
   #### `atr/tasks/__init__.py` — `metadata_update` (lines 243-271)
   _currently does this_
   Queues the metadata update as a background task. The task result should 
ultimately contain the audit details.
   
   ```python
   async def metadata_update(
       asf_uid: str,
       caller_data: db.Session | None = None,
       schedule: datetime.datetime | None = None,
       schedule_next: bool = False,
   ) -> sql.Task:
       """Queue a metadata update task."""
       task_args = args.Update(asf_uid=asf_uid, next_schedule_seconds=0)
       if schedule_next:
           task_args.next_schedule_seconds = _DAILY
       async with db.ensure_session(caller_data) as data:
           task = sql.Task(
               status=sql.TaskStatus.QUEUED,
               task_type=sql.TaskType.METADATA_UPDATE,
               task_args=task_args.model_dump(),
               asf_uid=asf_uid,
               revision_number=None,
               primary_rel_path=None,
               project_key=None,
               version_key=None,
           )
           if schedule:
               task.scheduled = schedule
               await data.begin_immediate()
               await _clear_existing_scheduled(data, 
sql.TaskType.METADATA_UPDATE, asf_uid)
           data.add(task)
           await data.commit()
           await data.flush()
           return task
   ```
   
   #### `atr/tasks/maintenance.py` — `run` (lines 40-65)
   _extension point_
   Shows the pattern for background tasks returning structured results. The 
metadata update task should follow the same pattern but include audit details.
   
   ```python
   @checks.with_model(args.MaintenanceArgs)
   async def run(task_args: args.MaintenanceArgs) -> results.Results | None:
       """Run maintenance."""
       log.info(f"Starting maintenance (user: {task_args.asf_uid})")
   
       try:
           # We schedule first to keep the start time approximately consistent
           # (and knowing this task will finish before it's re-scheduled)
           await tasks.schedule_next(task_args.asf_uid, 
task_args.next_schedule_seconds, tasks.run_maintenance)
   
           await _expired_pats_maintenance()
           await _session_data_maintenance()
           await _storage_maintenance()
           await _workflow_ssh_keys_maintenance()
   
           log.info(
               "Storage maintenance completed successfully",
           )
   
           return results.Maintenance(
               kind="maintenance",
           )
       except Exception as e:
           error_msg = f"Unexpected error during maintenance: {e!s}"
           log.exception("Maintenance failed with unexpected error")
           raise MaintenanceError(error_msg) from e
   ```
   
   ### Where new code would go
   - `atr/datasources/apache.py` — after symbol update_metadata
     Add a dataclass or typed structure to hold detailed change records (e.g., 
MetadataUpdateEvent with committee_key, change_type, details, timestamp)
   - `atr/models/results.py` — new result type
     Add a MetadataUpdate result type that includes the list of changes made 
during the sync, following the existing Results pattern
   
   ### Proposed approach
   The implementation should capture what specifically changed during each 
metadata sync and record it in the task result (which is already persisted in 
the database). The approach has two parts:
   
   1. **Collect change details in `update_metadata()`**: Modify 
`_update_committees()`, `_update_podlings()`, and `_update_projects()` to 
return structured change records (e.g., 'committee X added', 'committee Y 
members changed from [a,b] to [a,b,c]'). Change the return type of 
`update_metadata()` from `tuple[int, int]` to include these details.
   
   2. **Record changes in the task result**: The `metadata.update` task handler 
(in `atr/tasks/metadata.py`) should store these change details in the task's 
`result` field using a new `results.MetadataUpdate` result type. This makes the 
audit trail queryable through the existing task infrastructure. Additionally, 
log each change at info level so it appears in the application logs.
   
   ### Suggested patches
   
   #### `atr/datasources/apache.py`
   Add change tracking to _update_committees and surface it from update_metadata
   
   ````diff
   --- a/atr/datasources/apache.py
   +++ b/atr/datasources/apache.py
   @@ -20,6 +20,7 @@
    
    from __future__ import annotations
    
   +import dataclasses
    import datetime
    from typing import TYPE_CHECKING, Annotated, Any, Final
    
   @@ -38,6 +39,18 @@
    import atr.models.sql as sql
    import atr.util as util
    
   +
   [email protected]
   +class MetadataChangeRecord:
   +    """A single change made during metadata sync."""
   +
   +    timestamp: datetime.datetime
   +    entity_type: str  # "committee", "podling", "project", "tooling"
   +    entity_key: str
   +    change_type: str  # "added", "updated"
   +    details: str  # Human-readable description of what changed
   +
   +
    _WHIMSY_COMMITTEE_INFO_URL: Final[str] = 
"https://whimsy.apache.org/public/committee-info.json";
    _WHIMSY_COMMITTEE_RETIRED_URL: Final[str] = 
"https://whimsy.apache.org/public/committee-retired.json";
    _WHIMSY_PROJECTS_URL: Final[str] = 
"https://whimsy.apache.org/public/public_ldap_projects.json";
   @@ -283,7 +296,7 @@
    
    
   -async def update_metadata() -> tuple[int, int]:
   +async def update_metadata() -> tuple[int, int, list[MetadataChangeRecord]]:
        """Update metadata from remote data sources."""
    
        ldap_projects = await get_ldap_projects_data()
   @@ -295,27 +308,31 @@
    
        added_count = 0
        updated_count = 0
   +    changes: list[MetadataChangeRecord] = []
    
        async with db.session() as data:
            async with data.begin():
   -            added, updated = await _update_committees(data, ldap_projects, 
committees_by_name)
   +            added, updated, committee_changes = await 
_update_committees(data, ldap_projects, committees_by_name)
                added_count += added
                updated_count += updated
   +            changes.extend(committee_changes)
    
   -            added, updated = await _update_podlings(data, podlings_data, 
ldap_projects_by_name)
   +            added, updated, podling_changes = await _update_podlings(data, 
podlings_data, ldap_projects_by_name)
                added_count += added
                updated_count += updated
   +            changes.extend(podling_changes)
    
   -            added, updated = await _update_projects(data, projects)
   +            added, updated, project_changes = await _update_projects(data, 
projects)
                added_count += added
                updated_count += updated
   +            changes.extend(project_changes)
    
                added, updated = await _update_tooling(data)
                added_count += added
                updated_count += updated
    
                added, updated = await _process_undiscovered(data)
                added_count += added
                updated_count += updated
    
   -    return added_count, updated_count
   +    return added_count, updated_count, changes
   ````
   
   #### `atr/datasources/apache.py`
   Update _update_committees to track and return detailed change records
   
   ````diff
   --- a/atr/datasources/apache.py
   +++ b/atr/datasources/apache.py
   @@ -355,9 +355,10 @@
    async def _update_committees(
        data: db.Session, ldap_projects: LDAPProjectsData, committees_by_name: 
Mapping[str, Committee]
   -) -> tuple[int, int]:
   +) -> tuple[int, int, list[MetadataChangeRecord]]:
        added_count = 0
        updated_count = 0
   +    changes: list[MetadataChangeRecord] = []
   +    now = datetime.datetime.now(datetime.UTC)
    
        # First create PMC committees
        for project in ldap_projects.projects:
   @@ -369,8 +370,21 @@
            committee = await data.committee(key=name).get()
            if not committee:
                committee = sql.Committee(key=name)
                data.add(committee)
                added_count += 1
   +            changes.append(MetadataChangeRecord(
   +                timestamp=now,
   +                entity_type="committee",
   +                entity_key=name,
   +                change_type="added",
   +                details=f"New committee '{name}' discovered from LDAP",
   +            ))
            else:
   +            # Track membership changes
   +            old_members = set(committee.committee_members)
   +            new_members = set(project.owners)
   +            if old_members != new_members:
   +                added_members = new_members - old_members
   +                removed_members = old_members - new_members
   +                detail_parts = []
   +                if added_members:
   +                    detail_parts.append(f"members added: {', 
'.join(sorted(added_members))}")
   +                if removed_members:
   +                    detail_parts.append(f"members removed: {', 
'.join(sorted(removed_members))}")
   +                changes.append(MetadataChangeRecord(
   +                    timestamp=now,
   +                    entity_type="committee",
   +                    entity_key=name,
   +                    change_type="updated",
   +                    details=f"Committee members changed: {'; 
'.join(detail_parts)}",
   +                ))
                updated_count += 1
    
            committee.committee_members = project.owners
   @@ -383,7 +397,7 @@
    
            updated_count += 1
    
   -    return added_count, updated_count
   +    return added_count, updated_count, changes
   ````
   
   ### Open questions
   - What does `atr/tasks/metadata.py` look like? It's the task handler that 
calls `update_metadata()` and would need to be updated to handle the new return 
value and store change records in the task result.
   - Does a `results.MetadataUpdate` result type already exist in 
`atr/models/results.py`, or does it need to be created to persist the change 
records?
   - Should `_update_podlings()` and `_update_projects()` also be updated to 
return change records (similar to the `_update_committees` pattern shown)?
   - Is there a dedicated audit log system beyond task results, or should the 
task result field be the primary audit mechanism?
   - PR #449 by @developer-ravi-03 was mentioned — is that PR related to this 
issue or was it a separate contribution?
   
   ### Files examined
   - `atr/datasources/apache.py`
   - `atr/db/__init__.py`
   - `atr/models/sql.py`
   - `atr/tasks/__init__.py`
   - `atr/tasks/maintenance.py`
   - `tests/unit/datasources/test_apache.py`
   - `tests/unit/datasources/testdata/committees.json`
   - `atr/db/interaction.py`
   
   ---
   *Draft from a triage agent. A human reviewer should validate before merging 
any change. The agent did not run tests or verify diffs apply.*


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to