asf-tooling commented on issue #468:
URL: 
https://github.com/apache/tooling-trusted-releases/issues/468#issuecomment-4410266754

   <!-- gofannon-issue-triage-bot v2 -->
   
   **Automated triage** — analyzed at `main@2da7807a`
   
   **Type:** `refactor`  •  **Classification:** `actionable`  •  
**Confidence:** `high`
   **Application domain(s):** `project_committee_management`, 
`shared_infrastructure`
   
   ### Summary
   The issue requests splitting the single `update_metadata()` function in 
`atr/datasources/apache.py` into two separate update operations: (1) an 
automatic update that only uses Whimsy data (committee info, retired 
committees, LDAP projects), and (2) a manual admin-triggered update that uses 
projects.apache.org data (projects, podlings, groups). The current code 
combines all data sources into one monolithic function. All referenced code 
still exists and is unchanged.
   
   ### Where this lives in the code today
   
   #### `atr/datasources/apache.py` — `update_metadata` (lines 298-334)
   _needs modification_
   This is the monolithic function that needs to be split into Whimsy-only 
(auto) and projects.apache.org (manual admin) updates.
   
   ```python
   async def update_metadata() -> tuple[int, int]:
       """Update metadata from remote data sources."""
   
       ldap_projects = await get_ldap_projects_data()
       projects = await get_projects_data()
       podlings_data = await get_current_podlings_data()
       committees = await get_active_committee_data()
   
       ldap_projects_by_name: Mapping[str, LDAPProject] = {p.name: p for p in 
ldap_projects.projects}
       committees_by_name: Mapping[str, Committee] = {c.name: c for c in 
committees.committees}
   
       added_count = 0
       updated_count = 0
   
       async with db.session() as data:
           async with data.begin():
               added, updated = await _update_committees(data, ldap_projects, 
committees_by_name)
               added_count += added
               updated_count += updated
   
               added, updated = await _update_podlings(data, podlings_data, 
ldap_projects_by_name)
               added_count += added
               updated_count += updated
   
               added, updated = await _update_projects(data, projects)
               added_count += added
               updated_count += updated
   
               added, updated = await _update_tooling(data)
               added_count += added
               updated_count += updated
   
               added, updated = await _process_undiscovered(data)
               added_count += added
               updated_count += updated
   
       return added_count, updated_count
   ```
   
   #### `atr/datasources/apache.py` — `_update_committees` (lines 374-406)
   _currently does this_
   Uses Whimsy data only (LDAPProjectsData + CommitteeData), should remain in 
the automatic update.
   
   ```python
   async def _update_committees(
       data: db.Session, ldap_projects: LDAPProjectsData, committees_by_name: 
Mapping[str, Committee]
   ) -> tuple[int, int]:
       added_count = 0
       updated_count = 0
   
       # First create PMC committees
       for project in ldap_projects.projects:
           name = project.name
           # Skip non-PMC committees
           if project.pmc is not True:
               continue
   
           # Get or create PMC
           committee = await data.committee(key=name).get()
           if not committee:
               committee = sql.Committee(key=name)
               data.add(committee)
               added_count += 1
           else:
               updated_count += 1
   
           committee.committee_members = project.owners
           committee.committers = project.members
           # We create PMCs for now
           committee.is_podling = False
           committee_info = committees_by_name.get(name)
           if committee_info:
               committee.name = committee_info.display_name
   
           updated_count += 1
   
       return added_count, updated_count
   ```
   
   #### `atr/datasources/apache.py` — `_update_podlings` (lines 409-448)
   _needs modification_
   Uses PodlingsData from projects.apache.org mixed with LDAP data from Whimsy; 
should be moved to the manual admin update per the issue.
   
   ```python
   async def _update_podlings(
       data: db.Session, podlings_data: PodlingsData, ldap_projects_by_name: 
Mapping[str, LDAPProject]
   ) -> tuple[int, int]:
       added_count = 0
       updated_count = 0
   
       # Then add PPMCs and their associated project (podlings)
       for podling_name, podling_data in podlings_data:
           # Get or create PPMC
           ppmc = await data.committee(key=podling_name).get()
           if not ppmc:
               ppmc = sql.Committee(key=podling_name, is_podling=True)
               data.add(ppmc)
               added_count += 1
           else:
               updated_count += 1
   
           # We create a PPMC
           ppmc.is_podling = True
           ppmc.name = 
podling_data.name.removesuffix("(Incubating)").removeprefix("Apache").strip()
           podling_project = ldap_projects_by_name.get(podling_name)
           if podling_project is not None:
               ppmc.committee_members = podling_project.owners
               ppmc.committers = podling_project.members
           else:
               log.warning(f"could not find ldap data for podling 
{podling_name}")
   
           podling = await data.project(key=podling_name).get()
           if not podling:
               # Create the associated podling project
               podling = sql.Project(key=podling_name, name=podling_data.name, 
committee=ppmc)
               data.add(podling)
               added_count += 1
           else:
               updated_count += 1
   
           podling.name = podling_data.name.removesuffix(" (Incubating)")
           podling.committee = ppmc
   
       return added_count, updated_count
   ```
   
   #### `atr/datasources/apache.py` — `_update_projects` (lines 453-460)
   _needs modification_
   Uses ProjectsData from projects.apache.org; should be moved to the manual 
admin update.
   
   ```python
   async def _update_projects(data: db.Session, projects: ProjectsData) -> 
tuple[int, int]:
       added_count = 0
       updated_count = 0
   
       # Add projects and associate them with the right PMC
       for project_key, project_status in projects.items():
           # FIXME: this is a quick workaround for inconsistent data wrt 
webservices PMC / projects
           ...
   ```
   
   #### `atr/datasources/apache.py` — `_update_tooling` (lines 506-535)
   _currently does this_
   Uses LDAP data, should remain in the automatic update.
   
   ```python
   async def _update_tooling(data: db.Session) -> tuple[int, int]:
       added_count = 0
       updated_count = 0
   
       # Tooling is not a committee
       # We add a special entry for Tooling, pretending to be a PMC, for 
debugging and testing
       tooling_committee = await data.committee(key="tooling").get()
       if not tooling_committee:
           tooling_committee = sql.Committee(key="tooling", name="Tooling")
           data.add(tooling_committee)
           tooling_project = sql.Project(key="tooling", name="Apache Tooling", 
committee=tooling_committee)
           data.add(tooling_project)
           added_count += 1
       else:
           updated_count += 1
   
       additional = config.get().TOOLING_USERS_ADDITIONAL
       if additional:
           extra = set(additional.split(","))
       else:
           extra = set()
   
       # Update Tooling PMC data
       tooling_users = list(await ldap.fetch_tooling_users(extra))
       tooling_committee.committee_members = tooling_users
       tooling_committee.committers = tooling_users
       tooling_committee.release_managers = tooling_users
       tooling_committee.is_podling = False
   
       return added_count, updated_count
   ```
   
   #### `atr/models/sql.py` — `TaskType` (lines 203-215)
   _needs modification_
   The METADATA_UPDATE task type may need a companion METADATA_UPDATE_PROJECTS 
for the manual admin trigger, or the existing task could be parameterized.
   
   ```python
   class TaskType(enum.StrEnum):
       COMPARE_SOURCE_TREES = "compare_source_trees"
       DISTRIBUTION_STATUS = "distribution_status"
       DISTRIBUTION_WORKFLOW = "distribution_workflow"
       HASHING_CHECK = "hashing_check"
       KEYS_IMPORT_FILE = "keys_import_file"
       LICENSE_FILES = "license_files"
       LICENSE_HEADERS = "license_headers"
       MAINTENANCE = "maintenance"
       MESSAGE_SEND = "message_send"
       METADATA_UPDATE = "metadata_update"
       PATHS_CHECK = "paths_check"
       ...
   ```
   
   ### Where new code would go
   - `atr/datasources/apache.py` — after symbol update_metadata
     Add a new `update_project_metadata()` function for the manual 
admin-triggered update of projects.apache.org data, alongside the refactored 
`update_committee_metadata()` for the auto-update.
   - `atr/models/sql.py` — after symbol TaskType.METADATA_UPDATE
     Potentially add a new METADATA_UPDATE_PROJECTS task type for the manual 
admin trigger.
   
   ### Proposed approach
   Split `update_metadata()` into two functions: `update_committee_metadata()` 
(automatic, Whimsy-only) and `update_project_metadata()` (manual admin). The 
committee update fetches from Whimsy URLs (`_WHIMSY_COMMITTEE_INFO_URL`, 
`_WHIMSY_PROJECTS_URL`) and calls `_update_committees` and `_update_tooling`. 
The project update fetches from projects.apache.org URLs 
(`_PROJECTS_PROJECTS_URL`, `_PROJECTS_PODLINGS_URL`) and calls 
`_update_podlings` and `_update_projects`. Keep the original 
`update_metadata()` as a convenience that calls both (for backward 
compatibility), but callers that currently trigger the automatic update should 
be changed to call only `update_committee_metadata()`. A new `TaskType` entry 
(e.g., `METADATA_UPDATE_PROJECTS`) should be added for the admin-triggered 
project update. The admin route that triggers this new task type would need to 
be added (likely in `atr/admin/`), as referenced in issue #443.
   
   ### Suggested patches
   
   #### `atr/datasources/apache.py`
   Split update_metadata into update_committee_metadata (auto) and 
update_project_metadata (manual admin), keeping the original as a combined 
convenience function.
   
   ````diff
   --- a/atr/datasources/apache.py
   +++ b/atr/datasources/apache.py
   @@ -283,8 +283,8 @@
        return RetiredCommitteeData.model_validate(data)
    
    
   -async def update_metadata() -> tuple[int, int]:
   -    """Update metadata from remote data sources."""
   +async def update_committee_metadata() -> tuple[int, int]:
   +    """Update committee metadata from Whimsy data sources (automatic)."""
    
        ldap_projects = await get_ldap_projects_data()
   -    projects = await get_projects_data()
   -    podlings_data = await get_current_podlings_data()
        committees = await get_active_committee_data()
    
   -    ldap_projects_by_name: Mapping[str, LDAPProject] = {p.name: p for p in 
ldap_projects.projects}
        committees_by_name: Mapping[str, Committee] = {c.name: c for c in 
committees.committees}
    
        added_count = 0
   @@ -296,12 +293,6 @@
                added_count += added
                updated_count += updated
    
   -            added, updated = await _update_podlings(data, podlings_data, 
ldap_projects_by_name)
   -            added_count += added
   -            updated_count += updated
   -
   -            added, updated = await _update_projects(data, projects)
   -            added_count += added
   -            updated_count += updated
   -
                added, updated = await _update_tooling(data)
                added_count += added
                updated_count += updated
   @@ -313,6 +304,36 @@
        return added_count, updated_count
    
    
   +async def update_project_metadata() -> tuple[int, int]:
   +    """Update project metadata from projects.apache.org data sources 
(manual admin)."""
   +
   +    ldap_projects = await get_ldap_projects_data()
   +    projects = await get_projects_data()
   +    podlings_data = await get_current_podlings_data()
   +
   +    ldap_projects_by_name: Mapping[str, LDAPProject] = {p.name: p for p in 
ldap_projects.projects}
   +
   +    added_count = 0
   +    updated_count = 0
   +
   +    async with db.session() as data:
   +        async with data.begin():
   +            added, updated = await _update_podlings(data, podlings_data, 
ldap_projects_by_name)
   +            added_count += added
   +            updated_count += updated
   +
   +            added, updated = await _update_projects(data, projects)
   +            added_count += added
   +            updated_count += updated
   +
   +    return added_count, updated_count
   +
   +
   +async def update_metadata() -> tuple[int, int]:
   +    """Update all metadata from remote data sources (both Whimsy and 
projects.apache.org)."""
   +    added_c, updated_c = await update_committee_metadata()
   +    added_p, updated_p = await update_project_metadata()
   +    return added_c + added_p, updated_c + updated_p
   +
   +
    async def _process_undiscovered(data: db.Session) -> tuple[int, int]:
        added_count = 0
        updated_count = 0
   ````
   
   #### `atr/models/sql.py`
   Add a new METADATA_UPDATE_PROJECTS task type for the manual admin-triggered 
project metadata update.
   
   ````diff
   --- a/atr/models/sql.py
   +++ b/atr/models/sql.py
   @@ -192,6 +192,7 @@
        MESSAGE_SEND = "message_send"
        METADATA_UPDATE = "metadata_update"
   +    METADATA_UPDATE_PROJECTS = "metadata_update_projects"
        PATHS_CHECK = "paths_check"
        QUARANTINE_VALIDATE = "quarantine_validate"
   ````
   
   ### Open questions
   - Where is the task dispatcher that maps TaskType.METADATA_UPDATE to the 
actual `update_metadata()` call? The automatic scheduler should be updated to 
call `update_committee_metadata()` instead.
   - Issue #443 is referenced for the admin-triggered manual update - need to 
check if an admin route already exists or needs to be created for triggering 
METADATA_UPDATE_PROJECTS.
   - Should `_process_undiscovered` remain in `update_committee_metadata()` or 
move to `update_project_metadata()` or both? It checks for committees without 
projects, so logically it depends on projects being populated.
   - The `_update_podlings` function uses both Whimsy LDAP data and 
projects.apache.org podlings data - confirming it belongs in the manual admin 
update as proposed.
   - A database migration may be needed if the new TaskType enum value is 
stored in the database.
   
   ### Files examined
   - `atr/datasources/apache.py`
   - `atr/models/sql.py`
   - `tests/unit/datasources/test_apache.py`
   - `atr/db/__init__.py`
   - `atr/storage/__init__.py`
   - `atr/config.py`
   - `atr/principal.py`
   - `atr/server.py`
   
   ### Related issues
   This issue appears related to: #465, #462.
   
   _All address data model issues with committee and project metadata schema_
   
   ---
   *Draft from a triage agent. A human reviewer should validate before merging 
any change. The agent did not run tests or verify diffs apply.*


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to