asf-tooling commented on issue #468:
URL:
https://github.com/apache/tooling-trusted-releases/issues/468#issuecomment-4410266754
<!-- gofannon-issue-triage-bot v2 -->
**Automated triage** — analyzed at `main@2da7807a`
**Type:** `refactor` • **Classification:** `actionable` •
**Confidence:** `high`
**Application domain(s):** `project_committee_management`,
`shared_infrastructure`
### Summary
The issue requests splitting the single `update_metadata()` function in
`atr/datasources/apache.py` into two separate update operations: (1) an
automatic update that only uses Whimsy data (committee info, retired
committees, LDAP projects), and (2) a manual admin-triggered update that uses
projects.apache.org data (projects, podlings, groups). The current code
combines all data sources into one monolithic function. All referenced code
still exists and is unchanged.
### Where this lives in the code today
#### `atr/datasources/apache.py` — `update_metadata` (lines 298-334)
_needs modification_
This is the monolithic function that needs to be split into Whimsy-only
(auto) and projects.apache.org (manual admin) updates.
```python
async def update_metadata() -> tuple[int, int]:
"""Update metadata from remote data sources."""
ldap_projects = await get_ldap_projects_data()
projects = await get_projects_data()
podlings_data = await get_current_podlings_data()
committees = await get_active_committee_data()
ldap_projects_by_name: Mapping[str, LDAPProject] = {p.name: p for p in
ldap_projects.projects}
committees_by_name: Mapping[str, Committee] = {c.name: c for c in
committees.committees}
added_count = 0
updated_count = 0
async with db.session() as data:
async with data.begin():
added, updated = await _update_committees(data, ldap_projects,
committees_by_name)
added_count += added
updated_count += updated
added, updated = await _update_podlings(data, podlings_data,
ldap_projects_by_name)
added_count += added
updated_count += updated
added, updated = await _update_projects(data, projects)
added_count += added
updated_count += updated
added, updated = await _update_tooling(data)
added_count += added
updated_count += updated
added, updated = await _process_undiscovered(data)
added_count += added
updated_count += updated
return added_count, updated_count
```
#### `atr/datasources/apache.py` — `_update_committees` (lines 374-406)
_currently does this_
Uses Whimsy data only (LDAPProjectsData + CommitteeData), should remain in
the automatic update.
```python
async def _update_committees(
data: db.Session, ldap_projects: LDAPProjectsData, committees_by_name:
Mapping[str, Committee]
) -> tuple[int, int]:
added_count = 0
updated_count = 0
# First create PMC committees
for project in ldap_projects.projects:
name = project.name
# Skip non-PMC committees
if project.pmc is not True:
continue
# Get or create PMC
committee = await data.committee(key=name).get()
if not committee:
committee = sql.Committee(key=name)
data.add(committee)
added_count += 1
else:
updated_count += 1
committee.committee_members = project.owners
committee.committers = project.members
# We create PMCs for now
committee.is_podling = False
committee_info = committees_by_name.get(name)
if committee_info:
committee.name = committee_info.display_name
updated_count += 1
return added_count, updated_count
```
#### `atr/datasources/apache.py` — `_update_podlings` (lines 409-448)
_needs modification_
Uses PodlingsData from projects.apache.org mixed with LDAP data from Whimsy;
should be moved to the manual admin update per the issue.
```python
async def _update_podlings(
data: db.Session, podlings_data: PodlingsData, ldap_projects_by_name:
Mapping[str, LDAPProject]
) -> tuple[int, int]:
added_count = 0
updated_count = 0
# Then add PPMCs and their associated project (podlings)
for podling_name, podling_data in podlings_data:
# Get or create PPMC
ppmc = await data.committee(key=podling_name).get()
if not ppmc:
ppmc = sql.Committee(key=podling_name, is_podling=True)
data.add(ppmc)
added_count += 1
else:
updated_count += 1
# We create a PPMC
ppmc.is_podling = True
ppmc.name =
podling_data.name.removesuffix("(Incubating)").removeprefix("Apache").strip()
podling_project = ldap_projects_by_name.get(podling_name)
if podling_project is not None:
ppmc.committee_members = podling_project.owners
ppmc.committers = podling_project.members
else:
log.warning(f"could not find ldap data for podling
{podling_name}")
podling = await data.project(key=podling_name).get()
if not podling:
# Create the associated podling project
podling = sql.Project(key=podling_name, name=podling_data.name,
committee=ppmc)
data.add(podling)
added_count += 1
else:
updated_count += 1
podling.name = podling_data.name.removesuffix(" (Incubating)")
podling.committee = ppmc
return added_count, updated_count
```
#### `atr/datasources/apache.py` — `_update_projects` (lines 453-460)
_needs modification_
Uses ProjectsData from projects.apache.org; should be moved to the manual
admin update.
```python
async def _update_projects(data: db.Session, projects: ProjectsData) ->
tuple[int, int]:
added_count = 0
updated_count = 0
# Add projects and associate them with the right PMC
for project_key, project_status in projects.items():
# FIXME: this is a quick workaround for inconsistent data wrt
webservices PMC / projects
...
```
#### `atr/datasources/apache.py` — `_update_tooling` (lines 506-535)
_currently does this_
Uses LDAP data, should remain in the automatic update.
```python
async def _update_tooling(data: db.Session) -> tuple[int, int]:
added_count = 0
updated_count = 0
# Tooling is not a committee
# We add a special entry for Tooling, pretending to be a PMC, for
debugging and testing
tooling_committee = await data.committee(key="tooling").get()
if not tooling_committee:
tooling_committee = sql.Committee(key="tooling", name="Tooling")
data.add(tooling_committee)
tooling_project = sql.Project(key="tooling", name="Apache Tooling",
committee=tooling_committee)
data.add(tooling_project)
added_count += 1
else:
updated_count += 1
additional = config.get().TOOLING_USERS_ADDITIONAL
if additional:
extra = set(additional.split(","))
else:
extra = set()
# Update Tooling PMC data
tooling_users = list(await ldap.fetch_tooling_users(extra))
tooling_committee.committee_members = tooling_users
tooling_committee.committers = tooling_users
tooling_committee.release_managers = tooling_users
tooling_committee.is_podling = False
return added_count, updated_count
```
#### `atr/models/sql.py` — `TaskType` (lines 203-215)
_needs modification_
The METADATA_UPDATE task type may need a companion METADATA_UPDATE_PROJECTS
for the manual admin trigger, or the existing task could be parameterized.
```python
class TaskType(enum.StrEnum):
COMPARE_SOURCE_TREES = "compare_source_trees"
DISTRIBUTION_STATUS = "distribution_status"
DISTRIBUTION_WORKFLOW = "distribution_workflow"
HASHING_CHECK = "hashing_check"
KEYS_IMPORT_FILE = "keys_import_file"
LICENSE_FILES = "license_files"
LICENSE_HEADERS = "license_headers"
MAINTENANCE = "maintenance"
MESSAGE_SEND = "message_send"
METADATA_UPDATE = "metadata_update"
PATHS_CHECK = "paths_check"
...
```
### Where new code would go
- `atr/datasources/apache.py` — after symbol update_metadata
Add a new `update_project_metadata()` function for the manual
admin-triggered update of projects.apache.org data, alongside the refactored
`update_committee_metadata()` for the auto-update.
- `atr/models/sql.py` — after symbol TaskType.METADATA_UPDATE
Potentially add a new METADATA_UPDATE_PROJECTS task type for the manual
admin trigger.
### Proposed approach
Split `update_metadata()` into two functions: `update_committee_metadata()`
(automatic, Whimsy-only) and `update_project_metadata()` (manual admin). The
committee update fetches from Whimsy URLs (`_WHIMSY_COMMITTEE_INFO_URL`,
`_WHIMSY_PROJECTS_URL`) and calls `_update_committees` and `_update_tooling`.
The project update fetches from projects.apache.org URLs
(`_PROJECTS_PROJECTS_URL`, `_PROJECTS_PODLINGS_URL`) and calls
`_update_podlings` and `_update_projects`. Keep the original
`update_metadata()` as a convenience that calls both (for backward
compatibility), but callers that currently trigger the automatic update should
be changed to call only `update_committee_metadata()`. A new `TaskType` entry
(e.g., `METADATA_UPDATE_PROJECTS`) should be added for the admin-triggered
project update. The admin route that triggers this new task type would need to
be added (likely in `atr/admin/`), as referenced in issue #443.
### Suggested patches
#### `atr/datasources/apache.py`
Split update_metadata into update_committee_metadata (auto) and
update_project_metadata (manual admin), keeping the original as a combined
convenience function.
````diff
--- a/atr/datasources/apache.py
+++ b/atr/datasources/apache.py
@@ -283,8 +283,8 @@
return RetiredCommitteeData.model_validate(data)
-async def update_metadata() -> tuple[int, int]:
- """Update metadata from remote data sources."""
+async def update_committee_metadata() -> tuple[int, int]:
+ """Update committee metadata from Whimsy data sources (automatic)."""
ldap_projects = await get_ldap_projects_data()
- projects = await get_projects_data()
- podlings_data = await get_current_podlings_data()
committees = await get_active_committee_data()
- ldap_projects_by_name: Mapping[str, LDAPProject] = {p.name: p for p in
ldap_projects.projects}
committees_by_name: Mapping[str, Committee] = {c.name: c for c in
committees.committees}
added_count = 0
@@ -296,12 +293,6 @@
added_count += added
updated_count += updated
- added, updated = await _update_podlings(data, podlings_data,
ldap_projects_by_name)
- added_count += added
- updated_count += updated
-
- added, updated = await _update_projects(data, projects)
- added_count += added
- updated_count += updated
-
added, updated = await _update_tooling(data)
added_count += added
updated_count += updated
@@ -313,6 +304,36 @@
return added_count, updated_count
+async def update_project_metadata() -> tuple[int, int]:
+ """Update project metadata from projects.apache.org data sources
(manual admin)."""
+
+ ldap_projects = await get_ldap_projects_data()
+ projects = await get_projects_data()
+ podlings_data = await get_current_podlings_data()
+
+ ldap_projects_by_name: Mapping[str, LDAPProject] = {p.name: p for p in
ldap_projects.projects}
+
+ added_count = 0
+ updated_count = 0
+
+ async with db.session() as data:
+ async with data.begin():
+ added, updated = await _update_podlings(data, podlings_data,
ldap_projects_by_name)
+ added_count += added
+ updated_count += updated
+
+ added, updated = await _update_projects(data, projects)
+ added_count += added
+ updated_count += updated
+
+ return added_count, updated_count
+
+
+async def update_metadata() -> tuple[int, int]:
+ """Update all metadata from remote data sources (both Whimsy and
projects.apache.org)."""
+ added_c, updated_c = await update_committee_metadata()
+ added_p, updated_p = await update_project_metadata()
+ return added_c + added_p, updated_c + updated_p
+
+
async def _process_undiscovered(data: db.Session) -> tuple[int, int]:
added_count = 0
updated_count = 0
````
#### `atr/models/sql.py`
Add a new METADATA_UPDATE_PROJECTS task type for the manual admin-triggered
project metadata update.
````diff
--- a/atr/models/sql.py
+++ b/atr/models/sql.py
@@ -192,6 +192,7 @@
MESSAGE_SEND = "message_send"
METADATA_UPDATE = "metadata_update"
+ METADATA_UPDATE_PROJECTS = "metadata_update_projects"
PATHS_CHECK = "paths_check"
QUARANTINE_VALIDATE = "quarantine_validate"
````
### Open questions
- Where is the task dispatcher that maps TaskType.METADATA_UPDATE to the
actual `update_metadata()` call? The automatic scheduler should be updated to
call `update_committee_metadata()` instead.
- Issue #443 is referenced for the admin-triggered manual update - need to
check if an admin route already exists or needs to be created for triggering
METADATA_UPDATE_PROJECTS.
- Should `_process_undiscovered` remain in `update_committee_metadata()` or
move to `update_project_metadata()` or both? It checks for committees without
projects, so logically it depends on projects being populated.
- The `_update_podlings` function uses both Whimsy LDAP data and
projects.apache.org podlings data - confirming it belongs in the manual admin
update as proposed.
- A database migration may be needed if the new TaskType enum value is
stored in the database.
### Files examined
- `atr/datasources/apache.py`
- `atr/models/sql.py`
- `tests/unit/datasources/test_apache.py`
- `atr/db/__init__.py`
- `atr/storage/__init__.py`
- `atr/config.py`
- `atr/principal.py`
- `atr/server.py`
### Related issues
This issue appears related to: #465, #462.
_All address data model issues with committee and project metadata schema_
---
*Draft from a triage agent. A human reviewer should validate before merging
any change. The agent did not run tests or verify diffs apply.*
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]