asf-tooling commented on issue #478: URL: https://github.com/apache/tooling-trusted-releases/issues/478#issuecomment-4410221541
<!-- gofannon-issue-triage-bot v2 --> **Automated triage** — analyzed at `main@2da7807a` **Type:** `new_feature` • **Classification:** `actionable` • **Confidence:** `medium` **Application domain(s):** `project_committee_management`, `release_lifecycle`, `shared_infrastructure` ### Summary Issue #478 requests building a system to catalog each PMC's projects and release history from four data sources: SVN dist repo, DOAP files, reporter.apache.org, and [email protected] mboxes. Foundational code already exists: the `atr.datasources.apache` module provides models for parsing ASF committee/project data (including DOAP-derived releases), and `atr.analysis` provides release path parsing. However, the actual cataloging workflow—combining these sources, storing release history per PMC, and providing a manual per-PMC update process—does not yet exist. No prior discussion or comments on this issue exist to guide implementation decisions. ### Where this lives in the code today #### `tests/unit/datasources/test_apache.py` — `test_projects_data_model` (lines 72-76) _currently does this_ Shows that ProjectsData already parses DOAP-derived project data including release info (revision, date, name), which is data source #2 in the issue. ```python def test_projects_data_model(): projects = ProjectsData.model_validate(_load_test_data("projects")) assert len(projects) == 1 assert projects.get("accumulo") is not None ``` #### `tests/unit/datasources/test_apache.py` — `test_committee_data_model` (lines 32-44) _currently does this_ Shows CommitteeData already models PMC membership which is needed to associate releases with PMCs. ```python def test_committee_data_model(): committees = CommitteeData.model_validate(_load_test_data("committees")) assert committees is not None assert committees.pmc_count == 1 tooling = committees.committees[0] assert tooling.name == "tooling" assert len(tooling.roster) == 3 assert "tn" in map(lambda x: x.id, tooling.roster) assert len(tooling.chair) == 1 assert "wave" in map(lambda x: x.id, tooling.chair) ``` ### Where new code would go - `atr/catalog/__init__.py` — new file New module to implement the per-PMC release cataloging logic, combining data from multiple sources (SVN paths via atr.analysis, DOAP via ProjectsData, reporter.apache.org API, announce mbox parsing). - `atr/models/catalog.py` — new file SQL models for storing cataloged release history per PMC, linking discovered releases to projects and tracking which data source confirmed each release. - `atr/datasources/reporter.py` — new file Client for fetching self-reported release data from reporter.apache.org (data source #3). - `atr/datasources/announce.py` — new file Parser for release announcements from [email protected] mbox archives via lists.apache.org (data source #4). - `scripts/catalog_pmc.py` — new file Manual CLI script to trigger catalog reconstruction for a single PMC, as the issue specifies the update process should be manual and per-PMC. ### Proposed approach The implementation should create a cataloging subsystem that aggregates release data from the four specified sources into a unified per-PMC release history. The existing `atr.datasources.apache.ProjectsData` model already handles DOAP-sourced releases (source #2), and `atr.analysis` handles SVN dist path parsing (source #1). New modules are needed for reporter.apache.org (source #3) and [email protected] mbox parsing (source #4). A new SQL model (`CatalogedRelease` or similar) should store discovered releases with provenance tracking (which source(s) confirmed each release). A CLI script (`scripts/catalog_pmc.py`) should orchestrate the manual per-PMC workflow: fetch data from all available sources for a given PMC, reconcile/deduplicate releases using version pattern matching, and persist the results. The reconciliation logic should use 'permissible version patterns' (as mentioned in the issue) to normalize version strings across sources. Since this is explicitly manual and per-PMC, no background worker integration is needed initially. ### Open questions - What is the full implementation of `atr/analysis.py` (referenced by scripts/release_path_parse.py)? Understanding its current capabilities is needed before extending it. - What is the full implementation of `atr/datasources/apache.py`? The test shows model classes but the actual source would reveal how data is fetched and what release fields are already modeled. - Does the existing database schema (atr/models/sql.py) already have any table for tracking historical releases, or would an entirely new table be needed? - What format does reporter.apache.org expose its release dataset in? Is there an API or static JSON? - What are the 'permissible version patterns' referenced in the issue—should these be configurable per PMC or is there a standard ASF version regex? - Should the cataloged data be exposed in the web UI, or is this purely a data-collection feature for now? ### Files examined - `tests/unit/datasources/test_apache.py` - `tests/unit/datasources/testdata/committees.json` - `tests/unit/datasources/testdata/groups.json` - `tests/unit/datasources/testdata/ldap_projects.json` - `tests/unit/datasources/testdata/podlings.json` - `tests/unit/datasources/testdata/projects.json` - `tests/unit/datasources/testdata/retired_committees.json` - `scripts/release_path_parse.py` ### Related issues This issue appears related to: #479. _Both address cataloging and watching SVN dist releases for projects_ --- *Draft from a triage agent. A human reviewer should validate before merging any change. The agent did not run tests or verify diffs apply.* -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
