asf-tooling commented on issue #478:
URL: 
https://github.com/apache/tooling-trusted-releases/issues/478#issuecomment-4410221541

   <!-- gofannon-issue-triage-bot v2 -->
   
   **Automated triage** — analyzed at `main@2da7807a`
   
   **Type:** `new_feature`  •  **Classification:** `actionable`  •  
**Confidence:** `medium`
   **Application domain(s):** `project_committee_management`, 
`release_lifecycle`, `shared_infrastructure`
   
   ### Summary
   Issue #478 requests building a system to catalog each PMC's projects and 
release history from four data sources: SVN dist repo, DOAP files, 
reporter.apache.org, and [email protected] mboxes. Foundational code already 
exists: the `atr.datasources.apache` module provides models for parsing ASF 
committee/project data (including DOAP-derived releases), and `atr.analysis` 
provides release path parsing. However, the actual cataloging 
workflow—combining these sources, storing release history per PMC, and 
providing a manual per-PMC update process—does not yet exist. No prior 
discussion or comments on this issue exist to guide implementation decisions.
   
   ### Where this lives in the code today
   
   #### `tests/unit/datasources/test_apache.py` — `test_projects_data_model` 
(lines 72-76)
   _currently does this_
   Shows that ProjectsData already parses DOAP-derived project data including 
release info (revision, date, name), which is data source #2 in the issue.
   
   ```python
   def test_projects_data_model():
       projects = ProjectsData.model_validate(_load_test_data("projects"))
   
       assert len(projects) == 1
       assert projects.get("accumulo") is not None
   ```
   
   #### `tests/unit/datasources/test_apache.py` — `test_committee_data_model` 
(lines 32-44)
   _currently does this_
   Shows CommitteeData already models PMC membership which is needed to 
associate releases with PMCs.
   
   ```python
   def test_committee_data_model():
       committees = CommitteeData.model_validate(_load_test_data("committees"))
   
       assert committees is not None
       assert committees.pmc_count == 1
   
       tooling = committees.committees[0]
       assert tooling.name == "tooling"
       assert len(tooling.roster) == 3
       assert "tn" in map(lambda x: x.id, tooling.roster)
   
       assert len(tooling.chair) == 1
       assert "wave" in map(lambda x: x.id, tooling.chair)
   ```
   
   ### Where new code would go
   - `atr/catalog/__init__.py` — new file
     New module to implement the per-PMC release cataloging logic, combining 
data from multiple sources (SVN paths via atr.analysis, DOAP via ProjectsData, 
reporter.apache.org API, announce mbox parsing).
   - `atr/models/catalog.py` — new file
     SQL models for storing cataloged release history per PMC, linking 
discovered releases to projects and tracking which data source confirmed each 
release.
   - `atr/datasources/reporter.py` — new file
     Client for fetching self-reported release data from reporter.apache.org 
(data source #3).
   - `atr/datasources/announce.py` — new file
     Parser for release announcements from [email protected] mbox archives 
via lists.apache.org (data source #4).
   - `scripts/catalog_pmc.py` — new file
     Manual CLI script to trigger catalog reconstruction for a single PMC, as 
the issue specifies the update process should be manual and per-PMC.
   
   ### Proposed approach
   The implementation should create a cataloging subsystem that aggregates 
release data from the four specified sources into a unified per-PMC release 
history. The existing `atr.datasources.apache.ProjectsData` model already 
handles DOAP-sourced releases (source #2), and `atr.analysis` handles SVN dist 
path parsing (source #1). New modules are needed for reporter.apache.org 
(source #3) and [email protected] mbox parsing (source #4).
   
   A new SQL model (`CatalogedRelease` or similar) should store discovered 
releases with provenance tracking (which source(s) confirmed each release). A 
CLI script (`scripts/catalog_pmc.py`) should orchestrate the manual per-PMC 
workflow: fetch data from all available sources for a given PMC, 
reconcile/deduplicate releases using version pattern matching, and persist the 
results. The reconciliation logic should use 'permissible version patterns' (as 
mentioned in the issue) to normalize version strings across sources. Since this 
is explicitly manual and per-PMC, no background worker integration is needed 
initially.
   
   ### Open questions
   - What is the full implementation of `atr/analysis.py` (referenced by 
scripts/release_path_parse.py)? Understanding its current capabilities is 
needed before extending it.
   - What is the full implementation of `atr/datasources/apache.py`? The test 
shows model classes but the actual source would reveal how data is fetched and 
what release fields are already modeled.
   - Does the existing database schema (atr/models/sql.py) already have any 
table for tracking historical releases, or would an entirely new table be 
needed?
   - What format does reporter.apache.org expose its release dataset in? Is 
there an API or static JSON?
   - What are the 'permissible version patterns' referenced in the issue—should 
these be configurable per PMC or is there a standard ASF version regex?
   - Should the cataloged data be exposed in the web UI, or is this purely a 
data-collection feature for now?
   
   ### Files examined
   - `tests/unit/datasources/test_apache.py`
   - `tests/unit/datasources/testdata/committees.json`
   - `tests/unit/datasources/testdata/groups.json`
   - `tests/unit/datasources/testdata/ldap_projects.json`
   - `tests/unit/datasources/testdata/podlings.json`
   - `tests/unit/datasources/testdata/projects.json`
   - `tests/unit/datasources/testdata/retired_committees.json`
   - `scripts/release_path_parse.py`
   
   ### Related issues
   This issue appears related to: #479.
   
   _Both address cataloging and watching SVN dist releases for projects_
   
   ---
   *Draft from a triage agent. A human reviewer should validate before merging 
any change. The agent did not run tests or verify diffs apply.*


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to