asf-tooling commented on issue #443:
URL: 
https://github.com/apache/tooling-trusted-releases/issues/443#issuecomment-4410311950

   <!-- gofannon-issue-triage-bot v2 -->
   
   **Automated triage** — analyzed at `main@2da7807a`
   
   **Type:** `refactor`  •  **Classification:** `actionable`  •  
**Confidence:** `medium`
   **Application domain(s):** `project_committee_management`, 
`web_api_infrastructure`
   
   ### Summary
   Issue requests DOAP metadata import from projects.apache.org and a metadata 
download endpoint. Per @dave2wave's comment, both already exist — the import is 
in `atr/datasources/apache.py` and a metadata endpoint is available. The 
remaining actionable work, identified by @sebbASF, is that multi-valued items 
(categories, programming languages) are stored and served as comma-separated 
strings rather than proper JSON arrays, which is 'not ideal for further 
processing' and 'might be ambiguous in some circumstances'. The fix requires 
changing how these fields are stored in the DB and serialized in API responses.
   
   ### Where this lives in the code today
   
   #### `atr/datasources/apache.py` — `ProjectStatus` (lines 192-212)
   _currently does this_
   The source data model correctly parses multi-valued items as lists — the 
issue is only in how they're persisted and serialized.
   
   ```python
   class ProjectStatus(schema.Strict):
       category: list[str] = schema.factory(list)
       created: str | None = None
       description: str | None = None
       programming_language: list[str] = 
schema.Field(alias="programming-language", default_factory=list)
       doap: str | None = None
       homepage: str
       name: str
       pmc: str | None
       shortdesc: str | None = None
       repository: list[str | dict] = schema.factory(list)
       release: list[Release] = schema.factory(list)
       ...
       @pydantic.field_validator("category", "programming_language", 
mode="before")
       @classmethod
       def _coerce_to_list(cls, v: object) -> list[str]:
           if isinstance(v, list):
               return [str(x) for x in v]
           if isinstance(v, str):
               return [v] if v else []
           return []
   ```
   
   #### `atr/datasources/apache.py` — `get_projects_data` (lines 277-284)
   _currently does this_
   Already fetches DOAP-derived JSON from projects.apache.org — the import that 
the issue body requested is already in place.
   
   ```python
   async def get_projects_data() -> ProjectsData:
       """Returns the list of projects."""
   
       async with util.create_secure_session() as session:
           async with session.get(_PROJECTS_PROJECTS_URL) as response:
               response.raise_for_status()
               data = await response.json()
       return ProjectsData.model_validate(data)
   ```
   
   ### Where new code would go
   - `atr/models/sql.py` — Project model class
     The Project model likely stores `category` and `programming_languages` as 
`str | None`. These should be changed to JSON list columns (e.g., `list[str]` 
backed by a JSON column or a SQLModel Column with a JSON type) so they are 
natively stored and served as arrays.
   
   ### Proposed approach
   The core change is to store `category` and `programming_languages` on the 
`sql.Project` model as proper JSON list columns rather than comma-separated 
strings. This involves: (1) updating the `sql.Project` model to use list fields 
backed by JSON columns, (2) updating `_update_projects` in 
`atr/datasources/apache.py` to assign the lists directly instead of joining 
them, (3) updating UI code in `atr/get/projects.py` and `atr/post/projects.py` 
to use the list directly instead of splitting/joining, and (4) adding a 
database migration if required.
   
   Since I haven't seen the full `sql.Project` model definition, I'll propose 
changes only to the files I have read. The `sql.Project` model change and any 
needed migration would need to be verified against the actual schema. The 
existing `/api/project/get/<project_key>` endpoint would then automatically 
serve these as JSON arrays, satisfying @sebbASF's requirement.
   
   ### Suggested patches
   
   #### `atr/datasources/apache.py`
   Store category and programming_languages as lists directly instead of 
joining into comma-separated strings.
   
   ````diff
   --- a/atr/datasources/apache.py
   +++ b/atr/datasources/apache.py
   @@ -397,8 +397,8 @@
            # Pass the project name through the validator
            safe.ProjectKey(project_model.key)
            project_model.name = str(project_status.name)
   -        project_model.category = ", ".join(project_status.category) or None
   +        project_model.category = project_status.category  # TODO: confirm 
sql.Project.category accepts list[str]
            project_model.description = project_status.description
   -        project_model.programming_languages = ", 
".join(project_status.programming_language) or None
   +        project_model.programming_languages = 
project_status.programming_language  # TODO: confirm 
sql.Project.programming_languages accepts list[str]
    
        return added_count, updated_count
   ````
   
   ### Open questions
   - What is the actual column type for `category` and `programming_languages` 
in `sql.Project`? A schema migration will be needed if they are currently TEXT 
columns.
   - Does the existing `/api/project/get/<project_key>` endpoint serve as the 
'metadata download endpoint' @dave2wave mentioned, or is there a separate 
endpoint not shown in the provided files?
   - Issue #455 is mentioned as tied to this — what is its status and does it 
overlap with this change?
   - Are there other consumers of `project.category` or 
`project.programming_languages` beyond the files shown that would need updating?
   
   ### Files examined
   - `atr/datasources/apache.py`
   - `atr/blueprints/api.py`
   - `atr/tasks/metadata.py`
   - `atr/models/api.py`
   - `atr/blueprints/common.py`
   - `atr/get/projects.py`
   - `atr/api/__init__.py`
   - `atr/post/projects.py`
   
   ---
   *Draft from a triage agent. A human reviewer should validate before merging 
any change. The agent did not run tests or verify diffs apply.*


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to