asf-tooling commented on issue #443:
URL:
https://github.com/apache/tooling-trusted-releases/issues/443#issuecomment-4410311950
<!-- gofannon-issue-triage-bot v2 -->
**Automated triage** — analyzed at `main@2da7807a`
**Type:** `refactor` • **Classification:** `actionable` •
**Confidence:** `medium`
**Application domain(s):** `project_committee_management`,
`web_api_infrastructure`
### Summary
Issue requests DOAP metadata import from projects.apache.org and a metadata
download endpoint. Per @dave2wave's comment, both already exist — the import is
in `atr/datasources/apache.py` and a metadata endpoint is available. The
remaining actionable work, identified by @sebbASF, is that multi-valued items
(categories, programming languages) are stored and served as comma-separated
strings rather than proper JSON arrays, which is 'not ideal for further
processing' and 'might be ambiguous in some circumstances'. The fix requires
changing how these fields are stored in the DB and serialized in API responses.
### Where this lives in the code today
#### `atr/datasources/apache.py` — `ProjectStatus` (lines 192-212)
_currently does this_
The source data model correctly parses multi-valued items as lists — the
issue is only in how they're persisted and serialized.
```python
class ProjectStatus(schema.Strict):
category: list[str] = schema.factory(list)
created: str | None = None
description: str | None = None
programming_language: list[str] =
schema.Field(alias="programming-language", default_factory=list)
doap: str | None = None
homepage: str
name: str
pmc: str | None
shortdesc: str | None = None
repository: list[str | dict] = schema.factory(list)
release: list[Release] = schema.factory(list)
...
@pydantic.field_validator("category", "programming_language",
mode="before")
@classmethod
def _coerce_to_list(cls, v: object) -> list[str]:
if isinstance(v, list):
return [str(x) for x in v]
if isinstance(v, str):
return [v] if v else []
return []
```
#### `atr/datasources/apache.py` — `get_projects_data` (lines 277-284)
_currently does this_
Already fetches DOAP-derived JSON from projects.apache.org — the import that
the issue body requested is already in place.
```python
async def get_projects_data() -> ProjectsData:
"""Returns the list of projects."""
async with util.create_secure_session() as session:
async with session.get(_PROJECTS_PROJECTS_URL) as response:
response.raise_for_status()
data = await response.json()
return ProjectsData.model_validate(data)
```
### Where new code would go
- `atr/models/sql.py` — Project model class
The Project model likely stores `category` and `programming_languages` as
`str | None`. These should be changed to JSON list columns (e.g., `list[str]`
backed by a JSON column or a SQLModel Column with a JSON type) so they are
natively stored and served as arrays.
### Proposed approach
The core change is to store `category` and `programming_languages` on the
`sql.Project` model as proper JSON list columns rather than comma-separated
strings. This involves: (1) updating the `sql.Project` model to use list fields
backed by JSON columns, (2) updating `_update_projects` in
`atr/datasources/apache.py` to assign the lists directly instead of joining
them, (3) updating UI code in `atr/get/projects.py` and `atr/post/projects.py`
to use the list directly instead of splitting/joining, and (4) adding a
database migration if required.
Since I haven't seen the full `sql.Project` model definition, I'll propose
changes only to the files I have read. The `sql.Project` model change and any
needed migration would need to be verified against the actual schema. The
existing `/api/project/get/<project_key>` endpoint would then automatically
serve these as JSON arrays, satisfying @sebbASF's requirement.
### Suggested patches
#### `atr/datasources/apache.py`
Store category and programming_languages as lists directly instead of
joining into comma-separated strings.
````diff
--- a/atr/datasources/apache.py
+++ b/atr/datasources/apache.py
@@ -397,8 +397,8 @@
# Pass the project name through the validator
safe.ProjectKey(project_model.key)
project_model.name = str(project_status.name)
- project_model.category = ", ".join(project_status.category) or None
+ project_model.category = project_status.category # TODO: confirm
sql.Project.category accepts list[str]
project_model.description = project_status.description
- project_model.programming_languages = ",
".join(project_status.programming_language) or None
+ project_model.programming_languages =
project_status.programming_language # TODO: confirm
sql.Project.programming_languages accepts list[str]
return added_count, updated_count
````
### Open questions
- What is the actual column type for `category` and `programming_languages`
in `sql.Project`? A schema migration will be needed if they are currently TEXT
columns.
- Does the existing `/api/project/get/<project_key>` endpoint serve as the
'metadata download endpoint' @dave2wave mentioned, or is there a separate
endpoint not shown in the provided files?
- Issue #455 is mentioned as tied to this — what is its status and does it
overlap with this change?
- Are there other consumers of `project.category` or
`project.programming_languages` beyond the files shown that would need updating?
### Files examined
- `atr/datasources/apache.py`
- `atr/blueprints/api.py`
- `atr/tasks/metadata.py`
- `atr/models/api.py`
- `atr/blueprints/common.py`
- `atr/get/projects.py`
- `atr/api/__init__.py`
- `atr/post/projects.py`
---
*Draft from a triage agent. A human reviewer should validate before merging
any change. The agent did not run tests or verify diffs apply.*
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]