Re: [I] Unbounded Response Sizes on Multiple List Endpoints (tooling-trusted-releases)

via GitHub Fri, 08 May 2026 14:07:08 -0700


asf-tooling commented on issue #1062:
URL: 
https://github.com/apache/tooling-trusted-releases/issues/1062#issuecomment-4409859432


   <!-- gofannon-issue-triage-bot v2 -->
   
   **Automated triage** — analyzed at `main@2da7807a`
   
   **Type:** `new_feature`  •  **Classification:** `actionable`  •  
**Confidence:** `high`
   **Application domain(s):** `web_api_infrastructure`, `admin_operations`, 
`shared_infrastructure`
   
   ### Summary
   Multiple API endpoints return unbounded result sets without pagination. The 
issue is confirmed by a TODO comment in checks_list() and by the absence of 
limit/offset on several endpoints. The codebase already has an established 
pagination pattern (ReleasesListQuery, SshKeysListQuery, TasksListQuery 
dataclasses with limit/offset) and a pagination_args_validate utility, so this 
is a matter of applying the existing pattern to the affected endpoints.
   
   ### Where this lives in the code today
   
   #### `atr/api/__init__.py` — `checks_list` (lines 67-98)
   _needs modification_
   This endpoint explicitly acknowledges the need for pagination via the TODO 
comment and returns all check results unbounded.
   
   ```python
   @api.typed
   @quart_schema.validate_response(models.api.ChecksListResults, 200)
   async def checks_list(
       _checks_list: Literal["checks/list"],
       project_key: safe.ProjectKey,
       version_key: safe.VersionKey,
   ) -> DictResponse:
       """
       URL: GET /checks/list/<project_key>/<version_key>
   
       List checks by project and version.
   
       Checks are only conducted during the compose a draft phase. This endpoint
       only returns the checks for the most recent draft revision. Once a 
release
       has been promoted to the vote phase or beyond, the checks returned are
       still those for the compose phase.
   
       Warning: the check results include results for archive members, so there
       may potentially be thousands or results or more.
       """
       # TODO: We should perhaps paginate this
       async with db.session() as data:
           release_key = sql.release_key(str(project_key), str(version_key))
           release = await 
data.release(key=release_key).demand(exceptions.NotFound(f"Release 
{release_key} not found"))
           check_results = await interaction.checks_for(release, 
caller_data=data)
   
       return models.api.ChecksListResults(
           endpoint="/checks/list",
           checks=check_results,
           checks_revision=release.unwrap_revision_number,
           current_phase=release.phase,
       ).model_dump(mode="json"), 200
   ```
   
   #### `atr/models/validation.py` — `pagination_args_validate` (lines 70-86)
   _currently does this_
   Pagination validation utility already exists and enforces max limit of 1000 
and max offset of 1000000.
   
   ```python
   def pagination_args_validate(query_args: Any) -> None:
       # Users could request any amount using limit=N with arbitrarily high N
       # We therefore limit the maximum limit to 1000
       if hasattr(query_args, "limit"):
           limit = query_args.limit
           if limit > 1000:
               raise ValueError("Maximum limit of 1000 exceeded")
           elif limit < 1:
               raise ValueError("Minimum limit less than 1 is nonsense")
       # Users could request any amount using offset=N with arbitrarily high N
       # We therefore limit the maximum offset to 1000000
       if hasattr(query_args, "offset"):
           offset = query_args.offset
           if offset > 1000000:
               raise ValueError("Maximum offset of 1000000 exceeded")
           elif offset < 0:
               raise ValueError("Minimum offset less than 0 is nonsense")
   ```
   
   #### `atr/models/api.py` — `ReleasesListQuery` (lines 550-554)
   _currently does this_
   Established pattern for pagination query parameters using a dataclass parsed 
from query string.
   
   ```python
   @dataclasses.dataclass
   class ReleasesListQuery:
       offset: int = 0
       limit: int = 20
       phase: str | None = None
   ```
   
   #### `atr/models/api.py` — `ReleasesListResults` (lines 557-560)
   _currently does this_
   Established pattern for paginated API response including total count.
   
   ```python
   class ReleasesListResults(schema.Strict):
       endpoint: Literal["/releases/list"] = schema.alias("endpoint")
       data: Sequence[sql.Release]
       count: int
   ```
   
   #### `atr/models/api.py` — `ChecksListResults` (lines 36-45)
   _needs modification_
   Response model needs additional fields for pagination metadata (count, 
limit, offset).
   
   ```python
   class ChecksListResults(schema.Strict):
       endpoint: Literal["/checks/list"] = schema.alias("endpoint")
       checks: Sequence[sql.CheckResult]
       checks_revision: str = schema.example("00005")
       current_phase: sql.ReleasePhase = 
schema.example(sql.ReleasePhase.RELEASE_CANDIDATE)
   
       @pydantic.field_validator("current_phase", mode="before")
       @classmethod
       def current_phase_to_enum(cls, v):
           return sql.ReleasePhase(v) if isinstance(v, str) else v
   ```
   
   #### `atr/admin/__init__.py` — `consistency` (lines 145-173)
   _needs modification_
   Admin consistency endpoint loads all releases and walks the entire 
filesystem without limits.
   
   ```python
   @admin.typed
   async def consistency(_session: web.Committer, _consistency: 
Literal["consistency"]) -> web.TextResponse:
       """
       URL: GET /consistency
   
       Check for consistency between the database and the filesystem.
       """
       # Get all releases from the database
       async with db.session() as data:
           releases = await data.release().all()
       database_dirs = []
       for release in releases:
           path = paths.release_directory_version(release)
           database_dirs.append(str(path))
       if len(set(database_dirs)) != len(database_dirs):
           raise base.ASFQuartException("Duplicate release directories in 
database", errorcode=500)
   
       # Get all releases from the filesystem
       filesystem_dirs = await _get_filesystem_dirs()
   
       # Pair them up where possible
       paired_dirs = []
       for database_dir in database_dirs[:]:
           for filesystem_dir in filesystem_dirs[:]:
               if database_dir == filesystem_dir:
                   paired_dirs.append(database_dir)
                   database_dirs.remove(database_dir)
                   filesystem_dirs.remove(filesystem_dir)
                   break
   ```
   
   #### `atr/blueprints/common.py` — `build_api_path` (lines 122-136)
   _currently does this_
   The API path builder already supports dataclass parameters parsed from query 
strings, which is how pagination query params are passed.
   
   ```python
   def build_api_path(
       func: Callable[..., Any],
   ) -> tuple[
       str,
       list[tuple[str, type]],
       dict[str, str],
       tuple[str, type[pydantic.BaseModel]] | None,
       tuple[str, type] | None,
       tuple[str, type] | None,
       list[str],
   ]:
       """Inspect a function's type hints to build a URL path for an API route.
   
       Accepts URL path params for data, Literal strings for plain URL text, 
dataclasses for GET query params
       and Pydantic model params for POST bodies
   ```
   
   ### Where new code would go
   - `atr/models/api.py` — after symbol ReleasesListQuery
     Add ChecksListQuery dataclass following the established pattern for 
pagination query parameters.
   
   ### Proposed approach
   Apply the existing pagination pattern (dataclass query params + validation + 
count in response) to the checks_list endpoint as the highest-priority fix. The 
codebase already has ReleasesListQuery, SshKeysListQuery, and TasksListQuery as 
precedents. Add a ChecksListQuery dataclass with limit/offset defaults, call 
pagination_args_validate on it, then slice the results from 
interaction.checks_for() and return the total count. The ChecksListResults 
model gets additional optional fields (count, limit, offset) so existing 
clients aren't broken.
   
   For the admin endpoints (_data_browse and consistency), these are lower 
priority since they're admin-only and already behind authentication. A simple 
hard limit on _data_browse records and a note about consistency being 
inherently unbounded (it's a diagnostic tool) would suffice for now.
   
   ### Suggested patches
   
   #### `atr/models/api.py`
   Add ChecksListQuery dataclass for pagination params and extend 
ChecksListResults with pagination metadata.
   
   ````diff
   --- a/atr/models/api.py
   +++ b/atr/models/api.py
   @@ -30,11 +30,19 @@ class ResultsTypeError(TypeError):
        pass
    
    
   [email protected]
   +class ChecksListQuery:
   +    offset: int = 0
   +    limit: int = 100
   +
   +
    class ChecksListResults(schema.Strict):
        endpoint: Literal["/checks/list"] = schema.alias("endpoint")
        checks: Sequence[sql.CheckResult]
        checks_revision: str = schema.example("00005")
        current_phase: sql.ReleasePhase = 
schema.example(sql.ReleasePhase.RELEASE_CANDIDATE)
   +    count: int = schema.example(42)
   +    limit: int = schema.example(100)
   +    offset: int = schema.example(0)
    
        @pydantic.field_validator("current_phase", mode="before")
        @classmethod
   ````
   
   #### `atr/api/__init__.py`
   Add pagination support to checks_list using the new ChecksListQuery 
dataclass and validation.
   
   ````diff
   --- a/atr/api/__init__.py
   +++ b/atr/api/__init__.py
   @@ -63,6 +63,7 @@ async def checks_list(
        _checks_list: Literal["checks/list"],
        project_key: safe.ProjectKey,
        version_key: safe.VersionKey,
   +    query: models.api.ChecksListQuery = models.api.ChecksListQuery(),
    ) -> DictResponse:
        """
        URL: GET /checks/list/<project_key>/<version_key>
   @@ -76,16 +77,21 @@ async def checks_list(
        Warning: the check results include results for archive members, so there
        may potentially be thousands or results or more.
        """
   -    # TODO: We should perhaps paginate this
   +    validation.pagination_args_validate(query)
        async with db.session() as data:
            release_key = sql.release_key(str(project_key), str(version_key))
            release = await 
data.release(key=release_key).demand(exceptions.NotFound(f"Release 
{release_key} not found"))
            check_results = await interaction.checks_for(release, 
caller_data=data)
    
   +    total_count = len(check_results)
   +    paginated_checks = check_results[query.offset : query.offset + 
query.limit]
   +
        return models.api.ChecksListResults(
            endpoint="/checks/list",
   -        checks=check_results,
   +        checks=paginated_checks,
            checks_revision=release.unwrap_revision_number,
            current_phase=release.phase,
   +        count=total_count,
   +        limit=query.limit,
   +        offset=query.offset,
        ).model_dump(mode="json"), 200
   ````
   
   #### `atr/api/__init__.py`
   Add pagination support to checks_list_revision using the same pattern.
   
   ````diff
   --- a/atr/api/__init__.py
   +++ b/atr/api/__init__.py
   @@ -90,6 +90,7 @@ async def checks_list_revision(
        project_key: safe.ProjectKey,
        version_key: safe.VersionKey,
        revision: safe.RevisionNumber,
   +    query: models.api.ChecksListQuery = models.api.ChecksListQuery(),
    ) -> DictResponse:
        """
        URL: GET /checks/list/<project_key>/<version_key>/<revision>
   @@ -102,6 +103,7 @@ async def checks_list_revision(
        Warning: the check results include results for archive members, so there
        may potentially be thousands or results or more.
        """
   +    validation.pagination_args_validate(query)
        async with db.session() as data:
            release_key = sql.release_key(str(project_key), str(version_key))
            release_result = await data.release(key=release_key).demand(
   @@ -114,9 +116,14 @@ async def checks_list_revision(
    
            check_results = await interaction.checks_for(release_result, 
revision=revision, caller_data=data)
    
   +    total_count = len(check_results)
   +    paginated_checks = check_results[query.offset : query.offset + 
query.limit]
   +
        return models.api.ChecksListResults(
            endpoint="/checks/list",
   -        checks=check_results,
   +        checks=paginated_checks,
            checks_revision=str(revision),
            current_phase=release_result.phase,
   +        count=total_count,
   +        limit=query.limit,
   +        offset=query.offset,
        ).model_dump(mode="json"), 200
   ````
   
   ### Open questions
   - The interaction.checks_for() function (in atr/db/interaction.py, not 
provided) fetches all results in memory before slicing. For truly large result 
sets, database-level LIMIT/OFFSET would be more efficient. Need to check if 
interaction.checks_for can accept limit/offset parameters.
   - The release_paths() endpoint is referenced in the issue but the source 
wasn't provided in the truncated file. Need to see its implementation to add 
pagination.
   - The _data_browse() admin function implementation was not visible in the 
truncated source. Its pagination would follow the same pattern but needs the 
actual code.
   - Adding count/limit/offset to ChecksListResults is additive but changes the 
response schema. Need to confirm this won't break existing API consumers 
expecting exactly the old shape.
   - The consistency admin endpoint is inherently a diagnostic tool that needs 
to see all data. A hard cap there may hide real inconsistencies. Perhaps a 
streaming response or summary-only mode would be more appropriate.
   
   ### Files examined
   - `atr/api/__init__.py`
   - `atr/admin/__init__.py`
   - `atr/db/__init__.py`
   - `atr/models/validation.py`
   - `atr/models/api.py`
   - `atr/blueprints/api.py`
   - `atr/blueprints/common.py`
   - `atr/models/sql.py`
   
   ### Related issues
   This issue appears related to: #1072.
   
   _Both address unbounded response sizes and pagination issues on API 
endpoints_
   
   ---
   *Draft from a triage agent. A human reviewer should validate before merging 
any change. The agent did not run tests or verify diffs apply.*


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] Unbounded Response Sizes on Multiple List Endpoints (tooling-trusted-releases)

Reply via email to