yamoyamoto opened a new issue, #8842:
URL: https://github.com/apache/incubator-devlake/issues/8842

   ### Search before asking
   
   - [x] I had searched in the 
[issues](https://github.com/apache/incubator-devlake/issues?q=is%3Aissue) and 
found no similar issues.
   
   
   ### What happened
   
   When collecting workflow runs from a large GitHub.com (Enterprise Cloud) 
repository, the `Collect Workflow Runs` subtask fails with HTTP 422 once the 
collector crosses an undocumented pagination boundary, and the whole pipeline 
aborts. A representative error log:
   
   ```
   subtask Collect Workflow Runs ended unexpectedly
   Wraps: (2) Error waiting for async Collector execution
     | combined messages: 
     | {
     |   Retry exceeded 3 times calling repos/{owner}/{repo}/actions/runs.
     |   The last error was: Http DoAsync error calling
     |   [method:GET path:repos/{owner}/{repo}/actions/runs
     |    query:map[page:[1340] per_page:[30]]].
     |   Response: {"message":"In order to keep the API fast for everyone,
     |              pagination is limited for this resource.",
     |              
"documentation_url":"https://docs.github.com/v3/#pagination";,
     |              "status":"422"} (422)
     | }
   ```
   
   `page=1340 × per_page=30 = 40,200` items, which crosses a roughly `per_page 
× page ≤ 40,000 items` hard cap that github.com enforces on unfiltered 
`/actions/runs` pagination. The cap is easy to probe directly with `gh api`: 
`per_page=100&page=400` returns HTTP 200 with `total_count: 40000` and `Link 
rel="last" page=400`, while `per_page=100&page=401` (and anything beyond) 
returns HTTP 422 with the same `"pagination is limited for this resource"` 
message. `total_count` is clamped at 40,000 even though the repository has 
significantly more runs. Neither this 40k boundary nor the 422 response is 
described in the [official docs for this 
endpoint](https://docs.github.com/en/rest/actions/workflow-runs#list-workflow-runs-for-a-repository),
 which only mention a separate cap for *filtered* queries ("up to 1,000 results 
for each search when using `actor`, `branch`, `check_suite_id`, `created`, 
`event`, `head_sha`, or `status`").
   
   On the DevLake side, the root cause is in 
[`backend/plugins/github/tasks/cicd_run_collector.go` 
L77-84](https://github.com/apache/incubator-devlake/blob/a21382accbcccbf7ae7d9e2d3558a07cc4301afc/backend/plugins/github/tasks/cicd_run_collector.go#L77-L84):
   
   ```go
   Query: func(reqData *helper.RequestData, createdAfter *time.Time) 
(url.Values, errors.Error) {
       query := url.Values{}
       query.Set("page", fmt.Sprintf("%v", reqData.Pager.Page))
       query.Set("per_page", fmt.Sprintf("%v", reqData.Pager.Size))
       return query, nil
   },
   ```
   
   `createdAfter` is received but never forwarded to the server — time 
filtering happens purely on the client side. With `Concurrency=10` and 
`PageSize=30`, concurrent workers blindly advance until some of them cross the 
40k boundary and hit 422. Once 3 retries are exhausted for any one page the 
whole subtask fails, and no partial data is salvaged (`Extract Workflow Runs` / 
`Convert Workflow Runs` in the same task do not run either).
   
   GraphQL is not a viable fallback: the [`Repository` object in GitHub's 
GraphQL 
schema](https://docs.github.com/en/graphql/reference/objects#repository) has no 
`workflowRuns`, `workflows`, or `actions` field, and [`WorkflowRun` is only 
reachable as a singular field on 
`CheckSuite`](https://docs.github.com/en/graphql/reference/objects#checksuite) 
(not a connection), so there is no way to list a repository's workflow runs via 
GraphQL. DevLake's `github_graphql` plugin already acknowledges this by 
importing the REST collector 
([`plugins/github_graphql/impl/impl.go:97`](https://github.com/apache/incubator-devlake/blob/a21382accbcccbf7ae7d9e2d3558a07cc4301afc/backend/plugins/github_graphql/impl/impl.go#L97)).
   
   ### What do you expect to happen
   
   `Collect Workflow Runs` should successfully collect the full set of workflow 
runs within the blueprint's `timeAfter` window regardless of total volume. 
Large repositories on github.com with more than 40,000 runs in-window should 
not cause the pipeline to abort.
   
   ### How to reproduce
   
   Configure a GitHub connection pointing at a github.com repository that has 
more than 40,000 workflow runs, set the blueprint's `timeAfter` to a date far 
enough back that the range contains >40,000 runs, and trigger "Collect Data". 
The pipeline fails on `Collect Workflow Runs` once the collector reaches `page 
× per_page > 40,000` (with default `PageSize=30` this happens near page 1,334). 
Any sufficiently busy CI repository reaches the boundary eventually; no 
specific GitHub feature beyond volume is required.
   
   ### Anything else
   
   Existing workarounds are insufficient in isolation. Narrowing `timeAfter` 
works once but the same initial bootstrap fails again later as the repository 
accumulates runs, and historical data is forfeited. `skipOnFail=true` lets 
unrelated plugins (Jira, DORA, etc.) keep running, but `_tool_github_runs` 
still never gets populated for the affected repo. Raising `per_page` to 100 
reduces the number of requests but does not raise the 40,000-item ceiling.
   
   The only way to read past the 40k boundary is to use a filter parameter. 
Adding a `created` filter switches the endpoint into the *filtered* mode (up to 
1,000 results per search). Fixed-size windows (e.g. monthly) are not enough 
because a single month on a busy repo can exceed 1,000 runs, so the fix needs 
to bisect windows adaptively. Roughly:
   
   ```go
   // Pseudocode
   func collectRunsAdaptive(from, to time.Time) {
       items, reachedCap := fetchWindow(from, to) // GET 
.../actions/runs?created=<from>..<to>&per_page=100
       if reachedCap {
           mid := from.Add(to.Sub(from) / 2)
           collectRunsAdaptive(from, mid)
           collectRunsAdaptive(mid, to)
       } else {
           persist(items)
       }
   }
   ```
   
   The query syntax is `created:YYYY-MM-DD..YYYY-MM-DD` (ISO 8601, supports 
`>=`, `<=`, `..`, with optional `THH:MM:SSZ` for sub-day granularity — see 
[search 
syntax](https://docs.github.com/en/search-github/getting-started-with-searching-on-github/understanding-the-search-syntax)).
 `createdAfter` is already passed to the `Query` hook so no interface change is 
needed, and for incremental runs it simply becomes the lower bound of the outer 
window.
   
   The problem reproduces deterministically on every bootstrap of a large 
github.com repository. For reference, none of 
[#8028](https://github.com/apache/incubator-devlake/issues/8028), 
[#8614](https://github.com/apache/incubator-devlake/issues/8614), 
[#3642](https://github.com/apache/incubator-devlake/issues/3642), 
[#3688](https://github.com/apache/incubator-devlake/issues/3688), or 
[#3199](https://github.com/apache/incubator-devlake/issues/3199) address the 
github.com 40k item cap on unfiltered `/actions/runs` pagination, although they 
touch adjacent areas (large-repo GraphQL timeouts, PageSize tunables, 
time/workflow filters, payload size).
   
   ### Version
   
   v1.0.3-beta10
   
   ### Are you willing to submit PR?
   
   - [x] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [x] I agree to follow this project's [Code of 
Conduct](https://www.apache.org/foundation/policies/conduct)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to