yamoyamoto opened a new issue, #8842: URL: https://github.com/apache/incubator-devlake/issues/8842
### Search before asking - [x] I had searched in the [issues](https://github.com/apache/incubator-devlake/issues?q=is%3Aissue) and found no similar issues. ### What happened When collecting workflow runs from a large GitHub.com (Enterprise Cloud) repository, the `Collect Workflow Runs` subtask fails with HTTP 422 once the collector crosses an undocumented pagination boundary, and the whole pipeline aborts. A representative error log: ``` subtask Collect Workflow Runs ended unexpectedly Wraps: (2) Error waiting for async Collector execution | combined messages: | { | Retry exceeded 3 times calling repos/{owner}/{repo}/actions/runs. | The last error was: Http DoAsync error calling | [method:GET path:repos/{owner}/{repo}/actions/runs | query:map[page:[1340] per_page:[30]]]. | Response: {"message":"In order to keep the API fast for everyone, | pagination is limited for this resource.", | "documentation_url":"https://docs.github.com/v3/#pagination", | "status":"422"} (422) | } ``` `page=1340 × per_page=30 = 40,200` items, which crosses a roughly `per_page × page ≤ 40,000 items` hard cap that github.com enforces on unfiltered `/actions/runs` pagination. The cap is easy to probe directly with `gh api`: `per_page=100&page=400` returns HTTP 200 with `total_count: 40000` and `Link rel="last" page=400`, while `per_page=100&page=401` (and anything beyond) returns HTTP 422 with the same `"pagination is limited for this resource"` message. `total_count` is clamped at 40,000 even though the repository has significantly more runs. Neither this 40k boundary nor the 422 response is described in the [official docs for this endpoint](https://docs.github.com/en/rest/actions/workflow-runs#list-workflow-runs-for-a-repository), which only mention a separate cap for *filtered* queries ("up to 1,000 results for each search when using `actor`, `branch`, `check_suite_id`, `created`, `event`, `head_sha`, or `status`"). On the DevLake side, the root cause is in [`backend/plugins/github/tasks/cicd_run_collector.go` L77-84](https://github.com/apache/incubator-devlake/blob/a21382accbcccbf7ae7d9e2d3558a07cc4301afc/backend/plugins/github/tasks/cicd_run_collector.go#L77-L84): ```go Query: func(reqData *helper.RequestData, createdAfter *time.Time) (url.Values, errors.Error) { query := url.Values{} query.Set("page", fmt.Sprintf("%v", reqData.Pager.Page)) query.Set("per_page", fmt.Sprintf("%v", reqData.Pager.Size)) return query, nil }, ``` `createdAfter` is received but never forwarded to the server — time filtering happens purely on the client side. With `Concurrency=10` and `PageSize=30`, concurrent workers blindly advance until some of them cross the 40k boundary and hit 422. Once 3 retries are exhausted for any one page the whole subtask fails, and no partial data is salvaged (`Extract Workflow Runs` / `Convert Workflow Runs` in the same task do not run either). GraphQL is not a viable fallback: the [`Repository` object in GitHub's GraphQL schema](https://docs.github.com/en/graphql/reference/objects#repository) has no `workflowRuns`, `workflows`, or `actions` field, and [`WorkflowRun` is only reachable as a singular field on `CheckSuite`](https://docs.github.com/en/graphql/reference/objects#checksuite) (not a connection), so there is no way to list a repository's workflow runs via GraphQL. DevLake's `github_graphql` plugin already acknowledges this by importing the REST collector ([`plugins/github_graphql/impl/impl.go:97`](https://github.com/apache/incubator-devlake/blob/a21382accbcccbf7ae7d9e2d3558a07cc4301afc/backend/plugins/github_graphql/impl/impl.go#L97)). ### What do you expect to happen `Collect Workflow Runs` should successfully collect the full set of workflow runs within the blueprint's `timeAfter` window regardless of total volume. Large repositories on github.com with more than 40,000 runs in-window should not cause the pipeline to abort. ### How to reproduce Configure a GitHub connection pointing at a github.com repository that has more than 40,000 workflow runs, set the blueprint's `timeAfter` to a date far enough back that the range contains >40,000 runs, and trigger "Collect Data". The pipeline fails on `Collect Workflow Runs` once the collector reaches `page × per_page > 40,000` (with default `PageSize=30` this happens near page 1,334). Any sufficiently busy CI repository reaches the boundary eventually; no specific GitHub feature beyond volume is required. ### Anything else Existing workarounds are insufficient in isolation. Narrowing `timeAfter` works once but the same initial bootstrap fails again later as the repository accumulates runs, and historical data is forfeited. `skipOnFail=true` lets unrelated plugins (Jira, DORA, etc.) keep running, but `_tool_github_runs` still never gets populated for the affected repo. Raising `per_page` to 100 reduces the number of requests but does not raise the 40,000-item ceiling. The only way to read past the 40k boundary is to use a filter parameter. Adding a `created` filter switches the endpoint into the *filtered* mode (up to 1,000 results per search). Fixed-size windows (e.g. monthly) are not enough because a single month on a busy repo can exceed 1,000 runs, so the fix needs to bisect windows adaptively. Roughly: ```go // Pseudocode func collectRunsAdaptive(from, to time.Time) { items, reachedCap := fetchWindow(from, to) // GET .../actions/runs?created=<from>..<to>&per_page=100 if reachedCap { mid := from.Add(to.Sub(from) / 2) collectRunsAdaptive(from, mid) collectRunsAdaptive(mid, to) } else { persist(items) } } ``` The query syntax is `created:YYYY-MM-DD..YYYY-MM-DD` (ISO 8601, supports `>=`, `<=`, `..`, with optional `THH:MM:SSZ` for sub-day granularity — see [search syntax](https://docs.github.com/en/search-github/getting-started-with-searching-on-github/understanding-the-search-syntax)). `createdAfter` is already passed to the `Query` hook so no interface change is needed, and for incremental runs it simply becomes the lower bound of the outer window. The problem reproduces deterministically on every bootstrap of a large github.com repository. For reference, none of [#8028](https://github.com/apache/incubator-devlake/issues/8028), [#8614](https://github.com/apache/incubator-devlake/issues/8614), [#3642](https://github.com/apache/incubator-devlake/issues/3642), [#3688](https://github.com/apache/incubator-devlake/issues/3688), or [#3199](https://github.com/apache/incubator-devlake/issues/3199) address the github.com 40k item cap on unfiltered `/actions/runs` pagination, although they touch adjacent areas (large-repo GraphQL timeouts, PageSize tunables, time/workflow filters, payload size). ### Version v1.0.3-beta10 ### Are you willing to submit PR? - [x] Yes I am willing to submit a PR! ### Code of Conduct - [x] I agree to follow this project's [Code of Conduct](https://www.apache.org/foundation/policies/conduct) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
