zahorniak opened a new issue, #8907: URL: https://github.com/apache/devlake/issues/8907
### Search before asking - [x] I had searched in the [issues](https://github.com/apache/incubator-devlake/issues?q=is%3Aissue) and found no similar issues. ### What happened ## What happened The CircleCI plugin's `collectJobs` subtask fails on every incremental run whenever **any** CircleCI job is in a non-terminal status (`running` / `not_running` / `queued` / `on_hold`). The whole CircleCI collection aborts, so no CI/CD data (deployments, builds) is refreshed and DORA metrics go stale. The task fails with repeated HTTP 500s against a malformed URL that has an **empty workflow id**: ``` /v2/workflow//job ^^ empty path segment ``` Root cause: `CollectJobs` runs two API collectors that share the URL template `"/v2/workflow/{{ .Input.Id }}/job"`. The first ("new records") collector populates `.Input.Id` correctly; the second ("unfinished details") collector's iterator query selects `DISTINCT workflow_id` (not `id`), so the templated field `.Input.Id` is empty and DevLake requests `GET /v2/workflow//job`, which CircleCI answers with `500 Internal Server Error`. ## Environment | | | | --- | --- | | DevLake version | **v1.0.3-beta12** (image `apache/devlake:v1.0.3-beta12`) | | Plugin | `circleci` | | Database | MySQL 8.x | | CircleCI | CircleCI Cloud (`circleci.com` API v2) | | Trigger | Nightly blueprint (incremental, `skipCollectors=false`) | ## Error / logs Exact `error_name` / `message` recorded on the failed task (`_devlake_tasks`, `plugin='circleci'`, `status='TASK_FAILED'`): ``` subtask collectJobs ended unexpectedly caused by: Retry exceeded 3 times calling /v2/workflow//job. The last error was: Http DoAsync error calling [method:GET path:/v2/workflow//job query:map[]]. Response: {"message":"Internal Server Error"} (500) ``` The message repeats the same `/v2/workflow//job` 500 once per affected workflow id (8+ combined messages in our case) before the subtask aborts. ## Root cause analysis `backend/plugins/circleci/tasks/job_collector.go` defines the subtask: ```go var CollectJobsMeta = plugin.SubTaskMeta{ Name: "collectJobs", EntryPoint: CollectJobs, EnabledByDefault: true, Description: "collect circleci jobs", DomainTypes: []string{plugin.DOMAIN_TYPE_CICD}, } ``` `CollectJobs` creates **two** API collectors, both using the same URL template `"/v2/workflow/{{ .Input.Id }}/job"`. **Collector 1 — "new records" (works).** Its iterator selects `id` from the workflow table, so the templated field `.Input.Id` is populated: ```go clauses := []dal.Clause{ dal.Select("id, pipeline_id"), dal.From(&models.CircleciWorkflow{}), dal.Where("connection_id = ? and project_slug = ?", data.Options.ConnectionId, data.Options.ProjectSlug), } // ... UrlTemplate: "/v2/workflow/{{ .Input.Id }}/job", ``` **Collector 2 — "unfinished details" (broken).** Its iterator selects `DISTINCT workflow_id` (the column is `workflow_id`, **not** `id`), yet it reuses the **same** URL template referencing `.Input.Id`: ```go clauses := []dal.Clause{ dal.Select("DISTINCT workflow_id"), dal.From(&models.CircleciJob{}), dal.Where("connection_id = ? AND project_slug = ? AND status IN ('running', 'not_running', 'queued', 'on_hold')", data.Options.ConnectionId, data.Options.ProjectSlug), } // ... UrlTemplate: "/v2/workflow/{{ .Input.Id }}/job", ``` Because the row scanned by collector 2 has no `Id` field set (the query projected `workflow_id`, which maps to the struct field `WorkflowId`, not `Id`), `{{ .Input.Id }}` renders empty and the request path collapses to `/v2/workflow//job`. CircleCI returns `500` for the malformed path, the collector retries 3×, and the subtask fails — taking the entire CircleCI collection down with it. ## Evidence that this is a code bug, not missing data The affected jobs **all have a valid, non-empty `workflow_id` in the DB** — the empty id is introduced by the collector, not by the data. Status distribution of `_tool_circleci_jobs` at the time of failure: ``` status jobs ------- ----- success 74622 blocked 51314 <- NOT in the collector's status filter; not involved canceled 17742 on_hold 12368 <- queried by collector 2 failed 3031 not_run 2801 <- note: 'not_run' (data) != 'not_running' (query); not matched running 2 <- queried by collector 2 ``` Rows in collector 2's query set (`status IN ('running','not_running','queued','on_hold')`), and how many have a valid `workflow_id`: ``` status jobs with_valid_workflow_id ------- ---- ---------------------- on_hold 12368 12368 running 2 2 TOTAL 12370 12370 <- 100% have a valid workflow_id ``` A concrete example of an affected job (valid `workflow_id`, yet the collector requests `/v2/workflow//job`): ``` id : b8719700-9035-4a96-8053-58596b802e79 workflow_id : 0021f5f7-0161-4845-a8ef-eca400e9de7c <- present and valid project_slug : gh/elopage/elopage_web_client status : on_hold started_at : 2026-04-29 07:45:28 ``` So the correct request would be `GET /v2/workflow/0021f5f7-0161-4845-a8ef-eca400e9de7c/job`, but DevLake issues `GET /v2/workflow//job`. ## Impact - **The entire CircleCI collection aborts** on the failing subtask. CI/CD domain data (deployments/builds) is not refreshed → DORA *Deployment Frequency*, *Lead Time for Changes*, and *Change Failure Rate* go stale. - **It recurs indefinitely.** `on_hold` approval jobs accumulate naturally, and each *successful* collection re-collects them at their real status, so the failure returns on the next run (see "Durability of the workaround" below). ## Proposed fix Alias the projected column in collector 2 so the templated field is populated, mirroring collector 1: ```go // backend/plugins/circleci/tasks/job_collector.go — "unfinished details" collector dal.Select("DISTINCT workflow_id AS id"), ``` Alternatively, give the iterator input struct an explicit field for the workflow id and reference it in the template (e.g. `"/v2/workflow/{{ .Input.WorkflowId }}/job"`), so the two collectors don't silently depend on column aliasing. Either way the fix is one line in `CollectJobs`. ## Workaround (operator-side, non-durable) Terminalize the jobs in collector 2's query set so the "unfinished details" collector gets **zero** input rows and never builds the malformed URL: ```sql UPDATE _tool_circleci_jobs SET status = 'canceled' WHERE status IN ('running', 'not_running', 'queued', 'on_hold'); ``` After this, `collectJobs` completes successfully (verified — see below). ### Durability of the workaround The workaround lasts **exactly one collection cycle**. We verified the mechanism on our instance: 1. Before fix: `12,370` jobs in the query set → `collectJobs` fails. 2. After the `UPDATE` (query set = `0`) we triggered an incremental collection → the `circleci` task reached **`TASK_COMPLETED`** (no `/v2/workflow//job` error). 3. **However, that same collection's "new records" phase re-collected the workflows and left `9,013` jobs back in `on_hold`** (our `canceled` count fell by ~8,800 as the upsert overwrote the workaround, plus freshly-collected `on_hold` jobs). The collection succeeded only because collector 2 had already iterated the empty set before re-population happened. 4. Consequently the **next** collection's "unfinished details" phase queries those `9,013` `on_hold` jobs again and fails — so the operator must re-run the `UPDATE` before every collection. This treadmill is why an upstream code fix is needed. ### What do you expect to happen CircleCI sync will work without issues. ### How to reproduce 1. Configure a CircleCI connection + scope and run a first (full) collection — `_tool_circleci_jobs` gets populated. 2. Ensure at least one collected job is in a non-terminal status — `on_hold` is the most common (any approval-gated workflow that is waiting for / never received manual approval), but `running`/`queued`/`not_running` also trigger it. 3. Run the blueprint again (incremental). The `collectJobs` subtask enters its "unfinished details" phase, iterates the non-terminal jobs, and calls `GET /v2/workflow//job` (empty id) → `500` → `subtask collectJobs ended unexpectedly`. In practice this reproduces on **every** subsequent collection once approval-gated (`on_hold`) jobs exist, which for most CircleCI setups is continuously. ### Anything else _No response_ ### Version v1.0.3-beta12 ### Are you willing to submit PR? - [ ] Yes I am willing to submit a PR! ### Code of Conduct - [x] I agree to follow this project's [Code of Conduct](https://www.apache.org/foundation/policies/conduct) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
