zahorniak opened a new issue, #8907:
URL: https://github.com/apache/devlake/issues/8907

   ### Search before asking
   
   - [x] I had searched in the 
[issues](https://github.com/apache/incubator-devlake/issues?q=is%3Aissue) and 
found no similar issues.
   
   
   ### What happened
   
   ## What happened
   
   The CircleCI plugin's `collectJobs` subtask fails on every incremental run 
whenever **any** CircleCI job is in a non-terminal status (`running` / 
`not_running` / `queued` / `on_hold`). The whole CircleCI collection aborts, so 
no CI/CD data (deployments, builds) is refreshed and DORA metrics go stale.
   
   The task fails with repeated HTTP 500s against a malformed URL that has an 
**empty workflow id**:
   
   ```
   /v2/workflow//job
            ^^ empty path segment
   ```
   
   Root cause: `CollectJobs` runs two API collectors that share the URL 
template `"/v2/workflow/{{ .Input.Id }}/job"`. The first ("new records") 
collector populates `.Input.Id` correctly; the second ("unfinished details") 
collector's iterator query selects `DISTINCT workflow_id` (not `id`), so the 
templated field `.Input.Id` is empty and DevLake requests `GET 
/v2/workflow//job`, which CircleCI answers with `500 Internal Server Error`.
   
   ## Environment
   
   | | |
   | --- | --- |
   | DevLake version | **v1.0.3-beta12** (image `apache/devlake:v1.0.3-beta12`) 
|
   | Plugin | `circleci` |
   | Database | MySQL 8.x |
   | CircleCI | CircleCI Cloud (`circleci.com` API v2) |
   | Trigger | Nightly blueprint (incremental, `skipCollectors=false`) |
   
   ## Error / logs
   
   Exact `error_name` / `message` recorded on the failed task 
(`_devlake_tasks`, `plugin='circleci'`, `status='TASK_FAILED'`):
   
   ```
   subtask collectJobs ended unexpectedly
   caused by: Retry exceeded 3 times calling /v2/workflow//job.
   The last error was: Http DoAsync error calling [method:GET 
path:/v2/workflow//job query:map[]].
   Response: {"message":"Internal Server Error"} (500)
   ```
   
   The message repeats the same `/v2/workflow//job` 500 once per affected 
workflow id (8+ combined messages in our case) before the subtask aborts.
   
   ## Root cause analysis
   
   `backend/plugins/circleci/tasks/job_collector.go` defines the subtask:
   
   ```go
   var CollectJobsMeta = plugin.SubTaskMeta{
        Name:             "collectJobs",
        EntryPoint:       CollectJobs,
        EnabledByDefault: true,
        Description:      "collect circleci jobs",
        DomainTypes:      []string{plugin.DOMAIN_TYPE_CICD},
   }
   ```
   
   `CollectJobs` creates **two** API collectors, both using the same URL 
template `"/v2/workflow/{{ .Input.Id }}/job"`.
   
   **Collector 1 — "new records" (works).** Its iterator selects `id` from the 
workflow table, so the templated field `.Input.Id` is populated:
   
   ```go
   clauses := []dal.Clause{
        dal.Select("id, pipeline_id"),
        dal.From(&models.CircleciWorkflow{}),
        dal.Where("connection_id = ? and project_slug = ?",
                data.Options.ConnectionId, data.Options.ProjectSlug),
   }
   // ...
   UrlTemplate: "/v2/workflow/{{ .Input.Id }}/job",
   ```
   
   **Collector 2 — "unfinished details" (broken).** Its iterator selects 
`DISTINCT workflow_id` (the column is `workflow_id`, **not** `id`), yet it 
reuses the **same** URL template referencing `.Input.Id`:
   
   ```go
   clauses := []dal.Clause{
        dal.Select("DISTINCT workflow_id"),
        dal.From(&models.CircleciJob{}),
        dal.Where("connection_id = ? AND project_slug = ? AND status IN 
('running', 'not_running', 'queued', 'on_hold')",
                data.Options.ConnectionId, data.Options.ProjectSlug),
   }
   // ...
   UrlTemplate: "/v2/workflow/{{ .Input.Id }}/job",
   ```
   
   Because the row scanned by collector 2 has no `Id` field set (the query 
projected `workflow_id`, which maps to the struct field `WorkflowId`, not 
`Id`), `{{ .Input.Id }}` renders empty and the request path collapses to 
`/v2/workflow//job`. CircleCI returns `500` for the malformed path, the 
collector retries 3×, and the subtask fails — taking the entire CircleCI 
collection down with it.
   
   ## Evidence that this is a code bug, not missing data
   
   The affected jobs **all have a valid, non-empty `workflow_id` in the DB** — 
the empty id is introduced by the collector, not by the data.
   
   Status distribution of `_tool_circleci_jobs` at the time of failure:
   
   ```
   status     jobs
   -------    -----
   success    74622
   blocked    51314    <- NOT in the collector's status filter; not involved
   canceled   17742
   on_hold    12368    <- queried by collector 2
   failed      3031
   not_run     2801    <- note: 'not_run' (data) != 'not_running' (query); not 
matched
   running        2    <- queried by collector 2
   ```
   
   Rows in collector 2's query set (`status IN 
('running','not_running','queued','on_hold')`), and how many have a valid 
`workflow_id`:
   
   ```
   status     jobs    with_valid_workflow_id
   -------    ----    ----------------------
   on_hold    12368   12368
   running        2       2
   TOTAL      12370   12370   <- 100% have a valid workflow_id
   ```
   
   A concrete example of an affected job (valid `workflow_id`, yet the 
collector requests `/v2/workflow//job`):
   
   ```
   id           : b8719700-9035-4a96-8053-58596b802e79
   workflow_id  : 0021f5f7-0161-4845-a8ef-eca400e9de7c   <- present and valid
   project_slug : gh/elopage/elopage_web_client
   status       : on_hold
   started_at   : 2026-04-29 07:45:28
   ```
   
   So the correct request would be `GET 
/v2/workflow/0021f5f7-0161-4845-a8ef-eca400e9de7c/job`, but DevLake issues `GET 
/v2/workflow//job`.
   
   ## Impact
   
   - **The entire CircleCI collection aborts** on the failing subtask. CI/CD 
domain data (deployments/builds) is not refreshed → DORA *Deployment 
Frequency*, *Lead Time for Changes*, and *Change Failure Rate* go stale.
   - **It recurs indefinitely.** `on_hold` approval jobs accumulate naturally, 
and each *successful* collection re-collects them at their real status, so the 
failure returns on the next run (see "Durability of the workaround" below).
   
   ## Proposed fix
   
   Alias the projected column in collector 2 so the templated field is 
populated, mirroring collector 1:
   
   ```go
   // backend/plugins/circleci/tasks/job_collector.go — "unfinished details" 
collector
   dal.Select("DISTINCT workflow_id AS id"),
   ```
   
   Alternatively, give the iterator input struct an explicit field for the 
workflow id and reference it in the template (e.g. `"/v2/workflow/{{ 
.Input.WorkflowId }}/job"`), so the two collectors don't silently depend on 
column aliasing. Either way the fix is one line in `CollectJobs`.
   
   ## Workaround (operator-side, non-durable)
   
   Terminalize the jobs in collector 2's query set so the "unfinished details" 
collector gets **zero** input rows and never builds the malformed URL:
   
   ```sql
   UPDATE _tool_circleci_jobs
   SET status = 'canceled'
   WHERE status IN ('running', 'not_running', 'queued', 'on_hold');
   ```
   
   After this, `collectJobs` completes successfully (verified — see below).
   
   ### Durability of the workaround
   
   The workaround lasts **exactly one collection cycle**. We verified the 
mechanism on our instance:
   
   1. Before fix: `12,370` jobs in the query set → `collectJobs` fails.
   2. After the `UPDATE` (query set = `0`) we triggered an incremental 
collection → the `circleci` task reached **`TASK_COMPLETED`** (no 
`/v2/workflow//job` error).
   3. **However, that same collection's "new records" phase re-collected the 
workflows and left `9,013` jobs back in `on_hold`** (our `canceled` count fell 
by ~8,800 as the upsert overwrote the workaround, plus freshly-collected 
`on_hold` jobs). The collection succeeded only because collector 2 had already 
iterated the empty set before re-population happened.
   4. Consequently the **next** collection's "unfinished details" phase queries 
those `9,013` `on_hold` jobs again and fails — so the operator must re-run the 
`UPDATE` before every collection. This treadmill is why an upstream code fix is 
needed.
   
   
   
   ### What do you expect to happen
   
   CircleCI sync will work without issues. 
   
   ### How to reproduce
   
   1. Configure a CircleCI connection + scope and run a first (full) collection 
— `_tool_circleci_jobs` gets populated.
   2. Ensure at least one collected job is in a non-terminal status — `on_hold` 
is the most common (any approval-gated workflow that is waiting for / never 
received manual approval), but `running`/`queued`/`not_running` also trigger it.
   3. Run the blueprint again (incremental). The `collectJobs` subtask enters 
its "unfinished details" phase, iterates the non-terminal jobs, and calls `GET 
/v2/workflow//job` (empty id) → `500` → `subtask collectJobs ended 
unexpectedly`.
   
   In practice this reproduces on **every** subsequent collection once 
approval-gated (`on_hold`) jobs exist, which for most CircleCI setups is 
continuously.
   
   ### Anything else
   
   _No response_
   
   ### Version
   
   v1.0.3-beta12
   
   ### Are you willing to submit PR?
   
   - [ ] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [x] I agree to follow this project's [Code of 
Conduct](https://www.apache.org/foundation/policies/conduct)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to