[I] [Bug][customize] ExtractCustomizedFields aborts the whole scan on the first row with an orphaned _raw_data_id [devlake]

via GitHub Sat, 20 Jun 2026 09:08:21 -0700


danielemoraschi opened a new issue, #8945:
URL: https://github.com/apache/devlake/issues/8945


   ### Search before asking
   
   - [x] I had searched in the 
[issues](https://github.com/apache/incubator-devlake/issues?q=is%3Aissue) and 
found no similar issues.
   
   
   ### What happened
   
   The `customize` plugin's `ExtractCustomizedFields` subtask silently stops 
extracting custom fields the moment it encounters a single domain-layer row 
whose raw record no longer exists. Every remaining row in the cursor is left 
unprocessed, yet the subtask finishes as `TASK_COMPLETED`, so there is no error 
or warning.
   
   **Root cause**: 
`backend/plugins/customize/tasks/customized_fields_extractor.go`
   
   The extractor cursor uses a LEFT JOIN to the raw table:
   
   ```go
   dal.Join(fmt.Sprintf(" LEFT JOIN %s ON %s._raw_data_id = %s.id", rawTable, 
table, rawTable)),
   
   // When a domain row's _raw_data_id is orphaned (the raw record was cleaned 
up / replaced by a later collection), the joined data column is NULL. 
   // The per-row type switch then falls through to its default branch:
   
   switch blob := row["data"].(type) {
     case []byte:
         ...
     case string:
         ...
     default:
         return nil   // <-- exits the entire function, skipping all remaining 
rows
   }
   ```
   
   `return nil` returns from `extractCustomizedFields` entirely instead of 
skipping just that one row. Because cursor order is arbitrary, the first 
orphaned row encountered ends extraction for everything after it.
   
   **Impact**
   - Custom `x_*` columns silently stop being populated for a 
cursor-order-dependent subset of rows, some boards/projects populate fully 
while others stay entirely `NULL`.
   - The effect persists across syncs because already-written custom columns 
are not cleared, which makes it look like only "new" data is affected.
   - No error surfaces, the subtask reports success.
   
   In our instance, a connection with 30,287 matched rows had 1,382 orphaned 
rows. Extraction stopped early and left ~28,000 rows unpopulated.
   
   ### What do you expect to happen
   
   A domain row with no matching raw record should be skipped, and extraction 
should continue with the remaining rows. One orphaned `_raw_data_id` should 
never abort the entire subtask.
   
   ### How to reproduce
   
   1. Configure a customize transformation rule that maps a JSON path into a 
custom column, e.g. `issues.x_defect_category` <- 
`fields.customfield_XXXXX.value`, applied either via a blueprint `afterPlan` or 
a standalone `customize` pipeline.
   2. Ensure at least one matched domain row has a `_raw_data_id` with no 
corresponding record in the raw table. This happens naturally after 
re-collection orphans old raw records.
   3. Run the customize subtask.
   4. Observe: rows ordered after the first orphaned row are left `NULL`, 
despite having valid raw data and matching the filter, the subtask still 
completes successfully.
   
   
   Confirming query (MySQL), counts orphaned rows within the rule's filter:
   ```sql
   SELECT SUM(r.id IS NULL) AS orphaned, COUNT(*) AS total
   FROM issues i
   LEFT JOIN _raw_jira_api_issues r ON i._raw_data_id = r.id
   WHERE i._raw_data_table = '_raw_jira_api_issues'
     AND i._raw_data_params LIKE '{"ConnectionId":1,%';
   ```
   
   ### Anything else
   
   Happens every time once at least one orphaned `_raw_data_id` exists among 
the matched rows.
   Which rows end up unpopulated depends on cursor order, so the symptom looks 
erratic across projects/boards even though the cause is deterministic.
   
   Suggested fix: change the `default` branch in the value switch from `return 
nil` to `continue`, so orphaned rows are skipped instead of aborting the scan. 
Happy to submit this as a PR with a regression test for the orphaned-row case.
   
   Searched existing issues, closest are #8173 (create-field, unrelated) and 
#7571 (wildcard rawDataParams, unrelated) but neither covers this 
extraction-abort.
   
   v1.0.3-beta8 (also confirmed unchanged on main at v1.0.3-beta13)
   
   ### Version
   
   v1.0.3-beta8
   
   ### Are you willing to submit PR?
   
   - [x] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [x] I agree to follow this project's [Code of 
Conduct](https://www.apache.org/foundation/policies/conduct)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] [Bug][customize] ExtractCustomizedFields aborts the whole scan on the first row with an orphaned _raw_data_id [devlake]

Reply via email to