danielemoraschi opened a new issue, #8945: URL: https://github.com/apache/devlake/issues/8945
### Search before asking - [x] I had searched in the [issues](https://github.com/apache/incubator-devlake/issues?q=is%3Aissue) and found no similar issues. ### What happened The `customize` plugin's `ExtractCustomizedFields` subtask silently stops extracting custom fields the moment it encounters a single domain-layer row whose raw record no longer exists. Every remaining row in the cursor is left unprocessed, yet the subtask finishes as `TASK_COMPLETED`, so there is no error or warning. **Root cause**: `backend/plugins/customize/tasks/customized_fields_extractor.go` The extractor cursor uses a LEFT JOIN to the raw table: ```go dal.Join(fmt.Sprintf(" LEFT JOIN %s ON %s._raw_data_id = %s.id", rawTable, table, rawTable)), // When a domain row's _raw_data_id is orphaned (the raw record was cleaned up / replaced by a later collection), the joined data column is NULL. // The per-row type switch then falls through to its default branch: switch blob := row["data"].(type) { case []byte: ... case string: ... default: return nil // <-- exits the entire function, skipping all remaining rows } ``` `return nil` returns from `extractCustomizedFields` entirely instead of skipping just that one row. Because cursor order is arbitrary, the first orphaned row encountered ends extraction for everything after it. **Impact** - Custom `x_*` columns silently stop being populated for a cursor-order-dependent subset of rows, some boards/projects populate fully while others stay entirely `NULL`. - The effect persists across syncs because already-written custom columns are not cleared, which makes it look like only "new" data is affected. - No error surfaces, the subtask reports success. In our instance, a connection with 30,287 matched rows had 1,382 orphaned rows. Extraction stopped early and left ~28,000 rows unpopulated. ### What do you expect to happen A domain row with no matching raw record should be skipped, and extraction should continue with the remaining rows. One orphaned `_raw_data_id` should never abort the entire subtask. ### How to reproduce 1. Configure a customize transformation rule that maps a JSON path into a custom column, e.g. `issues.x_defect_category` <- `fields.customfield_XXXXX.value`, applied either via a blueprint `afterPlan` or a standalone `customize` pipeline. 2. Ensure at least one matched domain row has a `_raw_data_id` with no corresponding record in the raw table. This happens naturally after re-collection orphans old raw records. 3. Run the customize subtask. 4. Observe: rows ordered after the first orphaned row are left `NULL`, despite having valid raw data and matching the filter, the subtask still completes successfully. Confirming query (MySQL), counts orphaned rows within the rule's filter: ```sql SELECT SUM(r.id IS NULL) AS orphaned, COUNT(*) AS total FROM issues i LEFT JOIN _raw_jira_api_issues r ON i._raw_data_id = r.id WHERE i._raw_data_table = '_raw_jira_api_issues' AND i._raw_data_params LIKE '{"ConnectionId":1,%'; ``` ### Anything else Happens every time once at least one orphaned `_raw_data_id` exists among the matched rows. Which rows end up unpopulated depends on cursor order, so the symptom looks erratic across projects/boards even though the cause is deterministic. Suggested fix: change the `default` branch in the value switch from `return nil` to `continue`, so orphaned rows are skipped instead of aborting the scan. Happy to submit this as a PR with a regression test for the orphaned-row case. Searched existing issues, closest are #8173 (create-field, unrelated) and #7571 (wildcard rawDataParams, unrelated) but neither covers this extraction-abort. v1.0.3-beta8 (also confirmed unchanged on main at v1.0.3-beta13) ### Version v1.0.3-beta8 ### Are you willing to submit PR? - [x] Yes I am willing to submit a PR! ### Code of Conduct - [x] I agree to follow this project's [Code of Conduct](https://www.apache.org/foundation/policies/conduct) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
