zclllyybb commented on issue #63629: URL: https://github.com/apache/doris/issues/63629#issuecomment-4534590780
Breakwater-GitHub-Analysis-Slot: slot_24015f562221 Initial triage for maintainers: - Current issue metadata: open issue, version `4.0.5`, no labels/comments at the time of this check. The title is still generic (`test github issues`), but the body contains a real BE SIGSEGV stack. - The reported BE commit `59de8c4c524` matches the local `4.0.5` / `4.0.5-rc01` tag. - The visible crash frame is `ScannerContext::get_free_block()` calling `_free_blocks.try_dequeue()` (`be/src/vec/exec/scan/scanner_context.cpp:197-209`). I would not treat `moodycamel::ConcurrentQueue` as the root cause from this stack alone; it is likely the victim frame after earlier memory/schema corruption. - In `4.0.5`, `ScannerScheduler::_scanner_scan()` calls `scanner->prepare()` / `scanner->open()` before the read loop calls `ctx->get_free_block()` (`be/src/vec/exec/scan/scanner_scheduler.cpp:192-263`). So the suspicious state can be created before this top stack is reached. - The high-risk code path is `OlapScanner::prepare()` in `4.0.5`: it can reuse a shared `TabletSchema` from `SchemaCache` (`be/src/vec/exec/scan/olap_scanner.cpp:193-200`), and later `_init_tablet_reader_params()` may mutate that schema via `merge_dropped_columns()` when delete predicates are present (`be/src/vec/exec/scan/olap_scanner.cpp:395-398`; `be/src/olap/tablet_schema.cpp:1361-1377`). With parallel scan enabled by default in FE session variables, multiple scanners can prepare concurrently and share/mutate the cached schema. - This matches the later branch-4.0 fix lineage: #61853 avoids caching schemas that would be mutated by delete predicates, and #62427 removes `SchemaCache` from `OlapScanner::prepare()` so every scanner builds its own `TabletSchema` to avoid concurrent modification. I checked locally that neither the #61853 commit nor the #62427 commit is included in `4.0.5`. Preliminary judgment: This looks highly likely to be the known 4.0.5 `OlapScanner` / `SchemaCache` concurrent schema mutation crash, not a queue implementation bug. The most direct validation is to retry on a branch-4.0 build that includes #62427, or cherry-pick #62427 onto the affected 4.0.5 build. As a short-term mitigation/diagnostic step, retry the same workload with `set enable_parallel_scan=false`; if the crash disappears, it further supports the concurrent scanner prepare path. Missing information still needed for a hard reproduction: - The exact SQL text for query id `64e864ce12f5486b-89edc47b0e6e87e9`, including whether it is `INSERT INTO ... SELECT ...`. - DDL for the source/target tables, especially key model, delete predicates, indexes, schema-change history, and any dropped columns. - FE audit/profile for the query and BE logs around this query id before the SIGSEGV. - Session variables and BE/FE configs, especially `enable_parallel_scan`, `parallel_scan_max_scanners_count`, and whether local mode changes scanner scheduling. - Confirmation whether the same workload still crashes on a build containing #62427. Suggested next action: Ask the reporter to retest with a build including #62427 or temporarily disable parallel scan. If it no longer reproduces, this issue can be linked to that fix path. If it still reproduces, please attach the SQL/DDL/log/profile above so we can separate it from the already-fixed `SchemaCache` concurrency issue. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
