iversonleo587-cpu opened a new issue, #64826: URL: https://github.com/apache/doris/issues/64826
### Search before asking - [x] I had searched in the [issues](https://github.com/apache/doris/issues?q=is%3Aissue) and found no similar issues. ### Version doris-4.0.5-rc01-59de8c4c524 Git commit: 59de8c4c524 JDK: OpenJDK 17.0.18+8 (/opt/jdk-17.0.18+8/bin/java) OS: Linux x86_64 ### What's Wrong? Our BE node (BackendId: 1780885838226, Host: 10.66.7.1) crashes repeatedly with SIGSEGV on version 4.0.5-rc01. This is NOT caused by Linux OOM killer (no matching records in dmesg). Two distinct crash patterns observed in be.out: **Pattern 1 – Query execution (most frequent)** SIGSEGV at address @0x38 in bthread worker thread: bthread::TaskGroup::sched_to → run_main_task → worker_thread **Pattern 2 – Background cache management (2026-06-25 00:20)** SIGSEGV at address @0x8 during Page Cache capacity adjustment: Daemon::cache_adjust_capacity_thread → CacheManager::for_each_cache_refresh_capacity → LRUCache::set_capacity → SegmentFooterPB::~SegmentFooterPB **Most recent production incident (2026-06-25):** - 08:11:34 BE started (PID 419300) - ~13:40:22 BE crashed (Aborted at unix time 1782366022) - 13:41:10 Client ETL job failed: (1105, 'errCode = 2, detailMessage = backend 1780885838226 is down') - 13:41:55 BE restarted automatically **Crash frequency on this single BE (same version):** | Date | Query id | Stack | |------------|---------------------------------------|--------------------------------| | 2026-06-22 | 6c0e5ba4c3ab45f2-a3734229ec2d08ba | bthread SIGSEGV @0x38 | | 2026-06-23 | 0-0 | bthread SIGSEGV @0x38 | | 2026-06-24 | 7133b64a2c24c1d-942bd6304aa27668 | bthread SIGSEGV @0x38 | | 2026-06-24 | 594b7141026d19bf-7a05c24fda60e79f | bthread SIGSEGV @0x38 | | 2026-06-25 | 0-0 | Page Cache/SegmentFooterPB @0x8 | | 2026-06-25 | 0-0 | bthread SIGSEGV @0x38 | **BE environment:** - CPU: 32 cores, Memory: ~125 GB - TabletNum: 17860, DataUsedCapacity: ~585 GB, Disk UsedPct: ~35% - Workload mode: mix **Stack trace – most recent crash (2026-06-25 ~13:40, before restart at 13:41:55):** ### What You Expected? BE should remain stable under heavy query/load pressure. If resource limits are exceeded, Doris should fail the query gracefully (e.g., memory limit exceeded, query cancelled), not crash the entire BE process with SIGSEGV. After a failure, clients should receive a query-level error, not "backend is down" due to process crash. ### How to Reproduce? We have not isolated a minimal reproducible case yet. Crashes appear under sustained production load on a single BE node running 4.0.5-rc01. **Observed trigger pattern:** 1. BE runs for several hours under mixed query/load workload 2. A heavy INSERT INTO ... SELECT ETL job starts 3. BE crashes with SIGSEGV in bthread::TaskGroup::sched_to 4. FE reports "backend <id> is down" to clients 5. BE is restarted (~1 minute later) **Recent failing workload (high level, no full SQL):** - Job: ETL via DolphinScheduler (pre environment) - SQL pattern: - TRUNCATE TABLE dws.dws_finance_sku_order_time_expense_stats_h - Two large INSERT INTO ... SELECT statements (is_combination=0 and is_combination=1) - Source: large DWD fact tables (dwd_finance_order_expense_*) - Operations: GROUP BY on 20+ columns, UNION ALL, multiple LEFT JOINs (dim tables + ods table) - Task runtime before failure: ~3 minutes (13:37:58 → 13:41:10 CST) **Cluster info:** - Single affected BE: 10.66.7.1 (BackendId 1780885838226) - 17860 tablets on this BE (relatively heavy) - No disk pressure (UsedPct ~35%) - No Linux OOM kill records in dmesg **What we ruled out:** - Linux OOM killer: dmesg shows no OOM/kill records around crash time - Disk full: UsedPct ~35% - Manual shutdown: isShutdown=false in SHOW BACKENDS - One-off incident: same BE crashed 6+ times in 3 days with identical stack patterns **To help reproduce:** We can provide full be.out crash logs and SQL privately if needed. The crash also occurs without a specific query id (Query id: 0-0), suggesting it may be triggered by general load rather than one specific SQL shape. ### Anything Else? **Impact:** - Production ETL jobs fail with "backend is down" - BE requires repeated manual/automatic restarts - Data risk: jobs use TRUNCATE before INSERT, leaving target table empty/incomplete on failure **Questions for maintainers:** 1. Are there known fixes for bthread::TaskGroup::sched_to SIGSEGV (@0x38) in 4.0.x after commit 59de8c4c524? 2. Is the SegmentFooterPB destructor crash during cache_adjust_capacity_thread() a known issue in 4.0.5-rc01? 3. Is 4.0.5-rc01 recommended for production? Which stable version should we upgrade to? 4. Any workaround besides version upgrade (cache/memory config, query-side mitigation)? **Attachments we can provide:** - Full be.out crash sections - be.WARNING / be.INFO around 13:40-13:42 CST on 2026-06-25 - SHOW BACKENDS output - Full SQL text (privately if needed) ### Are you willing to submit PR? - [x] Yes I am willing to submit a PR! ### Code of Conduct - [x] I agree to follow this project's [Code of Conduct](https://www.apache.org/foundation/policies/conduct) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
