iversonleo587-cpu opened a new issue, #64826:
URL: https://github.com/apache/doris/issues/64826

   ### Search before asking
   
   - [x] I had searched in the 
[issues](https://github.com/apache/doris/issues?q=is%3Aissue) and found no 
similar issues.
   
   
   ### Version
   
   doris-4.0.5-rc01-59de8c4c524
   Git commit: 59de8c4c524
   JDK: OpenJDK 17.0.18+8 (/opt/jdk-17.0.18+8/bin/java)
   OS: Linux x86_64
   
   ### What's Wrong?
   
   Our BE node (BackendId: 1780885838226, Host: 10.66.7.1) crashes repeatedly 
with SIGSEGV on version 4.0.5-rc01. This is NOT caused by Linux OOM killer (no 
matching records in dmesg).
   Two distinct crash patterns observed in be.out:
   **Pattern 1 – Query execution (most frequent)**
   SIGSEGV at address @0x38 in bthread worker thread:
     bthread::TaskGroup::sched_to → run_main_task → worker_thread
   **Pattern 2 – Background cache management (2026-06-25 00:20)**
   SIGSEGV at address @0x8 during Page Cache capacity adjustment:
     Daemon::cache_adjust_capacity_thread
     → CacheManager::for_each_cache_refresh_capacity
     → LRUCache::set_capacity
     → SegmentFooterPB::~SegmentFooterPB
   **Most recent production incident (2026-06-25):**
   - 08:11:34  BE started (PID 419300)
   - ~13:40:22  BE crashed (Aborted at unix time 1782366022)
   - 13:41:10  Client ETL job failed: (1105, 'errCode = 2, detailMessage = 
backend 1780885838226 is down')
   - 13:41:55  BE restarted automatically
   **Crash frequency on this single BE (same version):**
   | Date       | Query id                              | Stack                 
         |
   
|------------|---------------------------------------|--------------------------------|
   | 2026-06-22 | 6c0e5ba4c3ab45f2-a3734229ec2d08ba     | bthread SIGSEGV @0x38 
         |
   | 2026-06-23 | 0-0                                   | bthread SIGSEGV @0x38 
         |
   | 2026-06-24 | 7133b64a2c24c1d-942bd6304aa27668     | bthread SIGSEGV @0x38  
        |
   | 2026-06-24 | 594b7141026d19bf-7a05c24fda60e79f     | bthread SIGSEGV @0x38 
         |
   | 2026-06-25 | 0-0                                   | Page 
Cache/SegmentFooterPB @0x8 |
   | 2026-06-25 | 0-0                                   | bthread SIGSEGV @0x38 
         |
   **BE environment:**
   - CPU: 32 cores, Memory: ~125 GB
   - TabletNum: 17860, DataUsedCapacity: ~585 GB, Disk UsedPct: ~35%
   - Workload mode: mix
   **Stack trace – most recent crash (2026-06-25 ~13:40, before restart at 
13:41:55):**
   
   ### What You Expected?
   
   BE should remain stable under heavy query/load pressure. If resource limits 
are exceeded, Doris should fail the query gracefully (e.g., memory limit 
exceeded, query cancelled), not crash the entire BE process with SIGSEGV.
   After a failure, clients should receive a query-level error, not "backend is 
down" due to process crash.
   
   ### How to Reproduce?
   
   We have not isolated a minimal reproducible case yet. Crashes appear under 
sustained production load on a single BE node running 4.0.5-rc01.
   **Observed trigger pattern:**
   1. BE runs for several hours under mixed query/load workload
   2. A heavy INSERT INTO ... SELECT ETL job starts
   3. BE crashes with SIGSEGV in bthread::TaskGroup::sched_to
   4. FE reports "backend <id> is down" to clients
   5. BE is restarted (~1 minute later)
   **Recent failing workload (high level, no full SQL):**
   - Job: ETL via DolphinScheduler (pre environment)
   - SQL pattern:
     - TRUNCATE TABLE dws.dws_finance_sku_order_time_expense_stats_h
     - Two large INSERT INTO ... SELECT statements (is_combination=0 and 
is_combination=1)
     - Source: large DWD fact tables (dwd_finance_order_expense_*)
     - Operations: GROUP BY on 20+ columns, UNION ALL, multiple LEFT JOINs (dim 
tables + ods table)
   - Task runtime before failure: ~3 minutes (13:37:58 → 13:41:10 CST)
   **Cluster info:**
   - Single affected BE: 10.66.7.1 (BackendId 1780885838226)
   - 17860 tablets on this BE (relatively heavy)
   - No disk pressure (UsedPct ~35%)
   - No Linux OOM kill records in dmesg
   **What we ruled out:**
   - Linux OOM killer: dmesg shows no OOM/kill records around crash time
   - Disk full: UsedPct ~35%
   - Manual shutdown: isShutdown=false in SHOW BACKENDS
   - One-off incident: same BE crashed 6+ times in 3 days with identical stack 
patterns
   **To help reproduce:**
   We can provide full be.out crash logs and SQL privately if needed. The crash 
also occurs without a specific query id (Query id: 0-0), suggesting it may be 
triggered by general load rather than one specific SQL shape.
   
   
   ### Anything Else?
   
   **Impact:**
   - Production ETL jobs fail with "backend is down"
   - BE requires repeated manual/automatic restarts
   - Data risk: jobs use TRUNCATE before INSERT, leaving target table 
empty/incomplete on failure
   **Questions for maintainers:**
   1. Are there known fixes for bthread::TaskGroup::sched_to SIGSEGV (@0x38) in 
4.0.x after commit 59de8c4c524?
   2. Is the SegmentFooterPB destructor crash during 
cache_adjust_capacity_thread() a known issue in 4.0.5-rc01?
   3. Is 4.0.5-rc01 recommended for production? Which stable version should we 
upgrade to?
   4. Any workaround besides version upgrade (cache/memory config, query-side 
mitigation)?
   **Attachments we can provide:**
   - Full be.out crash sections
   - be.WARNING / be.INFO around 13:40-13:42 CST on 2026-06-25
   - SHOW BACKENDS output
   - Full SQL text (privately if needed)
   
   ### Are you willing to submit PR?
   
   - [x] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [x] I agree to follow this project's [Code of 
Conduct](https://www.apache.org/foundation/policies/conduct)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to