zclllyybb commented on issue #64826:
URL: https://github.com/apache/doris/issues/64826#issuecomment-4796379393

   Breakwater-GitHub-Analysis-Slot: slot_2bb1eec8775e
   This content is generated by AI for reference only.
   
   Initial triage from the issue text and the branch-4.0 code at `59de8c4c524`:
   
   1. This should be treated as a real BE crash bug, not as normal query 
memory-limit behavior. The absence of Linux OOM-killer records only rules out a 
kernel kill; it does not explain repeated SIGSEGV inside the BE process.
   
   2. The `bthread::TaskGroup::sched_to` frame alone is not enough to identify 
the Doris operator that corrupted state. Doris 4.0.x bundles brpc 1.4.0, where 
`sched_to(pg, tid)` immediately resolves and dereferences `TaskMeta`. A SIGSEGV 
at a very small address such as `0x38` in that frame usually means invalid 
scheduler/task metadata or prior memory corruption. The useful frames are the 
Doris frames below the bthread scheduler, which are not included in the issue 
body.
   
   3. For the page-cache stack, the reported path is plausible as an 
eviction/destruction site: `cache_adjust_capacity_thread()` calls 
`CacheManager::for_each_cache_refresh_capacity()`, which calls 
`LRUCachePolicy::adjust_capacity_weighted()`, which calls 
`LRUCache::set_capacity()` and may evict cached entries. Segment footers are 
cached in the storage page cache as a 
`MemoryTrackedPageWithPagePtr<std::shared_ptr<SegmentFooterPB>>`, so 
`SegmentFooterPB::~SegmentFooterPB` can legitimately appear while evicting 
index-page-cache entries. However, the normal lifetime path appears intended to 
be safe: `PageCacheHandle` holds a cache handle while copying out the 
`shared_ptr`, and `Segment` stores only a `weak_ptr`. So a crash in the 
destructor is more suggestive of memory corruption or a cache-entry lifetime 
bug than of ordinary page-cache capacity adjustment by itself.
   
   4. I checked the local 4.0 tags: `4.0.5` and `4.0.5-rc01` point to the same 
commit `59de8c4c524`, while `4.0.6` points to `4.0.6-rc02`. The same-line 
`4.0.6` tag contains several crash fixes after `59de8c4c524`, including:
      - `#63257`: fixes a branch-4.0 BE SIGSEGV when shared-hash-table hash 
joins publish runtime filters during close. This is potentially relevant 
because the workload includes multiple joins and heavy `INSERT INTO ... SELECT`.
      - `#62427`: removes `SchemaCache` to fix a concurrent crash in 
`OlapScanner::prepare`.
      - `#61444`: fixes a file-cache background LRU SIGSEGV when clearing cache.
   
   None of those is proven to be the same root cause as the `SegmentFooterPB` 
destructor crash from the shortened stack, but they are enough evidence that 
staying on `4.0.5/4.0.5-rc01` is risky for a production workload that is 
already repeatedly crashing.
   
   Suggested next steps:
   
   1. Upgrade the affected BE from `4.0.5/4.0.5-rc01` to the latest available 
4.0.x release, at least the local same-line `4.0.6` tag, before spending too 
much time tuning around the crash. This is the strongest immediate mitigation 
because there are known branch-4.0 crash fixes after the reported SHA.
   
   2. As a temporary query-side mitigation for the ETL job, try disabling 
shared hash table for broadcast joins at the session/job level:
      ```sql
      SET enable_share_hash_table_for_broadcast_join = false;
      ```
      This targets the `#63257` class of crash. It may reduce performance, and 
it only helps if the full stack or profile confirms the failing plan uses 
broadcast hash joins/shared runtime filters.
   
   3. Avoid `TRUNCATE` followed by large `INSERT INTO ... SELECT` as the 
production recovery pattern while this BE is unstable. Use a staging table plus 
atomic publish/rename/swap pattern where possible, so a BE crash does not leave 
the target table empty.
   
   4. If the page-cache crash recurs before upgrade and it always occurs in 
`cache_adjust_capacity_thread`, `disable_memory_gc=true` can be considered only 
as emergency containment with close memory monitoring. It skips that background 
adjustment path but can worsen memory pressure, so it should not be the primary 
fix.
   
   Information still needed to root-cause this precisely:
   
   1. Full `be.out` fatal sections for all crash patterns, including the signal 
header and complete symbolized stack, not only the top frames.
   2. `be.INFO` and `be.WARNING` for at least 5-10 minutes around each crash, 
especially `[MemoryGC]` cache-capacity logs, fragment cancellation logs, 
runtime-filter/hash-join logs, and any file/page-cache warnings.
   3. The full SQL or at least `EXPLAIN`/profile for the failing ETL 
statements, with session variables used by DolphinScheduler.
   4. BE config/session values for `disable_memory_gc`, 
`enable_memory_orphan_check`, `storage_page_cache_limit`, 
`index_page_cache_percentage`, `enable_share_hash_table_for_broadcast_join`, 
query/load memory limits, and whether page/file cache was enabled.
   5. A core dump or `addr2line` output produced with the exact `doris_be` 
binary/build id from `59de8c4c524`.
   6. Whether the repeated `bthread::TaskGroup::sched_to @0x38` crashes include 
Doris frames such as `RuntimeFilterWrapper`, `HashJoinBuildSink`, 
`OlapScanner`, exchange/data-stream code, or cache code below the bthread 
scheduler.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to