zclllyybb commented on issue #64826:
URL: https://github.com/apache/doris/issues/64826#issuecomment-4796379393
Breakwater-GitHub-Analysis-Slot: slot_2bb1eec8775e
This content is generated by AI for reference only.
Initial triage from the issue text and the branch-4.0 code at `59de8c4c524`:
1. This should be treated as a real BE crash bug, not as normal query
memory-limit behavior. The absence of Linux OOM-killer records only rules out a
kernel kill; it does not explain repeated SIGSEGV inside the BE process.
2. The `bthread::TaskGroup::sched_to` frame alone is not enough to identify
the Doris operator that corrupted state. Doris 4.0.x bundles brpc 1.4.0, where
`sched_to(pg, tid)` immediately resolves and dereferences `TaskMeta`. A SIGSEGV
at a very small address such as `0x38` in that frame usually means invalid
scheduler/task metadata or prior memory corruption. The useful frames are the
Doris frames below the bthread scheduler, which are not included in the issue
body.
3. For the page-cache stack, the reported path is plausible as an
eviction/destruction site: `cache_adjust_capacity_thread()` calls
`CacheManager::for_each_cache_refresh_capacity()`, which calls
`LRUCachePolicy::adjust_capacity_weighted()`, which calls
`LRUCache::set_capacity()` and may evict cached entries. Segment footers are
cached in the storage page cache as a
`MemoryTrackedPageWithPagePtr<std::shared_ptr<SegmentFooterPB>>`, so
`SegmentFooterPB::~SegmentFooterPB` can legitimately appear while evicting
index-page-cache entries. However, the normal lifetime path appears intended to
be safe: `PageCacheHandle` holds a cache handle while copying out the
`shared_ptr`, and `Segment` stores only a `weak_ptr`. So a crash in the
destructor is more suggestive of memory corruption or a cache-entry lifetime
bug than of ordinary page-cache capacity adjustment by itself.
4. I checked the local 4.0 tags: `4.0.5` and `4.0.5-rc01` point to the same
commit `59de8c4c524`, while `4.0.6` points to `4.0.6-rc02`. The same-line
`4.0.6` tag contains several crash fixes after `59de8c4c524`, including:
- `#63257`: fixes a branch-4.0 BE SIGSEGV when shared-hash-table hash
joins publish runtime filters during close. This is potentially relevant
because the workload includes multiple joins and heavy `INSERT INTO ... SELECT`.
- `#62427`: removes `SchemaCache` to fix a concurrent crash in
`OlapScanner::prepare`.
- `#61444`: fixes a file-cache background LRU SIGSEGV when clearing cache.
None of those is proven to be the same root cause as the `SegmentFooterPB`
destructor crash from the shortened stack, but they are enough evidence that
staying on `4.0.5/4.0.5-rc01` is risky for a production workload that is
already repeatedly crashing.
Suggested next steps:
1. Upgrade the affected BE from `4.0.5/4.0.5-rc01` to the latest available
4.0.x release, at least the local same-line `4.0.6` tag, before spending too
much time tuning around the crash. This is the strongest immediate mitigation
because there are known branch-4.0 crash fixes after the reported SHA.
2. As a temporary query-side mitigation for the ETL job, try disabling
shared hash table for broadcast joins at the session/job level:
```sql
SET enable_share_hash_table_for_broadcast_join = false;
```
This targets the `#63257` class of crash. It may reduce performance, and
it only helps if the full stack or profile confirms the failing plan uses
broadcast hash joins/shared runtime filters.
3. Avoid `TRUNCATE` followed by large `INSERT INTO ... SELECT` as the
production recovery pattern while this BE is unstable. Use a staging table plus
atomic publish/rename/swap pattern where possible, so a BE crash does not leave
the target table empty.
4. If the page-cache crash recurs before upgrade and it always occurs in
`cache_adjust_capacity_thread`, `disable_memory_gc=true` can be considered only
as emergency containment with close memory monitoring. It skips that background
adjustment path but can worsen memory pressure, so it should not be the primary
fix.
Information still needed to root-cause this precisely:
1. Full `be.out` fatal sections for all crash patterns, including the signal
header and complete symbolized stack, not only the top frames.
2. `be.INFO` and `be.WARNING` for at least 5-10 minutes around each crash,
especially `[MemoryGC]` cache-capacity logs, fragment cancellation logs,
runtime-filter/hash-join logs, and any file/page-cache warnings.
3. The full SQL or at least `EXPLAIN`/profile for the failing ETL
statements, with session variables used by DolphinScheduler.
4. BE config/session values for `disable_memory_gc`,
`enable_memory_orphan_check`, `storage_page_cache_limit`,
`index_page_cache_percentage`, `enable_share_hash_table_for_broadcast_join`,
query/load memory limits, and whether page/file cache was enabled.
5. A core dump or `addr2line` output produced with the exact `doris_be`
binary/build id from `59de8c4c524`.
6. Whether the repeated `bthread::TaskGroup::sched_to @0x38` crashes include
Doris frames such as `RuntimeFilterWrapper`, `HashJoinBuildSink`,
`OlapScanner`, exchange/data-stream code, or cache code below the bthread
scheduler.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]