Re: [I] [Bug] BE Disk IO Saturation with High Self-Heal Rate Causing S3 Cache Persistence Failure [doris]

via GitHub Sun, 21 Jun 2026 19:14:28 -0700


zclllyybb commented on issue #64657:
URL: https://github.com/apache/doris/issues/64657#issuecomment-4764132427


   Breakwater-GitHub-Analysis-Slot: slot_ca3e2afdc689
   This content is generated by AI for reference only.
   
   Initial code-backed triage:
   
   The symptom is focused on the local BE file-cache layer, not on tablet 
distribution by itself. In 4.1.0-rc03, 
`cached_remote_reader_self_heal_on_not_found` is incremented only when 
`CachedRemoteFileReader` sees a cache block marked `DOWNLOADED`, then the local 
cache-file read returns `NOT_FOUND`. Doris then falls back to the remote reader 
for correctness and schedules `_cache->remove_if_cached_async(_cache_hash)`.
   
   This is the same class of stale file-cache metadata/local-file mismatch that 
public PRs #60977 and #61205 addressed. 4.1.0-rc03 already contains that 
self-heal logic, so a continuously rising counter means the affected BE is 
still repeatedly reaching stale `DOWNLOADED` entries, or cache blocks are being 
written and then immediately removed/not retained.
   
   `BytesWriteIntoCache` should not be treated as proof that the cache was 
successfully persisted. In this version, the read path increments the profile's 
write-into-cache bytes for the block after the remote-read write-back loop even 
when `append()` or `finalize()` failed and logged `Write data to file cache 
failed`. Therefore the reported 49-70 MB per query can still coexist with zero 
reusable cache if the cache path has write/rename/delete errors, inode/space 
pressure, or aggressive eviction.
   
   Most suspicious directions for this single BE:
   
   1. Cache path disk or inode pressure. Defaults enter disk resource limit 
mode at 90% and evict-in-advance at 88%, which matches the reported high 
IO/utilization range. In that state Doris can repeatedly do remote read -> 
local cache write -> eviction/removal -> next-query miss.
   2. File-cache metadata and cache files are inconsistent on that BE. The v3 
file cache loads block metadata from the local RocksDB meta store, and missing 
local block files then surface as `DOWNLOADED + NOT_FOUND`.
   3. Cache-file deletion/removal is delayed or failing. Please check the 
recycle queue, async remove logs, and RocksDB meta-store write/delete failures.
   4. If `enable_read_cache_file_directly=true` on this BE, also check the 
direct-read path. That path reads cached blocks through `_cache_file_readers`; 
a local read failure only breaks out to the indirect path and does not perform 
the same self-heal in the direct-read branch.
   
   Useful evidence to attach from the affected BE and one normal BE:
   
   - BE log snippets around the spike for: `Cache block file is missing, will 
self-heal by clearing cache hash`, `Read data failed from file cache downloaded 
by others`, `Write data to file cache failed`, `open file failed with both v3 
and v2 format`, `mode run in resource limit`, `need evict cache in advance`, 
`Failed to write to rocksdb`, and `Failed to delete to rocksdb`.
   - Bvar metrics for the affected cache path: 
`cached_remote_reader_self_heal_on_not_found`, `cached_remote_reader_s3_read`, 
`cached_remote_reader_peer_read`, 
`cached_remote_reader_failed_get_peer_addr_counter`, file-cache hit 
ratio/no-warmup hit ratio, cache size/capacity, queue sizes, per-reason evict 
bytes, `file_cache_total_evict_size`, `file_cache_disk_limit_mode`, 
`file_cache_need_evict_cache_in_advance`, `file_cache_recycle_keys_length`, 
`file_cache_meta_store_write_queue_size`, 
`file_cache_meta_rocksdb_write_failed_num`, and 
`file_cache_meta_rocksdb_delete_failed_num`.
   - The BE config values for `file_cache_path`, cache capacity, 
`file_cache_each_block_size`, `enable_read_cache_file_directly`, 
`enable_evict_file_cache_in_advance`, `file_cache_remove_block_qps_limit`, and 
`file_cache_background_gc_interval_ms`.
   - `df -h` and `df -i` for the cache mount on the bad BE and a normal BE, 
plus whether any external cleanup, pod reschedule, disk replacement, or BE 
restart happened before the counter spike.
   - The same query profile from the affected BE and a normal BE, including 
local/remote/peer bytes and timers, `BytesWriteIntoCache`, and write-cache time.
   
   Short-term mitigation, after preserving the above evidence: if this is 
isolated to one BE and cold-cache refill is acceptable, clearing the file cache 
on the affected BE with the existing `/api/file_cache?op=clear&sync=true` path 
should remove stale cache metadata/files and force a clean rebuild. If the 
counter immediately grows again after that, the remaining root cause is likely 
ongoing cache write/finalize failure, disk/inode pressure, or external deletion 
of cache files rather than old stale metadata.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] [Bug] BE Disk IO Saturation with High Self-Heal Rate Causing S3 Cache Persistence Failure [doris]

Reply via email to