zclllyybb commented on issue #64189:
URL: https://github.com/apache/doris/issues/64189#issuecomment-4643474670

   Breakwater-GitHub-Analysis-Slot: slot_20b5b0e4212d
   
   I checked the live issue state and current upstream `master` code for the 
reported `BlockFileCacheTest` failures.
   
   Initial assessment:
   
   - This looks like a BE unit-test determinism problem, not evidence of a 
production cache correctness bug from the current report.
   - `BlockFileCache::initialize_unlocked()` starts background monitor, GC, 
evict-in-advance, LRU update, dump, and replay threads for each cache instance. 
The `test_file_cache()` and `test_file_cache_memory_storage()` helpers assert 
exact internal block state/counts while those maintenance threads are live.
   - The `ttl_modify` failure is consistent with the code path: 
`run_background_evict_in_advance()` can call `try_evict_in_advance()`, which 
removes releasable downloaded blocks through the LRU reservation path and can 
change the capacity condition that the later `SKIP_CACHE` assertion depends on.
   - The `io_error` block-count failure also matches the cleanup rule in 
`FileBlocksHolder::~FileBlocksHolder()`: an `EMPTY` or deleting block is 
removed only when `use_count() == 2`; any extra live shared pointer during the 
assertion can leave the block in cache metadata.
   - The async-open timeout is also plausible. Filesystem-backed cache storage 
starts a background loading thread and sets `_async_open_done` only after the 
load finishes, while the reported test still uses a short polling loop before 
asserting readiness.
   
   Current public state:
   
   - PR #64190 is already open for this issue. Its overall direction matches 
the code path: isolate these tests from background maintenance timing and 
centralize/increase the async-open wait.
   - Two details are worth checking during PR review: the async wait helper 
should fail or return an assertion result on timeout so tests do not continue 
silently after the longer wait, and temporary global config changes should be 
restored with an RAII/scope guard so an early `ASSERT_*` inside a helper cannot 
leak modified intervals to later tests.
   
   Missing validation:
   
   - Please attach or link a flaky CI job / stress-run output if available, and 
verify PR #64190 with repeated `BlockFileCacheTest` runs under CPU load.
   - At minimum, rerun the directly affected cases (`ttl_modify`, `io_error`, 
and `evict_privilege_order_for_ttl`) repeatedly in the same build mode where 
the flake was observed.
   - If any flake remains after #64190, the useful follow-up evidence would be 
the full gtest log around the failed case, the exact BE test command, 
sanitizer/build type, host load, and the file-cache test temp directory state.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to