zclllyybb commented on issue #64189: URL: https://github.com/apache/doris/issues/64189#issuecomment-4643474670
Breakwater-GitHub-Analysis-Slot: slot_20b5b0e4212d I checked the live issue state and current upstream `master` code for the reported `BlockFileCacheTest` failures. Initial assessment: - This looks like a BE unit-test determinism problem, not evidence of a production cache correctness bug from the current report. - `BlockFileCache::initialize_unlocked()` starts background monitor, GC, evict-in-advance, LRU update, dump, and replay threads for each cache instance. The `test_file_cache()` and `test_file_cache_memory_storage()` helpers assert exact internal block state/counts while those maintenance threads are live. - The `ttl_modify` failure is consistent with the code path: `run_background_evict_in_advance()` can call `try_evict_in_advance()`, which removes releasable downloaded blocks through the LRU reservation path and can change the capacity condition that the later `SKIP_CACHE` assertion depends on. - The `io_error` block-count failure also matches the cleanup rule in `FileBlocksHolder::~FileBlocksHolder()`: an `EMPTY` or deleting block is removed only when `use_count() == 2`; any extra live shared pointer during the assertion can leave the block in cache metadata. - The async-open timeout is also plausible. Filesystem-backed cache storage starts a background loading thread and sets `_async_open_done` only after the load finishes, while the reported test still uses a short polling loop before asserting readiness. Current public state: - PR #64190 is already open for this issue. Its overall direction matches the code path: isolate these tests from background maintenance timing and centralize/increase the async-open wait. - Two details are worth checking during PR review: the async wait helper should fail or return an assertion result on timeout so tests do not continue silently after the longer wait, and temporary global config changes should be restored with an RAII/scope guard so an early `ASSERT_*` inside a helper cannot leak modified intervals to later tests. Missing validation: - Please attach or link a flaky CI job / stress-run output if available, and verify PR #64190 with repeated `BlockFileCacheTest` runs under CPU load. - At minimum, rerun the directly affected cases (`ttl_modify`, `io_error`, and `evict_privilege_order_for_ttl`) repeatedly in the same build mode where the flake was observed. - If any flake remains after #64190, the useful follow-up evidence would be the full gtest log around the failed case, the exact BE test command, sanitizer/build type, host load, and the file-cache test temp directory state. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
