heguanhui opened a new issue, #64189: URL: https://github.com/apache/doris/issues/64189
### Search before asking - [X] I had searched in the [issues](https://github.com/apache/doris/issues?q=is%3Aissue) and found no similar issues. ### Version master ### What's Wrong? `BlockFileCacheTest` has two types of flaky failures: --- **Type 1: Background thread interference — `ttl_modify` failure** The `test_file_cache()` helper creates a `BlockFileCache` that starts background threads. These threads asynchronously modify cache state between test assertions, causing `file_block->state()` to return EMPTY instead of SKIP_CACHE: ``` [ RUN ] BlockFileCacheTest.ttl_modify be/test/io/cache/block_file_cache_test.cpp:447: Failure Expected: file_block->state() == io::FileBlock::State::SKIP_CACHE Actual: EMPTY == SKIP_CACHE ``` Root cause: the background `evict_in_advance` thread evicts releasable DOWNLOADED blocks, freeing space so that `try_reserve()` unexpectedly succeeds, keeping the state as EMPTY instead of transitioning to SKIP_CACHE. --- **Type 2: Background thread interference — `io_error` failure** Same root cause as Type 1, but manifests as incorrect block count because EMPTY blocks are not removed due to `use_count>2` from background thread references: ``` [ RUN ] BlockFileCacheTest.io_error be/test/io/cache/block_file_cache_test.cpp:530: Failure Expected: mgr.get_file_blocks_num(key) == 9 Actual: 10 == 9 ``` Root cause: when a `FileBlocksHolder` destructor tries to remove EMPTY blocks (`use_count()==2`), the background thread or another holder still holds a reference (`use_count()>2`), preventing removal and leaving the block in the queue. --- **Type 3: Insufficient async open timeout — `evict_privilege_order_for_ttl` failure** `initialize()` starts a background disk I/O loading thread that sets `_async_open_done=true` only on completion. Tests wait only 100ms (100 iterations × 1ms), which is insufficient under high CPU load: ``` [ RUN ] BlockFileCacheTest.evict_privilege_order_for_ttl be/test/io/cache/block_file_cache_test.cpp:6980: Failure Expected: cache.get_or_set(key1, offset, 100000, context1) to succeed (async open not completed, cache not ready) ``` Root cause: the 100ms total timeout is too short when the system is under load. The background loading thread needs more time to complete disk I/O and set `_async_open_done=true`. --- ### What You Expected? Tests should be deterministic and not affected by background threads or timing issues. ### How to Reproduce? Run `BlockFileCacheTest` repeatedly under CPU load. The flaky failures appear intermittently. ### Anything Else? All three are test defects, not business code defects. The fix: 1. For Type 1 & 2: Save and restore config values, set all background thread intervals to 10000000ms during `test_file_cache` and `test_file_cache_memory_storage` to prevent background thread interference. 2. For Type 3: Extract `wait_for_async_open()` helper with 1000 iterations × 10ms (10s total timeout), replace all 91 inline wait loops. The 10ms sleep interval also avoids exacerbating CPU pressure under load compared to the original 1ms interval. ### Are you willing to submit PR? - [X] Yes I am willing to submit a PR! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
