heguanhui opened a new pull request, #64190:
URL: https://github.com/apache/doris/pull/64190

   ### What problem does this PR solve?
   
   Issue Number: close #64189
   
   Problem Summary:
   
   `BlockFileCacheTest` has two types of flaky failures, both are test defects, 
not business code defects.
   
   **Type 1: Background thread interference** (affects `ttl_modify`, 
`io_error`, and all tests calling `test_file_cache`)
   
   The `test_file_cache()` helper creates a `BlockFileCache` that starts 
background threads (evict_in_advance at 1000ms, gc at 100ms, block_lru_update 
at 5000ms, monitor at 5000ms). These threads asynchronously modify cache state 
between test assertions, causing:
   
   - `ttl_modify` failure: `file_block->state()` returns EMPTY instead of 
SKIP_CACHE. The background evict_in_advance thread evicts releasable DOWNLOADED 
blocks, freeing space so that `try_reserve()` unexpectedly succeeds, keeping 
the state as EMPTY instead of transitioning to SKIP_CACHE.
     ```
     be/test/io/cache/block_file_cache_test.cpp:447: Failure
     Expected: file_block->state() == io::FileBlock::State::SKIP_CACHE
     Actual:    EMPTY == SKIP_CACHE
     ```
   
   - `io_error` failure: `mgr.get_file_blocks_num()` returns 10 instead of 9. 
When a `FileBlocksHolder` destructor tries to remove EMPTY blocks 
(`use_count()==2`), the background thread or another holder still holds a 
reference (`use_count()>2`), preventing removal and leaving the block in the 
queue.
     ```
     be/test/io/cache/block_file_cache_test.cpp:530: Failure
     Expected: mgr.get_file_blocks_num(key) == 9
     Actual:    10 == 9
     ```
   
   Fix (Commit 1): Save and restore config values, set all background thread 
intervals to 10000000ms during `test_file_cache` and 
`test_file_cache_memory_storage`.
   
   **Type 2: Insufficient async open timeout** (affects 
`evict_privilege_order_for_ttl` and all 91 tests using `get_async_open_success`)
   
   `initialize()` starts a background disk I/O loading thread that sets 
`_async_open_done=true` only on completion. Tests wait only 100ms (100 
iterations × 1ms), which is insufficient under high CPU load:
   ```
   be/test/io/cache/block_file_cache_test.cpp:6980: Failure
   (async open not completed, cache not ready for get_or_set operations)
   ```
   
   Fix (Commit 2): Extract `wait_for_async_open()` helper with 1000 iterations 
× 10ms (10s total timeout), replace all 91 inline wait loops. The 10ms sleep 
interval also avoids exacerbating CPU pressure under load compared to the 
original 1ms interval.
   
   ### Release note
   
   None
   
   ### Check List (For Author)
   
   - Test: Unit Test (BlockFileCacheTest)
   - Behavior changed: No
   - Does this need documentation: No


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to