xylaaaaa opened a new pull request, #63376:
URL: https://github.com/apache/doris/pull/63376

   ### What problem does this PR solve?
   
   Issue Number: N/A
   
   Related PR: N/A
   
   Problem Summary:
   External Parquet/ORC footer metadata was only cached in BE memory. After 
memory cache eviction or BE restart, repeated reads from remote storage such as 
S3/HDFS had to fetch footer metadata remotely again. This PR adds a disk-backed 
L2 cache for external file metadata by reusing the file cache META queue, and 
wires it into Parquet/ORC readers with separate profile counters for memory and 
disk hits.
   
   Main changes:
   - Add `enable_external_file_meta_disk_cache` and 
`external_file_meta_disk_cache_percent` BE configs.
   - Add `FileMetaDiskCache` serialization and validation for Parquet/ORC 
footer payloads.
   - Add a META file-cache queue and eviction behavior that prefers evicting 
data cache before metadata cache.
   - Add Parquet/ORC reader profile counters split into memory/disk hit and 
disk miss/write counters.
   - Allow metadata disk cache initialization without enabling data file cache.
   
   ### Release note
   
   Add optional disk-backed external file metadata cache for Parquet/ORC 
readers.
   
   ### Check List (For Author)
   
   - Test: Unit Test / Manual test
       - `PATH=/mnt/disk1/chenjunwei/doris_tools/ldb_toolchain_llvm16/bin:$PATH 
CLANG_FORMAT_BINARY=/mnt/disk1/chenjunwei/doris_tools/ldb_toolchain_llvm16/bin/clang-format
 build-support/check-format.sh` passed.
       - `git diff --check origin/master...HEAD` passed.
       - Added BE UT coverage for file meta disk cache read/write, META queue 
eviction priority, read-only cache lookup, cache path meta percent parsing, and 
metadata cache initialization when `enable_file_cache=false`.
       - Attempted `./run-be-ut.sh --run 
--filter='FileMetaCacheTest.*:FileMetaDiskCacheTest.*:BlockFileCacheTest.file_cache_path_storage_parse:BlockFileCacheTest.file_meta_disk_cache_initializes_without_data_file_cache:BlockFileCacheTest.meta_queue_can_evict_data_cache_first:BlockFileCacheTest.normal_queue_does_not_evict_meta_cache:BlockFileCacheTest.read_if_cached_returns_downloaded_meta_block'
 -j 8`; local worktree failed before running tests because 
`thirdparty/installed/lib64/libaws-cpp-sdk-kinesis.a` is missing for current 
master's BE UT link target.
       - Manual Parquet local TVF smoke test with `enable_file_cache=false` and 
`enable_external_file_meta_disk_cache=true`: cold query wrote disk cache, 
same-process hot query hit memory cache, BE restart hot query hit disk cache 
(`FileFooterHitDiskCache=1`, `FileFooterReadCalls=0`).
       - Manual ORC local TVF smoke test with the same config: cold query wrote 
disk cache, same-process hot query hit memory cache, BE restart hot query hit 
disk cache (`FileFooterHitDiskCache=1`, `FileFooterReadCalls=0`).
   - Behavior changed: Yes. When `enable_external_file_meta_disk_cache` is 
true, external Parquet/ORC footer metadata can persist in local file cache and 
be reused after memory eviction or BE restart; this can be enabled 
independently from data file cache.
   - Does this need documentation: Yes. Config documentation should describe 
`enable_external_file_meta_disk_cache` and 
`external_file_meta_disk_cache_percent`.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to