xylaaaaa opened a new pull request, #63376:
URL: https://github.com/apache/doris/pull/63376
### What problem does this PR solve?
Issue Number: N/A
Related PR: N/A
Problem Summary:
External Parquet/ORC footer metadata was only cached in BE memory. After
memory cache eviction or BE restart, repeated reads from remote storage such as
S3/HDFS had to fetch footer metadata remotely again. This PR adds a disk-backed
L2 cache for external file metadata by reusing the file cache META queue, and
wires it into Parquet/ORC readers with separate profile counters for memory and
disk hits.
Main changes:
- Add `enable_external_file_meta_disk_cache` and
`external_file_meta_disk_cache_percent` BE configs.
- Add `FileMetaDiskCache` serialization and validation for Parquet/ORC
footer payloads.
- Add a META file-cache queue and eviction behavior that prefers evicting
data cache before metadata cache.
- Add Parquet/ORC reader profile counters split into memory/disk hit and
disk miss/write counters.
- Allow metadata disk cache initialization without enabling data file cache.
### Release note
Add optional disk-backed external file metadata cache for Parquet/ORC
readers.
### Check List (For Author)
- Test: Unit Test / Manual test
- `PATH=/mnt/disk1/chenjunwei/doris_tools/ldb_toolchain_llvm16/bin:$PATH
CLANG_FORMAT_BINARY=/mnt/disk1/chenjunwei/doris_tools/ldb_toolchain_llvm16/bin/clang-format
build-support/check-format.sh` passed.
- `git diff --check origin/master...HEAD` passed.
- Added BE UT coverage for file meta disk cache read/write, META queue
eviction priority, read-only cache lookup, cache path meta percent parsing, and
metadata cache initialization when `enable_file_cache=false`.
- Attempted `./run-be-ut.sh --run
--filter='FileMetaCacheTest.*:FileMetaDiskCacheTest.*:BlockFileCacheTest.file_cache_path_storage_parse:BlockFileCacheTest.file_meta_disk_cache_initializes_without_data_file_cache:BlockFileCacheTest.meta_queue_can_evict_data_cache_first:BlockFileCacheTest.normal_queue_does_not_evict_meta_cache:BlockFileCacheTest.read_if_cached_returns_downloaded_meta_block'
-j 8`; local worktree failed before running tests because
`thirdparty/installed/lib64/libaws-cpp-sdk-kinesis.a` is missing for current
master's BE UT link target.
- Manual Parquet local TVF smoke test with `enable_file_cache=false` and
`enable_external_file_meta_disk_cache=true`: cold query wrote disk cache,
same-process hot query hit memory cache, BE restart hot query hit disk cache
(`FileFooterHitDiskCache=1`, `FileFooterReadCalls=0`).
- Manual ORC local TVF smoke test with the same config: cold query wrote
disk cache, same-process hot query hit memory cache, BE restart hot query hit
disk cache (`FileFooterHitDiskCache=1`, `FileFooterReadCalls=0`).
- Behavior changed: Yes. When `enable_external_file_meta_disk_cache` is
true, external Parquet/ORC footer metadata can persist in local file cache and
be reused after memory eviction or BE restart; this can be enabled
independently from data file cache.
- Does this need documentation: Yes. Config documentation should describe
`enable_external_file_meta_disk_cache` and
`external_file_meta_disk_cache_percent`.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]