xylaaaaa opened a new pull request, #63376:
URL: https://github.com/apache/doris/pull/63376

   ### What problem does this PR solve?
   
   Issue Number: N/A
   
   Related PR: N/A
   
   Problem Summary:
   External Parquet/ORC footer metadata was only cached in BE memory. After 
memory cache eviction or BE restart, repeated reads from remote storage such as 
S3/HDFS had to fetch footer metadata remotely again.
   
   This PR extends `FileMetaCache` from a memory-only cache to a layered memory 
+ disk cache:
   - Parquet/ORC readers first lookup the existing in-memory `ObjLRUCache`.
   - On memory miss, `FileMetaCache` looks up the serialized footer payload 
from the existing `BlockFileCache`.
   - On total miss, the reader fetches footer metadata remotely, then 
`FileMetaCache` writes it to disk and memory.
   
   Main changes:
   - Add `enable_external_file_meta_disk_cache` and 
`external_file_meta_disk_cache_max_entry_bytes` BE configs.
   - Add unified `FileMetaCache` lookup/insert APIs for memory + persisted 
metadata cache.
   - Persist serialized Parquet/ORC footer payloads through the existing 
`BlockFileCache` INDEX queue.
   - Add a small disk payload header with magic/version/format/file 
size/mtime/payload size/checksum.
   - Remove corrupted or stale disk entries on validation failure so the cache 
can self-heal.
   - Add Parquet/ORC reader profile counters split into memory hit, disk hit, 
disk miss, disk write, and disk read/write time.
   - Allow metadata disk cache initialization without enabling data file cache.
   
   This implementation does not add a new META queue, does not add new 
`BlockFileCache` public APIs, and does not keep a separate `FileMetaDiskCache` 
adapter layer.
   
   ### Release note
   
   Add optional disk-backed external file metadata cache for Parquet/ORC 
readers.
   
   ### Check List (For Author)
   
   - Test: Code style check
       - `PATH=/mnt/disk1/chenjunwei/doris_tools/ldb_toolchain_llvm16/bin:$PATH 
build-support/clang-format.sh` passed.
       - `PATH=/mnt/disk1/chenjunwei/doris_tools/ldb_toolchain_llvm16/bin:$PATH 
build-support/check-format.sh` passed.
   - Behavior changed: Yes. When `enable_external_file_meta_disk_cache` is 
true, external Parquet/ORC footer metadata can persist in local file cache and 
be reused after memory eviction or BE restart. The disk layer reuses the 
existing file cache INDEX queue and its capacity/eviction policy.
   - Does this need documentation: Yes. Config documentation should describe 
`enable_external_file_meta_disk_cache`, 
`external_file_meta_disk_cache_max_entry_bytes`, and that persisted file 
metadata reuses the existing INDEX file cache queue.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to