Quanlong Huang created IMPALA-13178:
---------------------------------------
Summary: Flush the metadata cache to remote storage instead of
just invalidating them in full GCs
Key: IMPALA-13178
URL: https://issues.apache.org/jira/browse/IMPALA-13178
Project: IMPALA
Issue Type: Improvement
Components: Catalog
Reporter: Quanlong Huang
Assignee: Quanlong Huang
When invalidate_tables_on_memory_pressure is enabled, catalogd will invalidate
10% (configured by invalidate_tables_fraction_on_memory_pressure) of the tables
if the old gen usage of JVM still exceeds 60% (configured by
invalidate_tables_gc_old_gen_full_threshold) after a full GC.
Later if the table is used again, catalogd will try to load its metadata. The
loading process could also lead to OOM (see IMPALA-13117).
On the other hand, the metadata might have no changes so it's a waste to evict
and reload them again. Fetching all the partitions from HMS and file listing on
the storage are expensive. It'd be better to flush out the metadata cache of a
table instead of just invalidating it. If there are no more invalidates (either
implicit ones from HMS event processing or explicit ones from user commands) on
the table, we can reuse the flushed metadata.
They can be flushed to the remote storage (e.g. HDFS/Ozone/S3) so catalogd has
unlimited space to use. We can consider just flushing out the
encodedFileDescriptors (the file metadata) and incremental stats which are
usually the majority of the metadata cache. Or use a well-defined format (e.g.
Iceberg manifest files) so we can incrementally flush the metadata even with
catalog changes (DDL/DMLs).
--
This message was sent by Atlassian Jira
(v8.20.10#820010)