Quanlong Huang created IMPALA-13178:
---------------------------------------

             Summary: Flush the metadata cache to remote storage instead of 
just invalidating them in full GCs
                 Key: IMPALA-13178
                 URL: https://issues.apache.org/jira/browse/IMPALA-13178
             Project: IMPALA
          Issue Type: Improvement
          Components: Catalog
            Reporter: Quanlong Huang
            Assignee: Quanlong Huang


When invalidate_tables_on_memory_pressure is enabled, catalogd will invalidate 
10% (configured by invalidate_tables_fraction_on_memory_pressure) of the tables 
if the old gen usage of JVM still exceeds 60% (configured by 
invalidate_tables_gc_old_gen_full_threshold) after a full GC.

Later if the table is used again, catalogd will try to load its metadata. The 
loading process could also lead to OOM (see IMPALA-13117).

On the other hand, the metadata might have no changes so it's a waste to evict 
and reload them again. Fetching all the partitions from HMS and file listing on 
the storage are expensive. It'd be better to flush out the metadata cache of a 
table instead of just invalidating it. If there are no more invalidates (either 
implicit ones from HMS event processing or explicit ones from user commands) on 
the table, we can reuse the flushed metadata.

They can be flushed to the remote storage (e.g. HDFS/Ozone/S3) so catalogd has 
unlimited space to use. We can consider just flushing out the 
encodedFileDescriptors (the file metadata) and incremental stats which are 
usually the majority of the metadata cache. Or use a well-defined format (e.g. 
Iceberg manifest files) so we can incrementally flush the metadata even with 
catalog changes (DDL/DMLs).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to