Quanlong Huang created IMPALA-13177:
---------------------------------------
Summary: Compress encodedFileDescriptors inside the same partition
Key: IMPALA-13177
URL: https://issues.apache.org/jira/browse/IMPALA-13177
Project: IMPALA
Issue Type: Improvement
Components: Catalog
Reporter: Quanlong Huang
Assignee: Quanlong Huang
Attachments: Selection_124.png
File names under a table usually share some substrings, e.g. query id, job id,
task id, etc. We can compress them to save some memory space. Especially in the
case of small files issue, the memory footprint of the metadata cache is
occupied by encodedFileDescriptors.
An experiment shows that an HdfsTable with 67708 partitions and 3167561 files
on S3 takes 605MB. 80% of it is spent in encodedFileDescriptors. Each
encodedFileDescriptor is a byte array that takes 160B. Codes:
https://github.com/apache/impala/blob/6632fd00e17867c9f8f40d6905feafa049368a98/fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java#L723
Files of that table are created by Spark jobs. An example file name:
part-00006-f7e5265d-5a63-4477-8954-ac6cbaef553b-face6153-588c-4b44-a277-2836396bc57a.c000
Here are some file names inside the same partition:
By compressing the encodedFileDescriptors inside the same partition, we should
be able to save a significant memory space in this case. Compressing all of
them inside the same table might be even better, but it impacts the performance
when coordinator loading specific partitions from catalogd.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)