Quanlong Huang created IMPALA-13177:
---------------------------------------

             Summary: Compress encodedFileDescriptors inside the same partition
                 Key: IMPALA-13177
                 URL: https://issues.apache.org/jira/browse/IMPALA-13177
             Project: IMPALA
          Issue Type: Improvement
          Components: Catalog
            Reporter: Quanlong Huang
            Assignee: Quanlong Huang
         Attachments: Selection_124.png

File names under a table usually share some substrings, e.g. query id, job id, 
task id, etc. We can compress them to save some memory space. Especially in the 
case of small files issue, the memory footprint of the metadata cache is 
occupied by encodedFileDescriptors.

An experiment shows that an HdfsTable with 67708 partitions and 3167561 files 
on S3 takes 605MB. 80% of it is spent in encodedFileDescriptors. Each 
encodedFileDescriptor is a byte array that takes 160B. Codes:
https://github.com/apache/impala/blob/6632fd00e17867c9f8f40d6905feafa049368a98/fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java#L723

Files of that table are created by Spark jobs. An example file name: 
part-00006-f7e5265d-5a63-4477-8954-ac6cbaef553b-face6153-588c-4b44-a277-2836396bc57a.c000
Here are some file names inside the same partition:

By compressing the encodedFileDescriptors inside the same partition, we should 
be able to save a significant memory space in this case. Compressing all of 
them inside the same table might be even better, but it impacts the performance 
when coordinator loading specific partitions from catalogd.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to