[
https://issues.apache.org/jira/browse/IMPALA-13177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Quanlong Huang updated IMPALA-13177:
------------------------------------
Epic Link: IMPALA-13915
> Compress encodedFileDescriptors inside the same partition
> ---------------------------------------------------------
>
> Key: IMPALA-13177
> URL: https://issues.apache.org/jira/browse/IMPALA-13177
> Project: IMPALA
> Issue Type: Improvement
> Components: Catalog
> Reporter: Quanlong Huang
> Assignee: Quanlong Huang
> Priority: Critical
> Labels: catalog-2024
> Attachments: Selection_124.png
>
>
> File names under a table usually share some substrings, e.g. query id, job
> id, task id, etc. We can compress them to save some memory space. Especially
> in the case of small files issue, the memory footprint of the metadata cache
> is occupied by encodedFileDescriptors.
> An experiment shows that an HdfsTable with 67708 partitions and 3167561 files
> on S3 takes 605MB. 80% of it is spent in encodedFileDescriptors. Each
> encodedFileDescriptor is a byte array that takes 160B. Codes:
> [https://github.com/apache/impala/blob/6632fd00e17867c9f8f40d6905feafa049368a98/fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java#L723]
> Files of that table are created by Spark jobs. Here are some file names
> inside the same partition:
> {noformat}
> part-00000-14015d2b-b534-4747-8c42-c83a7af0f006-71fda97e-a41d-488f-aa15-6fd9112b6c5b.c000
> part-00001-14015d2b-b534-4747-8c42-c83a7af0f006-71fda97e-a41d-488f-aa15-6fd9112b6c5b.c000
> part-00002-14015d2b-b534-4747-8c42-c83a7af0f006-71fda97e-a41d-488f-aa15-6fd9112b6c5b.c000
> part-00003-14015d2b-b534-4747-8c42-c83a7af0f006-71fda97e-a41d-488f-aa15-6fd9112b6c5b.c000
> part-00004-14015d2b-b534-4747-8c42-c83a7af0f006-71fda97e-a41d-488f-aa15-6fd9112b6c5b.c000
> part-00005-14015d2b-b534-4747-8c42-c83a7af0f006-71fda97e-a41d-488f-aa15-6fd9112b6c5b.c000
> part-00006-14015d2b-b534-4747-8c42-c83a7af0f006-71fda97e-a41d-488f-aa15-6fd9112b6c5b.c000
> part-00007-14015d2b-b534-4747-8c42-c83a7af0f006-71fda97e-a41d-488f-aa15-6fd9112b6c5b.c000
> part-00008-14015d2b-b534-4747-8c42-c83a7af0f006-71fda97e-a41d-488f-aa15-6fd9112b6c5b.c000
> part-00009-14015d2b-b534-4747-8c42-c83a7af0f006-71fda97e-a41d-488f-aa15-6fd9112b6c5b.c000
> part-00010-14015d2b-b534-4747-8c42-c83a7af0f006-71fda97e-a41d-488f-aa15-6fd9112b6c5b.c000
> part-00011-14015d2b-b534-4747-8c42-c83a7af0f006-71fda97e-a41d-488f-aa15-6fd9112b6c5b.c000
> part-00012-14015d2b-b534-4747-8c42-c83a7af0f006-71fda97e-a41d-488f-aa15-6fd9112b6c5b.c000
> part-00013-14015d2b-b534-4747-8c42-c83a7af0f006-71fda97e-a41d-488f-aa15-6fd9112b6c5b.c000
> part-00014-14015d2b-b534-4747-8c42-c83a7af0f006-71fda97e-a41d-488f-aa15-6fd9112b6c5b.c000
> part-00015-14015d2b-b534-4747-8c42-c83a7af0f006-71fda97e-a41d-488f-aa15-6fd9112b6c5b.c000
> {noformat}
> By compressing the encodedFileDescriptors inside the same partition, we
> should be able to save a significant memory space in this case. Compressing
> all of them inside the same table might be even better, but it impacts the
> performance when coordinator loading specific partitions from catalogd.
> We can consider only do this for partitions whose number of files exceeds a
> threshold (e.g. 10).
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]