[
https://issues.apache.org/jira/browse/IMPALA-14583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18043083#comment-18043083
]
Noémi Pap-Takács edited comment on IMPALA-14583 at 12/5/25 3:07 PM:
--------------------------------------------------------------------
The array that hits the limit stores the file descriptors of the table, meaning
that 2 factors contribute to its size: the number of files and the size of file
descriptors. The file size and the amount of data stored in the table are not
relevant here.
To reproduce the issue, I would create a test table with lots of files that
contain 1-1 row each. For example CTAS (multiple times) from one of our test
tables, partitioning it by a column with high NDV. (edit: or better, if you
partition the table by multiple columns, increasing the length of partition
info that adds to the size of the file descriptors.) E.g use tpch_lineitem.
It is hard to say an exact number for the number of files, since the size of
file descriptors also matter (which depends on many things: the partition
information and block location info stored in the FD, etc.). Around 1-2 million
files could trigger the error.
was (Author: noemi):
The array that hits the limit stores the file descriptors of the table, meaning
that 2 factors contribute to its size: the number of files and the size of file
descriptors. The file size and the amount of data stored in the table are not
relevant here.
To reproduce the issue, I would create a test table with lots of files that
contain 1-1 row each. For example CTAS (multiple times) from one of our test
tables, partitioning it by a column with high NDV. E.g use tpch_lineitem.
It is hard to say an exact number for the number of files, since the size of
file descriptors also matter (which depends on many thing, e.g. the partition
information and block location info stored in the FD). Around 1-2 million files
could trigger the error.
> Limit the number of file descriptors per RPC to avoid JVM OOM in Catalog
> ------------------------------------------------------------------------
>
> Key: IMPALA-14583
> URL: https://issues.apache.org/jira/browse/IMPALA-14583
> Project: IMPALA
> Issue Type: Improvement
> Components: Catalog, Frontend
> Reporter: Noémi Pap-Takács
> Assignee: Mihaly Szjatinya
> Priority: Critical
> Labels: Catalog, OOM, impala-iceberg
>
> We often get OOM error when Impala tries to load a very large Iceberg table.
> This happens because the Catalog loads all the file descriptors and sends
> them to the Coordinator in one RPC, serializing all the file descriptors into
> one big byte array. However, JVM has a limit on the array length, so trying
> to send the entire table in one call can exceed this limit if there are too
> many files in the table.
> We could limit the number of files per RPC, so that the 2GB JVM array limit
> is not exceeded.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]