[
https://issues.apache.org/jira/browse/IMPALA-14583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18044376#comment-18044376
]
Quanlong Huang commented on IMPALA-14583:
-----------------------------------------
I think the most efficient way to create a large legacy table with millions of
files is by the following steps:
# Create an external Hive table in parquet format, partitioned by (p int).
# Create the dirs and files in your local file system in parallel. The dirs
are in the format of p=0, p=1, ..., p=2999999. Files can be an identical
parquet file that has only one row.
# Upload the dirs into HDFS in parallel.
# Run ALTER TABLE RECOVER PARTITIONS in Impala for the table so partitions are
created in batch. Disabling event processing can speed up the process.
With this table, we can convert it into Iceberg format to get an Iceberg table.
I have a script for step 2-3:
[https://github.com/stiga-huang/impala-event-processor-benchmark/blob/cb17b1b/create_hive_part_dirs.sh]
The corresponding parquet file has one row and 500 columns:
[https://github.com/stiga-huang/impala-event-processor-benchmark/blob/2fd9cb3/500_cols.parq]
For the current issue, I think the number of columns doesn't matter. So we can
use a simpler table that just have few columns and a simpler parquet file.
> Limit the number of file descriptors per RPC to avoid JVM OOM in Catalog
> ------------------------------------------------------------------------
>
> Key: IMPALA-14583
> URL: https://issues.apache.org/jira/browse/IMPALA-14583
> Project: IMPALA
> Issue Type: Improvement
> Components: Catalog, Frontend
> Reporter: Noémi Pap-Takács
> Assignee: Mihaly Szjatinya
> Priority: Critical
> Labels: Catalog, OOM, impala-iceberg
>
> We often get OOM error when Impala tries to load a very large Iceberg table.
> This happens because the Catalog loads all the file descriptors and sends
> them to the Coordinator in one RPC, serializing all the file descriptors into
> one big byte array. However, JVM has a limit on the array length, so trying
> to send the entire table in one call can exceed this limit if there are too
> many files in the table.
> We could limit the number of files per RPC, so that the 2GB JVM array limit
> is not exceeded.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]