[
https://issues.apache.org/jira/browse/HUDI-4919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
sivabalan narayanan updated HUDI-4919:
--------------------------------------
Description:
When using spark-sql MERGE INTO, memory requirement shoots up. To merge new
incoming data for 120MB parquet file, memory requirement shoots up > 10GB.
from user:
We are trying to process some input data which is of 5 GB (Parquet snappy
compression) and this will try to insert/update Hudi table for 4 days (Day is
partition).
My Data size in Hudi target table for each partition is like around 3.5GB to
10GB.We are trying to process the data and our process is keep failing with OOM
(java.lang.OutOfMemoryError: GC overhead limit exceeded).
We have tried with 32GB and 64GB of executor memory as well with 3 cores.
Our process is running fine when we have less updates and more inserts.
Its a partitioned dataset. and each partition roughly has 3.5 to 10GB size of
data. max parquet file size is default and so 120MB files max.
input batch is spread across last 3 to 4 partitions.
Incoming data is 5GB parquet compressed.
User tried giving close to 20GB per task (64 GB executor w/ 3 cores) and still
you had hit memory issues and failed.
If the incoming batch had fewer updates, it works. else it fails w/ OOM. Tried
w/ both BLOOM and simple, but did not work.
was:
When using spark-sql MERGE INTO, memory requirement shoots up. To merge new
incoming data for 120MB parquet file, memory requirement shoots up > 10GB.
from user:
We are trying to process some input data which is of 5 GB (Parquet snappy
compression) and this will try to insert/update Hudi table for 4 days (Day is
partition).
My Data size in Hudi target table for each partition is like around 3.5GB to
10GB.We are trying to process the data and our process is keep failing with OOM
(java.lang.OutOfMemoryError: GC overhead limit exceeded).
We have tried with 32GB and 64GB of executor memory as well with 3 cores.
Our process is running fine when we have less updates and more inserts.
> Sql MERGE INTO incurs too much memory overhead
> ----------------------------------------------
>
> Key: HUDI-4919
> URL: https://issues.apache.org/jira/browse/HUDI-4919
> Project: Apache Hudi
> Issue Type: Bug
> Components: spark-sql
> Reporter: sivabalan narayanan
> Assignee: Alexey Kudinkin
> Priority: Major
> Fix For: 0.13.0
>
>
> When using spark-sql MERGE INTO, memory requirement shoots up. To merge new
> incoming data for 120MB parquet file, memory requirement shoots up > 10GB.
>
> from user:
> We are trying to process some input data which is of 5 GB (Parquet snappy
> compression) and this will try to insert/update Hudi table for 4 days (Day is
> partition).
> My Data size in Hudi target table for each partition is like around 3.5GB to
> 10GB.We are trying to process the data and our process is keep failing with
> OOM (java.lang.OutOfMemoryError: GC overhead limit exceeded).
> We have tried with 32GB and 64GB of executor memory as well with 3 cores.
> Our process is running fine when we have less updates and more inserts.
>
>
>
> Its a partitioned dataset. and each partition roughly has 3.5 to 10GB size of
> data. max parquet file size is default and so 120MB files max.
> input batch is spread across last 3 to 4 partitions.
> Incoming data is 5GB parquet compressed.
> User tried giving close to 20GB per task (64 GB executor w/ 3 cores) and
> still you had hit memory issues and failed.
> If the incoming batch had fewer updates, it works. else it fails w/ OOM.
> Tried w/ both BLOOM and simple, but did not work.
>
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)