[ 
https://issues.apache.org/jira/browse/HUDI-4919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-4919:
--------------------------------------
    Description: 
When using spark-sql MERGE INTO, memory requirement shoots up. To merge new 
incoming data for 120MB parquet file, memory requirement shoots up > 10GB. 

 

from user:

We are trying to process some input data which is of 5 GB (Parquet snappy 
compression) and this will try to insert/update Hudi table for 4 days (Day is 
partition).
My Data size in Hudi target table for each partition is like around 3.5GB to 
10GB.We are trying to process the data and our process is keep failing with OOM 
(java.lang.OutOfMemoryError: GC overhead limit exceeded).
We have tried with 32GB and 64GB of executor memory as well with 3 cores.
Our process is running fine when we have less updates and more inserts.

 

 

Got a brief of the issue again: 

Its a partitioned dataset. and each partition roughly has 3.5 to 10GB size of 
data. max parquet file size is default and so 120MB files max.
input batch is spread across last 3 to 4 partitions. 
Incoming data is 5GB parquet compressed.
User tried giving close to 20GB per task (64 GB executor w/ 3 cores) and still 
hit memory issues and failed.

If the incoming batch had fewer updates, it works. else it fails w/ OOM. Tried 
w/ both BLOOM and simple, but did not work. 

Similar incremental ingestion works w/o any issues w/ spark-ds writes. Issue is 
only w/ MERGE INTO w/ spark-sql. 

 

 

  was:
When using spark-sql MERGE INTO, memory requirement shoots up. To merge new 
incoming data for 120MB parquet file, memory requirement shoots up > 10GB. 

 

from user:

We are trying to process some input data which is of 5 GB (Parquet snappy 
compression) and this will try to insert/update Hudi table for 4 days (Day is 
partition).
My Data size in Hudi target table for each partition is like around 3.5GB to 
10GB.We are trying to process the data and our process is keep failing with OOM 
(java.lang.OutOfMemoryError: GC overhead limit exceeded).
We have tried with 32GB and 64GB of executor memory as well with 3 cores.
Our process is running fine when we have less updates and more inserts.

 

 

Got a brief of the issue again: 

Its a partitioned dataset. and each partition roughly has 3.5 to 10GB size of 
data. max parquet file size is default and so 120MB files max.
input batch is spread across last 3 to 4 partitions. 
Incoming data is 5GB parquet compressed.
User tried giving close to 20GB per task (64 GB executor w/ 3 cores) and still 
hit memory issues and failed.

If the incoming batch had fewer updates, it works. else it fails w/ OOM. Tried 
w/ both BLOOM and simple, but did not work. 

 

 


> Sql MERGE INTO incurs too much memory overhead
> ----------------------------------------------
>
>                 Key: HUDI-4919
>                 URL: https://issues.apache.org/jira/browse/HUDI-4919
>             Project: Apache Hudi
>          Issue Type: Bug
>          Components: spark-sql
>            Reporter: sivabalan narayanan
>            Assignee: Alexey Kudinkin
>            Priority: Blocker
>             Fix For: 0.13.0
>
>
> When using spark-sql MERGE INTO, memory requirement shoots up. To merge new 
> incoming data for 120MB parquet file, memory requirement shoots up > 10GB. 
>  
> from user:
> We are trying to process some input data which is of 5 GB (Parquet snappy 
> compression) and this will try to insert/update Hudi table for 4 days (Day is 
> partition).
> My Data size in Hudi target table for each partition is like around 3.5GB to 
> 10GB.We are trying to process the data and our process is keep failing with 
> OOM (java.lang.OutOfMemoryError: GC overhead limit exceeded).
> We have tried with 32GB and 64GB of executor memory as well with 3 cores.
> Our process is running fine when we have less updates and more inserts.
>  
>  
> Got a brief of the issue again: 
> Its a partitioned dataset. and each partition roughly has 3.5 to 10GB size of 
> data. max parquet file size is default and so 120MB files max.
> input batch is spread across last 3 to 4 partitions. 
> Incoming data is 5GB parquet compressed.
> User tried giving close to 20GB per task (64 GB executor w/ 3 cores) and 
> still hit memory issues and failed.
> If the incoming batch had fewer updates, it works. else it fails w/ OOM. 
> Tried w/ both BLOOM and simple, but did not work. 
> Similar incremental ingestion works w/o any issues w/ spark-ds writes. Issue 
> is only w/ MERGE INTO w/ spark-sql. 
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to