[
https://issues.apache.org/jira/browse/HUDI-4919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
sivabalan narayanan closed HUDI-4919.
-------------------------------------
Resolution: Done
> Sql MERGE INTO incurs too much memory overhead
> ----------------------------------------------
>
> Key: HUDI-4919
> URL: https://issues.apache.org/jira/browse/HUDI-4919
> Project: Apache Hudi
> Issue Type: Bug
> Components: spark-sql
> Reporter: sivabalan narayanan
> Assignee: sivabalan narayanan
> Priority: Blocker
> Fix For: 0.13.0
>
>
> When using spark-sql MERGE INTO, memory requirement shoots up. To merge new
> incoming data for 120MB parquet file, memory requirement shoots up > 10GB.
>
> from user:
> We are trying to process some input data which is of 5 GB (Parquet snappy
> compression) and this will try to insert/update Hudi table for 4 days (Day is
> partition).
> My Data size in Hudi target table for each partition is like around 3.5GB to
> 10GB.We are trying to process the data and our process is keep failing with
> OOM (java.lang.OutOfMemoryError: GC overhead limit exceeded).
> We have tried with 32GB and 64GB of executor memory as well with 3 cores.
> Our process is running fine when we have less updates and more inserts.
>
>
> Got a brief of the issue again:
> Its a partitioned dataset. and each partition roughly has 3.5 to 10GB size of
> data. max parquet file size is default and so 120MB files max.
> input batch is spread across last 3 to 4 partitions.
> Incoming data is 5GB parquet compressed.
> User tried giving close to 20GB per task (64 GB executor w/ 3 cores) and
> still hit memory issues and failed.
> If the incoming batch had fewer updates, it works. else it fails w/ OOM.
> Tried w/ both BLOOM and simple, but did not work.
> Similar incremental ingestion works w/o any issues w/ spark-ds writes. Issue
> is only w/ MERGE INTO w/ spark-sql.
>
> Specifics about the table schema:
> Table has around 50 columns and there are no nested fields
> All data types are generic once like String,Timestamp,Decimal
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)