codejoyan opened a new issue #3607:
URL: https://github.com/apache/iceberg/issues/3607
Hi Team,
Can you suggest me some way to tune a slow running MERGE query. It is taking
~ 20 mins to upsert 1.5 million records.
Sample Merge query:
df.createOrReplaceTempView("source")
df.cache()
MERGE INTO iceberg_hive_cat.iceberg_poc_db.iceberg_tab target
USING (SELECT * FROM source)
ON target.col1 = source.col1 AND target.col2 = target.col2 AND target.col3 =
source.col3
WHEN MATCHED AND part_date_col between '2021-01-01' and '2021-01-16' THEN
UPDATE SET *
WHEN NOT MATCHED THEN INSERT *
The source dataset is a temporary view and it contains 1.5 million records
and contains data between '2021-01-01' and '2021-01-16'.
The target iceberg table is a partitioned table partitioned by day and has
60 partitions. The source dataset will only upsert few trailing partitions of
the target table. But from Spark UI it looks like it is touching all the
partitions instead of looking just for partitions between '2021-01-01' and
'2021-01-16'.
1. Can partition pruning happen in MERGE?
2. Is there a way to tune and improve the performance
3. Is there any Java or Spark UI to achieve Merge instead of the SQL syntax?
Let me know if you need any further details.
Regards
Joyan
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]