codejoyan opened a new issue #3607:
URL: https://github.com/apache/iceberg/issues/3607


   Hi Team,
   
   Can you suggest me some way to tune a slow running MERGE query. It is taking 
~ 20 mins to upsert 1.5 million records.
   Sample Merge query:
   
   df.createOrReplaceTempView("source")
   df.cache()
   
   MERGE INTO iceberg_hive_cat.iceberg_poc_db.iceberg_tab target
   USING (SELECT * FROM source)
   ON target.col1 = source.col1 AND target.col2 = target.col2 AND target.col3 = 
source.col3
   WHEN MATCHED AND part_date_col between '2021-01-01' and '2021-01-16' THEN 
UPDATE SET *
   WHEN NOT MATCHED THEN INSERT *
   
   
   The source dataset is a temporary view and it contains 1.5 million records 
and contains data between '2021-01-01' and '2021-01-16'.
   The target iceberg table is a partitioned table partitioned by day and has 
60 partitions. The source dataset will only upsert few trailing partitions of 
the target table. But from Spark UI it looks like it is touching all the 
partitions instead of looking just for partitions between '2021-01-01' and 
'2021-01-16'. 
   
   1. Can partition pruning happen in MERGE?
   2. Is there a way to tune and improve the performance
   3. Is there any Java or Spark UI to achieve Merge instead of the SQL syntax?
   
   Let me know if you need any further details.
   
   Regards
   Joyan


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to