[GitHub] [iceberg] akghbti opened a new issue #4093: Merge (CopyOnWrite) Not Efficient as compared to equivalent Delete/Update Operation

GitBox Fri, 11 Feb 2022 05:24:38 -0800


akghbti opened a new issue #4093:
URL: https://github.com/apache/iceberg/issues/4093



   Spark Version - 3.2.0
   Iceberg Version - 0.13
   
   There is table 'table1' partitioned by _c1 
   
   +----------+---+---+
   | _c0|_c1|_c2|
   +----------+---+---+
   |1225526400| 1| a|
   |1228118400| 10| j|
   |1228377600| 11| k|
   |1228809600| 12| l|
   |1228982400| 13| m|
   |1229673600| 14| n|
   |1230019200| 15| o|
   |1230278400| 16| p|
   |1230451200| 17| q|
   |1230624000| 18| r|
   |1230710400| 19| s|
   |1225699200| 2| b|
   |1225785600| 3| c|
   |1226476800| 4| d|
   |1226908800| 5| e|
   |1226995200| 6| f|
   |1227513600| 7| g|
   |1227772800| 8| h|
   |1228032000| 9| i|
   |1230796800| 20| t|
   +----------+---+---+
   
   There would be 25 part files. 
   
   
   Now there is target table 'table2', partitioned by '_c1'
   
   +----------+---+---+
   | _c0|_c1|_c2|
   +----------+---+---+
   |1228377600| 11| k|
   |1228809600| 12| l|
   |1228982400| 13| m|
   +----------+---+---+
   
   
   Now if run following query in Spark: 
   
   sparkSession.sql("MERGE INTO local.db.table1 t USING (SELECT * FROM 
local.db.table2) u ON t._c1=u._c1 "
   + "WHEN MATCHED AND t._c1='13' THEN DELETE");
   
   The summary in the manifest list output is : 
   
   "summary" : {
   "operation" : "overwrite",
   "spark.app.id" : "local-1644584660016",
   "added-data-files" : "2",
   "deleted-data-files" : "3",
   "added-records" : "2",
   "deleted-records" : "3",
   "added-files-size" : "1836",
   "removed-files-size" : "2754",
   "changed-partition-count" : "3",
   "total-records" : "25",
   "total-files-size" : "22883",
   "total-data-files" : "25",
   "total-delete-files" : "0",
   "total-position-deletes" : "0",
   "total-equality-deletes" : "0"
   }
   It shows that total part files which were re-written are 3 in numbers, 
Ideally, only 1 input part file should have been re-written because the merge 
condition only affects 1 input part file. 
   
   Same operation if one runs via plain delete query (as shown below), the 
summary of manifest reflects what is expected. 
   
   Plain alternate delete query: 
   
   -- sparkSession.sql("Delete from local.db.table1 WHERE _c1 in 
('11','12','13') AND _c1 = '13'")
   
   Here is the summary of manifest list -- 
   
   
   "summary" : {
   "operation" : "delete",
   "spark.app.id" : "local-1644585674579",
   "deleted-data-files" : "1",
   "deleted-records" : "1",
   "removed-files-size" : "918",
   "changed-partition-count" : "1",
   "total-records" : "25",
   "total-files-size" : "22883",
   "total-data-files" : "25",
   "total-delete-files" : "0",
   "total-position-deletes" : "0",
   "total-equality-deletes" : "0"
   }
   
   So, from above, if you see two operations Merge with Delete and Plain 
Delete, the Plain Delete is more efficient as compared to Merge with Delete. 
   
   
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] akghbti opened a new issue #4093: Merge (CopyOnWrite) Not Efficient as compared to equivalent Delete/Update Operation

Reply via email to