ldwnt commented on issue #6956:
URL: https://github.com/apache/iceberg/issues/6956#issuecomment-1447723347

   > Ah, I see, using merge on read using Flink makes sense.
   > 
   > > And I have a question: with merge on read mode, in the worst case, does 
an executor have to read all delete records (in my case maybe all the rows 
before the whole table delete)?
   > 
   > There is some logic involved to optimize this, but equality deletes aren't 
the best choice when it comes to performance. Because at some point Flink will 
write a delete (`id=5`), and you have to apply this to the subsequent data 
files, which is quite costly as you might imagine. Of course, this is limited 
to the partitions that you're reading and will prune the deletes of the 
partitions that are outside of the scope of the query.
   > 
   > What also would work is to compact the table using a Spark job 
periodically (ideally the partitions that aren't being written to anymore). So 
you'll get rid of the deletes.
   
   The iceberg table is written by a flink job with an iceberg sink, whose 
source is a mysql table. I can not control the way the source table is updated, 
which is deleting all rows and inserting the new ones. In this scenario, can I 
make the connector use position delete or by any means make the following 
rewriting more efficient?
   
   Yes, the spark job is run everyday. Unfortunately due to the way the source 
table is updated, I'm still facing 9 millions row deletes in a single run.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to