liubo1022126 commented on pull request #3130: URL: https://github.com/apache/iceberg/pull/3130#issuecomment-921432832
There is a problem with this pr, When the write job is started from a non-state, It will get the watermark in the current snapshot, if the operation of the current snapshot is not append operation(delete, rewrite), it can't get watermark in the current snapshot, because the advancement information of watermark is only implemented in streaming write. Now I can think of two solutions: 1. Add watermark transfer support for various operations, like delete, rewrite and so on: But this idea have a problem, In the scenario of streaming write while file rewrite, It is common for new data to be written in the process of rewrite start to end. that is to say, when snapshot s1 has watermark w1 when file rewrite begin, then streaming write commit new snapshot s2 has watermark w2, then file rewrite complete for snapshot s3 with watermark w3, to calculate w3, in addition to passing w1 to w3, we must also perform additional calculations on w2. It looks very complicated and hard to understand. 2. Keep the current implementation in this pr, only record watermark in append operation: When we need to get the watermark of the current table, we will backtrack the table snapshot until we get the watermark value in the most recent append operation. I prefer solution 2. what about you think? @stevenzwu @rdblue @openinx -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
