openinx commented on pull request #3258: URL: https://github.com/apache/iceberg/pull/3258#issuecomment-941000106
Thanks all for pinging me ! Let's provide more original background for this [validation](https://github.com/apache/iceberg/blob/e20088449daec9ed431754044b520b3ac5fa3eaa/flink/src/main/java/org/apache/iceberg/flink/sink/IcebergFilesCommitter.java#L286): it was added for validating the existence of data files that were referenced from the pos-delete files introduced in the current TXN which is planning to commit but have not committed yet. After I read the implementation of the `validateDataFilesExist` method, I think I misunderstood the meaning of it before, because I thought it could validate the existence of data files that was planing to commit in the current TXN. While the real meaning is: validate the existence of the data files after going through the history committed txn and ensuring that NO other OVERWRITE/REPLACE TXN has deleted the required data files. About the unit test broken [testValidateDataFileExist](https://github.com/apache/iceberg/blob/8104769f81ba79338fd3c94d5bd9267f22d31ed7/flink/src/test/java/org/apache/iceberg/flink/sink/TestIcebergFilesCommitter.java#L720) by this PR, I think it is addressing the case that will never happen in the real CDC user case, because its steps are following: * Step.1 : Start a txn.1 to insert a row `<1, 'aaa'>`, which produces a `data-file1`; * Step.2 : Start a txn.2 to replace the whole iceberg table with a newly produced row `<2, 'bbb'>`, it will remove the old data file `data-file-1` and add a new data file `data-file2`; * Step.3 : Start a txn.3 to commit a row delta txn , which includes a pos-delete to delete the old row `<1, 'aaa'>`. The key point is: in the real flink CDC user case, we won't produce any pos-deletes in a new txn to delete a row coming from an older txn, we just produce pos-deletes to delete a row that comes from the same txn as the pos-delete ( as @rdblue said, `Because position deletes will only reference data files being added to the table, there is no possibility that those files are concurrently deleted` ). In summary, the unit test is trying to address a case that will never happen in the real user case, so I would like to remove this unit test in this PR. Thanks all for the patient work ! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
