openinx commented on pull request #3258:
URL: https://github.com/apache/iceberg/pull/3258#issuecomment-941000106


   Thanks all for pinging me ! Let's provide more original background for this 
[validation](https://github.com/apache/iceberg/blob/e20088449daec9ed431754044b520b3ac5fa3eaa/flink/src/main/java/org/apache/iceberg/flink/sink/IcebergFilesCommitter.java#L286):
  it was added for validating the existence of data files that were referenced 
from the pos-delete files introduced in the current TXN which is planning to 
commit but have not committed yet.
   
   After I read the implementation of the `validateDataFilesExist` method,  I 
think I misunderstood the meaning of it before, because I thought it could 
validate the existence of data files that was planing to commit in the current 
TXN. While the real meaning is:  validate the existence of the data files after 
going through the history committed txn and ensuring that NO other 
OVERWRITE/REPLACE TXN has deleted the required data files. 
   
   About the unit test broken 
[testValidateDataFileExist](https://github.com/apache/iceberg/blob/8104769f81ba79338fd3c94d5bd9267f22d31ed7/flink/src/test/java/org/apache/iceberg/flink/sink/TestIcebergFilesCommitter.java#L720)
 by this PR,  I think it is addressing the case that will never happen in the 
real CDC user case, because its steps are following: 
   
   * Step.1 :  Start a txn.1 to insert a row `<1, 'aaa'>`, which produces a 
`data-file1`; 
   * Step.2 : Start a txn.2 to replace the whole iceberg table with a newly 
produced row `<2, 'bbb'>`,  it will remove the old data file `data-file-1` and 
add a new data file `data-file2`;
   * Step.3 :  Start a txn.3 to commit a row delta txn , which includes a 
pos-delete to delete the old row `<1, 'aaa'>`.
   
   The key point is:  in the real flink CDC user case,  we won't produce any 
pos-deletes in a new txn to delete a row coming from an older txn, we just 
produce pos-deletes to delete a row that comes from the same txn as the 
pos-delete ( as @rdblue said, `Because position deletes will only reference 
data files being added to the table, there is no possibility that those files 
are concurrently deleted` ).  In summary,  the unit test is trying to address a 
case that will never happen in the real user case,  so I would like to remove 
this unit test in this PR. 
   
   Thanks all for the patient work !
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to