openinx edited a comment on pull request #3307: URL: https://github.com/apache/iceberg/pull/3307#issuecomment-946683261
Thanks @nastra for bringing this up. I got a case that would break the flaky unit test from [here](https://github.com/apache/iceberg/blob/7ff9eb4a49c836dc8da0768153d77f19f405d649/flink/src/test/java/org/apache/iceberg/flink/SimpleDataUtil.java#L275): In the unit test case, we are trying to write the following records into apache iceberg table by shuffling by partition field `data` (The parallelism is 2): ``` (1, 'aaa'), (1, 'bbb'), (1, 'ccc') (2, 'aaa'), (2, 'bbb'), (2, 'ccc') (3, 'aaa'), (3, 'bbb'), (3, 'ccc') ``` As we may produces multiple checkpoints when the streaming job is running, Then it's possible that we write the records in the following checkpoints: * checkpoint#1 * (1, 'aaa') * (1, 'bbb') * (1, 'ccc') * checkpoint#2 * (2, 'aaa'), * (2, 'bbb'), * (2, 'ccc') * (3, 'aaa'), * (3, 'bbb'), * (3, 'ccc') Then it will produces a seperate data file for each partition in the given checkpoint. Let's say: * checkpoint#1 * produces `data-file-1` for partition `aaa` * produces `data-file-2` for partition `bbb` * produces `data-file-3` for partition `ccc` * checkpoint#2 * produces `data-file-4` for partition `aaa` * produces `data-file-5` for partition `bbb` * produces `data-file-6` for partition `ccc` In the [IcebergFilesCommitter](https://github.com/apache/iceberg/blob/0abaa7ce670f74493cdc5aad43ec56c25d4a54d3/flink/src/main/java/org/apache/iceberg/flink/sink/IcebergFilesCommitter.java#L266) , we will use the default MergeAppend to merge the manifest. Then we will produces: * checkpoint#1 * manifest-1: includes `data-file-1`, `data-file-2`, `data-file-3` * checkpoint#2 * manifest-2: includes `data-file-1`, `data-file-2`, `data-file-3`, `data-file-4`, `data-file-5`, `data-file-6` Then finally, in this [line](https://github.com/apache/iceberg/blob/7ff9eb4a49c836dc8da0768153d77f19f405d649/flink/src/test/java/org/apache/iceberg/flink/SimpleDataUtil.java#L275), we will encounter the `data-file1` twice in the `result` map. Finally, the assert would be failure. I think we only need to find out the newly added data files for each given commit for fixing this unit test purpose. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
