openinx commented on pull request #3307:
URL: https://github.com/apache/iceberg/pull/3307#issuecomment-946683261


   Thanks @nastra for bringing this up.  I got a case that would break the 
flaky unit test from 
[here](https://github.com/apache/iceberg/blob/7ff9eb4a49c836dc8da0768153d77f19f405d649/flink/src/test/java/org/apache/iceberg/flink/SimpleDataUtil.java#L275):
 
   
   In the unit test case,  we are trying to write the following records into 
apache iceberg table by shuffling by partition field `data`  (The parallelism 
is 2): 
   
   ```
   (1, 'aaa'), (1, 'bbb'), (1, 'ccc')
   (2, 'aaa'), (2, 'bbb'), (2, 'ccc')
   (3, 'aaa'), (3, 'bbb'), (3, 'ccc')
   ```
   
   As we may produces multiple checkpoints when the streaming job is running,   
Then it's possible that we write the records in the following checkpoints: 
   
   * checkpoint#1
     * (1, 'aaa')
     * (1, 'bbb')
     * (1, 'ccc')
   
   * checkpoint#2
   
     * (2, 'aaa'), 
     * (2, 'bbb'), 
     * (2, 'ccc')
     * (3, 'aaa'), 
     * (3, 'bbb'), 
     * (3, 'ccc')
   
   Then  it will produces a seperate data file for each partition in the given 
checkpoint.  Let's say: 
   
   * checkpoint#1 
     * produces `data-file-1` for partition `aaa`
     * produces `data-file-2` for partition `bbb`
     * produces `data-file-3` for partition `ccc`
   
   * checkpoint#2
     * produces `data-file-4` for partition `aaa`
     * produces `data-file-5` for partition `bbb`
     * produces `data-file-6` for partition `ccc`
   
    In the 
[IcebergFilesCommitter](https://github.com/apache/iceberg/blob/0abaa7ce670f74493cdc5aad43ec56c25d4a54d3/flink/src/main/java/org/apache/iceberg/flink/sink/IcebergFilesCommitter.java#L266)
 ,  we will use the default MergeAppend to merge the manifest.  Then we will 
produces: 
   
   * checkpoint#1
     * manifest-1:  includes `data-file-1`, `data-file-2`, `data-file3`
   
   * checkpoint#2
     * manifest-2: includes `data-file1`, `data-file2`, `data-file3`, 
`data-file4`, `data-file5`, `data-file6`
   
   Then finally, in this 
[line](https://github.com/apache/iceberg/blob/7ff9eb4a49c836dc8da0768153d77f19f405d649/flink/src/test/java/org/apache/iceberg/flink/SimpleDataUtil.java#L275),
  we will encounter the `data-file1` twice in the `result` map.  Finally, the 
assert would be failure. 
   
   I think we only need to find out the newly added data files for each given 
commit for fixing this unit test purpose.
   
   
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to