pvary opened a new issue, #5339:
URL: https://github.com/apache/iceberg/issues/5339
During reviewing #4904 I found the following with a slightly modified
`TestIcebergInputFormats.testFilterExp` test:
```java
[..]
helper = new TestHelper(conf, tables, location.toString(), SCHEMA, SPEC,
fileFormat, temp);
[..]
helper.createTable();
List<Record> expectedRecords = helper.generateRandomRecords(2, 0L);
expectedRecords.get(0).set(2, "2020-03-20");
expectedRecords.get(1).set(2, "2020-03-20");
DataFile dataFile1 = helper.writeFile(Row.of("2020-03-20", 0),
expectedRecords);
DataFile dataFile2 = helper.writeFile(Row.of("2020-03-21", 0),
helper.generateRandomRecords(2, 0L));
helper.appendToTable(dataFile1, dataFile2); // This creates a
transaction and adds the data files to it using 'table.newAppend()'
// Adding the same files again to the same table
helper.appendToTable(dataFile1, dataFile2);
```
The test basically adds the same data file twice for the Iceberg table.
The result is that the table will contain duplicate rows, which is what I
would expect if we do not want to prevent this situation in the first place.
I have not tested yet, but based on the specification it is not possible to
deduplicate the data using any of the V2 delete formats. It is only possible
with knowledge about the data and the data files of the Iceberg table.
Question for the community:
- Do we think that this is an expected behaviour?
- Do we want to prevent this situation by checking the uniqueness of the
file names when adding new data files to a table? What should we do in this
case?
- Throw an exception?
- Log a warning message, and skip adding the file?
My first instinct would be to prevent adding the same file to the same table
again and throw an exception, but I would like to see how others think about
this issue.
Thanks,
Peter
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]