krisnaru opened a new issue, #14458: URL: https://github.com/apache/iceberg/issues/14458
### Apache Iceberg version 1.10.0 (latest release) ### Query engine Spark ### Please describe the bug 🐞 The snapshotId filtering logic was incorrectly excluding live data files during table copy operations. entry.snapshotId() records when a data file was initially added, not which snapshots currently reference it. After manifest compaction or snapshot expiration, a snapshot can reference manifests containing entries with expired snapshotIds, but those files are still live and must be copied. The check snapshotIds.contains(entry.snapshotId()) was fundamentally wrong because it filtered out data files whose original snapshot had expired, even though they were still referenced by the snapshot(s) being copied. This bug likely affects many production tables where manifest compaction has run. Customers may not notice the issue if they don't query the missing data files. ### Willingness to contribute - [x] I can contribute a fix for this bug independently - [x] I would be willing to contribute a fix for this bug with guidance from the Iceberg community - [ ] I cannot contribute a fix for this bug at this time -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
