I do recall an issue where duplicate data/delete files where possible, but I'm not sure if that's the underlying cause in your case. The issue was fixed by #10007 <https://github.com/apache/iceberg/pull/10007> and was shipped with Iceberg 1.6.0.
On Thu, Nov 7, 2024 at 11:12 PM Lewis, William <wimle...@amazon.com.invalid> wrote: > On 2024/03/13 22:38:06 Shwetha Dharmarajan wrote: > > We are using Apache Iceberg with AWS Glue. We are seeing an issue where > duplicates are getting inserted into the table, even after making sure > there are no duplicates in the data being upserted into the table. We use > MERGE sql to upsert data into the table. > > > > We also see an issue where duplicates appear in the SELECT sql query, > when queried using spark SQL. But when we query the same table using > Athena, we don’t see any duplicates in the table. > > Did you ever find a solution to this? We’re experiencing what seems to be > a very similar issue: > > - Problem occurs only in some tables, and (as far as we can tell) only in > Glue/Spark, not Athena/Trino > - Iceberg 1.0.0 as found in Glue 4.0; newer Iceberg as used by Athena > - Table writes are via MERGE INTO sql > - Not (explicitly) using any branching or tagging features > > Additionally, we're using Iceberg format 2, with > write.merge.mode=merge-on-read (our writes are mostly inserts). One of our > jobs occasionally sprays the table with a largish number (~20k-50k) of tiny > parquet files, which eventually get coalesced by iceberg's > rewrite_data_files() procedure - that's the only thing that we can think of > that is different about the problem tables. > > Because merge-on-read seems to be a less commonly used mode, or at least > less common in the 1.0.0 era, I wonder if there is a bug in the merging of > updates during read. > > If this reminds anyone of a known issue in older versions of Iceberg, I > would very much appreciate any pointers to more info (issue tracker, > commits, vague anecdotes, etc.). > > > >