Hi Luis, You're right about what's happening. Because the Avro appender doesn't track column-level stats, Iceberg can't determine that the file only contains matching data rows and can be deleted. Parquet does keep those stats, so even though the partitioning doesn't guarantee the delete is safe, Iceberg can determine that it is.
The solution is to add column-level stats for Avro files. Is that something you're interested in working on? rb On Thu, Mar 12, 2020 at 10:09 AM Luis Otero <lote...@gmail.com> wrote: > Hi, > > AvroFileAppender doesn't report min/max values ( > https://github.com/apache/incubator-iceberg/blob/80cbc60ee55911ee627a7ad3013804394d7b5e9a/core/src/main/java/org/apache/iceberg/avro/AvroFileAppender.java#L60 > ). > > As a side effect (I think) overwrite operations (if there are data files > with the same partition) fail with "Cannot delete file where some, but not > all, rows match filter" because StrictMetricsEvaluator can't confirm all > rows match. > > For instance, if you modify TestLocalScan with: > > this.partitionSpec = > PartitionSpec.builderFor(SCHEMA).bucket("id",10).build(); > > this.file1Records = new ArrayList<Record>(); > file1Records.add(record.copy(ImmutableMap.of("id", 60L, "data", > UUID.randomUUID().toString()))); > DataFile file1 = writeFile(sharedTable.location(), > format.addExtension("file-1"), file1Records); > > this.file2Records = new ArrayList<Record>(); > file2Records.add(record.copy(ImmutableMap.of("id", 1L, "data", > UUID.randomUUID().toString()))); > DataFile file2 = writeFile(sharedTable.location(), > format.addExtension("file-2"), file2Records); > > this.file3Records = new ArrayList<Record>(); > file3Records.add(record.copy(ImmutableMap.of("id", 1L, "data", > UUID.randomUUID().toString()))); > DataFile file3 = writeFile(sharedTable.location(), > format.addExtension("file-3"), file3Records); > > sharedTable.newAppend() > .appendFile(file1) > .commit(); > > sharedTable.newAppend() > .appendFile(file2) > .commit(); > > sharedTable.newOverwrite() > .overwriteByRowFilter(equal("id",1L)) > .addFile(file3) > .commit(); > > > Fails with 'org.apache.iceberg.exceptions.ValidationException: Cannot > delete file where some, but not all, rows match filter ref(name="id") == 1: > file:/AVRO/file-2.avro' for AVRO format but works fine for PARQUET format. > > Am I missing something here? > > Thanks!! > -- Ryan Blue Software Engineer Netflix