aokolnychyi commented on a change in pull request #351: Extend Iceberg with a way to overwrite files for eager updates/deletes URL: https://github.com/apache/incubator-iceberg/pull/351#discussion_r317398607
########## File path: core/src/main/java/org/apache/iceberg/OverwriteData.java ########## @@ -88,6 +119,42 @@ public OverwriteFiles validateAddedFiles() { } } + if (conflictDetectionFilter != null) { + PartitionSpec spec = writeSpec(); + Expression inclusiveExpr = Projections.inclusive(spec).project(conflictDetectionFilter); + Evaluator inclusive = new Evaluator(spec.partitionType(), inclusiveExpr); + + InclusiveMetricsEvaluator metrics = new InclusiveMetricsEvaluator(base.schema(), conflictDetectionFilter); + + List<DataFile> newFiles = collectNewFiles(base); + for (DataFile newFile : newFiles) { + ValidationException.check( + !inclusive.eval(newFile.partition()) || !metrics.eval(newFile), + "A conflicting file was appended that matches filter '%s': %s", + conflictDetectionFilter, newFile.path()); + } + } + return super.apply(base); } + + private List<DataFile> collectNewFiles(TableMetadata meta) { + List<DataFile> newFiles = new ArrayList<>(); + + Long currentSnapshotId = meta.currentSnapshot() == null ? null : meta.currentSnapshot().snapshotId(); + while (currentSnapshotId != null && !currentSnapshotId.equals(readSnapshotId)) { Review comment: This part is only invoked if we call `validateNoConflictingAppends`, which always accepts a valid `readSnapshotId` and an expression for detecting conflicts. If `validateNoConflictingAppends` is not called, there will be no validation and the behavior would be as it was before this change. We will call `validateNoConflictingAppends(null, expr)` only if we start with an empty table, meaning that we have to scan the entire history. We can consider setting `readSnapshotId` and `conflictDetectionFilter` independently, but I am not sure whether it gives us any benefits. Also, it will lead to correctness errors if we won't set `readSnapshotId` in transactions correctly. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org