[GitHub] [iceberg] rdblue commented on a change in pull request #1738: Spark: Copy-on-write DELETE prototype

GitBox Tue, 10 Nov 2020 16:12:55 -0800


rdblue commented on a change in pull request #1738:
URL: https://github.com/apache/iceberg/pull/1738#discussion_r520954934




##########
File path: spark3/src/main/java/org/apache/iceberg/spark/source/SparkTable.java
##########
@@ -155,6 +163,39 @@ public WriteBuilder newWriteBuilder(LogicalWriteInfo info) 
{
     return new SparkWriteBuilder(sparkSession(), icebergTable, info);
   }
 
+  @Override
+  public MergeBuilder newMergeBuilder(LogicalWriteInfo info) {
+    String mode = icebergTable.properties().getOrDefault(WRITE_ROW_LEVEL_MODE, 
WRITE_ROW_LEVEL_MODE_DEFAULT);
+    ValidationException.check(mode.equals("copy-on-write"), "Unsupported row 
operations mode: %s", mode);
+    return new SparkMergeBuilder(sparkSession(), icebergTable, info);
+  }
+
+  @Override
+  public boolean canDeleteWhere(Filter[] filters) {
+    // TODO: multiple partition specs
+    Set<Integer> partitionFieldSourceIds = 
PartitionUtil.sourceFieldIds(table().spec());
+    Schema schema = table().schema();
+
+    for (Filter filter : filters) {
+      // return false if we have a predicate on a non-partition column or if 
we cannot translate the filter
+      if (isDataFilter(filter, schema, partitionFieldSourceIds) || 
SparkFilters.convert(filter) == null) {
+        return false;
+      }
+    }
+
+    return true;
+  }
+
+  private boolean isDataFilter(Filter filter, Schema schema, Set<Integer> 
partitionFieldSourceIds) {
+    // TODO: handle dots correctly via v2references
+    Set<String> filterRefs = Sets.newHashSet(filter.references());
+    return filterRefs.stream().anyMatch(ref -> {
+      Types.NestedField field = schema.findField(ref);
+      ValidationException.check(field != null, "Cannot find field %s in 
schema", ref);
+      return !partitionFieldSourceIds.contains(field.fieldId());

Review comment:
       I think this only works for identity partition columns. A filter like 
`ts < 2020-11-10T11:15:03.712498` may reference a partition source field, `ts`, 
but it isn't safe because it doesn't align with the partition boundaries. The 
only time we know that it aligns with boundaries is when the partition uses 
`identity`
   
   The check that we use in the `ManifestFilterManager` to delete files is when 
`inclusiveProjection.eval` and `structProjection.eval` both accept a file. That 
is, when the file can contain rows that match the filter (inclusive projection) 
and all rows are guaranteed to match the filter (strict projection). We will 
also delete files where the filter is guaranteed to match all rows of a file 
using the file's metrics.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] rdblue commented on a change in pull request #1738: Spark: Copy-on-write DELETE prototype

Reply via email to