[GitHub] [iceberg] rdblue commented on a change in pull request #3600: Core: Handle metrics-based deletes

GitBox Tue, 30 Nov 2021 13:10:01 -0800


rdblue commented on a change in pull request #3600:
URL: https://github.com/apache/iceberg/pull/3600#discussion_r759661579




##########
File path: core/src/main/java/org/apache/iceberg/ManifestFilterManager.java
##########
@@ -429,13 +425,44 @@ private ManifestFile filterManifestWithDeletedFiles(
     }
   }
 
-  private Evaluator strictDeleteEvaluator(PartitionSpec spec) {
-    Expression strictExpr = Projections.strict(spec).project(deleteExpression);
-    return new Evaluator(spec.partitionType(), strictExpr);
-  }
+  // an evaluator that checks whether rows in a file may/must match a given 
expression
+  // this class first partially evaluates the provided expression using the 
partition tuple
+  // and then checks the remaining part of the expression using metrics 
evaluators
+  private class ExpressionEvaluator {
+    private final Schema tableSchema;
+    private final ResidualEvaluator residualEvaluator;
+    private final StructLikeMap<Pair<InclusiveMetricsEvaluator, 
StrictMetricsEvaluator>> metricsEvaluators;
+
+    // TODO: support case sensitive flags
+    ExpressionEvaluator(Schema tableSchema, PartitionSpec spec, Expression 
expr) {
+      this.tableSchema = tableSchema;
+      this.residualEvaluator = ResidualEvaluator.of(spec, expr, true);
+      this.metricsEvaluators = StructLikeMap.create(spec.partitionType());
+    }
 
-  private Evaluator inclusiveDeleteEvaluator(PartitionSpec spec) {
-    Expression inclusiveExpr = 
Projections.inclusive(spec).project(deleteExpression);
-    return new Evaluator(spec.partitionType(), inclusiveExpr);
+    boolean rowsMightMatch(F file) {
+      Pair<InclusiveMetricsEvaluator, StrictMetricsEvaluator> evaluators = 
metricsEvaluators(file);
+      InclusiveMetricsEvaluator inclusiveMetricsEvaluator = evaluators.first();
+      return inclusiveMetricsEvaluator.eval(file);
+    }
+
+    boolean rowsMustMatch(F file) {
+      Pair<InclusiveMetricsEvaluator, StrictMetricsEvaluator> evaluators = 
metricsEvaluators(file);
+      StrictMetricsEvaluator strictMetricsEvaluator = evaluators.second();
+      return strictMetricsEvaluator.eval(file);
+    }
+
+    private Pair<InclusiveMetricsEvaluator, StrictMetricsEvaluator> 
metricsEvaluators(F file) {
+      // this logic depends on ResidualEvaluator that behaves in the following 
way
+      // if strict projection returns true -> the pred would return true -> 
replace the pred with true
+      // if inclusive projection returns false -> the pred would return false 
-> replace the pred with false
+      // otherwise, keep the original predicate and try evaluating it using 
metrics

Review comment:
       I'm not sure that I agree with this. The residual evaluator does this 
for every predicate in the expression. It effectively removes any predicate 
that is determined by the strict evaluator and returns the part of the 
predicate that needs to be evaluated for rows in the given partition.
   
   For example, `id = 5 AND ts > '2021-11-30T10:00:00'` for partition 
`(id_bucket=0, ts_day='2021-12-01')` will return `id = 5` because the entire 
partition matches `ts > 2021-11-30T10:00:00`.
   
   I think the actual logic here is correct. It only uses the predicates that 
are not satisfied by the partition itself. Since the residual evaluator handles 
transform expressions and those are usually determined by the partition tuple, 
those are effectively removed and you can just use metrics. This is a really 
good insight.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] rdblue commented on a change in pull request #3600: Core: Handle metrics-based deletes

Reply via email to