[GitHub] [iceberg] kbendick commented on a change in pull request #2062: [CORE] Add in a NOT_STARTS_WITH operator with filter pushdown

GitBox Sun, 10 Jan 2021 19:59:25 -0800


kbendick commented on a change in pull request #2062:
URL: https://github.com/apache/iceberg/pull/2062#discussion_r554674753




##########
File path: 
api/src/main/java/org/apache/iceberg/expressions/BoundLiteralPredicate.java
##########
@@ -91,6 +93,8 @@ public String toString() {
         return term() + " != " + literal;
       case STARTS_WITH:
         return term() + " startsWith \"" + literal + "\"";
+      case NOT_STARTS_WITH:
+        return term() + " notStartsWith \"" + literal + "\"";

Review comment:
       Again, I chose to go with the existing style of the code, but it does 
seem that the string representation of these predicates has too many quotes.
   
   Here's an example: `ref(name="data") startsWith ""ab""`. I would expect that 
to be `ref(name="data") startsWith "ab"`.
   
   It gets even weirder when using SQL, as I saw Spark showing `data NOT LIKE 
'"b"%'` in the explain output, though the predicate did properly evaluate to 
the expected `data NOT LIKE 'b%'`.

##########
File path: 
api/src/main/java/org/apache/iceberg/expressions/ManifestEvaluator.java
##########
@@ -329,5 +329,36 @@ public Boolean or(Boolean leftResult, Boolean rightResult) 
{
 
       return ROWS_MIGHT_MATCH;
     }
+
+    @Override
+    public <T> Boolean notStartsWith(BoundReference<T> ref, Literal<T> lit) {
+      int pos = Accessors.toPosition(ref.accessor());
+      PartitionFieldSummary fieldStats = stats.get(pos);
+
+      // values are all null (and stats exist) and literal cannot contain null
+      if (fieldStats.lowerBound() == null) {
+        return ROWS_CANNOT_MATCH;
+      }
+
+      ByteBuffer prefixAsBytes = lit.toByteBuffer();
+
+      Comparator<ByteBuffer> comparator = Comparators.unsignedBytes();
+
+      ByteBuffer lower = fieldStats.lowerBound();
+      // truncate lower bound so that its length in bytes is not greater than 
the length of prefix
+      int lowerLength = Math.min(prefixAsBytes.remaining(), lower.remaining());
+      int lowerCmp = comparator.compare(BinaryUtil.truncateBinary(lower, 
lowerLength), prefixAsBytes);
+      if (lowerCmp == 0) {
+        ByteBuffer upper = fieldStats.upperBound();
+        // truncate upper bound so that its length in bytes is not greater 
than the length of prefix
+        int upperLength = Math.min(prefixAsBytes.remaining(), 
upper.remaining());
+        int upperCmp = comparator.compare(BinaryUtil.truncateBinary(upper, 
upperLength), prefixAsBytes);
+        if (upperCmp == 0) {
+          return ROWS_CANNOT_MATCH;
+        }
+      }

Review comment:
       I had originally made a utility function, 
`BinaryUtils.startsWith(ByteBuffer value, ByteBuffer prefix, 
Comparator<ByteBuffer> cmp)`, however there are a number of assumptions that 
are valid once we've gotten to the evaluators that don't necessarily hold in 
the general case, e.g. that the ByteBuffers are at `position` 0 (which should 
be the case here), so I decided to inline the definition the handful of times 
that it's needed as it's not too complicated.
   
   If we think the utility function would be cleaner, I'd be happy to add it 
back in and also update STARTS_WITH to use it as well (it's inlined in the same 
manner, with different logic on the comparison result).

##########
File path: 
api/src/test/java/org/apache/iceberg/expressions/TestInclusiveMetricsEvaluator.java
##########
@@ -529,6 +533,47 @@ public void testStringStartsWith() {
     Assert.assertFalse("Should not read: range doesn't match", shouldRead);
   }
 
+  @Test
+  // More interesting test cases are included in TestNotStartsWith
+  public void testStringNotStartsWith() {
+    boolean shouldRead = new InclusiveMetricsEvaluator(SCHEMA, 
notStartsWith("required", "a"), true).eval(FILE);
+    Assert.assertTrue("Should read: no stats", shouldRead);
+
+    shouldRead = new InclusiveMetricsEvaluator(SCHEMA, 
notStartsWith("required", "a"), true).eval(FILE_2);
+    Assert.assertTrue("Should read: range matches", shouldRead);
+
+    shouldRead = new InclusiveMetricsEvaluator(SCHEMA, 
notStartsWith("required", "aa"), true).eval(FILE_2);
+    Assert.assertTrue("Should read: range matches", shouldRead);
+
+    shouldRead = new InclusiveMetricsEvaluator(SCHEMA, 
notStartsWith("required", "aaa"), true).eval(FILE_2);
+    Assert.assertTrue("Should read: range matches", shouldRead);
+
+    shouldRead = new InclusiveMetricsEvaluator(SCHEMA, 
notStartsWith("required", "1s"), true).eval(FILE_3);
+    Assert.assertTrue("Should read: range matches", shouldRead);
+
+    shouldRead = new InclusiveMetricsEvaluator(SCHEMA, 
notStartsWith("required", "1str1x"), true).eval(FILE_3);
+    Assert.assertTrue("Should read: range matches", shouldRead);
+
+    shouldRead = new InclusiveMetricsEvaluator(SCHEMA, 
notStartsWith("required", "ff"), true).eval(FILE_4);
+    Assert.assertTrue("Should read: range matches", shouldRead);
+
+    shouldRead = new InclusiveMetricsEvaluator(SCHEMA, 
notStartsWith("required", "aB"), true).eval(FILE_2);
+    Assert.assertTrue("Should read: range matches", shouldRead);
+
+    shouldRead = new InclusiveMetricsEvaluator(SCHEMA, 
notStartsWith("required", "dWX"), true).eval(FILE_2);
+    Assert.assertTrue("Should read: range matches", shouldRead);
+
+    shouldRead = new InclusiveMetricsEvaluator(SCHEMA, 
notStartsWith("required", "5"), true).eval(FILE_3);
+    Assert.assertTrue("Should read: range matches", shouldRead);
+
+    shouldRead = new InclusiveMetricsEvaluator(SCHEMA, 
notStartsWith("required", "3str3x"), true).eval(FILE_3);
+    Assert.assertTrue("Should read: range matches", shouldRead);
+
+    String aboveMax = UnicodeUtil.truncateStringMax(Literal.of("イロハニホヘト"), 
4).value().toString();
+    shouldRead = new InclusiveMetricsEvaluator(SCHEMA, 
notStartsWith("required", aboveMax), true).eval(FILE_4);
+    Assert.assertTrue("Should read: range matches", shouldRead);
+  }

Review comment:
       This is similar to the existing `testStringStartsWith` function above. 
However, the values with included metrics for the given files makes this test 
rather uninteresting.
   
   I've seen elsewhere in the codebase where we have left the test function in 
with an empty body and junit's `@Ignore`, with a comment directing to where 
this is better tested - which is currently the newly added  
`TestNotStartsWith`, which has files where `notStartsWith` will not be a match. 
Possibly we should do that here?

##########
File path: spark3/src/main/java/org/apache/iceberg/spark/Spark3Util.java
##########
@@ -579,6 +579,8 @@ public String or(String leftResult, String rightResult) {
           return pred.ref().name() + " != " + sqlString(pred.literal());
         case STARTS_WITH:
           return pred.ref().name() + " LIKE '" + pred.literal() + "%'";
+        case NOT_STARTS_WITH:
+          return pred.ref().name() + " NOT LIKE '" + pred.literal() + "%'";

Review comment:
       Here's another case where we possibly wind up with too many quotes. I 
can follow up in another PR. I chose to go with the existing format.

##########
File path: 
api/src/main/java/org/apache/iceberg/expressions/BoundLiteralPredicate.java
##########
@@ -91,6 +93,8 @@ public String toString() {
         return term() + " != " + literal;
       case STARTS_WITH:
         return term() + " startsWith \"" + literal + "\"";
+      case NOT_STARTS_WITH:
+        return term() + " notStartsWith \"" + literal + "\"";

Review comment:
       I opened an issue for this as well, 
https://github.com/apache/iceberg/issues/2059. I can change both STARTS_WITH 
and NOT_STARTS_WITH if we think that there are in fact too many quotes in the 
string representation.

##########
File path: api/src/test/java/org/apache/iceberg/transforms/TestStartsWith.java
##########
@@ -67,8 +67,8 @@ public void testTruncateString() {
     UnboundPredicate<String> projected = trunc.project(COLUMN, boundExpr);
     Evaluator evaluator = new Evaluator(SCHEMA.asStruct(), projected);
 
-    Assert.assertTrue("startsWith(abcde, truncate(abcde,2))  => true",
-        evaluator.eval(TestHelpers.Row.of("abcde")));
+    Assert.assertTrue("startsWith(abcde, truncate(abcdg,2))  => true",
+        evaluator.eval(TestHelpers.Row.of("abcdg")));

Review comment:
       I updated this test from abcde, as it seems much more revealing of a 
test that we're going to read a partition that actually does not start with the 
given value, because of the truncation width used in the partition spec.
   
   I'd be happy to move this to anther PR given that it is relatively 
unrelated, or remove it if anybody feels that strongly about it,. It does seem 
to me to be much more informative of a test case than`abcde startsWith abcde`, 
as that would return true on an identity partition spec as well.

##########
File path: 
api/src/main/java/org/apache/iceberg/expressions/InclusiveMetricsEvaluator.java
##########
@@ -372,6 +372,38 @@ public Boolean or(Boolean leftResult, Boolean rightResult) 
{
       return ROWS_MIGHT_MATCH;
     }
 
+    @Override
+    public <T> Boolean notStartsWith(BoundReference<T> ref, Literal<T> lit) {
+      Integer id = ref.fieldId();
+
+      if (containsNullsOnly(id)) {
+        return ROWS_CANNOT_MATCH;
+      }
+
+      ByteBuffer prefixAsBytes = lit.toByteBuffer();
+
+      Comparator<ByteBuffer> comparator = Comparators.unsignedBytes();
+
+      if (lowerBounds != null && upperBounds != null &&
+              lowerBounds.containsKey(id) && upperBounds.containsKey(id)) {
+        ByteBuffer lower = lowerBounds.get(id);

Review comment:
       I did not add a `null` check on lower, as I didn't see it in any of the 
other metrics based tests. But I thought I'd bring it up as this will 
definitely NPE if it's possible for the column to be in the `upperBounds` and 
`lowerBounds` map with a null value. Seems like that shouldn't happen though.

##########
File path: 
parquet/src/main/java/org/apache/iceberg/parquet/ParquetDictionaryRowGroupFilter.java
##########
@@ -373,6 +373,25 @@ public Boolean or(Boolean leftResult, Boolean rightResult) 
{
       return ROWS_CANNOT_MATCH;
     }
 
+    @Override
+    public <T> Boolean notStartsWith(BoundReference<T> ref, Literal<T> lit) {
+      int id = ref.fieldId();
+
+      Boolean hasNonDictPage = isFallback.get(id);
+      if (hasNonDictPage == null || hasNonDictPage) {
+        return ROWS_MIGHT_MATCH;
+      }
+
+      Set<T> dictionary = dict(id, lit.comparator());
+      for (T item : dictionary) {
+        if (!item.toString().startsWith(lit.value().toString())) {
+          return ROWS_MIGHT_MATCH;
+        }

Review comment:
       Here's one more case where we're using `.toString` and I wonder if we 
should be using one of the built in Literal Comparators instead.

##########
File path: 
api/src/main/java/org/apache/iceberg/expressions/ResidualEvaluator.java
##########
@@ -214,6 +214,11 @@ public Expression or(Expression leftResult, Expression 
rightResult) {
       return ((String) ref.eval(struct)).startsWith((String) lit.value()) ? 
alwaysTrue() : alwaysFalse();
     }
 
+    @Override
+    public <T> Expression notStartsWith(BoundReference<T> ref, Literal<T> lit) 
{
+      return ((String) ref.eval(struct)).startsWith((String) lit.value()) ? 
alwaysFalse() : alwaysTrue();
+    }

Review comment:
       This is another case where I chose to go with the existing style of the 
surrounding code (e.g. the `startsWith` definition just above this), but I'm 
not so sure that casting to java String type is appropriate vs using the built 
in comparators.
   
   Comparing the values as ByteBuffers, given the differences in Java's UTF-16 
default and the included Literals comparator which handles UTF-8 code point 
differences for characters that are encoded as 3 vs 4 bytes, seems safer to me 
but I chose to be safe and emulate the surrounding code.
   
   

##########
File path: api/src/test/java/org/apache/iceberg/transforms/TestTruncate.java
##########
@@ -75,6 +75,8 @@ public void testTruncateString() {
         "abcde", trunc.apply("abcdefg"));
     Assert.assertEquals("Should not pad strings shorter than length",
         "abc", trunc.apply("abc"));
+    Assert.assertEquals("Should not alter strings equal to length",
+        "abcde", trunc.apply("abcde"));

Review comment:
       Not exactly related so happy to remove if anybody rejects (or move it 
into its own PR).




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] kbendick commented on a change in pull request #2062: [CORE] Add in a NOT_STARTS_WITH operator with filter pushdown

Reply via email to