[GitHub] [iceberg] Fokko commented on a diff in pull request #6892: Python: `starts_with` and `not_starts_with` expressions

via GitHub Mon, 27 Feb 2023 06:08:22 -0800


Fokko commented on code in PR #6892:
URL: https://github.com/apache/iceberg/pull/6892#discussion_r1118743477



##########
python/pyiceberg/expressions/parser.py:
##########
@@ -207,7 +210,22 @@ def _(result: ParseResults) -> BooleanExpression:
     return NotIn(result.column, result.literal_set)
 
 
-predicate = (comparison | in_check | null_check | nan_check | 
boolean).set_results_name("predicate")
+starts_with = column + STARTS_WITH + string

Review Comment:
   ```suggestion
   starts_with = column + LIKE + string
   ```



##########
python/pyiceberg/expressions/parser.py:
##########
@@ -71,6 +73,7 @@
 IN = CaselessKeyword("in")
 NULL = CaselessKeyword("null")
 NAN = CaselessKeyword("nan")
+STARTS_WITH = CaselessKeyword("like")

Review Comment:
   ```suggestion
   LIKE = CaselessKeyword("like")
   ```



##########
python/pyiceberg/expressions/visitors.py:
##########
@@ -943,6 +1021,12 @@ def visit_less_than(self, term: BoundTerm[L], literal: 
Literal[L]) -> List[Tuple
     def visit_less_than_or_equal(self, term: BoundTerm[L], literal: 
Literal[L]) -> List[Tuple[str, str, Any]]:
         return [(term.ref().field.name, "<=", 
self._cast_if_necessary(term.ref().field.field_type, literal.value))]
 
+    def visit_starts_with(self, term: BoundTerm[L], literal: Literal[L]) -> 
List[Tuple[str, str, Any]]:
+        return []
+
+    def visit_not_starts_with(self, term: BoundTerm[L], literal: Literal[L]) 
-> List[Tuple[str, str, Any]]:
+        return [(term.ref().field.name, "not starts_with", 
self._cast_if_necessary(term.ref().field.field_type, literal.value))]

Review Comment:
   This operation isn't available in the DNF logic, I think we should omit this 
one for now.
   ```suggestion
           return []
   ```



##########
python/pyiceberg/expressions/visitors.py:
##########
@@ -675,6 +702,57 @@ def visit_less_than_or_equal(self, term: BoundTerm[L], 
literal: Literal[L]) -> b
 
         return ROWS_MIGHT_MATCH
 
+    def visit_starts_with(self, term: BoundTerm[L], literal: Literal[L]) -> 
bool:
+        pos = term.ref().accessor.position
+        field = self.partition_fields[pos]
+        prefix = str(literal.value)
+
+        # contains_null encodes whether at least one partition value is null,
+        # lowerBound is null if all partition values are null
+        all_null = field.contains_null is True and field.lower_bound is None
+        if all_null or not field.lower_bound or not field.upper_bound:
+            return ROWS_CANNOT_MATCH
+
+        lower = _from_byte_buffer(term.ref().field.field_type, 
field.lower_bound)
+        # truncate lower bound so that its length is not greater than the 
length of prefix
+        if lower and lower[: len(str(prefix))] > prefix:
+            return ROWS_CANNOT_MATCH
+
+        upper = _from_byte_buffer(term.ref().field.field_type, 
field.upper_bound)
+        # truncate upper bound so that its length is not greater than the 
length of prefix
+        if upper and upper[: len(prefix)] < prefix:
+            return ROWS_CANNOT_MATCH
+
+        return ROWS_MIGHT_MATCH
+
+    def visit_not_starts_with(self, term: BoundTerm[L], literal: Literal[L]) 
-> bool:
+        pos = term.ref().accessor.position
+        field = self.partition_fields[pos]
+        prefix = str(literal.value)
+
+        if field.contains_null or not field.lower_bound or not 
field.upper_bound:
+            return ROWS_MIGHT_MATCH
+
+        # not_starts_with will match unless all values must start with the 
prefix. This happens when
+        # the lower and upper bounds both start with the prefix.
+        lower = _from_byte_buffer(term.ref().field.field_type, 
field.lower_bound)
+        upper = _from_byte_buffer(term.ref().field.field_type, 
field.upper_bound)
+
+        if lower and upper:

Review Comment:
   ```suggestion
           if lower is not Non and upper is not None:
   ```



##########
python/pyiceberg/expressions/visitors.py:
##########
@@ -675,6 +702,57 @@ def visit_less_than_or_equal(self, term: BoundTerm[L], 
literal: Literal[L]) -> b
 
         return ROWS_MIGHT_MATCH
 
+    def visit_starts_with(self, term: BoundTerm[L], literal: Literal[L]) -> 
bool:
+        pos = term.ref().accessor.position
+        field = self.partition_fields[pos]
+        prefix = str(literal.value)
+
+        # contains_null encodes whether at least one partition value is null,
+        # lowerBound is null if all partition values are null
+        all_null = field.contains_null is True and field.lower_bound is None
+        if all_null or not field.lower_bound or not field.upper_bound:
+            return ROWS_CANNOT_MATCH
+
+        lower = _from_byte_buffer(term.ref().field.field_type, 
field.lower_bound)
+        # truncate lower bound so that its length is not greater than the 
length of prefix
+        if lower and lower[: len(str(prefix))] > prefix:

Review Comment:
   ```suggestion
           if lower is not None and lower[: len(str(prefix))] > prefix:
   ```
   
   Should we put `len(str(prefix))` in a variable? `prefix_len`?



##########
python/pyiceberg/expressions/parser.py:
##########
@@ -207,7 +210,22 @@ def _(result: ParseResults) -> BooleanExpression:
     return NotIn(result.column, result.literal_set)
 
 
-predicate = (comparison | in_check | null_check | nan_check | 
boolean).set_results_name("predicate")
+starts_with = column + STARTS_WITH + string
+not_starts_with = column + NOT + STARTS_WITH + string

Review Comment:
   ```suggestion
   not_starts_with = column + NOT + LIKE + string
   ```



##########
python/pyiceberg/expressions/visitors.py:
##########
@@ -675,6 +702,57 @@ def visit_less_than_or_equal(self, term: BoundTerm[L], 
literal: Literal[L]) -> b
 
         return ROWS_MIGHT_MATCH
 
+    def visit_starts_with(self, term: BoundTerm[L], literal: Literal[L]) -> 
bool:
+        pos = term.ref().accessor.position
+        field = self.partition_fields[pos]
+        prefix = str(literal.value)
+
+        # contains_null encodes whether at least one partition value is null,
+        # lowerBound is null if all partition values are null
+        all_null = field.contains_null is True and field.lower_bound is None

Review Comment:
   Should we keep this in line with Java?
   ```java
     if (fieldStats.lowerBound() == null) {
       return ROWS_CANNOT_MATCH; // values are all null and literal cannot 
contain null
     }
   ```



##########
python/pyiceberg/expressions/visitors.py:
##########
@@ -675,6 +702,57 @@ def visit_less_than_or_equal(self, term: BoundTerm[L], 
literal: Literal[L]) -> b
 
         return ROWS_MIGHT_MATCH
 
+    def visit_starts_with(self, term: BoundTerm[L], literal: Literal[L]) -> 
bool:
+        pos = term.ref().accessor.position
+        field = self.partition_fields[pos]
+        prefix = str(literal.value)
+
+        # contains_null encodes whether at least one partition value is null,
+        # lowerBound is null if all partition values are null
+        all_null = field.contains_null is True and field.lower_bound is None
+        if all_null or not field.lower_bound or not field.upper_bound:
+            return ROWS_CANNOT_MATCH
+
+        lower = _from_byte_buffer(term.ref().field.field_type, 
field.lower_bound)
+        # truncate lower bound so that its length is not greater than the 
length of prefix
+        if lower and lower[: len(str(prefix))] > prefix:
+            return ROWS_CANNOT_MATCH
+
+        upper = _from_byte_buffer(term.ref().field.field_type, 
field.upper_bound)
+        # truncate upper bound so that its length is not greater than the 
length of prefix
+        if upper and upper[: len(prefix)] < prefix:
+            return ROWS_CANNOT_MATCH
+
+        return ROWS_MIGHT_MATCH
+
+    def visit_not_starts_with(self, term: BoundTerm[L], literal: Literal[L]) 
-> bool:
+        pos = term.ref().accessor.position
+        field = self.partition_fields[pos]
+        prefix = str(literal.value)
+
+        if field.contains_null or not field.lower_bound or not 
field.upper_bound:
+            return ROWS_MIGHT_MATCH
+
+        # not_starts_with will match unless all values must start with the 
prefix. This happens when
+        # the lower and upper bounds both start with the prefix.
+        lower = _from_byte_buffer(term.ref().field.field_type, 
field.lower_bound)
+        upper = _from_byte_buffer(term.ref().field.field_type, 
field.upper_bound)
+
+        if lower and upper:
+            # if lower is shorter than the prefix then lower doesn't start 
with the prefix
+            if len(lower) < len(prefix):

Review Comment:
   Same here, should we put the `len(prefix)` in a variable, it looks like 
we're using it multiple times.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] Fokko commented on a diff in pull request #6892: Python: `starts_with` and `not_starts_with` expressions

Reply via email to