larry98 commented on code in PR #43256:
URL: https://github.com/apache/arrow/pull/43256#discussion_r1681323476


##########
cpp/src/arrow/compute/expression.cc:
##########
@@ -1242,33 +1242,103 @@ struct Inequality {
                             /*insert_implicit_casts=*/false, &exec_context);
   }
 
+  /// Simplify an is_in predicate against this inequality as a guarantee.
+  Result<Expression> SimplifyIsIn(Expression expr) {
+    const auto& guarantee = *this;
+    auto call = expr.call();
+    auto options = checked_pointer_cast<SetLookupOptions>(call->options);
+
+    auto value_set = options->value_set.make_array();
+    if (!value_set) return expr;
+    if (value_set->length() == 0) return literal(false);
+
+    if (!options->sorted_and_deduped) return expr;

Review Comment:
   If the value set isn't sorted, then we'd have to do a linear scan. For large 
value sets (the use case that I care about), this cost is prohibitive. Our goal 
after all is support fast parquet filtering on `is_in`.
   
   However, I could see this being useful for small value sets. Do you think we 
should add another option, either to `SetLookupOption` or 
`SimplifyWithGuarantee`, to enable simplifying `is_in` via linear scan? Or can 
we punt until someone has a use case and actually wants this feature?



##########
cpp/src/arrow/compute/expression.cc:
##########
@@ -1242,33 +1242,103 @@ struct Inequality {
                             /*insert_implicit_casts=*/false, &exec_context);
   }
 
+  /// Simplify an is_in predicate against this inequality as a guarantee.
+  Result<Expression> SimplifyIsIn(Expression expr) {
+    const auto& guarantee = *this;
+    auto call = expr.call();
+    auto options = checked_pointer_cast<SetLookupOptions>(call->options);
+
+    auto value_set = options->value_set.make_array();
+    if (!value_set) return expr;
+    if (value_set->length() == 0) return literal(false);
+
+    if (!options->sorted_and_deduped) return expr;
+
+    // For now, only simplify when the guarantee is non-nullable.
+    if (guarantee.nullable) return expr;
+
+    auto compare = [&value_set, &guarantee](size_t i) -> 
Result<Comparison::type> {
+      ARROW_ASSIGN_OR_RAISE(std::shared_ptr<Scalar> scalar, 
value_set->GetScalar(i));
+      // Nulls compare greater than any non-null value.
+      if (!scalar->is_valid) {
+        return Comparison::GREATER;
+      }

Review Comment:
   I commented in `SetLookupOptions::sorted_and_deduped` that nulls are 
required to be placed at the end. I don't particularly like the interface, 
though, so open to other suggestions on how to handle nulls in the value set.
   
   If the binary search gets `Comparison::NA`, I suppose we would fail 
simplification and immediately return `expr`? If we allow arbitrary null 
placement then this might be weird in that the simplification will sometimes 
succeed and sometimes fail, depending on whether the binary search happens to 
check a null value. But I'm fine with this approach if you think this is right.



##########
cpp/src/arrow/compute/expression.cc:
##########
@@ -1242,33 +1242,103 @@ struct Inequality {
                             /*insert_implicit_casts=*/false, &exec_context);
   }
 
+  /// Simplify an is_in predicate against this inequality as a guarantee.
+  Result<Expression> SimplifyIsIn(Expression expr) {
+    const auto& guarantee = *this;
+    auto call = expr.call();
+    auto options = checked_pointer_cast<SetLookupOptions>(call->options);
+
+    auto value_set = options->value_set.make_array();
+    if (!value_set) return expr;
+    if (value_set->length() == 0) return literal(false);
+
+    if (!options->sorted_and_deduped) return expr;
+
+    // For now, only simplify when the guarantee is non-nullable.

Review Comment:
   Suppose the value set is sorted such that nulls are placed at the end. Then 
simplifying on a `less` or `less_equal` predicate wouldn't work as slicing the 
front half of the value set array would drop the null values.
   
   To support this, I think we'd need the `is_in` predicate itself to "know" 
whether it contains a null separately from storing nulls in the value set, 
perhaps by keeping track using an additional flag in `SetLookupOptions`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to