felipecrv commented on PR #43256:
URL: https://github.com/apache/arrow/pull/43256#issuecomment-2241709072

   > `SimplifyWithGuarantee` is called twice per row group, so sorting or 
checking for sortedness on each call to `SimplifyWithGuarantee` would eliminate 
any performance improvements - the predicate pushdown statistics computations 
would outweigh any gains from decoding less data.
   > 
   > > This list could be sorted-and-deduped only once in the planning phase 
(the "user" code on top of arrow expression).
   
   In a sense, sorting the `value_set` is part of the work that the `is_in` 
kernel should do. But during simplification you perform a bind that augments 
that `Expression::Call` with these extra fields
   
   
https://github.com/apache/arrow/blob/c3ebdf500e75ca868f50b7d374fc8ce2237756b8/cpp/src/arrow/compute/expression.h#L53-L59
   
   You can encode the "pre-sorted" fact in the `kernel_state` here. Then you 
can skip checking again when simplifying if the `Expression::Call` value is 
already bound and with its kernel state initialized.
   
   `SetLookupBase` is the base class for the kernel state of is_in and 
index_is_in.
   
https://github.com/apache/arrow/blob/main/cpp/src/arrow/compute/kernels/scalar_set_lookup.cc


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to