felipecrv commented on PR #43256: URL: https://github.com/apache/arrow/pull/43256#issuecomment-2241709072
> `SimplifyWithGuarantee` is called twice per row group, so sorting or checking for sortedness on each call to `SimplifyWithGuarantee` would eliminate any performance improvements - the predicate pushdown statistics computations would outweigh any gains from decoding less data. > > > This list could be sorted-and-deduped only once in the planning phase (the "user" code on top of arrow expression). In a sense, sorting the `value_set` is part of the work that the `is_in` kernel should do. But during simplification you perform a bind that augments that `Expression::Call` with these extra fields https://github.com/apache/arrow/blob/c3ebdf500e75ca868f50b7d374fc8ce2237756b8/cpp/src/arrow/compute/expression.h#L53-L59 You can encode the "pre-sorted" fact in the `kernel_state` here. Then you can skip checking again when simplifying if the `Expression::Call` value is already bound and with its kernel state initialized. `SetLookupBase` is the base class for the kernel state of is_in and index_is_in. https://github.com/apache/arrow/blob/main/cpp/src/arrow/compute/kernels/scalar_set_lookup.cc -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
