larry98 commented on PR #43256: URL: https://github.com/apache/arrow/pull/43256#issuecomment-2243549826
> In a sense, sorting the `value_set` is part of the work that the `is_in` kernel should do. But during simplification you perform a bind that augments that `Expression::Call` with these extra fields > > https://github.com/apache/arrow/blob/c3ebdf500e75ca868f50b7d374fc8ce2237756b8/cpp/src/arrow/compute/expression.h#L53-L59 > > You can encode the "pre-sorted" fact in the `kernel_state` here. Then you can skip checking again when simplifying if the `Expression::Call` value is already bound and with its kernel state initialized. > > `SetLookupBase` is the base class for the kernel state of is_in and index_is_in. https://github.com/apache/arrow/blob/main/cpp/src/arrow/compute/kernels/scalar_set_lookup.cc Ah I didn't realize that binding initializes the kernel state and constructs the lookup table. Turns out there was a "bug" in my proof of concept where I copy over the post-bind fields with a simplified value set instead of re-binding. But this happens to DTRT for parquet predicate pushdown since the simplified expression is simply thrown away after we check that it is satisfiable, and the original, unsimplified expression is actually used to filter the data. To fix this, I think we'd also need to delay binding until the end of `SimplifyWithGuarantee` once all inequalities have been processed. That way, the total cost of re-binding the simplified expressions across all row groups is still linear in the value set size. Otherwise, thanks for the suggestion! I like this approach, but just to check my understanding of how this would work: - We change `SetLookupState::Init` so that it takes the `SetLookupOptions&` argument without `const`. When initializing, we sort and remove duplicates from the value set and update the options in place. This effectively makes it so that binding an expression can mutate the input in a way that doesn't change semantics. - Inside `Simplify`, if we see an `is_in` call that is unbound, then we immediately fail simplification, since an unbound `is_in` is not guaranteed to have a sorted and unique value set. We cannot bind the expression on the spot since we don't have access to a schema. - Once we compute the simplified value set, we must avoid re-binding so that we don't re-sorted the value set (and re-construct the memo table). We only re-bind once after all inequalities in the guarantee have been processed. @felipecrv Does this sound right? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
