larry98 commented on PR #43256:
URL: https://github.com/apache/arrow/pull/43256#issuecomment-2243549826

   > In a sense, sorting the `value_set` is part of the work that the `is_in` 
kernel should do. But during simplification you perform a bind that augments 
that `Expression::Call` with these extra fields
   > 
   > 
https://github.com/apache/arrow/blob/c3ebdf500e75ca868f50b7d374fc8ce2237756b8/cpp/src/arrow/compute/expression.h#L53-L59
   > 
   > You can encode the "pre-sorted" fact in the `kernel_state` here. Then you 
can skip checking again when simplifying if the `Expression::Call` value is 
already bound and with its kernel state initialized.
   > 
   > `SetLookupBase` is the base class for the kernel state of is_in and 
index_is_in. 
https://github.com/apache/arrow/blob/main/cpp/src/arrow/compute/kernels/scalar_set_lookup.cc
   
   Ah I didn't realize that binding initializes the kernel state and constructs 
the lookup table. Turns out there was a "bug" in my proof of concept where I 
copy over the post-bind fields with a simplified value set instead of 
re-binding. But this happens to DTRT for parquet predicate pushdown since the 
simplified expression is simply thrown away after we check that it is 
satisfiable, and the original, unsimplified expression is actually used to 
filter the data. 
   
   To fix this, I think we'd also need to delay binding until the end of 
`SimplifyWithGuarantee` once all inequalities have been processed. That way, 
the total cost of re-binding the simplified expressions across all row groups 
is still linear in the value set size. 
   
   Otherwise, thanks for the suggestion! I like this approach, but just to 
check my understanding of how this would work:
   
   - We change `SetLookupState::Init` so that it takes the `SetLookupOptions&` 
argument without `const`. When initializing, we sort and remove duplicates from 
the value set and update the options in place. This effectively makes it so 
that binding an expression can mutate the input in a way that doesn't change 
semantics.
   - Inside `Simplify`, if we see an `is_in` call that is unbound, then we 
immediately fail simplification, since an unbound `is_in` is not guaranteed to 
have a sorted and unique value set. We cannot bind the expression on the spot 
since we don't have access to a schema.
   - Once we compute the simplified value set, we must avoid re-binding so that 
we don't re-sorted the value set (and re-construct the memo table). We only 
re-bind once after all inequalities in the guarantee have been processed. 
   
   @felipecrv Does this sound right?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to