bkietz commented on PR #43256:
URL: https://github.com/apache/arrow/pull/43256#issuecomment-2341909147

   My main note when this PR is rebased is I'd like to see benchmarks comparing 
the sort-and-slice simplification to the basic filtering one. My intuition is 
that we'll only prefer sort-and-slice for larger value sets and it'd be best to 
measure the threshold.
   
   > - It sounds like we're ok with storing the memoized sorted and unique 
value set in the kernel state
   
   The issue is less about where state should be stored and more about the 
potential for race conditions when updating KernelState. I'm not opposed to 
adding memoizations to KernelState, but I don't think it will be necessary here.
   
   > - Any thoughts on how to use ArrayStatistics for this optimization?
   
   A user can set the flag explicitly, but mostly: after the first time a value 
set is simplified by this optimization, the optimization itself can set that 
flag (and then subsequent simplifications of the same value set will be able to 
skip sorting).
   
   > - Any concerns with using the SimplificationContext unordered map to defer 
binding
   
   This does add complexity and I'd prefer to avoid it if we can. One approach 
to reducing the cost of binding I'd prefer is pattern matching against the 
guarantee to extract as much of it as is usable by SimplifyIsIn. For example 
parquet statistics frequently produces guarantee expressions like `(a > 3 and a 
< 9) or a is null` which either filtering or sort-and-slice could use in one go.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to