bkietz commented on PR #43256: URL: https://github.com/apache/arrow/pull/43256#issuecomment-2341909147
My main note when this PR is rebased is I'd like to see benchmarks comparing the sort-and-slice simplification to the basic filtering one. My intuition is that we'll only prefer sort-and-slice for larger value sets and it'd be best to measure the threshold. > - It sounds like we're ok with storing the memoized sorted and unique value set in the kernel state The issue is less about where state should be stored and more about the potential for race conditions when updating KernelState. I'm not opposed to adding memoizations to KernelState, but I don't think it will be necessary here. > - Any thoughts on how to use ArrayStatistics for this optimization? A user can set the flag explicitly, but mostly: after the first time a value set is simplified by this optimization, the optimization itself can set that flag (and then subsequent simplifications of the same value set will be able to skip sorting). > - Any concerns with using the SimplificationContext unordered map to defer binding This does add complexity and I'd prefer to avoid it if we can. One approach to reducing the cost of binding I'd prefer is pattern matching against the guarantee to extract as much of it as is usable by SimplifyIsIn. For example parquet statistics frequently produces guarantee expressions like `(a > 3 and a < 9) or a is null` which either filtering or sort-and-slice could use in one go. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
