larry98 commented on PR #43256:
URL: https://github.com/apache/arrow/pull/43256#issuecomment-2233903221

   Thanks for the review.
   
   To give some context, we are trying to optimize parquet filtering on high 
cardinality columns. You can think of this column as an ID on the row, so a 
typical use case might be to read 100K rows out of a file with 10MM rows. We 
express the 100K rows that we want to read as an `is_in` predicate. 
   
   > I think it would be better when we encapsulate all "is_in" and sorted 
checking within SimplifyWithGuarantee 🤔
   
   `SimplifyWithGuarantee` is called twice per row group, so sorting or 
checking for sortedness on each call to `SimplifyWithGuarantee` would eliminate 
any performance improvements - the predicate pushdown statistics computations 
would outweigh any gains from decoding less data.
   
   > This list could be sorted-and-deduped only once in the planning phase (the 
"user" code on top of arrow expression).
   
   I agree that having some sort of planning or preprocessing step is what we 
ultimately want, I'm just unsure of what the right interface is. If we put the 
option inside `SetLookupOptions`, then it is up to the end user to to do this. 
If we add an option to `SimplifyWithGuarantee` then `ParquetFileFragment` could 
preprocess the user's filter expression to sort `is_in` value sets and invoke 
`SimplifyWithGuarantee` with the option to turn on the optimization.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to