larry98 commented on PR #43256: URL: https://github.com/apache/arrow/pull/43256#issuecomment-2233903221
Thanks for the review. To give some context, we are trying to optimize parquet filtering on high cardinality columns. You can think of this column as an ID on the row, so a typical use case might be to read 100K rows out of a file with 10MM rows. We express the 100K rows that we want to read as an `is_in` predicate. > I think it would be better when we encapsulate all "is_in" and sorted checking within SimplifyWithGuarantee 🤔 `SimplifyWithGuarantee` is called twice per row group, so sorting or checking for sortedness on each call to `SimplifyWithGuarantee` would eliminate any performance improvements - the predicate pushdown statistics computations would outweigh any gains from decoding less data. > This list could be sorted-and-deduped only once in the planning phase (the "user" code on top of arrow expression). I agree that having some sort of planning or preprocessing step is what we ultimately want, I'm just unsure of what the right interface is. If we put the option inside `SetLookupOptions`, then it is up to the end user to to do this. If we add an option to `SimplifyWithGuarantee` then `ParquetFileFragment` could preprocess the user's filter expression to sort `is_in` value sets and invoke `SimplifyWithGuarantee` with the option to turn on the optimization. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
