adamfaulkner-at opened a new issue, #13268: URL: https://github.com/apache/datafusion/issues/13268
### Describe the bug When running a query like ``` SELECT * FROM table WHERE RANDOM() < 0.1; ``` I get different results depending on the value of `"datafusion.execution.parquet.pushdown_filters"`. When this setting is turned off, I get the results I expect, roughly 10% of the rows in the table. When it is turned on, I think I'm seeing 1% of the rows in the table. I suspect I'm seeing these results because pushdown with `TableProviderFilterPushDown::Inexact` is applying this filter at both the parquet level and a `FilterExec: random() <= 0.1`. This results in the `RANDOM()` filter being evaluated twice, which causes fewer rows to be sampled. ### To Reproduce This can be reproduced with `datafusion-cli` version 42.2.0: Without `pushdown_filters` ``` > create external table data stored as parquet location '/Users/adam.faulkner/Downloads/parquet_data/'; > select COUNT(*) from data WHERE RANDOM() < 0.1; +----------+ | count(*) | +----------+ | 605572 | +----------+ 1 row(s) fetched. Elapsed 0.043 seconds. ``` With `pushdown_filters` (note that you must re-create the table with the updated setting): ``` > set datafusion.execution.parquet.pushdown_filters=true; 0 row(s) fetched. Elapsed 0.002 seconds. > create external table data stored as parquet location '/Users/adam.faulkner/Downloads/parquet_data/'; 0 row(s) fetched. Elapsed 0.007 seconds. > select COUNT(*) from data WHERE RANDOM() < 0.1; +----------+ | count(*) | +----------+ | 60152 | +----------+ 1 row(s) fetched. Elapsed 0.045 seconds. ``` ### Expected behavior I would expect that a filter on `RANDOM()` would be applied only once, so that `RANDOM() < 0.1` means that only 10% of all rows will be sampled. It would be acceptable if `RANDOM()` was no longer eligible for pushdown, though I suspect this leaves a negligible amount of performance on the table compared to the alternative. It feels like the "right" solution is to somehow guarantee that `RANDOM()` always returns the same value for a given row and query evaluation, perhaps by "caching" its values. ### Additional context In my custom TableProvider, I tried using ``TableProviderFilterPushDown::Exact` for these filters, and I get the results that I expect. However, it seems that this is only because my filter is really simple. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org