adamfaulkner-at opened a new issue, #13268:
URL: https://github.com/apache/datafusion/issues/13268

   ### Describe the bug
   
   When running a query like
   ```
   SELECT * FROM table WHERE RANDOM() < 0.1;
   ```
   
   I get different results depending on the value of 
`"datafusion.execution.parquet.pushdown_filters"`. When this setting is turned 
off, I get the results I expect, roughly 10% of the rows in the table. When it 
is turned on, I think I'm seeing 1% of the rows in the table.
   
   I suspect I'm seeing these results because pushdown with 
`TableProviderFilterPushDown::Inexact` is applying this filter at both the 
parquet level and a `FilterExec: random() <= 0.1`. This results in the 
`RANDOM()` filter being evaluated twice, which causes fewer rows to be sampled.
   
   ### To Reproduce
   
   This can be reproduced with `datafusion-cli` version 42.2.0:
   
   Without `pushdown_filters`
   ```
   > create external table data stored as parquet location 
'/Users/adam.faulkner/Downloads/parquet_data/';
   > select COUNT(*) from data WHERE RANDOM() < 0.1;
   +----------+
   | count(*) |
   +----------+
   | 605572   |
   +----------+
   1 row(s) fetched.
   Elapsed 0.043 seconds.
   ```
   
   With `pushdown_filters` (note that you must re-create the table with the 
updated setting):
   ```
   > set datafusion.execution.parquet.pushdown_filters=true;
   0 row(s) fetched.
   Elapsed 0.002 seconds.
   
   > create external table data stored as parquet location 
'/Users/adam.faulkner/Downloads/parquet_data/';
   0 row(s) fetched.
   Elapsed 0.007 seconds.
   
   > select COUNT(*) from data WHERE RANDOM() < 0.1;
   +----------+
   | count(*) |
   +----------+
   | 60152    |
   +----------+
   1 row(s) fetched.
   Elapsed 0.045 seconds.
   ```
   
   ### Expected behavior
   
   I would expect that a filter on `RANDOM()` would be applied only once, so 
that `RANDOM() < 0.1` means that only 10% of all rows will be sampled.
   
   It would be acceptable if `RANDOM()` was no longer eligible for pushdown, 
though I suspect this leaves a negligible amount of performance on the table 
compared to the alternative.
   
   It feels like the "right" solution is to somehow guarantee that `RANDOM()` 
always returns the same value for a given row and query evaluation, perhaps by 
"caching" its values. 
   
   ### Additional context
   
   In my custom TableProvider, I tried using 
``TableProviderFilterPushDown::Exact` for these filters, and I get the results 
that I expect. However, it seems that this is only because my filter is really 
simple.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to