theirix commented on PR #16325: URL: https://github.com/apache/datafusion/pull/16325#issuecomment-2985522134
> According to PostgreSQL's reference: https://wiki.postgresql.org/wiki/TABLESAMPLE_Implementation#SYSTEM_Option I believe `SYSTEM` option is equivalent to keep the entire `RecordBatch` according to the specified probability, this rewrite rule implemented here is sampling row by row, which follows the behavior of `BERNOULLI` option. Since df has vectorized execution, evaluation a `random() < x` filter should be efficient, I think we can apply this implementation on both `SYSTEM` and `BERNOULLI` option to keep it simple. @2010YOUY01 I'd like to double-check if a volatile filter pushdown to a Parquet executor is expected. In the mentioned PR, I disabled optimisation in a logical plan optimiser to push down volatile predicates. But it seems like the physical optimiser still pushes this predicate to an executor. While it helps us with automatic sampling, the results could be wrong. How do you think – should we implement a similar mechanism to make volatile predicates as unsupported filters? Before: ``` [2025-06-18T18:20:07Z TRACE datafusion::physical_planner] Optimized physical plan by LimitedDistinctAggregation: OutputRequirementExec ProjectionExec: expr=[count(Int64(1))@0 as count(*)] AggregateExec: mode=Final, gby=[], aggr=[count(Int64(1))] AggregateExec: mode=Partial, gby=[], aggr=[count(Int64(1))] FilterExec: random() < 0.1 DataSourceExec: file_groups={1 group: [[sample.parquet]]}, file_type=parquet ``` After: ``` [2025-06-18T18:20:07Z TRACE datafusion::physical_planner] Optimized physical plan by FilterPushdown: OutputRequirementExec ProjectionExec: expr=[count(Int64(1))@0 as count(*)] AggregateExec: mode=Final, gby=[], aggr=[count(Int64(1))] AggregateExec: mode=Partial, gby=[], aggr=[count(Int64(1))] DataSourceExec: file_groups={1 group: [[sample.parquet]]}, file_type=parquet, predicate=random() < 0.1 ``` Data: <details> set datafusion.execution.parquet.pushdown_filters=true; create external table data stored as parquet location 'sample.parquet'; SELECT count(*) FROM data WHERE random() < 0.1; </details> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org