adriangb commented on PR #13795: URL: https://github.com/apache/datafusion/pull/13795#issuecomment-2548395059
Thank you for filling in the backstory Andrew! Yes that's right I basically have the statistics in a Postgres table (with `file_id` and `row_group` columns) so I end up running a query along the lines of: ```sql select file_id, row_group, <some other metadata> from stats where <pruning predicate> is not false ``` Note that I add `is not false`, will come back to that later. The point is that even for 100k row groups / several thousand files it's very easy for Postgres to scan this table and return the matching row groups (easy as in <1s) while it would be a lot of work to do a LIST on a bucket, download footers, decode, get stats, etc. And this is better than storing the data in postgres, copying it to the application and doing the pruning there because it saves moving the data over the wire. I want to make it fast even for _millions_ of row groups, which requires an index on some of the stats columns (for us in particular temporal or partitioning columns make sense, so something like `start_timestamp_min` gets an index because pretty much every query filters by it). Now about the `is not false`: I found that `PruningPredicate` _can_ return null values: if `_max` or `_min` are null. Consider: ```sql postgres=# create table stats (row_count int, null_count int, min int, max int); CREATE TABLE postgres=# insert into stats values (1, 0, null, null); INSERT 0 1 postgres=# select case when row_count = null_count then false else min > 1 end from stats; case ------ (1 row) ``` Maybe this can never happen with Parquet stats, but it certainly can happen with my setup: you add a new column and all "old" rows for files that don't have that column are null. So I don't think using a CASE statement is enough to guarantee that the predicate can never return null. There must be some further guarantees about Parquet metadata specifically e.g. that the `min`/`max` stats can only be null if all of the rows are null or something. In fact, I *think* there is code to handle that here: https://github.com/apache/datafusion/blob/f4e65d2d9711ed097982d2fbde4191c402c05023/datafusion/physical-optimizer/src/pruning.rs#L727-L730 Regarding Postgres' ability to handle `CASE`, I was a bit surprised as well but indeed it does not use an index with `CASE` expressions: ```sql postgres=# create table stats (row_count int, null_count int, min int, max int); CREATE TABLE postgres=# INSERT INTO stats (row_count, null_count, min, max) INSERT INTO stats (row_count, null_count, min, max) SELECT 1, 0, g.id as min, g.id + 1 as max FROM generate_series(0, 10000000) g(id); INSERT 0 10000001 postgres=# create index on stats (min); CREATE INDEX postgres=# SET enable_seqscan = OFF; SET postgres=# explain select * from stats where case when row_count = null_count then false else min > 599 end; QUERY PLAN -------------------------------------------------------------------------------- Seq Scan on stats (cost=10000000000.00..10000204052.92 rows=4999930 width=16) Filter: CASE WHEN (row_count = null_count) THEN false ELSE (min > 599) END postgres=# explain select * from stats where not row_count = null_count and min > 599; QUERY PLAN --------------------------------------------------------------------------------------- Index Scan using stats_min_idx on stats (cost=0.43..363724.15 rows=9949240 width=16) Index Cond: (min > 599) Filter: (row_count <> null_count) ``` Two other general notes: - It would be nice if PruningPrediacate could take into account which columns are non-nullable and skip the `row_count = null_count` condition for those (will always evaluate to `false`). - I would have flipped the boolean logic so that if the predicate returns `false`(ey) we keep the container. That way `null` and `false` meen `keep` and only `true` means prune. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org