adriangb commented on PR #13795:
URL: https://github.com/apache/datafusion/pull/13795#issuecomment-2548395059

   Thank you for filling in the backstory Andrew!
   
   Yes that's right I basically have the statistics in a Postgres table (with 
`file_id` and `row_group` columns) so I end up running a query along the lines 
of:
   
   ```sql
   select file_id, row_group, <some other metadata>
   from stats
   where <pruning predicate> is not false
   ```
   
   Note that I add `is not false`, will come back to that later.
   
   The point is that even for 100k row groups / several thousand files it's 
very easy for Postgres to scan this table and return the matching row groups 
(easy as in <1s) while it would be a lot of work to do a LIST on a bucket, 
download footers, decode, get stats, etc. And this is better than storing the 
data in postgres, copying it to the application and doing the pruning there 
because it saves moving the data over the wire.
   I want to make it fast even for _millions_ of row groups, which requires an 
index on some of the stats columns (for us in particular temporal or 
partitioning columns make sense, so something like `start_timestamp_min` gets 
an index because pretty much every query filters by it).
   
   Now about the `is not false`: I found that `PruningPredicate` _can_ return 
null values: if `_max` or `_min` are null.
   
   Consider:
   
   ```sql
   postgres=# create table stats (row_count int, null_count int, min int, max 
int);
   CREATE TABLE
   postgres=# insert into stats values (1, 0, null, null);
   INSERT 0 1
   postgres=# select case when row_count = null_count then false else min > 1 
end from stats;
    case
   ------
   
   (1 row)
   ```
   
   Maybe this can never happen with Parquet stats, but it certainly can happen 
with my setup: you add a new column and all "old" rows for files that don't 
have that column are null.
   So I don't think using a CASE statement is enough to guarantee that the 
predicate can never return null. There must be some further guarantees about 
Parquet metadata specifically e.g. that the `min`/`max` stats can only be null 
if all of the rows are null or something.
   In fact, I *think* there is code to handle that here: 
https://github.com/apache/datafusion/blob/f4e65d2d9711ed097982d2fbde4191c402c05023/datafusion/physical-optimizer/src/pruning.rs#L727-L730
   
   Regarding Postgres' ability to handle `CASE`, I was a bit surprised as well 
but indeed it does not use an index with `CASE` expressions:
   
   ```sql
   postgres=# create table stats (row_count int, null_count int, min int, max 
int);
   CREATE TABLE
   postgres=# INSERT INTO stats (row_count, null_count, min, max)
   INSERT INTO stats (row_count, null_count, min, max)
   SELECT 
       1,
       0,
       g.id as min,
       g.id + 1 as max
   FROM generate_series(0, 10000000) g(id);
   INSERT 0 10000001
   postgres=# create index on stats (min);
   CREATE INDEX
   postgres=# SET enable_seqscan = OFF;
   SET
   postgres=# explain select * from stats where case when row_count = 
null_count then false else min > 599 end;
                                      QUERY PLAN
   
--------------------------------------------------------------------------------
    Seq Scan on stats  (cost=10000000000.00..10000204052.92 rows=4999930 
width=16)
      Filter: CASE WHEN (row_count = null_count) THEN false ELSE (min > 599) END
   postgres=# explain select * from stats where not row_count = null_count and 
min > 599;
                                         QUERY PLAN
   
---------------------------------------------------------------------------------------
    Index Scan using stats_min_idx on stats  (cost=0.43..363724.15 rows=9949240 
width=16)
      Index Cond: (min > 599)
      Filter: (row_count <> null_count)
   ```
   
   Two other general notes:
   - It would be nice if PruningPrediacate could take into account which 
columns are non-nullable and skip the `row_count = null_count` condition for 
those (will always evaluate to `false`).
   - I would have flipped the boolean logic so that if the predicate returns 
`false`(ey) we keep the container. That way `null` and `false` meen `keep` and 
only `true` means prune.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to