adriangb commented on PR #13795:
URL: https://github.com/apache/datafusion/pull/13795#issuecomment-2548395059
Thank you for filling in the backstory Andrew!
Yes that's right I basically have the statistics in a Postgres table (with
`file_id` and `row_group` columns) so I end up running a query along the lines
of:
```sql
select file_id, row_group, <some other metadata>
from stats
where <pruning predicate> is not false
```
Note that I add `is not false`, will come back to that later.
The point is that even for 100k row groups / several thousand files it's
very easy for Postgres to scan this table and return the matching row groups
(easy as in <1s) while it would be a lot of work to do a LIST on a bucket,
download footers, decode, get stats, etc. And this is better than storing the
data in postgres, copying it to the application and doing the pruning there
because it saves moving the data over the wire.
I want to make it fast even for _millions_ of row groups, which requires an
index on some of the stats columns (for us in particular temporal or
partitioning columns make sense, so something like `start_timestamp_min` gets
an index because pretty much every query filters by it).
Now about the `is not false`: I found that `PruningPredicate` _can_ return
null values: if `_max` or `_min` are null.
Consider:
```sql
postgres=# create table stats (row_count int, null_count int, min int, max
int);
CREATE TABLE
postgres=# insert into stats values (1, 0, null, null);
INSERT 0 1
postgres=# select case when row_count = null_count then false else min > 1
end from stats;
case
------
(1 row)
```
Maybe this can never happen with Parquet stats, but it certainly can happen
with my setup: you add a new column and all "old" rows for files that don't
have that column are null.
So I don't think using a CASE statement is enough to guarantee that the
predicate can never return null. There must be some further guarantees about
Parquet metadata specifically e.g. that the `min`/`max` stats can only be null
if all of the rows are null or something.
In fact, I *think* there is code to handle that here:
https://github.com/apache/datafusion/blob/f4e65d2d9711ed097982d2fbde4191c402c05023/datafusion/physical-optimizer/src/pruning.rs#L727-L730
Regarding Postgres' ability to handle `CASE`, I was a bit surprised as well
but indeed it does not use an index with `CASE` expressions:
```sql
postgres=# create table stats (row_count int, null_count int, min int, max
int);
CREATE TABLE
postgres=# INSERT INTO stats (row_count, null_count, min, max)
INSERT INTO stats (row_count, null_count, min, max)
SELECT
1,
0,
g.id as min,
g.id + 1 as max
FROM generate_series(0, 10000000) g(id);
INSERT 0 10000001
postgres=# create index on stats (min);
CREATE INDEX
postgres=# SET enable_seqscan = OFF;
SET
postgres=# explain select * from stats where case when row_count =
null_count then false else min > 599 end;
QUERY PLAN
--------------------------------------------------------------------------------
Seq Scan on stats (cost=10000000000.00..10000204052.92 rows=4999930
width=16)
Filter: CASE WHEN (row_count = null_count) THEN false ELSE (min > 599) END
postgres=# explain select * from stats where not row_count = null_count and
min > 599;
QUERY PLAN
---------------------------------------------------------------------------------------
Index Scan using stats_min_idx on stats (cost=0.43..363724.15 rows=9949240
width=16)
Index Cond: (min > 599)
Filter: (row_count <> null_count)
```
Two other general notes:
- It would be nice if PruningPrediacate could take into account which
columns are non-nullable and skip the `row_count = null_count` condition for
those (will always evaluate to `false`).
- I would have flipped the boolean logic so that if the predicate returns
`false`(ey) we keep the container. That way `null` and `false` meen `keep` and
only `true` means prune.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]