adriangb opened a new issue, #21624:
URL: https://github.com/apache/datafusion/issues/21624

   Our current config of `datafusion.execution.collect_statistics` feels like a 
nuclear option that blindly collects all statistics even if they are not useful 
for the query at hand.
   
   Two cases I see that are egregious.
   
   ## Pointless statistics for all columns on wide tables
   
   ```sql
   SELECT *
   FROM t
   ORDER BY ts
   LIMIT 10
   ```
   
   With dynamic filters from the TopK operator there is a real advantage to 
having file-level statistics: we use these to skip files before opening them 
again if we can prove they won't factor into the topk heap. However on a wide 
table we collect and store statistics for *all columns* in the table even if 
the only stats used are for `ts`. For `datafusion-cli` the main waste is 
storing all of those stats in memory. This is not a small problem: there's been 
a lot of recent work on caching statistics w/ bounded memory (e.g. 
https://github.com/apache/datafusion/issues/19052) in an effort to reduce 
memory consumption from statistics. In a query like this we could reduce memory 
consumption by something like 99% if the table is wide, other columns are 
strings (larger stats), etc.
   
   For `ListingTable` memory consumption is the main issue. Since stats are 
collected from parquet thrift footers you at least have to read all of the 
stats into memory (there's some ideas to skip decoding, but that's not 
implemented yet). However for systems like we have at Pydantic (which store 
stats in parquet files) or Ducklake (which stores stats in Postgres) it's 
trivial and beneficial to only collect stats for some columns. Thus there's 
also an IO price being paid to collect useless stats.
   
   Obviously for a query like `SELECT * FROM t` there is not point in 
collecting any stats at all.
   
   ## Collecting stats for files that are never touched
   
   Another issue is collecting stats for files that are never touched.
   
   For example:
   
   ```sql
   SELECT val
   FROM t
   WHERE val = 'abc'
   LIMIT 10
   ```
   
   This query can push the limit into the scan such that we may stop after 
reading the first file. Stats on `val` may be useful to skip entire files, but 
if there are 100s of files and we are able to satisfy the limit by e.g. 
skipping 3 then reading 2 it was pointless to collect stats for the other 95 
files; we could have never even read the footer from those.
   
   ## Collecting statistics is buried inside of `ListingTable`
   
   There are no generalized APIs for "I have a source of statistics", everyone 
has to build their own and inject the stats into the `PartitionedFile`s they 
return from `ListingTable`.
   
   ## Proposal
   
   I've started poking at this in 
https://github.com/apache/datafusion/pull/21157 but I think the solution is to 
be lazier about collecting statistics and to decide what stats to collect / 
store based on the needs of the query. 
   
   One idea is that instead of 1 step of "execute query" we have:
   
   1. Stats collection based on the query's requirements (which columns, all 
files or up to some limit within each partition / scan queue). I imagine this 
could be a good place to introduce things like a join operator requesting a 
sample of cardinality for a column (maybe randomly sample files, randomly 
sample row groups, randomly sample pages). This is like a mini query in that it 
requires IO, CPU work, uses memory, etc.
   2. Actual execution.
   
   Getting the right APIs and heuristics for (1) is going to be the hard part. 
For example, is it the topk operator that asks for stats on the columns it is 
sorting on? The stats may not make it back up to the TopK (e.g. if there is a 
join in the middle). Even then the topk operator itself may not even use the 
stats, it just wants the scans to use them for pruning. Joins are a bit 
different: the join operator wants cardinality stats to choose join order (this 
would probably be an optimizer as well... making it even more unclear what the 
APIs should be).
   
   This is kind of happening right now already but as per above it's implicitly 
buried in `ListingTable`: we do a run of stats collection during planning 
(which as per above is a mix of IO and CPU) but it's kind of ad-hoc, not tied 
into knowledge of the query or optimizer needs and not thought of as a public 
API.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to