adriangb opened a new issue, #21624: URL: https://github.com/apache/datafusion/issues/21624
Our current config of `datafusion.execution.collect_statistics` feels like a nuclear option that blindly collects all statistics even if they are not useful for the query at hand. Two cases I see that are egregious. ## Pointless statistics for all columns on wide tables ```sql SELECT * FROM t ORDER BY ts LIMIT 10 ``` With dynamic filters from the TopK operator there is a real advantage to having file-level statistics: we use these to skip files before opening them again if we can prove they won't factor into the topk heap. However on a wide table we collect and store statistics for *all columns* in the table even if the only stats used are for `ts`. For `datafusion-cli` the main waste is storing all of those stats in memory. This is not a small problem: there's been a lot of recent work on caching statistics w/ bounded memory (e.g. https://github.com/apache/datafusion/issues/19052) in an effort to reduce memory consumption from statistics. In a query like this we could reduce memory consumption by something like 99% if the table is wide, other columns are strings (larger stats), etc. For `ListingTable` memory consumption is the main issue. Since stats are collected from parquet thrift footers you at least have to read all of the stats into memory (there's some ideas to skip decoding, but that's not implemented yet). However for systems like we have at Pydantic (which store stats in parquet files) or Ducklake (which stores stats in Postgres) it's trivial and beneficial to only collect stats for some columns. Thus there's also an IO price being paid to collect useless stats. Obviously for a query like `SELECT * FROM t` there is not point in collecting any stats at all. ## Collecting stats for files that are never touched Another issue is collecting stats for files that are never touched. For example: ```sql SELECT val FROM t WHERE val = 'abc' LIMIT 10 ``` This query can push the limit into the scan such that we may stop after reading the first file. Stats on `val` may be useful to skip entire files, but if there are 100s of files and we are able to satisfy the limit by e.g. skipping 3 then reading 2 it was pointless to collect stats for the other 95 files; we could have never even read the footer from those. ## Collecting statistics is buried inside of `ListingTable` There are no generalized APIs for "I have a source of statistics", everyone has to build their own and inject the stats into the `PartitionedFile`s they return from `ListingTable`. ## Proposal I've started poking at this in https://github.com/apache/datafusion/pull/21157 but I think the solution is to be lazier about collecting statistics and to decide what stats to collect / store based on the needs of the query. One idea is that instead of 1 step of "execute query" we have: 1. Stats collection based on the query's requirements (which columns, all files or up to some limit within each partition / scan queue). I imagine this could be a good place to introduce things like a join operator requesting a sample of cardinality for a column (maybe randomly sample files, randomly sample row groups, randomly sample pages). This is like a mini query in that it requires IO, CPU work, uses memory, etc. 2. Actual execution. Getting the right APIs and heuristics for (1) is going to be the hard part. For example, is it the topk operator that asks for stats on the columns it is sorting on? The stats may not make it back up to the TopK (e.g. if there is a join in the middle). Even then the topk operator itself may not even use the stats, it just wants the scans to use them for pruning. Joins are a bit different: the join operator wants cardinality stats to choose join order (this would probably be an optimizer as well... making it even more unclear what the APIs should be). This is kind of happening right now already but as per above it's implicitly buried in `ListingTable`: we do a run of stats collection during planning (which as per above is a mix of IO and CPU) but it's kind of ad-hoc, not tied into knowledge of the query or optimizer needs and not thought of as a public API. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
