adriangb opened a new issue, #17348: URL: https://github.com/apache/datafusion/issues/17348
**Background** We already have an excellent blog post that I think does a great introduction to the importance of optimizing sort order information in queries: https://datafusion.apache.org/blog/2025/03/11/ordering-analysis/ Work on dynamic filters has made this even more interesting: https://datafusion.staged.apache.org/blog/2025/09/01/dynamic-filters/. TopK dynamic filters will perform *significantly better* because we generate a more selective filter earlier on and thus can prune more data. Interestingly, and not touched upon in the article, even without dynamic filters or perfectly sorted inputs TopK operators benefit from partially sorted inputs because they are able to more quickly discard batches as they come in instead of churning the heap, although this effect is probably minor. I think the same sort of thing applies to other sort operators, depending on the algorithms used (see https://www.toptal.com/developers/sorting-algorithms for examples). Thus there are a couple high level aspects of the current sorting I think should be improved, detailed below. **Exact vs. Partial ordering** A lot of the sort machinery currently focuses on what I'll call "exact" ordering. This is when the data within each file and the files themselves can be ordered in such a way that one can scan the table completely skipping any upstream sorting. There is little done about *partial* ordering. In particular, I think about the case where the files themselves may not be ordered internally but we can use the min/max stats to order between files to some extent. Two examples that come to mind: - Timestamp columns. Often if data is appended to files as they come in data will be roughly clustered by time, even if it is not sorted by time. - Values that can be efficiently zone map indexed: consider something like `salary`, depending on the data distribution if we want `ORDER BY salary DESC LIMIT 10` the top 10 outliers might stick out like a sore thumb in the min/max stats -> we can scan files with the largest `max(salary)` first -> we can skip more of the other files. I also think that *exact* ordering is a subset of partial ordering where you just happen to be able to order amongst the files *and* you know that the order within the files matches the desired order. There's even cases where files are *clustered* in such a way that they are not exactly ordered but are statistically much better than randomly ordered, in this case the partial ordering within the files won't let us skip a sort but might make operations more efficient (see point above about some sorts being much faster with partially sorted data). **Order of the Data vs order of the Query** Currently the only way to get some benefit from ordering is to specify a known ordering of the data upfront: ```sql CREATE EXTERNAL TABLE source ( amount INT NOT NULL, price DOUBLE NOT NULL, time TIMESTAMP NOT NULL, ... ) STORED AS CSV WITH ORDER (time ASC) WITH ORDER (amount ASC, price ASC) LOCATION '/path/to/FILE_NAME.csv' OPTIONS ('has_header' 'true'); ``` (taken from the blog post) But what if this order doesn't match the desired order of the query? E.g. if we want `ORDER BY price`. The physical ordering of the file doesn't help us at all here, it's as good as a random scan. We're much better off using min/max stats on `price` to re-order the scan so that the file opening is done in ~ the order the query wants, and we don't use the ordering within the files at all. **Inferred order vs Known order** Related to the point above: users currently have to specify the order of a table. I think this could in many cases be inferred from a combination of metadata on each file (Parquet for example has metadata to specify the ordering of a file) and min/max stats (which give an ordering between files). Using this we can reconstruct *exact* or *partial* ordering without users having to specify a known order upfront. **Proposal** I propose that we move to a system where: 1. The desired ordering of the query is pushed down into the `TableScan` / `TableProvider::scan` so that file opens can be optimized to match the desired order of the query, be that an *exact* order if possible or a *partial* order if not. 2. Our default `ListingTable` should use file metadata (min/max stats and any ordering information) to arrange files to best match the queries desired ordering. The `ExecutionPlan` that `TableProvider::scan` should record if the order is *exact* so that subsequent physical optimizers can remove unecessary upstream sort operations. The goal should be to produce an *exact* order if possible and if not fall back to a best effort *partial* order (which still requires an upstream sort). 3. We should infer ordering information from file metadata instead of requiring users to specify a known ordering upfront. **Implementation** *Refactor `TableProvider::scan` into `TableProvider::scan_with_args`*: we need to make API changes to `scan()` to pass in ordering information. Adding a parameter would be a breaking change. Let's take this opportunity to refactor the method into something we can change over time while minimizing breaking API changes for downstream consumers. https://github.com/apache/datafusion/pull/17336 *Push down sorting information into `TableScan` / `TableProvider`. This requires some careful evaluation of optimizer rules, i.e. what can we push down sorts through and what can we not. I suggest we start with the obviously okay cases and add more complex ones in the future. https://github.com/apache/datafusion/pull/17337 *Connect the two above changes by passing the sort order from `TableScan` into `TableProvider::scan_with_args` in our logical -> physical planning step. *Rewrite our re-partitioning / sorting to first try to create an exact ordering and fall back to partial ordering if not possible, marking the resultant `ExecutionPlan` with which one it has. *On a per-file-format basis extract ordering information from the files*: I think this involves some refactoring to go from e.g. `FileFormat::collect_statistics() -> Statistics` to `FileFormat::collect_metadata() -> StatisticsAndSortInformation` as well as some other tbd work. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org