adriangb opened a new issue, #17348:
URL: https://github.com/apache/datafusion/issues/17348

   **Background**
   
   We already have an excellent blog post that I think does a great 
introduction to the importance of optimizing sort order information in queries: 
https://datafusion.apache.org/blog/2025/03/11/ordering-analysis/
   
   Work on dynamic filters has made this even more interesting: 
https://datafusion.staged.apache.org/blog/2025/09/01/dynamic-filters/. TopK 
dynamic filters will perform *significantly better* because we generate a more 
selective filter earlier on and thus can prune more data. Interestingly, and 
not touched upon in the article, even without dynamic filters or perfectly 
sorted inputs TopK operators benefit from partially sorted inputs because they 
are able to more quickly discard batches as they come in instead of churning 
the heap, although this effect is probably minor. I think the same sort of 
thing applies to other sort operators, depending on the algorithms used (see 
https://www.toptal.com/developers/sorting-algorithms for examples).
   
   Thus there are a couple high level aspects of the current sorting I think 
should be improved, detailed below.
   
   **Exact vs. Partial ordering**
   
   A lot of the sort machinery currently focuses on what I'll call "exact" 
ordering. This is when the data within each file and the files themselves can 
be ordered in such a way that one can scan the table completely skipping any 
upstream sorting. There is little done about *partial* ordering. In particular, 
I think about the case where the files themselves may not be ordered internally 
but we can use the min/max stats to order between files to some extent. Two 
examples that come to mind:
   - Timestamp columns. Often if data is appended to files as they come in data 
will be roughly clustered by time, even if it is not sorted by time.
   - Values that can be efficiently zone map indexed: consider something like 
`salary`, depending on the data distribution if we want `ORDER BY salary DESC 
LIMIT 10` the top 10 outliers might stick out like a sore thumb in the min/max 
stats -> we can scan files with the largest `max(salary)` first -> we can skip 
more of the other files.
   
   I also think that *exact* ordering is a subset of partial ordering where you 
just happen to be able to order amongst the files *and* you know that the order 
within the files matches the desired order. There's even cases where files are 
*clustered* in such a way that they are not exactly ordered but are 
statistically much better than randomly ordered, in this case the partial 
ordering within the files won't let us skip a sort but might make operations 
more efficient (see point above about some sorts being much faster with 
partially sorted data).
   
   **Order of the Data vs order of the Query**
   
   Currently the only way to get some benefit from ordering is to specify a 
known ordering of the data upfront:
   
   ```sql
   CREATE EXTERNAL TABLE source (
       amount INT NOT NULL,
       price DOUBLE NOT NULL,
       time TIMESTAMP NOT NULL,
       ...
   )
   STORED AS CSV
   WITH ORDER (time ASC)
   WITH ORDER (amount ASC, price ASC)
   LOCATION '/path/to/FILE_NAME.csv'
   OPTIONS ('has_header' 'true');
   ```
   
   (taken from the blog post)
   
   But what if this order doesn't match the desired order of the query? E.g. if 
we want `ORDER BY price`. The physical ordering of the file doesn't help us at 
all here, it's as good as a random scan. We're much better off using min/max 
stats on `price` to re-order the scan so that the file opening is done in ~ the 
order the query wants, and we don't use the ordering within the files at all.
   
   **Inferred order vs Known order**
   
   Related to the point above: users currently have to specify the order of a 
table. I think this could in many cases be inferred from a combination of 
metadata on each file (Parquet for example has metadata to specify the ordering 
of a file) and min/max stats (which give an ordering between files). Using this 
we can reconstruct *exact* or *partial* ordering without users having to 
specify a known order upfront.
   
   **Proposal**
   
   I propose that we move to a system where:
   1. The desired ordering of the query is pushed down into the `TableScan` / 
`TableProvider::scan` so that file opens can be optimized to match the desired 
order of the query, be that an *exact* order if possible or a *partial* order 
if not.
   2. Our default `ListingTable` should use file metadata (min/max stats and 
any ordering information) to arrange files to best match the queries desired 
ordering. The `ExecutionPlan` that `TableProvider::scan` should record if the 
order is *exact* so that subsequent physical optimizers can remove unecessary 
upstream sort operations. The goal should be to produce an *exact* order if 
possible and if not fall back to a best effort *partial* order (which still 
requires an upstream sort).
   3. We should infer ordering information from file metadata instead of 
requiring users to specify a known ordering upfront.
   
   **Implementation**
   
   *Refactor `TableProvider::scan` into `TableProvider::scan_with_args`*: we 
need to make API changes to `scan()` to pass in ordering information. Adding a 
parameter would be a breaking change. Let's take this opportunity to refactor 
the method into something we can change over time while minimizing breaking API 
changes for downstream consumers. 
https://github.com/apache/datafusion/pull/17336
   *Push down sorting information into `TableScan` / `TableProvider`. This 
requires some careful evaluation of optimizer rules, i.e. what can we push down 
sorts through and what can we not. I suggest we start with the obviously okay 
cases and add more complex ones in the future. 
https://github.com/apache/datafusion/pull/17337
   *Connect the two above changes by passing the sort order from `TableScan` 
into `TableProvider::scan_with_args` in our logical -> physical planning step.
   *Rewrite our re-partitioning / sorting to first try to create an exact 
ordering and fall back to partial ordering if not possible, marking the 
resultant `ExecutionPlan` with which one it has.
   *On a per-file-format basis extract ordering information from the files*: I 
think this involves some refactoring to go from e.g. 
`FileFormat::collect_statistics() -> Statistics` to 
`FileFormat::collect_metadata() -> StatisticsAndSortInformation` as well as 
some other tbd work.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to