alamb commented on issue #10572: URL: https://github.com/apache/datafusion/issues/10572#issuecomment-2153478083
Hi @twitu -- thanks for this Some comments: I took a quick peek at https://github.com/nautechsystems/nautilus_experiments/tree/efficient-query and: it looks like it reads only a single batch out https://github.com/nautechsystems/nautilus_experiments/blob/a4ceb950de3b4bbc43ec82b64ee1495d077f5116/src/bin/single_row_group.rs#L45 This means you are running the equivalent of `SELECT ... LIMIT 4000` or something similar > It seems like order=false and repartition=false seems to be the holy grail of performant, low-memory footprint streaming in sorted order. I would expect those settings will be the lowest latency (time to first batch) > However, As you can see, even after specifying the sort order of the file the ORDER BY query still loads the whole file and does some kind of sort operation. I would expect that it loads the first row group of each file and begins merging them together (e.g. if you did an `EXPLAIN ...` on your SQL you would see a `SortPreservingMerge` in the physical plan @suremarc and @matthewmturner and others have been working on optimizing a similar case -- see https://github.com/apache/datafusion/issues/7490 for example. We have other items tracked https://github.com/apache/datafusion/issues/10313 (notably we have a way to avoid opening all the files if we know the data is already sorted and doesn't overlap: https://github.com/apache/datafusion/issues/10316) cc @NGA-TRAN -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org