twitu commented on issue #10572: URL: https://github.com/apache/datafusion/issues/10572#issuecomment-2154211869
I also added a query explanation and here're the results. ``` plan: Analyze Sort: data.ts_init ASC NULLS LAST Projection: data.bid, data.ask, data.bid_size, data.ask_size, data.ts_event, data.ts_init TableScan: data, ``` ``` plan: Analyze Projection: data.bid, data.ask, data.bid_size, data.ask_size, data.ts_event, data.ts_init TableScan: data, ``` You'll see that the queries with `ORDER BY` have a Sort expression in the plan. It's not clear to me why despite specifying the sort order in the configuration the plan still has a sort. I hope the optimizations you've mentioned will take this into account. ``` file_sort_order: vec![vec![Expr::Sort(Sort { expr: Box::new(col("ts_init")), asc: true, nulls_first: true, })]], ``` You'll see in the experiments repo that I've implement binaries for both reading a single row group and reading the whole file. The queries are the same the behaviour is changed in how the resulting stream is consumed. I don't think this is equivalent to adding a `LIMIT `clause because for the purpose of the query I'm reading the whole file. It is only that the consumer decides to stop after reading one row group. > > It seems like order=false and repartition=false seems to be the holy grail of performant, low-memory footprint streaming in sorted order. > I would expect those settings will be the lowest latency (time to first batch) Surprisingly, these settings also give lowest latency and memory foot print when reading the full file as shown in the above table. If you need an additional contributor in any of the above mentioned issues, I'm happy to help :smile: -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org