alamb commented on issue #10572: URL: https://github.com/apache/datafusion/issues/10572#issuecomment-2170404296
Hi @twitu -- I am very sorry for the delay in responding -- I have been traveling for sever > You'll see that the queries with ORDER BY have a Sort expression in the plan. It's not clear to me why despite specifying the sort order in the configuration the plan still has a sort. I hope the optimizations you've mentioned will take this into account. One thing that might be going on is that the NULLS FIRST doesn't seem to match In your plan the sort is putting nulls last ``` Sort: data.ts_init ASC NULLS LAST ``` but in your code you specify NULLS first ```rust file_sort_order: vec![vec![Expr::Sort(Sort { expr: Box::new(col("ts_init")), asc: true, nulls_first: true, })]], ``` > I don't think this is equivalent to adding a LIMIT clause because for the purpose of the query I'm reading the whole file. It is only that the consumer decides to stop after reading one row group. DataFusion is a streaming engine, so if you open a parquet file and read one batch and stop then the entire file will not be opened read (the batches are basically created on demand) There are certain "pipeline breaking" operators that do require reading the entire input, such as `Sort` and `GroupHashAggregate` which is why I think you are seeing the entire file read when your query has a sprt > If you need an additional contributor in any of the above mentioned issues, I'm happy to help 😄 We are always looking for contributors -- anything you can do to help others would be most appreciated. For example, perhaps you can add an example to `datafusion-examples` https://github.com/apache/datafusion/tree/main/datafusion-examples showing how to use a pre-sorted input file to avoid sorting during query (assuming that you can actually get that working) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org