alamb commented on issue #10572:
URL: https://github.com/apache/datafusion/issues/10572#issuecomment-2170404296

   Hi @twitu  -- I am very sorry for the delay in responding -- I have been 
traveling for sever
   
   > You'll see that the queries with ORDER BY have a Sort expression in the 
plan. It's not clear to me why despite specifying the sort order in the 
configuration the plan still has a sort. I hope the optimizations you've 
mentioned will take this into account.
   
   One thing that might be going on is that the NULLS FIRST doesn't seem to 
match
   
   In your plan the sort is putting nulls last
   ```
         Sort: data.ts_init ASC NULLS LAST
   ```
   
   but in your code you specify NULLS first
   
   ```rust
           file_sort_order: vec![vec![Expr::Sort(Sort {
               expr: Box::new(col("ts_init")),
               asc: true,
               nulls_first: true,
           })]],
   ```
   
   > I don't think this is equivalent to adding a LIMIT clause because for the 
purpose of the query I'm reading the whole file. It is only that the consumer 
decides to stop after reading one row group.
   
   DataFusion is a streaming engine, so if you open a parquet file and read one 
batch and stop then the entire file will not be opened read (the batches are 
basically created on demand)
   
   There are certain "pipeline breaking" operators that do require reading the 
entire input, such as `Sort` and `GroupHashAggregate` which is why I think you 
are seeing the entire file read when your query has a sprt
   
   > If you need an additional contributor in any of the above mentioned 
issues, I'm happy to help 😄
   
   
   We are always looking for contributors -- anything you can do to help others 
would be most appreciated. For example, perhaps you can add an example to 
`datafusion-examples` 
https://github.com/apache/datafusion/tree/main/datafusion-examples  showing how 
to use a pre-sorted input file to avoid sorting during query (assuming that you 
can actually get that working)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to