twitu commented on issue #10572:
URL: https://github.com/apache/datafusion/issues/10572#issuecomment-2380598250

   # Sorted row groups not read in-order when FILTER clause is used
   
   I'm seeing another issue with datafusion streaming sorted records from disk 
with a filter query in the SQL. The code and instructions to reproduce the 
issue are here[^1].To briefly recount the context.
   
   I have a parquet file is sorted on the `ts_init` column. We want to stream 
data from the file in ascending order of `ts_init`. However, we do not want to 
sort the data in-memory since it is already sorted.
   
   * Parquet file sorted on a column
   * Stream row groups in constant memory
   * No additional sorting should be done
   
   To achieve this I use the following datafusion configuration.
   
   ```rust
       let session_cfg =
           
SessionConfig::new().set_str("datafusion.optimizer.repartition_file_scans", 
"false");
       let session_ctx = SessionContext::new_with_config(session_cfg);
       let parquet_options = ParquetReadOptions::<'_> {
           skip_metadata: Some(false),
           file_sort_order: vec![vec![datafusion_expr::SortExpr {
               expr: col("ts_init"),
               asc: true,
               nulls_first: false,
           }]],
           ..Default::default()
       };
   ```
   
   * set `datafusion.optimizer.repartition_file_scans` to false
   * Add sort order in parquet read options
   * Don't use `SORT BY` clause in query
   
   This works well when there is no filter clause in the query. The below 
commands will pass. `SELECT * FROM data`
   
   ```
   cargo run 127-groups.parquet > 127-groups-rust.csv
   python check_invariant.py 127-groups-rust.csv
   ```
   
   However, when there is a filter clause in the query. The row groups are not 
read in-order causing the ascending order check to FAIL. The row groups are 
read out of order. `SELECT * FROM data where ts_init >= 1701388832486000000 AND 
ts_init <= 1701392194001000000`
   
   ```
   cargo run 127-groups.parquet filter > 127-groups-rust.csv
   python check_invariant.py 127-groups-rust.csv #WILL FAIL
   ```
   
   [^1]: 
https://github.com/nautechsystems/nautilus_experiments/tree/datafusion-filter-bug


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to