twitu commented on issue #10572:
URL: https://github.com/apache/datafusion/issues/10572#issuecomment-2380598250
# Sorted row groups not read in-order when FILTER clause is used
I'm seeing another issue with datafusion streaming sorted records from disk
with a filter query in the SQL. The code and instructions to reproduce the
issue are here[^1].To briefly recount the context.
I have a parquet file is sorted on the `ts_init` column. We want to stream
data from the file in ascending order of `ts_init`. However, we do not want to
sort the data in-memory since it is already sorted.
* Parquet file sorted on a column
* Stream row groups in constant memory
* No additional sorting should be done
To achieve this I use the following datafusion configuration.
```rust
let session_cfg =
SessionConfig::new().set_str("datafusion.optimizer.repartition_file_scans",
"false");
let session_ctx = SessionContext::new_with_config(session_cfg);
let parquet_options = ParquetReadOptions::<'_> {
skip_metadata: Some(false),
file_sort_order: vec![vec![datafusion_expr::SortExpr {
expr: col("ts_init"),
asc: true,
nulls_first: false,
}]],
..Default::default()
};
```
* set `datafusion.optimizer.repartition_file_scans` to false
* Add sort order in parquet read options
* Don't use `SORT BY` clause in query
This works well when there is no filter clause in the query. The below
commands will pass. `SELECT * FROM data`
```
cargo run 127-groups.parquet > 127-groups-rust.csv
python check_invariant.py 127-groups-rust.csv
```
However, when there is a filter clause in the query. The row groups are not
read in-order causing the ascending order check to FAIL. The row groups are
read out of order. `SELECT * FROM data where ts_init >= 1701388832486000000 AND
ts_init <= 1701392194001000000`
```
cargo run 127-groups.parquet filter > 127-groups-rust.csv
python check_invariant.py 127-groups-rust.csv #WILL FAIL
```
[^1]:
https://github.com/nautechsystems/nautilus_experiments/tree/datafusion-filter-bug
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]