AntoinePrv opened a new issue, #19654:
URL: https://github.com/apache/datafusion/issues/19654
### Is your feature request related to a problem or challenge?
`Dataframe::limit` offset option is not used to skip rows when reading a
parquet file.
Using this reproducer with `datafusion 51.0` on MacBook pro M3, the larger
the offset, the longer it takes to terminates. With a ~50GB parquet file, and
offesting near the end takes ~20s to complete.
Similar Polars code can collect it instantly.
<details>
<summary>Rust offset in large file</summary>
```rust
use std::env;
use std::process;
use std::time::Instant;
use datafusion::prelude::*;
use datafusion::error::Result;
#[tokio::main]
async fn main() -> Result<()> {
let args: Vec<String> = env::args().collect();
if args.len() != 4 {
eprintln!("Usage: {} <filename> <offset> <count>", args[0]);
process::exit(1);
}
let filename = &args[1];
let offset: usize = match args[2].parse() {
Ok(n) => n,
Err(_) => {
eprintln!("Error: offset must be a positive integer");
process::exit(1);
}
};
let count: usize = match args[3].parse() {
Ok(n) => n,
Err(_) => {
eprintln!("Error: count must be a positive integer");
process::exit(1);
}
};
println!("Filename: {}", filename);
println!("Offset: {}", offset);
println!("Count: {}", count);
// configure parquet options
let config = SessionConfig::new()
.set_bool("datafusion.execution.parquet.pushdown_filters", true)
.set_bool("datafusion.execution.parquet.reorder_filters", true)
.set_bool("datafusion.execution.parquet.enable_page_index", true);
let ctx = SessionContext::new_with_config(config);
// Start timer
let start = Instant::now();
// read parquet, apply offset then limit (offset, Some(count))
let df = ctx
.read_parquet(filename, ParquetReadOptions::default())
.await?;
let df = df.limit(offset, Some(count))?;
// collect to Arrow RecordBatches (like to_arrow_table)
let batches = df.collect().await?;
// Stop timer
let duration = start.elapsed();
// `batches` is Vec<arrow::record_batch::RecordBatch>
println!("collected {} record batches", batches.len());
println!("Execution time: {:?}", duration);
Ok(())
}
```
</details>
### Describe the solution you'd like
A significant speedup.
### Describe alternatives you've considered
I also tried the SQL API, although only from Python.
### Additional context
<details>
<summary>Python offset in large file</summary>
```python
import datafusion as dn
config = (
dn.SessionConfig()
.set("datafusion.execution.parquet.pushdown_filters", "true")
.set("datafusion.execution.parquet.reorder_filters", "true")
.set("datafusion.execution.parquet.enable_page_index", "true")
.set("datafusion.execution.parquet.pushdown_filters", "true")
)
ctx = dn.SessionContext(config)
df = ctx.read_parquet("large.parquet").limit(count=512, offset=99999744)
df.to_arrow_table()
```
</details>
<details>
<summary>Polars reproducer</summary>
```python
import os
os.environ["POLARS_MAX_THREADS"] = "1"
os.environ["RAYON_NUM_THREADS"] = "1"
import polars as pl
df = pl.scan_parquet("large.parquet")
df.slice(length=512, offset=99999744).collect().to_arrow()
```
</details>
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]