AntoinePrv opened a new issue, #19654:
URL: https://github.com/apache/datafusion/issues/19654

   ### Is your feature request related to a problem or challenge?
   
   `Dataframe::limit` offset option is not used to skip rows when reading a 
parquet file.
   
   Using this reproducer with `datafusion 51.0` on MacBook pro M3, the larger 
the offset, the longer it takes to terminates. With a ~50GB parquet file, and 
offesting near the end takes ~20s to complete.
   Similar Polars code can collect it instantly.
   
   <details>
   <summary>Rust offset in large file</summary>
   
   ```rust
   use std::env;
   use std::process;
   use std::time::Instant;
   use datafusion::prelude::*;
   use datafusion::error::Result;
   
   #[tokio::main]
   async fn main() -> Result<()> {
       let args: Vec<String> = env::args().collect();
   
       if args.len() != 4 {
           eprintln!("Usage: {} <filename> <offset> <count>", args[0]);
           process::exit(1);
       }
   
       let filename = &args[1];
   
       let offset: usize = match args[2].parse() {
           Ok(n) => n,
           Err(_) => {
               eprintln!("Error: offset must be a positive integer");
               process::exit(1);
           }
       };
   
       let count: usize = match args[3].parse() {
           Ok(n) => n,
           Err(_) => {
               eprintln!("Error: count must be a positive integer");
               process::exit(1);
           }
       };
   
       println!("Filename: {}", filename);
       println!("Offset: {}", offset);
       println!("Count: {}", count);
   
       // configure parquet options
       let config = SessionConfig::new()
           .set_bool("datafusion.execution.parquet.pushdown_filters", true)
           .set_bool("datafusion.execution.parquet.reorder_filters", true)
           .set_bool("datafusion.execution.parquet.enable_page_index", true);
   
       let ctx = SessionContext::new_with_config(config);
   
       // Start timer
       let start = Instant::now();
   
       // read parquet, apply offset then limit (offset, Some(count))
       let df = ctx
           .read_parquet(filename, ParquetReadOptions::default())
           .await?;
       let df = df.limit(offset, Some(count))?;
   
       // collect to Arrow RecordBatches (like to_arrow_table)
       let batches = df.collect().await?;
   
       // Stop timer
       let duration = start.elapsed();
   
       // `batches` is Vec<arrow::record_batch::RecordBatch>
       println!("collected {} record batches", batches.len());
       println!("Execution time: {:?}", duration);
   
       Ok(())
   }
   ```
   
   </details>
   
   
   
   ### Describe the solution you'd like
   
   A significant speedup.
   
   ### Describe alternatives you've considered
   
   I also tried the SQL API, although only from Python.
   
   ### Additional context
   
   <details>
   <summary>Python offset in large file</summary>
   
   ```python
   import datafusion as dn
   
   config = (
       dn.SessionConfig()
       .set("datafusion.execution.parquet.pushdown_filters", "true")
       .set("datafusion.execution.parquet.reorder_filters", "true")
       .set("datafusion.execution.parquet.enable_page_index", "true")
       .set("datafusion.execution.parquet.pushdown_filters", "true")
   )
   
   ctx = dn.SessionContext(config)
   df = ctx.read_parquet("large.parquet").limit(count=512, offset=99999744)
   df.to_arrow_table()
   ```
   
   </details>
   
   <details>
   <summary>Polars reproducer</summary>
   
   ```python
   import os
   
   os.environ["POLARS_MAX_THREADS"] = "1"
   os.environ["RAYON_NUM_THREADS"] = "1"
   
   import polars as pl
   
   df = pl.scan_parquet("large.parquet")
   df.slice(length=512, offset=99999744).collect().to_arrow()
   
   ```
   
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to