[I] DataFusion performance problem (or optimization opportunity?) [arrow-datafusion]

via GitHub Wed, 07 Feb 2024 05:35:31 -0800


alamb opened a new issue, #9148:
URL: https://github.com/apache/arrow-datafusion/issues/9148


   ### Describe the bug
   
   Reported in discord by @mispp: 
https://discord.com/channels/885562378132000778/1166447479609376850/1204163621433639003
   
   > ok people, a performance question if i may... I pulled a ~400mb parquet 
file from new york taxi drives- for testing. have a simple aggregation that is 
supposed to sum up a column called trip_time. no group by column is done and it 
is all performed via dataframe
   > this operation lasts for ~2s
   > is this expected?
   
   
   > i saw a video  https://youtu.be/NVKujPxwSBA?t=1589 that showed  datafusion 
processed some gigabytes in less than a second
   
   So basically the task here is to reproduce the reported performance and see 
if there is anything wrong or that we could improve
   
   
   
   
   ### To Reproduce
   
   Original report: 
https://gist.github.com/mispp/229fdad7d70c8ab974a8f72f4bdfc43c
   
   DataSet: 
https://d37ci6vzurychx.cloudfront.net/trip-data/fhvhv_tripdata_2023-01.parquet
   
   Cargo.toml
   ```toml
   [package]
   name = "perf-issue"
   version = "0.1.0"
   edition = "2021"
   
   [dependencies]
   tokio = { version = "1", features = ["full"] }
   serde = { version = "1", features = ["derive"] }
   serde_json = "1.0"
   datafusion = "34"
   arrow-schema = "*"
   ```
   
   Program:
   ```rust
   use std::time::SystemTime;
   
   use datafusion::{
       common::Column,
       execution::{context::SessionContext, options::ParquetReadOptions},
       logical_expr::Expr
   };
   
   
   #[tokio::main]
   async fn main() {
       let start = SystemTime::now();
   
       let _ctx = SessionContext::new();
       let _read_options = ParquetReadOptions {
           file_extension: ".parquet",
           table_partition_cols: vec!(),
           parquet_pruning: Some(true),
           skip_metadata: Some(false),
           schema: None,
           file_sort_order: vec![]
       };
   
   
   
       let analysis_expressions: Vec<Expr> = [ 
datafusion::logical_expr::expr_fn::sum(Expr::Column(Column::from_name("trip_time")))
 ].to_vec();
       let group_expressions: Vec<Expr> = [].to_vec();
   
       println!("just before df -> {}", start.elapsed().unwrap().as_millis());
   
       let df = _ctx.read_parquet("./fhvhv_tripdata_2023-01.parquet", 
_read_options).await.unwrap();
       println!("reading df -> {}", start.elapsed().unwrap().as_millis());
   
       let df_aggregated = df.aggregate(group_expressions, 
analysis_expressions).unwrap().collect().await;
       println!("df aggregation -> {}", start.elapsed().unwrap().as_millis());
   
       println!("results -> {:?}", df_aggregated);
   
   }
   ```
   
   ### Expected behavior
   
   _No response_
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] DataFusion performance problem (or optimization opportunity?) [arrow-datafusion]

Reply via email to