alamb opened a new issue, #9148: URL: https://github.com/apache/arrow-datafusion/issues/9148
### Describe the bug Reported in discord by @mispp: https://discord.com/channels/885562378132000778/1166447479609376850/1204163621433639003 > ok people, a performance question if i may... I pulled a ~400mb parquet file from new york taxi drives- for testing. have a simple aggregation that is supposed to sum up a column called trip_time. no group by column is done and it is all performed via dataframe > this operation lasts for ~2s > is this expected? > i saw a video https://youtu.be/NVKujPxwSBA?t=1589 that showed datafusion processed some gigabytes in less than a second So basically the task here is to reproduce the reported performance and see if there is anything wrong or that we could improve ### To Reproduce Original report: https://gist.github.com/mispp/229fdad7d70c8ab974a8f72f4bdfc43c DataSet: https://d37ci6vzurychx.cloudfront.net/trip-data/fhvhv_tripdata_2023-01.parquet Cargo.toml ```toml [package] name = "perf-issue" version = "0.1.0" edition = "2021" [dependencies] tokio = { version = "1", features = ["full"] } serde = { version = "1", features = ["derive"] } serde_json = "1.0" datafusion = "34" arrow-schema = "*" ``` Program: ```rust use std::time::SystemTime; use datafusion::{ common::Column, execution::{context::SessionContext, options::ParquetReadOptions}, logical_expr::Expr }; #[tokio::main] async fn main() { let start = SystemTime::now(); let _ctx = SessionContext::new(); let _read_options = ParquetReadOptions { file_extension: ".parquet", table_partition_cols: vec!(), parquet_pruning: Some(true), skip_metadata: Some(false), schema: None, file_sort_order: vec![] }; let analysis_expressions: Vec<Expr> = [ datafusion::logical_expr::expr_fn::sum(Expr::Column(Column::from_name("trip_time"))) ].to_vec(); let group_expressions: Vec<Expr> = [].to_vec(); println!("just before df -> {}", start.elapsed().unwrap().as_millis()); let df = _ctx.read_parquet("./fhvhv_tripdata_2023-01.parquet", _read_options).await.unwrap(); println!("reading df -> {}", start.elapsed().unwrap().as_millis()); let df_aggregated = df.aggregate(group_expressions, analysis_expressions).unwrap().collect().await; println!("df aggregation -> {}", start.elapsed().unwrap().as_millis()); println!("results -> {:?}", df_aggregated); } ``` ### Expected behavior _No response_ ### Additional context _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
