vincev commented on issue #4973: URL: https://github.com/apache/arrow-datafusion/issues/4973#issuecomment-1618619316
I am adding this comment here instead of opening a new issue as it looks the above PRs may fix/improve but please let me know if you would like a new issue. I am experimenting with using datafusion instead of Polars as query engine for dply-rs and I have noticed that group by median is more than ~20x slower (I run these a few times the timing is consistent): The Polars version: ``` $ time ./target/release/polars-median shape: (6, 3) ┌──────────────┬──────────────┬──────────┐ │ payment_type ┆ total_amount ┆ n │ │ --- ┆ --- ┆ --- │ │ i64 ┆ f64 ┆ u32 │ ╞══════════════╪══════════════╪══════════╡ │ 5 ┆ 5.275 ┆ 6 │ │ 3 ┆ 7.7 ┆ 194323 │ │ 4 ┆ -6.8 ┆ 244364 │ │ 0 ┆ 23.0 ┆ 1368303 │ │ 2 ┆ 13.3 ┆ 7763339 │ │ 1 ┆ 16.56 ┆ 30085763 │ └──────────────┴──────────────┴──────────┘ real 0m0.551s user 0m1.654s sys 0m0.253s ``` datafusion version: ``` $ time ./target/release/datafusion-median +--------------+--------------+----------+ | payment_type | total_amount | n | +--------------+--------------+----------+ | 5 | 5.275 | 6 | | 3 | 7.7 | 194323 | | 4 | -6.8 | 244364 | | 0 | 23.0 | 1368303 | | 2 | 13.3 | 7763339 | | 1 | 16.56 | 30085763 | +--------------+--------------+----------+ real 0m22.681s user 1m54.036s sys 0m3.001s ``` This is using one year of data from the [nyctaxi](https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page) dataset. The Datafusion code: ```rust use anyhow::Result; use datafusion::prelude::*; #[tokio::main] async fn main() -> Result<()> { let ctx = SessionContext::new(); let df = ctx .read_parquet("./data/*.parquet", Default::default()) .await?; df.aggregate( vec![col("payment_type")], vec![ median(col("total_amount")).alias("total_amount"), count(lit(1)).alias("n"), ], )? .sort(vec![col("n").sort(true, false)])? .show() .await?; Ok(()) } ``` The Polars code: ```rust use anyhow::Result; use polars::prelude::*; fn main() -> Result<()> { let lazy = LazyFrame::scan_parquet("./data/*.parquet", ScanArgsParquet::default())?; let df = lazy .groupby([col("payment_type")]) .agg([col("total_amount").median(), col("total_amount").count().alias("n")]) .sort("n", Default::default()) .collect()?; println!("{}", df); Ok(()) } ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
