vincev commented on issue #4973:
URL: 
https://github.com/apache/arrow-datafusion/issues/4973#issuecomment-1618619316

   I am adding this comment here instead of opening a new issue as it looks the 
above PRs may fix/improve but please let me know if you would like a new issue.
   
   I am experimenting with using datafusion instead of Polars as query engine 
for dply-rs and I have noticed that group by median is more than ~20x slower (I 
run these a few times the timing is consistent):
   
   The Polars version:
   
   ```
   $ time ./target/release/polars-median
   shape: (6, 3)
   ┌──────────────┬──────────────┬──────────┐
   │ payment_type ┆ total_amount ┆ n        │
   │ ---          ┆ ---          ┆ ---      │
   │ i64          ┆ f64          ┆ u32      │
   ╞══════════════╪══════════════╪══════════╡
   │ 5            ┆ 5.275        ┆ 6        │
   │ 3            ┆ 7.7          ┆ 194323   │
   │ 4            ┆ -6.8         ┆ 244364   │
   │ 0            ┆ 23.0         ┆ 1368303  │
   │ 2            ┆ 13.3         ┆ 7763339  │
   │ 1            ┆ 16.56        ┆ 30085763 │
   └──────────────┴──────────────┴──────────┘
   
   real    0m0.551s
   user    0m1.654s
   sys     0m0.253s
   ```
   datafusion version:
   
   ```
   $ time ./target/release/datafusion-median
   +--------------+--------------+----------+
   | payment_type | total_amount | n        |
   +--------------+--------------+----------+
   | 5            | 5.275        | 6        |
   | 3            | 7.7          | 194323   |
   | 4            | -6.8         | 244364   |
   | 0            | 23.0         | 1368303  |
   | 2            | 13.3         | 7763339  |
   | 1            | 16.56        | 30085763 |
   +--------------+--------------+----------+
   
   real    0m22.681s
   user    1m54.036s
   sys     0m3.001s
   ```
   
   This is using one year of data from the 
[nyctaxi](https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page) dataset.
   
   The Datafusion code:
   
   ```rust
   use anyhow::Result;
   use datafusion::prelude::*;
   
   #[tokio::main]
   async fn main() -> Result<()> {
       let ctx = SessionContext::new();
       let df = ctx
           .read_parquet("./data/*.parquet", Default::default())
           .await?;
       df.aggregate(
           vec![col("payment_type")],
           vec![
               median(col("total_amount")).alias("total_amount"),
               count(lit(1)).alias("n"),
           ],
       )?
       .sort(vec![col("n").sort(true, false)])?
       .show()
       .await?;
       Ok(())
   }
   ```
   
   The Polars code:
   
   ```rust
   use anyhow::Result;
   use polars::prelude::*;
   
   fn main() -> Result<()> {
       let lazy = LazyFrame::scan_parquet("./data/*.parquet", 
ScanArgsParquet::default())?;
       let df = lazy
           .groupby([col("payment_type")])
           .agg([col("total_amount").median(), 
col("total_amount").count().alias("n")])
           .sort("n", Default::default())
           .collect()?;
       println!("{}", df);
   
       Ok(())
   }
   ```
   
   
   
   
   
   
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to