GitHub user mispp closed a discussion: Performance issue when loading 6.5gb 
parquet file into memory

Is it expected that loading a ~6.5gb parquet file into memory has huge 
difference between polars and datafusion?
Datafusion's `.cache()` method takes ~2 minutes. Loading same data with polars 
takes ~15s.

> polars start -> 2023-07-10T22:43:25.690623200+02:00
> polars end -> 2023-07-10T22:43:40.854580400+02:00
> datafusion start -> 2023-07-10T22:43:41.363312400+02:00
> datafusion end -> 2023-07-10T22:45:32.949019300+02:00

Minimum working example is below.

Both are submitted with `cargo run` - if it makes a difference due to 
`--release`.

Code:
```
use std::io::Error;
use polars::prelude::*;
use datafusion::prelude::*;
use chrono;

#[tokio::main]
async fn main() -> Result<(), Error> {
    let _ = _dataframe2();

    let _ = _datafusion().await;

    Ok(())
}

pub async fn _datafusion() {
    let _ctx = SessionContext::new();

    let _read_options = ParquetReadOptions { file_extension: ".parquet", 
table_partition_cols: vec!(), parquet_pruning: None, skip_metadata: None };
    let _df = 
_ctx.read_parquet("/mnt/d/Projects/testdf/data/test_data.parquet", 
_read_options).await.unwrap();

    println!("datafusion start -> {:?}", chrono::offset::Local::now());

    let _cached = _df.cache().await;

    println!("datafusion end -> {:?}", chrono::offset::Local::now());   
}

pub fn _dataframe2() -> Result<String, PolarsError> {
    let mut file = 
std::fs::File::open("/mnt/d/Projects/testdf/data/test_data.parquet").unwrap();

    println!("polars start -> {:?}", chrono::offset::Local::now());

    let _df = ParquetReader::new(&mut file).finish().unwrap();

    println!("polars end -> {:?}", chrono::offset::Local::now());

    Ok("done".to_string())
}

```


Cargo.toml
```
[package]
name = "testdf"
version = "0.1.0"
edition = "2021"

# See more keys and their definitions at 
https://doc.rust-lang.org/cargo/reference/manifest.html

[dependencies]
parquet = "40.0.0"
polars = { version = "0.30.0", features = 
["lazy","temporal","describe","json","parquet","dtype-datetime","dtype-categorical",
 "sql", "streaming", "serde-lazy", "ipc", "dynamic_groupby", "sort_multiple", 
"rows", "dataframe_arithmetic", "partition_by"] }
serde = "1.0.163"
serde_json = "1.0.96"
connectorx = { version = "0.3.1", features = ["src_postgres", "dst_arrow", 
"dst_arrow2"] }
datafusion = "27.0.0"
tokio = "1.0"
chrono = "0.4.26"
```


GitHub link: https://github.com/apache/datafusion/discussions/6908

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: 
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to