[GitHub] [arrow-datafusion] alamb opened a new issue, #6983: [DataFrame] Parallel Load into dataframe

via GitHub Sun, 16 Jul 2023 05:51:55 -0700


alamb opened a new issue, #6983:
URL: https://github.com/apache/arrow-datafusion/issues/6983


   ### Is your feature request related to a problem or challenge?
   
   When loading data into a DataFusion via 
[SessionContext::read_parquet](https://docs.rs/datafusion/latest/datafusion/execution/context/struct.SessionContext.html#method.read_parquet),
 DataFrame , only a single core is used even when there are many cores 
available.
   
   This leads to slower performance, as reported by @mispp on 
https://github.com/apache/arrow-datafusion/discussions/6908
   
   # Reproducer
   
   Create data using
   
   ```shell
   cd datafusion/benchmarks
   ./bench.sh data tpch10
   ```
   
   Then lad the 
   
   ```rust
   use std::{io::Error, time::Instant};
   use datafusion::prelude::*;
   use chrono;
   
   const FILENAME: &str = 
"/Users/alamb/Software/arrow-datafusion/benchmarks/data/tpch_sf10/lineitem/part-0.parquet";
   
   #[tokio::main]
   async fn main() -> Result<(), Error> {
       env_logger::init();
       {
           let _ = _datafusion().await;
       }
   
       Ok(())
   }
   
   pub async fn _datafusion() {
       let _ctx = SessionContext::new();
   
       let _read_options = ParquetReadOptions { file_extension: ".parquet", 
table_partition_cols: vec!(), parquet_pruning: None, skip_metadata: None };
       let _df = _ctx.read_parquet(FILENAME, _read_options).await.unwrap();
   
       let start = Instant::now();
       println!("datafusion start -> {:?}", chrono::offset::Local::now());
   
       let _cached = _df.cache().await;
       let elapsed = Instant::now() - start;
       println!("datafusion end -> {:?} {elapsed:?}", 
chrono::offset::Local::now());
   }
   ```
   
   Cargo.toml
   ```toml
   
   # See more keys and their definitions at 
https://doc.rust-lang.org/cargo/reference/manifest.html
   
   [package]
   name = "perf_test"
   version = "0.1.0"
   edition = "2021"
   
   # See more keys and their definitions at 
https://doc.rust-lang.org/cargo/reference/manifest.html
   
   [dependencies]
   env_logger = "0.10.0"
   
   parquet = "40.0.0"
   serde = "1.0.163"
   serde_json = "1.0.96"
   datafusion = "27.0.0"
   tokio = "1.0"
   chrono = "0.4.26"
   ```
   
   ### Describe the solution you'd like
   
   I would like datafusion to read the parquet file in parallel, using 
target_partitions config parameter
   
   
https://docs.rs/datafusion/latest/datafusion/config/struct.ExecutionOptions.html#structfield.target_partitions
   
   
   
   ### Describe alternatives you've considered
   
   _No response_
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] alamb opened a new issue, #6983: [DataFrame] Parallel Load into dataframe

Reply via email to