alamb opened a new issue, #6983: URL: https://github.com/apache/arrow-datafusion/issues/6983
### Is your feature request related to a problem or challenge? When loading data into a DataFusion via [SessionContext::read_parquet](https://docs.rs/datafusion/latest/datafusion/execution/context/struct.SessionContext.html#method.read_parquet), DataFrame , only a single core is used even when there are many cores available. This leads to slower performance, as reported by @mispp on https://github.com/apache/arrow-datafusion/discussions/6908 # Reproducer Create data using ```shell cd datafusion/benchmarks ./bench.sh data tpch10 ``` Then lad the ```rust use std::{io::Error, time::Instant}; use datafusion::prelude::*; use chrono; const FILENAME: &str = "/Users/alamb/Software/arrow-datafusion/benchmarks/data/tpch_sf10/lineitem/part-0.parquet"; #[tokio::main] async fn main() -> Result<(), Error> { env_logger::init(); { let _ = _datafusion().await; } Ok(()) } pub async fn _datafusion() { let _ctx = SessionContext::new(); let _read_options = ParquetReadOptions { file_extension: ".parquet", table_partition_cols: vec!(), parquet_pruning: None, skip_metadata: None }; let _df = _ctx.read_parquet(FILENAME, _read_options).await.unwrap(); let start = Instant::now(); println!("datafusion start -> {:?}", chrono::offset::Local::now()); let _cached = _df.cache().await; let elapsed = Instant::now() - start; println!("datafusion end -> {:?} {elapsed:?}", chrono::offset::Local::now()); } ``` Cargo.toml ```toml # See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html [package] name = "perf_test" version = "0.1.0" edition = "2021" # See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html [dependencies] env_logger = "0.10.0" parquet = "40.0.0" serde = "1.0.163" serde_json = "1.0.96" datafusion = "27.0.0" tokio = "1.0" chrono = "0.4.26" ``` ### Describe the solution you'd like I would like datafusion to read the parquet file in parallel, using target_partitions config parameter https://docs.rs/datafusion/latest/datafusion/config/struct.ExecutionOptions.html#structfield.target_partitions ### Describe alternatives you've considered _No response_ ### Additional context _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
