alamb commented on PR #14918: URL: https://github.com/apache/datafusion/pull/14918#issuecomment-2694766133
I still could not reproduce any improvement with this PR, FWIW. I still think it is a good change so i merged it in, but it might be cool to find some benchmark results that showed the improvement <details><summary>Details</summary> <p> ```rust use std::sync::Arc; use std::time::Instant; use datafusion::datasource::file_format::parquet::ParquetFormat; use datafusion::datasource::listing::{ListingOptions, ListingTable, ListingTableConfig, ListingTableUrl}; use datafusion::execution::object_store::ObjectStoreUrl; use datafusion::prelude::SessionContext; #[tokio::main] async fn main() -> datafusion::error::Result<()> { let ctx = SessionContext::new(); let object_store_url = ObjectStoreUrl::parse("https://datasets.clickhouse.com").unwrap(); let object_store = object_store::http::HttpBuilder::new() .with_url(object_store_url.as_str()) .build() .unwrap(); ctx.register_object_store(object_store_url.as_ref(), Arc::new(object_store)); // urls are like // https://datasets.clickhouse.com/hits_compatible/athena_partitioned/hits_1.parquet' //let base_url = ObjectStoreUrl::parse("https://datasets.clickhouse.com").unwrap(); let paths: Vec<ListingTableUrl> = (1..100).map(|i| format!("https://datasets.clickhouse.com/hits_compatible/athena_partitioned/hits_{i}.parquet")) .map(|url| ListingTableUrl::parse(&url).unwrap()) .collect(); let listing_options = ListingOptions::new(Arc::new(ParquetFormat::new())) .with_collect_stat(true); let start = Instant::now(); println!("Creating table / reading statistics...."); let config = ListingTableConfig::new_with_multi_paths(paths) .with_listing_options(listing_options) .infer_schema(&ctx.state()).await?; let listing_table = ListingTable::try_new(config).unwrap(); let df = ctx.read_table(Arc::new(listing_table))?; println!("Done in {:?}", Instant::now() - start); println!("running query"); let start = Instant::now(); let batches = df.limit(0, Some(10))?.collect().await.unwrap(); println!("Got {} batches in {:?}", batches.len(), Instant::now() - start); Ok(()) } ``` </p> </details> Some testing numbers (the results vary wildly) On this branch ``` Creating table / reading statistics.... Done in 250.333042ms running query Got 1 batches in 1.943637416s hello world! (venv) andrewlamb@Andrews-MacBook-Pro-2:~/Software/rust_playground$ cargo run --release Finished `release` profile [optimized] target(s) in 0.21s Running `target/release/rust_playground` Creating table / reading statistics.... Done in 174.578ms running query Got 1 batches in 1.62131175s hello world! (venv) andrewlamb@Andrews-MacBook-Pro-2:~/Software/rust_playground$ cargo run --release Finished `release` profile [optimized] target(s) in 0.12s Running `target/release/rust_playground` Creating table / reading statistics.... Done in 191.24325ms running query Got 1 batches in 1.257049458s hello world! ``` On main ``` Creating table / reading statistics.... Done in 165.25ms running query Got 1 batches in 819.607625ms hello world! (venv) andrewlamb@Andrews-MacBook-Pro-2:~/Software/rust_playground$ cargo run --release Finished `release` profile [optimized] target(s) in 0.20s Running `target/release/rust_playground` Creating table / reading statistics.... Done in 165.120666ms running query Got 1 batches in 1.036410625s hello world! (venv) andrewlamb@Andrews-MacBook-Pro-2:~/Software/rust_playground$ cargo run --release Finished `release` profile [optimized] target(s) in 0.10s Running `target/release/rust_playground` Creating table / reading statistics.... Done in 198.459166ms running query Got 1 batches in 831.307041ms hello world! ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org