Re: [PR] Add clickbench parquet based queries to sql_planner benchmark [datafusion]

via GitHub Fri, 25 Oct 2024 07:37:32 -0700


alamb commented on code in PR #13103:
URL: https://github.com/apache/datafusion/pull/13103#discussion_r1816775576



##########
datafusion/core/benches/sql_planner.rs:
##########
@@ -299,6 +329,49 @@ fn criterion_benchmark(c: &mut Criterion) {
             }
         })
     });
+
+    // -- clickbench --
+
+    let queries_file =
+        File::open("../../benchmarks/queries/clickbench/queries.sql").unwrap();
+    let extended_file =
+        
File::open("../../benchmarks/queries/clickbench/extended.sql").unwrap();
+
+    let clickbench_queries: Vec<String> = BufReader::new(queries_file)
+        .lines()
+        .chain(BufReader::new(extended_file).lines())
+        .map(|l| l.expect("Could not parse line"))
+        .collect_vec();
+
+    let clickbench_ctx = register_clickbench_hits_table();
+
+    for (i, sql) in clickbench_queries.iter().enumerate() {

Review Comment:
   Since the physical planing benchmark *also* includes the logical planning 
and most usecases include both logican and physical planning, I think the 
logical planning only benchmarks are largely redundant
   
   however, I realize this PR just follows the existing pattern. Maybe we 
should remove all the "logical planning" benchmarks 🤔 



##########
datafusion/core/benches/sql_planner.rs:
##########
@@ -15,22 +15,28 @@
 // specific language governing permissions and limitations
 // under the License.
 
+extern crate arrow;
 #[macro_use]
 extern crate criterion;
-extern crate arrow;
 extern crate datafusion;
 
 mod data_utils;
+
 use crate::criterion::Criterion;
 use arrow::datatypes::{DataType, Field, Fields, Schema};
 use datafusion::datasource::MemTable;
 use datafusion::execution::context::SessionContext;
+use itertools::Itertools;
+use std::fs::File;
+use std::io::{BufRead, BufReader};
 use std::sync::Arc;
 use test_utils::tpcds::tpcds_schemas;
 use test_utils::tpch::tpch_schemas;
 use test_utils::TableDef;
 use tokio::runtime::Runtime;
 
+const CLICKBENCH_DATA_PATH: &str = "../../benchmarks/data/hits_partitioned/";

Review Comment:
   I think this assumes the script is run from `datafusion/core` (what cargo 
does)
   
   However, that meant when I tried to run the benchmark binary directly it 
failed like this:
   
   ```
   (venv) andrewlamb@Andrews-MacBook-Pro-2:~/Software/datafusion$ 
target/release/deps/sql_planner-d64e21551189f776 --bench 
physical_plan_clickbench_q1
   
   Gnuplot not found, using plotters backend
   thread 'main' panicked at datafusion/core/benches/sql_planner.rs:121:38:
   benchmarks/data/hits_partitioned/ could not be loaded. Please run 
'benchmarks/bench.sh data clickbench_partitioned' prior to running this 
benchmark: Os { code: 2, kind: NotFound, message: "No such file or directory" }
   ```
   
   Any chance you could make the script test both locations 
   - `../../benchmarks/data/hits_partitioned`
   - `benchmarks/data/hits_partitioned`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Add clickbench parquet based queries to sql_planner benchmark [datafusion]

Reply via email to