2010YOUY01 commented on code in PR #13996: URL: https://github.com/apache/datafusion/pull/13996#discussion_r1903253001
########## benchmarks/bench.sh: ########## @@ -80,6 +80,9 @@ clickbench_1: ClickBench queries against a single parquet file clickbench_partitioned: ClickBench queries against a partitioned (100 files) parquet clickbench_extended: ClickBench \"inspired\" queries against a single parquet (DataFusion specific) external_aggr: External aggregation benchmark +h2o_small: h2oai benchmark with small dataset (1e7 rows), default file format is parquet +h2o_medium: h2oai benchmark with medium dataset (1e8 rows), default file format is parquet +h2o_big: h2oai benchmark with large dataset (1e9 rows), default file format is parquet Review Comment: The benchmark results in https://duckdb.org/2023/04/14/h2oai.html is running on csv dataset, perhaps we can include a `h2o_medium_csv` in this entry point? ########## benchmarks/src/h2o.rs: ########## @@ -0,0 +1,263 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +use crate::util::{BenchmarkRun, CommonOpt}; +use datafusion::{error::Result, prelude::SessionContext}; +use datafusion_common::{exec_datafusion_err, instant::Instant, DataFusionError}; +use std::path::{Path, PathBuf}; +use structopt::StructOpt; +use datafusion::prelude::ParquetReadOptions; + +/// Run the H2O benchmark +#[derive(Debug, StructOpt, Clone)] +#[structopt(verbatim_doc_comment)] +pub struct RunOpt { + #[structopt(short, long)] + query: Option<usize>, + + /// Common options + #[structopt(flatten)] + common: CommonOpt, + + /// Path to queries.sql (single file) + /// default value is the groupby.sql file in the h2o benchmark + #[structopt( + parse(from_os_str), + short = "r", + long = "queries-path", + default_value = "benchmarks/queries/h2o/groupby.sql" + )] + queries_path: PathBuf, Review Comment: Perhaps we can remove this query path option? I think those queries should be static, and are unlikely to be placed elsewhere like large datasets ########## benchmarks/bench.sh: ########## @@ -541,6 +583,128 @@ run_imdb() { $CARGO_COMMAND --bin imdb -- benchmark datafusion --iterations 5 --path "${IMDB_DIR}" --prefer_hash_join "${PREFER_HASH_JOIN}" --format parquet -o "${RESULTS_FILE}" } +data_h2o() { + # Default values for size and data format + SIZE=${1:-"SMALL"} + DATA_FORMAT=${2:-"PARQUET"} + + # Ensure the Python version is 3.10 or higher + REQUIRED_PYTHON="python3.10" + if ! command -v $REQUIRED_PYTHON &> /dev/null + then + echo "$REQUIRED_PYTHON could not be found. Please install Python 3.10 or higher." Review Comment: ```suggestion echo "$REQUIRED_PYTHON could not be found. Please install Python 3.10." ``` Looks like this script requires exact 3.10 ########## benchmarks/bench.sh: ########## @@ -541,6 +583,128 @@ run_imdb() { $CARGO_COMMAND --bin imdb -- benchmark datafusion --iterations 5 --path "${IMDB_DIR}" --prefer_hash_join "${PREFER_HASH_JOIN}" --format parquet -o "${RESULTS_FILE}" } +data_h2o() { + # Default values for size and data format + SIZE=${1:-"SMALL"} + DATA_FORMAT=${2:-"PARQUET"} + + # Ensure the Python version is 3.10 or higher + REQUIRED_PYTHON="python3.10" + if ! command -v $REQUIRED_PYTHON &> /dev/null + then + echo "$REQUIRED_PYTHON could not be found. Please install Python 3.10 or higher." Review Comment: ```suggestion echo "$REQUIRED_PYTHON could not be found. Please install Python 3.10." ``` Looks like this script requires exact 3.10 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
