alamb commented on code in PR #5433: URL: https://github.com/apache/arrow-datafusion/pull/5433#discussion_r1123656890
########## benchmarks/README.md: ########## @@ -143,13 +145,13 @@ h2o groupby query 1 took 1669 ms [1]: http://www.tpc.org/tpch/ [2]: https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page -## Parquet filter pushdown benchmarks +## Parquet benchmarks -This is a set of benchmarks for testing and verifying performance of parquet filter pushdown. The queries are executed on -a synthetic dataset generated during the benchmark execution and designed to simulate web server access logs. +This is a set of benchmarks for testing and verifying performance of parquet filtering and sorting. +The queries are executed on a synthetic dataset generated during the benchmark execution and designed to simulate web server access logs. ```base -cargo run --release --bin parquet_filter_pushdown -- --path ./data --scale-factor 1.0 +cargo run --release --bin parquet -- --path ./data --scale-factor 1.0 Review Comment: 🤔 I tried this command and got: ```shell alamb@MacBook-Pro-8:~/Software/arrow-datafusion$ CARGO_TARGET_DIR=/Users/alamb/Software/target-df cargo run --release --bin parquet -- --path ./data --scale-factor 1.0 error: no bin target named `parquet`. Available bin targets: print_config_docs tpch h2o nyctaxi ``` I think it is because the file was moved out of `benchmarks/src/bin` and into `benchmarks/parquet.rs` ########## benchmarks/parquet.rs: ########## @@ -84,13 +86,132 @@ async fn main() -> Result<()> { } let test_file = gen_data(path, opt.scale_factor, props_builder.build())?; - - run_benchmarks(opt, &test_file).await?; - + println!("running filter benchmarks"); + run_filter_benchmarks(opt.clone(), &test_file).await?; + println!("running sort benchmarks"); Review Comment: what would you think about adding some way to pick which benchmark to run here. Something like Run the sort benchmark like ```shell cargo run --release --bin parquet -- sort --path ./data --scale-factor 1.0 ``` Run the filter benchmark like ```shell cargo run --release --bin parquet -- filter --path ./data --scale-factor 1.0 ``` ########## benchmarks/parquet.rs: ########## @@ -84,13 +86,132 @@ async fn main() -> Result<()> { } let test_file = gen_data(path, opt.scale_factor, props_builder.build())?; - - run_benchmarks(opt, &test_file).await?; - + println!("running filter benchmarks"); + run_filter_benchmarks(opt.clone(), &test_file).await?; + println!("running sort benchmarks"); + run_sort_benchmarks(opt, &test_file).await?; Ok(()) } -async fn run_benchmarks(opt: Opt, test_file: &TestParquetFile) -> Result<()> { +async fn run_sort_benchmarks(opt: Opt, test_file: &TestParquetFile) -> Result<()> { + use datafusion::physical_expr::expressions::col; + let scan_options_matrix = vec![ + ParquetScanOptions { + pushdown_filters: false, + reorder_filters: false, + enable_page_index: false, + }, + ParquetScanOptions { + pushdown_filters: true, + reorder_filters: true, + enable_page_index: true, + }, + ParquetScanOptions { + pushdown_filters: true, + reorder_filters: true, + enable_page_index: false, + }, + ]; + let schema = test_file.schema(); + let sort_cases = vec![ + ( + "sort utf8", + vec![PhysicalSortExpr { + expr: col("request_method", &schema)?, + options: Default::default(), + }], + ), + ( + "sort int", + vec![PhysicalSortExpr { + expr: col("request_bytes", &schema)?, + options: Default::default(), + }], + ), + ( + "sort decimal", + vec![ + // sort decimal + PhysicalSortExpr { + expr: col("decimal_price", &schema)?, + options: Default::default(), + }, + ], + ), + ( + "sort integer tuple", + vec![ + PhysicalSortExpr { + expr: col("request_bytes", &schema)?, + options: Default::default(), + }, + PhysicalSortExpr { + expr: col("response_bytes", &schema)?, + options: Default::default(), + }, + ], + ), + ( + "sort utf8 tuple", + vec![ + // sort utf8 tuple + PhysicalSortExpr { + expr: col("service", &schema)?, + options: Default::default(), + }, + PhysicalSortExpr { + expr: col("host", &schema)?, + options: Default::default(), + }, + PhysicalSortExpr { + expr: col("pod", &schema)?, + options: Default::default(), + }, + PhysicalSortExpr { + expr: col("image", &schema)?, + options: Default::default(), + }, + ], + ), + ( + "sort mixed tuple", + vec![ + PhysicalSortExpr { + expr: col("service", &schema)?, + options: Default::default(), + }, + PhysicalSortExpr { + expr: col("request_bytes", &schema)?, + options: Default::default(), + }, + PhysicalSortExpr { + expr: col("decimal_price", &schema)?, + options: Default::default(), + }, + ], + ), + ]; + for (title, expr) in sort_cases { + println!("Executing '{title}' (sorting by: {expr:?})"); + for scan_options in &scan_options_matrix { Review Comment: since there are no filters in this test, I don't think it is necessary to try the various different filter options -- just the different sorts would be great. ########## benchmarks/parquet.rs: ########## @@ -178,6 +299,22 @@ async fn exec_scan( Ok(result.iter().map(|b| b.num_rows()).sum()) } +async fn exec_sort( + ctx: &SessionContext, + expr: &[PhysicalSortExpr], + test_file: &TestParquetFile, + debug: bool, +) -> Result<()> { + let scan = test_file.create_scan(lit(true)).await?; Review Comment: I am not sure the `lit(true)` filter will be removed at this level. What do you think about making `TestParquetFile::create_scan` take `Option<Expr>` instead of passing in lit true? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
