[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #5433: Parquet sorting benchmark

via GitHub Thu, 02 Mar 2023 12:31:29 -0800


alamb commented on code in PR #5433:
URL: https://github.com/apache/arrow-datafusion/pull/5433#discussion_r1123656890



##########
benchmarks/README.md:
##########
@@ -143,13 +145,13 @@ h2o groupby query 1 took 1669 ms
 [1]: http://www.tpc.org/tpch/
 [2]: https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page
 
-## Parquet filter pushdown benchmarks
+## Parquet benchmarks
 
-This is a set of benchmarks for testing and verifying performance of parquet 
filter pushdown. The queries are executed on
-a synthetic dataset generated during the benchmark execution and designed to 
simulate web server access logs.
+This is a set of benchmarks for testing and verifying performance of parquet 
filtering and sorting.
+The queries are executed on a synthetic dataset generated during the benchmark 
execution and designed to simulate web server access logs.
 
 ```base
-cargo run --release --bin parquet_filter_pushdown --  --path ./data 
--scale-factor 1.0
+cargo run --release --bin parquet --  --path ./data --scale-factor 1.0

Review Comment:
   🤔  I tried this command and got:
   
   ```shell
   alamb@MacBook-Pro-8:~/Software/arrow-datafusion$ 
CARGO_TARGET_DIR=/Users/alamb/Software/target-df cargo run --release --bin 
parquet -- --path ./data --scale-factor 1.0
   error: no bin target named `parquet`.
   Available bin targets:
       print_config_docs
       tpch
       h2o
       nyctaxi
   ```
   
   I think it is because the file was moved out of `benchmarks/src/bin` and 
into `benchmarks/parquet.rs`



##########
benchmarks/parquet.rs:
##########
@@ -84,13 +86,132 @@ async fn main() -> Result<()> {
     }
 
     let test_file = gen_data(path, opt.scale_factor, props_builder.build())?;
-
-    run_benchmarks(opt, &test_file).await?;
-
+    println!("running filter benchmarks");
+    run_filter_benchmarks(opt.clone(), &test_file).await?;
+    println!("running sort benchmarks");

Review Comment:
   what would you think about adding some way to pick which benchmark to run 
here. Something like
   
   Run the sort benchmark  like
   ```shell
   cargo run --release --bin parquet -- sort --path ./data --scale-factor 1.0
   ```
   
   Run the filter benchmark  like
   ```shell
   cargo run --release --bin parquet -- filter --path ./data --scale-factor 1.0
   ```



##########
benchmarks/parquet.rs:
##########
@@ -84,13 +86,132 @@ async fn main() -> Result<()> {
     }
 
     let test_file = gen_data(path, opt.scale_factor, props_builder.build())?;
-
-    run_benchmarks(opt, &test_file).await?;
-
+    println!("running filter benchmarks");
+    run_filter_benchmarks(opt.clone(), &test_file).await?;
+    println!("running sort benchmarks");
+    run_sort_benchmarks(opt, &test_file).await?;
     Ok(())
 }
 
-async fn run_benchmarks(opt: Opt, test_file: &TestParquetFile) -> Result<()> {
+async fn run_sort_benchmarks(opt: Opt, test_file: &TestParquetFile) -> 
Result<()> {
+    use datafusion::physical_expr::expressions::col;
+    let scan_options_matrix = vec![
+        ParquetScanOptions {
+            pushdown_filters: false,
+            reorder_filters: false,
+            enable_page_index: false,
+        },
+        ParquetScanOptions {
+            pushdown_filters: true,
+            reorder_filters: true,
+            enable_page_index: true,
+        },
+        ParquetScanOptions {
+            pushdown_filters: true,
+            reorder_filters: true,
+            enable_page_index: false,
+        },
+    ];
+    let schema = test_file.schema();
+    let sort_cases = vec![
+        (
+            "sort utf8",
+            vec![PhysicalSortExpr {
+                expr: col("request_method", &schema)?,
+                options: Default::default(),
+            }],
+        ),
+        (
+            "sort int",
+            vec![PhysicalSortExpr {
+                expr: col("request_bytes", &schema)?,
+                options: Default::default(),
+            }],
+        ),
+        (
+            "sort decimal",
+            vec![
+                // sort decimal
+                PhysicalSortExpr {
+                    expr: col("decimal_price", &schema)?,
+                    options: Default::default(),
+                },
+            ],
+        ),
+        (
+            "sort integer tuple",
+            vec![
+                PhysicalSortExpr {
+                    expr: col("request_bytes", &schema)?,
+                    options: Default::default(),
+                },
+                PhysicalSortExpr {
+                    expr: col("response_bytes", &schema)?,
+                    options: Default::default(),
+                },
+            ],
+        ),
+        (
+            "sort utf8 tuple",
+            vec![
+                // sort utf8 tuple
+                PhysicalSortExpr {
+                    expr: col("service", &schema)?,
+                    options: Default::default(),
+                },
+                PhysicalSortExpr {
+                    expr: col("host", &schema)?,
+                    options: Default::default(),
+                },
+                PhysicalSortExpr {
+                    expr: col("pod", &schema)?,
+                    options: Default::default(),
+                },
+                PhysicalSortExpr {
+                    expr: col("image", &schema)?,
+                    options: Default::default(),
+                },
+            ],
+        ),
+        (
+            "sort mixed tuple",
+            vec![
+                PhysicalSortExpr {
+                    expr: col("service", &schema)?,
+                    options: Default::default(),
+                },
+                PhysicalSortExpr {
+                    expr: col("request_bytes", &schema)?,
+                    options: Default::default(),
+                },
+                PhysicalSortExpr {
+                    expr: col("decimal_price", &schema)?,
+                    options: Default::default(),
+                },
+            ],
+        ),
+    ];
+    for (title, expr) in sort_cases {
+        println!("Executing '{title}' (sorting by: {expr:?})");
+        for scan_options in &scan_options_matrix {

Review Comment:
   since there are no filters in this test, I don't think it is necessary to 
try the various different filter options -- just the different sorts would be 
great.



##########
benchmarks/parquet.rs:
##########
@@ -178,6 +299,22 @@ async fn exec_scan(
     Ok(result.iter().map(|b| b.num_rows()).sum())
 }
 
+async fn exec_sort(
+    ctx: &SessionContext,
+    expr: &[PhysicalSortExpr],
+    test_file: &TestParquetFile,
+    debug: bool,
+) -> Result<()> {
+    let scan = test_file.create_scan(lit(true)).await?;

Review Comment:
   I am not sure the `lit(true)` filter will be removed at this level. What do 
you think about making `TestParquetFile::create_scan` take `Option<Expr>` 
instead of passing in lit true?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #5433: Parquet sorting benchmark

Reply via email to