2010YOUY01 commented on code in PR #21821:
URL: https://github.com/apache/datafusion/pull/21821#discussion_r3188033253


##########
benchmarks/src/hj.rs:
##########
@@ -303,6 +301,86 @@ const HASH_QUERIES: &[HashJoinQuery] = &[
         build_size: "100K_(20%_dups)",
         probe_size: "60M",
     },
+    // RightSemi Join benchmarks with Int32 keys
+    // Q16: RightSemi, Small build (25 rows), 100% Hit rate

Review Comment:
   Let's also doc the fanout here, it means if we change the join type to inner 
join, for each probe row, how many matches can be found on average.
   
   This can be automatically calculated from `explain analyze` the query, after 
changing join type to `inner join`, it will show up in the `HashJoinExec`'s 
metrics.
   
   And later we should ensure those queries have covered different fanouts.



##########
benchmarks/src/hj.rs:
##########


Review Comment:
   Perhaps we should disable the `datafusion.optimizer.join_reordering`
   configuration here, in case the optimizer swaps the join sides 🤔



##########
datafusion/physical-plan/benches/hash_join_semi_anti.rs:
##########
@@ -0,0 +1,334 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! Criterion benchmarks for Hash Join with RightSemi/RightAnti
+//!
+//! These benchmarks measure the hash join kernel for semi/anti joins
+//! with Int32 keys, which can use roaring bitmap optimization.

Review Comment:
   ```suggestion
   //! with Int32 keys.
   //! Useful for Right Semi/Anti joins, where probing can short-circuit
   //! once the first match is found.
   ```



##########
datafusion/physical-plan/benches/hash_join_semi_anti.rs:
##########
@@ -0,0 +1,334 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! Criterion benchmarks for Hash Join with RightSemi/RightAnti
+//!
+//! These benchmarks measure the hash join kernel for semi/anti joins
+//! with Int32 keys, which can use roaring bitmap optimization.
+
+use std::sync::Arc;
+
+use arrow::array::{Int32Array, RecordBatch, StringArray};
+use arrow::datatypes::{DataType, Field, Schema, SchemaRef};
+use criterion::{BenchmarkId, Criterion, criterion_group, criterion_main};
+use datafusion_common::{JoinType, NullEquality};
+use datafusion_execution::TaskContext;
+use datafusion_physical_expr::expressions::col;
+use datafusion_physical_plan::collect;
+use datafusion_physical_plan::joins::{HashJoinExec, PartitionMode, 
utils::JoinOn};
+use datafusion_physical_plan::test::TestMemoryExec;
+use tokio::runtime::Runtime;
+
+/// Build RecordBatches with Int32 keys (for roaring optimization).
+///
+/// Schema: (key: Int32, data: Int32, payload: Utf8)
+///
+/// `key_mod` controls distinct key count: key = row_index % key_mod.
+/// `key_offset` shifts keys to control hit rate.
+fn build_batches(
+    num_rows: usize,
+    key_mod: usize,
+    key_offset: i32,
+    schema: &SchemaRef,
+) -> Vec<RecordBatch> {
+    let keys: Vec<i32> = (0..num_rows)
+        .map(|i| ((i % key_mod) as i32) + key_offset)
+        .collect();
+    let data: Vec<i32> = (0..num_rows).map(|i| i as i32).collect();
+    let payload: Vec<String> = data.iter().map(|d| 
format!("val_{d}")).collect();
+
+    let batch = RecordBatch::try_new(
+        Arc::clone(schema),
+        vec![
+            Arc::new(Int32Array::from(keys)),
+            Arc::new(Int32Array::from(data)),
+            Arc::new(StringArray::from(payload)),
+        ],
+    )
+    .unwrap();
+
+    let batch_size = 8192;
+    let mut batches = Vec::new();
+    let mut offset = 0;
+    while offset < batch.num_rows() {
+        let len = (batch.num_rows() - offset).min(batch_size);
+        batches.push(batch.slice(offset, len));
+        offset += len;
+    }
+    batches
+}
+
+fn make_exec(
+    batches: &[RecordBatch],
+    schema: &SchemaRef,
+) -> Arc<dyn datafusion_physical_plan::ExecutionPlan> {
+    TestMemoryExec::try_new_exec(&[batches.to_vec()], Arc::clone(schema), 
None).unwrap()
+}
+
+fn schema() -> SchemaRef {
+    Arc::new(Schema::new(vec![
+        Field::new("key", DataType::Int32, false),
+        Field::new("data", DataType::Int32, false),
+        Field::new("payload", DataType::Utf8, false),
+    ]))
+}
+
+fn do_hash_join(
+    left: Arc<dyn datafusion_physical_plan::ExecutionPlan>,
+    right: Arc<dyn datafusion_physical_plan::ExecutionPlan>,
+    join_type: JoinType,
+    rt: &Runtime,
+) -> usize {
+    let on: JoinOn = vec![(
+        col("key", &left.schema()).unwrap(),
+        col("key", &right.schema()).unwrap(),
+    )];
+    let join = HashJoinExec::try_new(
+        left,
+        right,
+        on,
+        None,
+        &join_type,
+        None,
+        PartitionMode::CollectLeft,
+        NullEquality::NullEqualsNothing,
+        false,
+    )
+    .unwrap();
+
+    let task_ctx = Arc::new(TaskContext::default());
+    rt.block_on(async {
+        let batches = collect(Arc::new(join), task_ctx).await.unwrap();
+        batches.iter().map(|b| b.num_rows()).sum()
+    })
+}
+
+/// Build batches with sparse keys (key = row_index % key_mod * multiplier + 
key_offset).
+/// The `multiplier` controls density: 1 = 100%, 2 = 50%, 10 = 10%.
+fn build_batches_sparse(
+    num_rows: usize,
+    key_mod: usize,
+    key_offset: i32,
+    multiplier: i32,
+    schema: &SchemaRef,
+) -> Vec<RecordBatch> {
+    let keys: Vec<i32> = (0..num_rows)
+        .map(|i| ((i % key_mod) as i32) * multiplier + key_offset)
+        .collect();
+    let data: Vec<i32> = (0..num_rows).map(|i| i as i32).collect();
+    let payload: Vec<String> = data.iter().map(|d| 
format!("val_{d}")).collect();
+
+    let batch = RecordBatch::try_new(
+        Arc::clone(schema),
+        vec![
+            Arc::new(Int32Array::from(keys)),
+            Arc::new(Int32Array::from(data)),
+            Arc::new(StringArray::from(payload)),
+        ],
+    )
+    .unwrap();
+
+    let batch_size = 8192;
+    let mut batches = Vec::new();
+    let mut offset = 0;
+    while offset < batch.num_rows() {
+        let len = (batch.num_rows() - offset).min(batch_size);
+        batches.push(batch.slice(offset, len));
+        offset += len;
+    }
+    batches
+}
+
+fn bench_hash_join_semi_anti(c: &mut Criterion) {
+    let rt = Runtime::new().unwrap();
+    let s = schema();
+
+    let mut group = c.benchmark_group("hash_join_semi_anti");
+
+    // Build side: 100K rows, Probe side: 1M rows
+    let build_rows = 100_000;
+    let probe_rows = 1_000_000;
+
+    // 
=========================================================================
+    // RightSemi Join benchmarks
+    // 
=========================================================================
+
+    // RightSemi - 100% Density, 100% hit rate

Review Comment:
   The `density` parameter is a bit hard to interpret, could you add a comment 
to make the workload easier to understand?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to