This is an automated email from the ASF dual-hosted git repository. alamb pushed a commit to branch main in repository https://gitbox.apache.org/repos/asf/datafusion.git
The following commit(s) were added to refs/heads/main by this push: new 4818966fa0 Apply pre-selection and computation skipping to short-circuit optimization (#15694) 4818966fa0 is described below commit 4818966fa0950a6a471d1ea9e4b8c5808878207e Author: LB7666 <acking...@tencent.com> AuthorDate: Fri Apr 18 01:06:13 2025 +0800 Apply pre-selection and computation skipping to short-circuit optimization (#15694) * Enhance short-circuit evaluation for binary expressions - Delay evaluation of the right-hand side (RHS) unless necessary. - Optimize short-circuiting for `Operator::And` and `Operator::Or` by checking LHS alone first. - Introduce `get_short_circuit_result` function to determine short-circuit conditions based on LHS and RHS. - Update tests to cover various short-circuit scenarios for both `AND` and `OR` operations. * refactor: rename test_check_short_circuit to test_get_short_circuit_result and update assertions - Renamed the test function for clarity. - Updated assertions to use get_short_circuit_result instead of check_short_circuit. - Added additional test cases for AND and OR operations with expected results. * fix: enhance short-circuit evaluation logic in get_short_circuit_result function for null - Updated AND and OR short-circuit conditions to only trigger when all values are either false or true, respectively, and there are no nulls in the array. - Adjusted test case to reflect the change in expected output. * feat: add debug logging for binary expression evaluation and short-circuit checks * fix: improve short-circuit evaluation logic in BinaryExpr to ensure RHS is only evaluated when necessary * fix: restrict short-circuit evaluation to logical operators in get_short_circuit_result function * add more println!("==> "); * fix: remove duplicate data type checks for left and right operands in BinaryExpr evaluation * feat: add debug prints for dictionary values and keys in binary expression tests * Tests pass * fix: remove redundant short-circuit evaluation check in BinaryExpr and enhance documentation for get_short_circuit_result * refactor: remove unnecessary debug prints and streamline short-circuit evaluation in BinaryExpr * test: enhance short-circuit evaluation tests for nullable and scalar values in BinaryExpr * add benchmark * refactor: improve short-circuit logic in BinaryExpr for logical operators - Renamed `arg` to `lhs` for clarity in the `get_short_circuit_result` function. - Updated handling of Boolean data types to return `None` for null values. - Simplified short-circuit checks for AND/OR operations by consolidating logic. - Enhanced readability and maintainability of the code by restructuring match statements. * refactor: enhance short-circuit evaluation strategy in BinaryExpr to optimize logical operations * Revert "refactor: enhance short-circuit evaluation strategy in BinaryExpr to optimize logical operations" This reverts commit a62df471e10ea4e53c7438c768ed273d1d344d22. * bench: add benchmark for OR operation with all false values in short-circuit evaluation * refactor: add ShortCircuitStrategy enum to optimize short-circuit evaluation in BinaryExpr - Replaced the lazy evaluation of the right-hand side (RHS) with immediate evaluation based on short-circuiting logic. - Introduced a new function `check_short_circuit` to determine if short-circuiting can be applied for logical operators. - Updated the logic to return early for `Operator::And` and `Operator::Or` based on the evaluation of the left-hand side (LHS) and the conditions of the RHS. - Improved clarity and efficiency of the short-circuit evaluation process by eliminating unnecessary evaluations. * refactor: simplify short-circuit evaluation logic in check_short_circuit function * datafusion_expr::lit as expr_lit * refactor: optimize short-circuit evaluation in check_short_circuit function - Simplified logic for AND/OR operations by prioritizing false/true counts to enhance performance. - Updated documentation to reflect changes in array handling techniques. * refactor: add count_boolean_values helper function and optimize check_short_circuit logic - Introduced a new helper function `count_boolean_values` to count true and false values in a BooleanArray, improving readability and performance. - Updated `check_short_circuit` to utilize the new helper function for counting, reducing redundant operations and enhancing clarity in the evaluation logic for AND/OR operations. - Adjusted comments for better understanding of the short-circuiting conditions based on the new counting mechanism. * Revert "refactor: add count_boolean_values helper function and optimize check_short_circuit logic" This reverts commit e2b9f777222759a530bd539ab0453c100b2699b4. * optimise evaluate * optimise evaluate 2 * refactor op:AND, lhs all false op:OR, lhs all true to be faster * fix clippy warning * refactor: optimize short-circuit evaluation logic in check_short_circuit function * fix clippy warning * add pre selection * add some comments * [WIP] fix pre-selection result * fix: Error in calculating the ratio * fix: Correct typo in pre-selection threshold constant and improve pre-selection scatter function documentation * fix doctest error * fix cargo doc * fix cargo doc * test: Add unit tests for pre_selection_scatter function --------- Co-authored-by: Siew Kam Onn <kos...@gmail.com> --- datafusion/physical-expr/benches/binary_op.rs | 17 +- datafusion/physical-expr/src/expressions/binary.rs | 509 ++++++++++++++++++--- 2 files changed, 458 insertions(+), 68 deletions(-) diff --git a/datafusion/physical-expr/benches/binary_op.rs b/datafusion/physical-expr/benches/binary_op.rs index 59a602df05..216d8a520e 100644 --- a/datafusion/physical-expr/benches/binary_op.rs +++ b/datafusion/physical-expr/benches/binary_op.rs @@ -126,14 +126,25 @@ fn generate_boolean_cases<const TEST_ALL_FALSE: bool>( )); } + // Scenario 7: Test all true or all false in AND/OR + // This situation won't cause a short circuit, but it can skip the bool calculation + if TEST_ALL_FALSE { + let all_true = vec![true; len]; + cases.push(("all_true_in_and".to_string(), BooleanArray::from(all_true))); + } else { + let all_false = vec![false; len]; + cases.push(("all_false_in_or".to_string(), BooleanArray::from(all_false))); + } + cases } /// Benchmarks AND/OR operator short-circuiting by evaluating complex regex conditions. /// -/// Creates 6 test scenarios per operator: +/// Creates 7 test scenarios per operator: /// 1. All values enable short-circuit (all_true/all_false) /// 2. 2-6 Single true/false value at different positions to measure early exit +/// 3. Test all true or all false in AND/OR /// /// You can run this benchmark with: /// ```sh @@ -203,7 +214,7 @@ fn benchmark_binary_op_in_short_circuit(c: &mut Criterion) { // Each scenario when the test operator is `and` { - for (name, batch) in batches_and { + for (name, batch) in batches_and.into_iter() { c.bench_function(&format!("short_circuit/and/{}", name), |b| { b.iter(|| expr_and.evaluate(black_box(&batch)).unwrap()) }); @@ -211,7 +222,7 @@ fn benchmark_binary_op_in_short_circuit(c: &mut Criterion) { } // Each scenario when the test operator is `or` { - for (name, batch) in batches_or { + for (name, batch) in batches_or.into_iter() { c.bench_function(&format!("short_circuit/or/{}", name), |b| { b.iter(|| expr_or.evaluate(black_box(&batch)).unwrap()) }); diff --git a/datafusion/physical-expr/src/expressions/binary.rs b/datafusion/physical-expr/src/expressions/binary.rs index 84374f4a29..6c68d11e2c 100644 --- a/datafusion/physical-expr/src/expressions/binary.rs +++ b/datafusion/physical-expr/src/expressions/binary.rs @@ -29,7 +29,9 @@ use arrow::compute::kernels::boolean::{and_kleene, not, or_kleene}; use arrow::compute::kernels::cmp::*; use arrow::compute::kernels::comparison::{regexp_is_match, regexp_is_match_scalar}; use arrow::compute::kernels::concat_elements::concat_elements_utf8; -use arrow::compute::{cast, ilike, like, nilike, nlike}; +use arrow::compute::{ + cast, filter_record_batch, ilike, like, nilike, nlike, SlicesIterator, +}; use arrow::datatypes::*; use arrow::error::ArrowError; use datafusion_common::cast::as_boolean_array; @@ -358,11 +360,24 @@ impl PhysicalExpr for BinaryExpr { fn evaluate(&self, batch: &RecordBatch) -> Result<ColumnarValue> { use arrow::compute::kernels::numeric::*; + // Evaluate left-hand side expression. let lhs = self.left.evaluate(batch)?; - // Optimize for short-circuiting `Operator::And` or `Operator::Or` operations and return early. - if check_short_circuit(&lhs, &self.op) { - return Ok(lhs); + // Check if we can apply short-circuit evaluation. + match check_short_circuit(&lhs, &self.op) { + ShortCircuitStrategy::None => {} + ShortCircuitStrategy::ReturnLeft => return Ok(lhs), + ShortCircuitStrategy::ReturnRight => { + let rhs = self.right.evaluate(batch)?; + return Ok(rhs); + } + ShortCircuitStrategy::PreSelection(selection) => { + // The function `evaluate_selection` was not called for filtering and calculation, + // as it takes into account cases where the selection contains null values. + let batch = filter_record_batch(batch, selection)?; + let right_ret = self.right.evaluate(&batch)?; + return pre_selection_scatter(selection, right_ret); + } } let rhs = self.right.evaluate(batch)?; @@ -405,23 +420,19 @@ impl PhysicalExpr for BinaryExpr { let result_type = self.data_type(input_schema)?; - // Attempt to use special kernels if one input is scalar and the other is an array - let scalar_result = match (&lhs, &rhs) { - (ColumnarValue::Array(array), ColumnarValue::Scalar(scalar)) => { - // if left is array and right is literal(not NULL) - use scalar operations - if scalar.is_null() { - None - } else { - self.evaluate_array_scalar(array, scalar.clone())?.map(|r| { - r.and_then(|a| to_result_type_array(&self.op, a, &result_type)) - }) + // If the left-hand side is an array and the right-hand side is a non-null scalar, try the optimized kernel. + if let (ColumnarValue::Array(array), ColumnarValue::Scalar(ref scalar)) = + (&lhs, &rhs) + { + if !scalar.is_null() { + if let Some(result_array) = + self.evaluate_array_scalar(array, scalar.clone())? + { + let final_array = result_array + .and_then(|a| to_result_type_array(&self.op, a, &result_type)); + return final_array.map(ColumnarValue::Array); } } - (_, _) => None, // default to array implementation - }; - - if let Some(result) = scalar_result { - return result.map(ColumnarValue::Array); } // if both arrays or both literals - extract arrays and continue execution @@ -811,58 +822,199 @@ impl BinaryExpr { } } +enum ShortCircuitStrategy<'a> { + None, + ReturnLeft, + ReturnRight, + PreSelection(&'a BooleanArray), +} + +/// Based on the results calculated from the left side of the short-circuit operation, +/// if the proportion of `true` is less than 0.2 and the current operation is an `and`, +/// the `RecordBatch` will be filtered in advance. +const PRE_SELECTION_THRESHOLD: f32 = 0.2; + /// Checks if a logical operator (`AND`/`OR`) can short-circuit evaluation based on the left-hand side (lhs) result. /// -/// Short-circuiting occurs when evaluating the right-hand side (rhs) becomes unnecessary: -/// - For `AND`: if ALL values in `lhs` are `false`, the expression must be `false` regardless of rhs. -/// - For `OR`: if ALL values in `lhs` are `true`, the expression must be `true` regardless of rhs. -/// -/// Returns `true` if short-circuiting is possible, `false` otherwise. -/// +/// Short-circuiting occurs under these circumstances: +/// - For `AND`: +/// - if LHS is all false => short-circuit → return LHS +/// - if LHS is all true => short-circuit → return RHS +/// - if LHS is mixed and true_count/sum_count <= [`PRE_SELECTION_THRESHOLD`] -> pre-selection +/// - For `OR`: +/// - if LHS is all true => short-circuit → return LHS +/// - if LHS is all false => short-circuit → return RHS /// # Arguments -/// * `arg` - The left-hand side (lhs) columnar value (array or scalar) +/// * `lhs` - The left-hand side (lhs) columnar value (array or scalar) +/// * `lhs` - The left-hand side (lhs) columnar value (array or scalar) /// * `op` - The logical operator (`AND` or `OR`) /// /// # Implementation Notes /// 1. Only works with Boolean-typed arguments (other types automatically return `false`) /// 2. Handles both scalar values and array values -/// 3. For arrays, uses optimized `true_count()`/`false_count()` methods from arrow-rs. -/// `bool_or`/`bool_and` maybe a better choice too,for detailed discussion,see:[link](https://github.com/apache/datafusion/pull/15462#discussion_r2020558418) -fn check_short_circuit(arg: &ColumnarValue, op: &Operator) -> bool { - let data_type = arg.data_type(); - match (data_type, op) { - (DataType::Boolean, Operator::And) => { - match arg { - ColumnarValue::Array(array) => { - if let Ok(array) = as_boolean_array(&array) { - return array.false_count() == array.len(); - } +/// 3. For arrays, uses optimized bit counting techniques for boolean arrays +fn check_short_circuit<'a>( + lhs: &'a ColumnarValue, + op: &Operator, +) -> ShortCircuitStrategy<'a> { + // Quick reject for non-logical operators,and quick judgment when op is and + let is_and = match op { + Operator::And => true, + Operator::Or => false, + _ => return ShortCircuitStrategy::None, + }; + + // Non-boolean types can't be short-circuited + if lhs.data_type() != DataType::Boolean { + return ShortCircuitStrategy::None; + } + + match lhs { + ColumnarValue::Array(array) => { + // Fast path for arrays - try to downcast to boolean array + if let Ok(bool_array) = as_boolean_array(array) { + // Arrays with nulls can't be short-circuited + if bool_array.null_count() > 0 { + return ShortCircuitStrategy::None; } - ColumnarValue::Scalar(scalar) => { - if let ScalarValue::Boolean(Some(value)) = scalar { - return !value; + + let len = bool_array.len(); + if len == 0 { + return ShortCircuitStrategy::None; + } + + let true_count = bool_array.values().count_set_bits(); + if is_and { + // For AND, prioritize checking for all-false (short circuit case) + // Uses optimized false_count() method provided by Arrow + + // Short circuit if all values are false + if true_count == 0 { + return ShortCircuitStrategy::ReturnLeft; + } + + // If no false values, then all must be true + if true_count == len { + return ShortCircuitStrategy::ReturnRight; + } + + // determine if we can pre-selection + if true_count as f32 / len as f32 <= PRE_SELECTION_THRESHOLD { + return ShortCircuitStrategy::PreSelection(bool_array); + } + } else { + // For OR, prioritize checking for all-true (short circuit case) + // Uses optimized true_count() method provided by Arrow + + // Short circuit if all values are true + if true_count == len { + return ShortCircuitStrategy::ReturnLeft; + } + + // If no true values, then all must be false + if true_count == 0 { + return ShortCircuitStrategy::ReturnRight; } } } - false } - (DataType::Boolean, Operator::Or) => { - match arg { - ColumnarValue::Array(array) => { - if let Ok(array) = as_boolean_array(&array) { - return array.true_count() == array.len(); - } - } - ColumnarValue::Scalar(scalar) => { - if let ScalarValue::Boolean(Some(value)) = scalar { - return *value; - } + ColumnarValue::Scalar(scalar) => { + // Fast path for scalar values + if let ScalarValue::Boolean(Some(is_true)) = scalar { + // Return Left for: + // - AND with false value + // - OR with true value + if (is_and && !is_true) || (!is_and && *is_true) { + return ShortCircuitStrategy::ReturnLeft; + } else { + return ShortCircuitStrategy::ReturnRight; } } - false } - _ => false, } + + // If we can't short-circuit, indicate that normal evaluation should continue + ShortCircuitStrategy::None +} + +/// Creates a new boolean array based on the evaluation of the right expression, +/// but only for positions where the left_result is true. +/// +/// This function is used for short-circuit evaluation optimization of logical AND operations: +/// - When left_result has few true values, we only evaluate the right expression for those positions +/// - Values are copied from right_array where left_result is true +/// - All other positions are filled with false values +/// +/// # Parameters +/// - `left_result` Boolean array with selection mask (typically from left side of AND) +/// - `right_result` Result of evaluating right side of expression (only for selected positions) +/// +/// # Returns +/// A combined ColumnarValue with values from right_result where left_result is true +/// +/// # Example +/// Initial Data: { 1, 2, 3, 4, 5 } +/// Left Evaluation +/// (Condition: Equal to 2 or 3) +/// ↓ +/// Filtered Data: {2, 3} +/// Left Bitmap: { 0, 1, 1, 0, 0 } +/// ↓ +/// Right Evaluation +/// (Condition: Even numbers) +/// ↓ +/// Right Data: { 2 } +/// Right Bitmap: { 1, 0 } +/// ↓ +/// Combine Results +/// Final Bitmap: { 0, 1, 0, 0, 0 } +/// +/// # Note +/// Perhaps it would be better to modify `left_result` directly without creating a copy? +/// In practice, `left_result` should have only one owner, so making changes should be safe. +/// However, this is difficult to achieve under the immutable constraints of [`Arc`] and [`BooleanArray`]. +fn pre_selection_scatter( + left_result: &BooleanArray, + right_result: ColumnarValue, +) -> Result<ColumnarValue> { + let right_boolean_array = match &right_result { + ColumnarValue::Array(array) => array.as_boolean(), + ColumnarValue::Scalar(_) => return Ok(right_result), + }; + + let result_len = left_result.len(); + + let mut result_array_builder = BooleanArray::builder(result_len); + + // keep track of current position we have in right boolean array + let mut right_array_pos = 0; + + // keep track of how much is filled + let mut last_end = 0; + SlicesIterator::new(left_result).for_each(|(start, end)| { + // the gap needs to be filled with false + if start > last_end { + result_array_builder.append_n(start - last_end, false); + } + + // copy values from right array for this slice + let len = end - start; + right_boolean_array + .slice(right_array_pos, len) + .iter() + .for_each(|v| result_array_builder.append_option(v)); + + right_array_pos += len; + last_end = end; + }); + + // Fill any remaining positions with false + if last_end < result_len { + result_array_builder.append_n(result_len - last_end, false); + } + let boolean_result = result_array_builder.finish(); + + Ok(ColumnarValue::Array(Arc::new(boolean_result))) } fn concat_elements(left: Arc<dyn Array>, right: Arc<dyn Array>) -> Result<ArrayRef> { @@ -919,10 +1071,14 @@ pub fn similar_to( mod tests { use super::*; use crate::expressions::{col, lit, try_cast, Column, Literal}; + use datafusion_expr::lit as expr_lit; use datafusion_common::plan_datafusion_err; use datafusion_physical_expr_common::physical_expr::fmt_sql; + use crate::planner::logical2physical; + use arrow::array::BooleanArray; + use datafusion_expr::col as logical_col; /// Performs a binary operation, applying any type coercion necessary fn binary_op( left: Arc<dyn PhysicalExpr>, @@ -4895,9 +5051,7 @@ mod tests { #[test] fn test_check_short_circuit() { - use crate::planner::logical2physical; - use datafusion_expr::col as logical_col; - use datafusion_expr::lit; + // Test with non-nullable arrays let schema = Arc::new(Schema::new(vec![ Field::new("a", DataType::Int32, false), Field::new("b", DataType::Int32, false), @@ -4911,20 +5065,245 @@ mod tests { .unwrap(); // op: AND left: all false - let left_expr = logical2physical(&logical_col("a").eq(lit(2)), &schema); + let left_expr = logical2physical(&logical_col("a").eq(expr_lit(2)), &schema); let left_value = left_expr.evaluate(&batch).unwrap(); - assert!(check_short_circuit(&left_value, &Operator::And)); + assert!(matches!( + check_short_circuit(&left_value, &Operator::And), + ShortCircuitStrategy::ReturnLeft + )); + // op: AND left: not all false - let left_expr = logical2physical(&logical_col("a").eq(lit(3)), &schema); + let left_expr = logical2physical(&logical_col("a").eq(expr_lit(3)), &schema); let left_value = left_expr.evaluate(&batch).unwrap(); - assert!(!check_short_circuit(&left_value, &Operator::And)); + let ColumnarValue::Array(array) = &left_value else { + panic!("Expected ColumnarValue::Array"); + }; + let ShortCircuitStrategy::PreSelection(value) = + check_short_circuit(&left_value, &Operator::And) + else { + panic!("Expected ShortCircuitStrategy::PreSelection"); + }; + let expected_boolean_arr: Vec<_> = + as_boolean_array(array).unwrap().iter().collect(); + let boolean_arr: Vec<_> = value.iter().collect(); + assert_eq!(expected_boolean_arr, boolean_arr); + // op: OR left: all true - let left_expr = logical2physical(&logical_col("a").gt(lit(0)), &schema); + let left_expr = logical2physical(&logical_col("a").gt(expr_lit(0)), &schema); let left_value = left_expr.evaluate(&batch).unwrap(); - assert!(check_short_circuit(&left_value, &Operator::Or)); + assert!(matches!( + check_short_circuit(&left_value, &Operator::Or), + ShortCircuitStrategy::ReturnLeft + )); + // op: OR left: not all true - let left_expr = logical2physical(&logical_col("a").gt(lit(2)), &schema); + let left_expr: Arc<dyn PhysicalExpr> = + logical2physical(&logical_col("a").gt(expr_lit(2)), &schema); let left_value = left_expr.evaluate(&batch).unwrap(); - assert!(!check_short_circuit(&left_value, &Operator::Or)); + assert!(matches!( + check_short_circuit(&left_value, &Operator::Or), + ShortCircuitStrategy::None + )); + + // Test with nullable arrays and null values + let schema_nullable = Arc::new(Schema::new(vec![ + Field::new("c", DataType::Boolean, true), + Field::new("d", DataType::Boolean, true), + ])); + + // Create arrays with null values + let c_array = Arc::new(BooleanArray::from(vec![ + Some(true), + Some(false), + None, + Some(true), + None, + ])) as ArrayRef; + let d_array = Arc::new(BooleanArray::from(vec![ + Some(false), + Some(true), + Some(false), + None, + Some(true), + ])) as ArrayRef; + + let batch_nullable = RecordBatch::try_new( + Arc::clone(&schema_nullable), + vec![Arc::clone(&c_array), Arc::clone(&d_array)], + ) + .unwrap(); + + // Case: Mixed values with nulls - shouldn't short-circuit for AND + let mixed_nulls = logical2physical(&logical_col("c"), &schema_nullable); + let mixed_nulls_value = mixed_nulls.evaluate(&batch_nullable).unwrap(); + assert!(matches!( + check_short_circuit(&mixed_nulls_value, &Operator::And), + ShortCircuitStrategy::None + )); + + // Case: Mixed values with nulls - shouldn't short-circuit for OR + assert!(matches!( + check_short_circuit(&mixed_nulls_value, &Operator::Or), + ShortCircuitStrategy::None + )); + + // Test with all nulls + let all_nulls = Arc::new(BooleanArray::from(vec![None, None, None])) as ArrayRef; + let null_batch = RecordBatch::try_new( + Arc::new(Schema::new(vec![Field::new("e", DataType::Boolean, true)])), + vec![all_nulls], + ) + .unwrap(); + + let null_expr = logical2physical(&logical_col("e"), &null_batch.schema()); + let null_value = null_expr.evaluate(&null_batch).unwrap(); + + // All nulls shouldn't short-circuit for AND or OR + assert!(matches!( + check_short_circuit(&null_value, &Operator::And), + ShortCircuitStrategy::None + )); + assert!(matches!( + check_short_circuit(&null_value, &Operator::Or), + ShortCircuitStrategy::None + )); + + // Test with scalar values + // Scalar true + let scalar_true = ColumnarValue::Scalar(ScalarValue::Boolean(Some(true))); + assert!(matches!( + check_short_circuit(&scalar_true, &Operator::Or), + ShortCircuitStrategy::ReturnLeft + )); // Should short-circuit OR + assert!(matches!( + check_short_circuit(&scalar_true, &Operator::And), + ShortCircuitStrategy::ReturnRight + )); // Should return the RHS for AND + + // Scalar false + let scalar_false = ColumnarValue::Scalar(ScalarValue::Boolean(Some(false))); + assert!(matches!( + check_short_circuit(&scalar_false, &Operator::And), + ShortCircuitStrategy::ReturnLeft + )); // Should short-circuit AND + assert!(matches!( + check_short_circuit(&scalar_false, &Operator::Or), + ShortCircuitStrategy::ReturnRight + )); // Should return the RHS for OR + + // Scalar null + let scalar_null = ColumnarValue::Scalar(ScalarValue::Boolean(None)); + assert!(matches!( + check_short_circuit(&scalar_null, &Operator::And), + ShortCircuitStrategy::None + )); + assert!(matches!( + check_short_circuit(&scalar_null, &Operator::Or), + ShortCircuitStrategy::None + )); + } + + /// Test for [pre_selection_scatter] + /// Since [check_short_circuit] ensures that the left side does not contain null and is neither all_true nor all_false, as well as not being empty, + /// the following tests have been designed: + /// 1. Test sparse left with interleaved true/false + /// 2. Test multiple consecutive true blocks + /// 3. Test multiple consecutive true blocks + /// 4. Test single true at first position + /// 5. Test single true at last position + /// 6. Test nulls in right array + /// 7. Test scalar right handling + #[test] + fn test_pre_selection_scatter() { + fn create_bool_array(bools: Vec<bool>) -> BooleanArray { + BooleanArray::from(bools.into_iter().map(Some).collect::<Vec<_>>()) + } + // Test sparse left with interleaved true/false + { + // Left: [T, F, T, F, T] + // Right: [F, T, F] (values for 3 true positions) + let left = create_bool_array(vec![true, false, true, false, true]); + let right = ColumnarValue::Array(Arc::new(create_bool_array(vec![ + false, true, false, + ]))); + + let result = pre_selection_scatter(&left, right).unwrap(); + let result_arr = result.into_array(left.len()).unwrap(); + + let expected = create_bool_array(vec![false, false, true, false, false]); + assert_eq!(&expected, result_arr.as_boolean()); + } + // Test multiple consecutive true blocks + { + // Left: [F, T, T, F, T, T, T] + // Right: [T, F, F, T, F] + let left = + create_bool_array(vec![false, true, true, false, true, true, true]); + let right = ColumnarValue::Array(Arc::new(create_bool_array(vec![ + true, false, false, true, false, + ]))); + + let result = pre_selection_scatter(&left, right).unwrap(); + let result_arr = result.into_array(left.len()).unwrap(); + + let expected = + create_bool_array(vec![false, true, false, false, false, true, false]); + assert_eq!(&expected, result_arr.as_boolean()); + } + // Test single true at first position + { + // Left: [T, F, F] + // Right: [F] + let left = create_bool_array(vec![true, false, false]); + let right = ColumnarValue::Array(Arc::new(create_bool_array(vec![false]))); + + let result = pre_selection_scatter(&left, right).unwrap(); + let result_arr = result.into_array(left.len()).unwrap(); + + let expected = create_bool_array(vec![false, false, false]); + assert_eq!(&expected, result_arr.as_boolean()); + } + // Test single true at last position + { + // Left: [F, F, T] + // Right: [F] + let left = create_bool_array(vec![false, false, true]); + let right = ColumnarValue::Array(Arc::new(create_bool_array(vec![false]))); + + let result = pre_selection_scatter(&left, right).unwrap(); + let result_arr = result.into_array(left.len()).unwrap(); + + let expected = create_bool_array(vec![false, false, false]); + assert_eq!(&expected, result_arr.as_boolean()); + } + // Test nulls in right array + { + // Left: [F, T, F, T] + // Right: [None, Some(false)] (with null at first position) + let left = create_bool_array(vec![false, true, false, true]); + let right_arr = BooleanArray::from(vec![None, Some(false)]); + let right = ColumnarValue::Array(Arc::new(right_arr)); + + let result = pre_selection_scatter(&left, right).unwrap(); + let result_arr = result.into_array(left.len()).unwrap(); + + let expected = BooleanArray::from(vec![ + Some(false), + None, // null from right + Some(false), + Some(false), + ]); + assert_eq!(&expected, result_arr.as_boolean()); + } + // Test scalar right handling + { + // Left: [T, F, T] + // Right: Scalar true + let left = create_bool_array(vec![true, false, true]); + let right = ColumnarValue::Scalar(ScalarValue::Boolean(Some(true))); + + let result = pre_selection_scatter(&left, right).unwrap(); + assert!(matches!(result, ColumnarValue::Scalar(_))); + } } } --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@datafusion.apache.org For additional commands, e-mail: commits-h...@datafusion.apache.org