kosiew commented on code in PR #21315:
URL: https://github.com/apache/datafusion/pull/21315#discussion_r3121924591
##########
datafusion/functions-aggregate-common/src/min_max.rs:
##########
@@ -760,37 +813,40 @@ pub fn min_batch(values: &ArrayRef) ->
Result<ScalarValue> {
min_binary_view
)
}
- DataType::Struct(_) => min_max_batch_generic(values,
Ordering::Greater)?,
- DataType::List(_) => min_max_batch_generic(values, Ordering::Greater)?,
- DataType::LargeList(_) => min_max_batch_generic(values,
Ordering::Greater)?,
- DataType::FixedSizeList(_, _) => {
- min_max_batch_generic(values, Ordering::Greater)?
- }
- DataType::Dictionary(_, _) => {
- let values = values.as_any_dictionary().values();
- min_batch(values)?
- }
+ DataType::Struct(_)
+ | DataType::List(_)
+ | DataType::LargeList(_)
+ | DataType::FixedSizeList(_, _)
+ | DataType::Dictionary(_, _) => min_max_batch_generic(values,
Ordering::Greater)?,
_ => min_max_batch!(values, min),
})
}
-/// Generic min/max implementation for complex types
-fn min_max_batch_generic(array: &ArrayRef, ordering: Ordering) ->
Result<ScalarValue> {
- if array.len() == array.null_count() {
- return ScalarValue::try_from(array.data_type());
- }
- let mut extreme = ScalarValue::try_from_array(array, 0)?;
- for i in 1..array.len() {
- let current = ScalarValue::try_from_array(array, i)?;
- if current.is_null() {
- continue;
+/// Finds the min/max by scanning logical rows via
`ScalarValue::try_from_array`.
+///
+/// This path is required for dictionary arrays because comparing
+/// `dictionary.values()` is not semantically correct: it can include
+/// unreferenced values and ignore null key positions.
+fn min_max_batch_generic(values: &ArrayRef, ordering: Ordering) ->
Result<ScalarValue> {
+ let mut index = 0;
Review Comment:
> it is not the responsibility of this function that dictionary.values()
isn't semantically correct
I amended the comment and locked this down with a new test.
> ..why this function is being changed....the original version already works
as intended.
The changed `min_max_batch_generic` shape is a secondary improvement
(independent of the bug fix): it first finds the first non-null row and only
then enters comparison.
The two-phase version is easier to follow because it separates setup from
comparison.
1. Phase 1 finds the first non-null value.
2. Phase 2 compares remaining non-null values against that baseline.
After phase 1, `extreme` is always non-null, so the loop only needs to:
1. Skip null `current` values.
2. Compare/update for non-null values.
The old single-loop mixed these concerns in every iteration (`current` null,
`extreme` null, or compare), which made the logic more branchy and harder to
read.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]