kosiew commented on code in PR #15296: URL: https://github.com/apache/datafusion/pull/15296#discussion_r2002421418
########## datafusion/expr-common/src/statistics.rs: ########## @@ -203,6 +203,121 @@ impl Distribution { }; Ok(dt) } + + /// Merges two distributions into a single distribution that represents their combined statistics. + /// This creates a more general distribution that approximates the mixture of the input distributions. + pub fn merge(&self, other: &Self) -> Result<Self> { + let range_a = self.range()?; + let range_b = other.range()?; + + // Determine data type and create combined range + let combined_range = range_a.union(&range_b)?; + + // Calculate weights for the mixture distribution + let (weight_a, weight_b) = match (range_a.cardinality(), range_b.cardinality()) { + (Some(ca), Some(cb)) => { + let total = (ca + cb) as f64; + ((ca as f64) / total, (cb as f64) / total) + } + _ => (0.5, 0.5), // Equal weights if cardinalities not available + }; + + // Get the original statistics + let mean_a = self.mean()?; + let mean_b = other.mean()?; + let median_a = self.median()?; + let median_b = other.median()?; + let var_a = self.variance()?; + let var_b = other.variance()?; + + // Always use Float64 for intermediate calculations to avoid truncation + // I assume that the target type is always numeric + // Todo: maybe we can keep all `ScalarValue` as `Float64` in `Distribution`? + let calc_type = DataType::Float64; + + // Create weight scalars using Float64 to avoid truncation + let weight_a_scalar = ScalarValue::from(weight_a); + let weight_b_scalar = ScalarValue::from(weight_b); + + // Calculate combined mean + let combined_mean = if mean_a.is_null() || mean_b.is_null() { + if mean_a.is_null() { + mean_b.clone() + } else { + mean_a.clone() + } + } else { + // Cast to Float64 for calculation + let mean_a_f64 = mean_a.cast_to(&calc_type)?; + let mean_b_f64 = mean_b.cast_to(&calc_type)?; + + // Calculate weighted mean + mean_a_f64 + .mul_checked(&weight_a_scalar)? + .add_checked(&mean_b_f64.mul_checked(&weight_b_scalar)?)? + }; + + // Calculate combined median + let combined_median = if median_a.is_null() || median_b.is_null() { + if median_a.is_null() { + median_b + } else { + median_a + } + } else { + // Cast to Float64 for calculation + let median_a_f64 = median_a.cast_to(&calc_type)?; + let median_b_f64 = median_b.cast_to(&calc_type)?; + + // Calculate weighted median + median_a_f64 + .mul_checked(&weight_a_scalar)? + .add_checked(&median_b_f64.mul_checked(&weight_b_scalar)?)? Review Comment: Medians are not linear statistics. This currently calculates the combined median as a weighted average, which might not represent the true median of the combined distribution. Consider adding a comment discussing this approximation and any potential impact on downstream results. For instance, consider two distributions where one is symmetric and the other is highly skewed. The weighted average of their medians may not represent the true central tendency of the merged distribution because the skewness can cause the overall median to shift in a non-linear fashion. Potential Impact on Downstream Results: - Accuracy: Downstream processes that rely on the combined median might be misled by this approximation, especially in cases where the data's distribution shapes differ significantly. - Interpretability: Users expecting an exact median might misinterpret the results, leading to potential errors in statistical analysis or decision-making. - Statistical Validity: For critical applications, the inaccuracy of a weighted median approximation might necessitate alternative methods, such as reconstructing the combined distribution's CDF and computing the median directly from it. ########## datafusion/expr-common/src/statistics.rs: ########## @@ -203,6 +203,121 @@ impl Distribution { }; Ok(dt) } + + /// Merges two distributions into a single distribution that represents their combined statistics. + /// This creates a more general distribution that approximates the mixture of the input distributions. + pub fn merge(&self, other: &Self) -> Result<Self> { + let range_a = self.range()?; + let range_b = other.range()?; + + // Determine data type and create combined range + let combined_range = range_a.union(&range_b)?; + + // Calculate weights for the mixture distribution + let (weight_a, weight_b) = match (range_a.cardinality(), range_b.cardinality()) { + (Some(ca), Some(cb)) => { + let total = (ca + cb) as f64; + ((ca as f64) / total, (cb as f64) / total) Review Comment: Should we add a safeguard or comment regarding the possibility of zero cardinalities. For example, what should be done if both distributions are empty or if one distribution has a cardinality of zero? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org