gabotechs commented on code in PR #23239:
URL: https://github.com/apache/datafusion/pull/23239#discussion_r3497193679


##########
datafusion/physical-optimizer/src/ensure_requirements/enforce_distribution.rs:
##########
@@ -1245,15 +1180,58 @@ pub fn ensure_distribution(
                     child = add_merge_on_top(child, removed_fetch);
                 }
                 Distribution::HashPartitioned(exprs) => {
+                    let child_partitions =
+                        child.plan.output_partitioning().partition_count();
+                    let hash_satisfied = child
+                        .plan
+                        .output_partitioning()
+                        .satisfaction(
+                            &requirement,
+                            child.plan.equivalence_properties(),
+                            allow_subset_satisfy_partitioning,
+                        )
+                        .is_satisfied();
+                    let range_satisfied_for_aggregate =
+                        aggregate_can_reuse_range_partitioning(&plan)
+                            && allow_subset_satisfy_partitioning
+                            && range_partitioning_satisfies_key_partitioning(
+                                child.plan.output_partitioning(),
+                                exprs,
+                                child.plan.equivalence_properties(),
+                                allow_subset_satisfy_partitioning,
+                            );
+
+                    // When subset satisfaction is enabled, preserve an
+                    // already-satisfying partitioning. Otherwise, hash
+                    // repartition may also increase parallelism.
+                    let needs_hash_repartition = if 
allow_subset_satisfy_partitioning {
+                        !hash_satisfied
+                    } else {
+                        !hash_satisfied || target_partitions > child_partitions
+                    };
+                    let should_add_hash_repartition = hash_necessary
+                        && needs_hash_repartition
+                        && !range_satisfied_for_aggregate;
+

Review Comment:
   The `bool` threading in this file is pretty mind-bending, although it seems 
like it's just how it is, I cannot think of a way of simplifying it...
   
   This PR actually manages that complexity relatively well. I'll think a bit 
more about it and see if there's a way we can make it easier, but I don't think 
this is a blocker as long as we have good test coverage.



##########
datafusion/sqllogictest/test_files/range_partitioning.slt:
##########
@@ -29,18 +29,16 @@ set datafusion.explain.physical_plan_only = true;
 ##########
 # TEST 1: Aggregate on Range Partition Column
 # Scanning range_key preserves source Range partitioning metadata.
-# Planning still inserts Hash repartitioning today; later optimizer PRs can
-# use this baseline to show when the repartition is removed.
+# Planning does not need Hash repartitioning because Range partitioning
+# colocates rows with equal range_key values.
 ##########
 

Review Comment:
   It seems like we can add a bit more coverage to this file?
   
   For example, adding some positive and negative tests that play with subset 
satisfaction + range partitioning, and trying to get all the boolean code paths 
in `enforce_distribution.rs` to execute. 



##########
datafusion/physical-optimizer/src/sanity_checker.rs:
##########
@@ -162,11 +166,24 @@ pub fn check_plan_sanity(
             }
         }
 
-        if !child
+        let child_satisfies_distribution = child
             .output_partitioning()
             .satisfaction(&dist_req, child_eq_props, true)
-            .is_satisfied()
-        {
+            .is_satisfied();
+        let range_satisfies_aggregate_distribution =

Review Comment:
   Following on our discussion in 
https://github.com/apache/datafusion/issues/23236, I think we'll maintain the 
two `HashPartitioned`(deprecated) and `KeyPartitioned` and treat them as equal.
   
   Probably that can be done in a preliminary PR.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to