Re: [PR] fix: Preserve equivalence properties of input plan in unnest [datafusion]

via GitHub Thu, 31 Jul 2025 02:11:39 -0700


asubiotto commented on code in PR #16985:
URL: https://github.com/apache/datafusion/pull/16985#discussion_r2244811476



##########
datafusion/physical-plan/src/unnest.rs:
##########
@@ -101,8 +101,22 @@ impl UnnestExec {
         input: &Arc<dyn ExecutionPlan>,
         schema: SchemaRef,
     ) -> PlanProperties {
+        // Extract equivalence properties from input plan
+        let input_eq_properties = input.equivalence_properties();
+        let input_oeq_class = input_eq_properties.oeq_class();
+        let orderings = input_oeq_class.orderings().to_vec();
+        let eq_group = input_eq_properties.eq_group();
+        let constraints = input_eq_properties.constraints();
+

Review Comment:
   I think rather than creating a new equivalence properties object and trying 
to set all known fields, I would follow an approach similar to what filter 
does: 
https://github.com/apache/datafusion/blob/4bc66c80d560581f527dee5774fb4f0479786d3e/datafusion/physical-plan/src/filter.rs#L267
   
   Which basically clones the full input equivalence properties and then 
modifies it. In this case I believe we should just remove the orderings on the 
columns we're unnesting.
   
   I'm not very familiar with this code so not sure how to express this exactly.



##########
datafusion/sqllogictest/test_files/unnest.slt:
##########
@@ -941,3 +941,60 @@ where min_height * width1 = (
 )
 ----
 4 7 4 28
+
+## Unnest with ordering on unrelated column is preserved
+query TT
+EXPLAIN WITH unnested AS (SELECT
+    ROW_NUMBER() OVER () AS generated_id,
+    unnest(array[value]) as ar
+  FROM range(1,5)) SELECT array_agg(ar) FROM unnested group by generated_id;
+----
+logical_plan
+01)Projection: array_agg(unnested.ar)
+02)--Aggregate: groupBy=[[unnested.generated_id]], 
aggr=[[array_agg(unnested.ar)]]
+03)----SubqueryAlias: unnested
+04)------Projection: generated_id, 
__unnest_placeholder(make_array(range().value),depth=1) AS 
UNNEST(make_array(range().value)) AS ar
+05)--------Unnest: 
lists[__unnest_placeholder(make_array(range().value))|depth=1] structs[]
+06)----------Projection: row_number() ROWS BETWEEN UNBOUNDED PRECEDING AND 
UNBOUNDED FOLLOWING AS generated_id, make_array(range().value) AS 
__unnest_placeholder(make_array(range().value))
+07)------------WindowAggr: windowExpr=[[row_number() ROWS BETWEEN UNBOUNDED 
PRECEDING AND UNBOUNDED FOLLOWING]]
+08)--------------TableScan: range() projection=[value]
+physical_plan
+01)ProjectionExec: expr=[array_agg(unnested.ar)@1 as array_agg(unnested.ar)]
+02)--AggregateExec: mode=FinalPartitioned, gby=[generated_id@0 as 
generated_id], aggr=[array_agg(unnested.ar)], ordering_mode=Sorted
+03)----SortExec: expr=[generated_id@0 ASC NULLS LAST], 
preserve_partitioning=[true]
+04)------CoalesceBatchesExec: target_batch_size=8192
+05)--------RepartitionExec: partitioning=Hash([generated_id@0], 4), 
input_partitions=4
+06)----------AggregateExec: mode=Partial, gby=[generated_id@0 as 
generated_id], aggr=[array_agg(unnested.ar)], ordering_mode=Sorted
+07)------------ProjectionExec: expr=[generated_id@0 as generated_id, 
__unnest_placeholder(make_array(range().value),depth=1)@1 as ar]
+08)--------------UnnestExec
+09)----------------ProjectionExec: expr=[row_number() ROWS BETWEEN UNBOUNDED 
PRECEDING AND UNBOUNDED FOLLOWING@1 as generated_id, make_array(value@0) as 
__unnest_placeholder(make_array(range().value))]
+10)------------------RepartitionExec: partitioning=RoundRobinBatch(4), 
input_partitions=1
+11)--------------------BoundedWindowAggExec: wdw=[row_number() ROWS BETWEEN 
UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING: Field { name: "row_number() ROWS 
BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING", data_type: UInt64, 
nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }, frame: 
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING], mode=[Sorted]
+12)----------------------LazyMemoryExec: partitions=1, 
batch_generators=[range: start=1, end=5, batch_size=8192]
+
+## Unnest without ordering
+query TT
+EXPLAIN WITH unnested AS (SELECT
+    random() AS generated_id,
+    unnest(array[value]) as ar
+  FROM range(1,5)) SELECT array_agg(ar) FROM unnested group by generated_id;

Review Comment:
   I think it's fine to only have a single EXPLAIN test case for the ordering



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] fix: Preserve equivalence properties of input plan in unnest [datafusion]

Reply via email to