[GitHub] [arrow-datafusion] ozankabak commented on a diff in pull request #5074: [Enhancement] Don't repartition ProjectionExec when it does not compute anything

via GitHub Mon, 30 Jan 2023 09:34:43 -0800


ozankabak commented on code in PR #5074:
URL: https://github.com/apache/arrow-datafusion/pull/5074#discussion_r1090939100



##########
datafusion/core/tests/sql/explain_analyze.rs:
##########
@@ -654,13 +654,13 @@ async fn 
test_physical_plan_display_indent_multi_children() {
         "    HashJoinExec: mode=Partitioned, join_type=Inner, on=[(Column { 
name: \"c1\", index: 0 }, Column { name: \"c2\", index: 0 })]",
         "      CoalesceBatchesExec: target_batch_size=4096",
         "        RepartitionExec: partitioning=Hash([Column { name: \"c1\", 
index: 0 }], 9000), input_partitions=9000",
-        "          ProjectionExec: expr=[c1@0 as c1]",
-        "            RepartitionExec: partitioning=RoundRobinBatch(9000), 
input_partitions=1",
+        "          RepartitionExec: partitioning=RoundRobinBatch(9000), 
input_partitions=1",

Review Comment:
   I think @andygrove came across this behavior recently and @Dandandan had a 
good explanation why this happens. IIRC, this surprising-looking repartition 
actually is not unnecessary because hash repartitioning could benefit from 
parallelization (which is supplied by RR).
   
   The net effect of this PR is simply moving the RR from below projection to 
above.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] ozankabak commented on a diff in pull request #5074: [Enhancement] Don't repartition ProjectionExec when it does not compute anything

Reply via email to