sample

GitBox Tue, 12 Mar 2019 23:28:49 -0700

cloud-fan commented on a change in pull request #24049: [SPARK-27123][SQL] 
Improve CollapseProject to handle projects cross limit/repartition/sample
URL: https://github.com/apache/spark/pull/24049#discussion_r264988470


 ##########
 File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ##########
 @@ -699,6 +699,24 @@ object CollapseProject extends Rule[LogicalPlan] {
         agg.copy(aggregateExpressions = buildCleanedProjectList(
           p.projectList, agg.aggregateExpressions))
       }
+    case p1 @ Project(_, g @ GlobalLimit(_, l @ LocalLimit(_, p2: Project))) =>
 
 Review comment:
   Sorry to be late for the review. I have 2 concerns about this optimization:
   1. if `p2` outputs one column, and `p1` outputs 1000 columns, then pushing 
down `p1` through limit operator would increase the data size to be shuffled.
   2. if `p1` has an expensive expression like UDF, pushing `p1` through limit 
operator means the expensive expression will be executed a lot more times.
   
   Do we have a general rule to justify the benefit of pushing down the project 
operator?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] cloud-fan commented on a change in pull request #24049: [SPARK-27123][SQL] Improve CollapseProject to handle projects cross limit/repartition/sample

Reply via email to