Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/7192#discussion_r34112433
  
    --- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/basicOperators.scala ---
    @@ -36,11 +36,15 @@ import org.apache.spark.{HashPartitioner, SparkEnv}
     case class Project(projectList: Seq[NamedExpression], child: SparkPlan) 
extends UnaryNode {
       override def output: Seq[Attribute] = projectList.map(_.toAttribute)
     
    -  @transient lazy val buildProjection = newMutableProjection(projectList, 
child.output)
    +  private def buildProjection = newMutableProjection(projectList, 
child.output)
     
    -  protected override def doExecute(): RDD[InternalRow] = 
child.execute().mapPartitions { iter =>
    -    val reusableProjection = buildProjection()
    -    iter.map(reusableProjection)
    +  protected override def doExecute(): RDD[InternalRow] = {
    +    // Use local variable to avoid referencing to $out inside closure
    --- End diff --
    
    yea this is not a hot code path, and what I'm trying to do is not saving 
time of closure cleaning, but reducing the data size that we serialize and 
transport to executor side. 
    Consider we have a complicated SparkPlan tree, and we call `doExecute` on 
one node in that tree. And in `doExecute`, we may only need some data related 
to this node, but with `$out` referenced, we have to serialize and transported 
this node with all of its children to executor side.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to