Github user andrewor14 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/7192#discussion_r34061952
  
    --- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/basicOperators.scala ---
    @@ -36,11 +36,15 @@ import org.apache.spark.{HashPartitioner, SparkEnv}
     case class Project(projectList: Seq[NamedExpression], child: SparkPlan) 
extends UnaryNode {
       override def output: Seq[Attribute] = projectList.map(_.toAttribute)
     
    -  @transient lazy val buildProjection = newMutableProjection(projectList, 
child.output)
    +  private def buildProjection = newMutableProjection(projectList, 
child.output)
     
    -  protected override def doExecute(): RDD[InternalRow] = 
child.execute().mapPartitions { iter =>
    -    val reusableProjection = buildProjection()
    -    iter.map(reusableProjection)
    +  protected override def doExecute(): RDD[InternalRow] = {
    +    // Use local variable to avoid referencing to $out inside closure
    --- End diff --
    
    This doesn't actually speed it up that much, because even cleaning a 
serializable closure takes some time. If you want a real speed up I would 
recommend that you avoid calling `mapPartitions` and make your own 
`MapPartitionsRDD` here, which avoids cleaning the closure altogether. This is 
safe to do because we provide the closure ourselves, so we already know that 
it's serializable.
    
    An example of where we currently already do this: 
https://github.com/apache/spark/blob/70beb808e13f6371968ac87f7cf625ed110375e6/sql/core/src/main/scala/org/apache/spark/sql/sources/DataSourceStrategy.scala#L221


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to