Github user cloud-fan commented on a diff in the pull request:
https://github.com/apache/spark/pull/7192#discussion_r34112433
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/execution/basicOperators.scala ---
@@ -36,11 +36,15 @@ import org.apache.spark.{HashPartitioner, SparkEnv}
case class Project(projectList: Seq[NamedExpression], child: SparkPlan)
extends UnaryNode {
override def output: Seq[Attribute] = projectList.map(_.toAttribute)
- @transient lazy val buildProjection = newMutableProjection(projectList,
child.output)
+ private def buildProjection = newMutableProjection(projectList,
child.output)
- protected override def doExecute(): RDD[InternalRow] =
child.execute().mapPartitions { iter =>
- val reusableProjection = buildProjection()
- iter.map(reusableProjection)
+ protected override def doExecute(): RDD[InternalRow] = {
+ // Use local variable to avoid referencing to $out inside closure
--- End diff --
yea this is not a hot code path, and what I'm trying to do is not saving
time of closure cleaning, but reducing the data size that we serialize and
transport to executor side.
Consider we have a complicated SparkPlan tree, and we call `doExecute` on
one node in that tree. And in `doExecute`, we may only need some data related
to this node, but with `$out` referenced, we have to serialize and transported
this node with all of its children to executor side.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]