robreeves commented on code in PR #40812:
URL: https://github.com/apache/spark/pull/40812#discussion_r1180671671


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryRelation.scala:
##########
@@ -389,7 +389,7 @@ case class InMemoryRelation(
 
   @volatile var statsOfPlanToCache: Statistics = null
 
-  override def innerChildren: Seq[SparkPlan] = Seq(cachedPlan)
+  override val innerChildren: Seq[SparkPlan] = Seq(cachedPlan.clone())

Review Comment:
   Since plans are considered immutable and usually cloned, thread safety is 
not an issue. `cachedPlan` was not cloned so it was easy to get in a situation 
where multiple threads access the same plan objects. This solves that narrow 
issue. After this the `cachedPlan` is no more susceptible to concurrency issues 
than any other object in the plan. 
   
   I still think there is a wider thread safety issue with the mutable state in 
`TreeNode.tags`, but I need to investigate it more to propose a holistic 
solution. Consider the case where in the application code:
   
   1. User thread 1 calls `d.explain("formatted")` where `d` is a `Dataset` 
object. This will set the `TreeNode.tag` keys `QueryPlan.OP_ID_TAG` and 
`QueryPlan.CODEGEN_ID_TAG`. 
   2. At the same time in user thread 2 they call `d.filter($"foo" > 1)`. This 
will create a new dataset that reads the `TreeNode.tags` during the clone. 
   
   At first I considered making `TreeNode.tags` a thread-safe map, but that 
only solves the concurrency issue. At least for this issue, there would still 
be a correctness issue with other threads setting different operator ids for 
the same node. `TreeNode.tags` is used in so many other places that I'm not 
sure if there are other correctness issues.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to