robreeves commented on PR #40812:
URL: https://github.com/apache/spark/pull/40812#issuecomment-1552124399
>
@cloud-fan Cloning the cachedPlan is also problematic because it contains
state (accumulators in private fields) when it includes a `CollectMetricsExec`
operator. `CollectMetricsExec.collect` specifically looks at the
`InMemoryRelation.cachedPlan` to get the stateful metrics.
I verified this is an issue by cloning `InMemoryRelation.cachedPlan`. Then I
modified a unit test that uses `Dataset.observe` to cache the dataset. This
breaks the unit test. When I revert cloning the cachedPlan it passes. Here is
how I modified the unit test to prove the issue.
```
diff --git
a/sql/core/src/test/scala/org/apache/spark/sql/util/DataFrameCallbackSuite.scala
b/sql/core/src/test/scala/org/apache/spark/sql/util/DataFrameCallbackSuite.scala
index f046daacb91..3de33e3e1b2 100644
---
a/sql/core/src/test/scala/org/apache/spark/sql/util/DataFrameCallbackSuite.scala
+++
b/sql/core/src/test/scala/org/apache/spark/sql/util/DataFrameCallbackSuite.scala
@@ -277,7 +277,7 @@ class DataFrameCallbackSuite extends QueryTest
max($"id").as("max_val"),
// Test unresolved alias
sum($"id"),
- count(when($"id" % 2 === 0, 1)).as("num_even"))
+ count(when($"id" % 2 === 0, 1)).as("num_even")).cache()
.observe(
name = "other_event",
avg($"id").cast("int").as("avg_val"))
```
So I think we should stick with only cloning it for the innerChildren since
the only usage is the ExplainUtils and statefulness doesn't matter, besides the
`TreeNode.tag` values.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]