robreeves commented on PR #40812: URL: https://github.com/apache/spark/pull/40812#issuecomment-1542680938
> I agree that we do need to copy the cached plan when getting it, but I feel the current change is hard to understand and reason about. > > I think a better place to fix is `CacheManager#useCachedDataInternal`. It does copy `InMemoryRelation` but does not copy the physical plan: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala#L298-L306 > > We can add a new method `InMemoryRelation#freshCopy` which takes new output attributes and copy the physical plan, then we change `cached.cachedRepresentation.withOutput(currentFragment.output)` to `cached.cachedRepresentation.freshCopy(currentFragment.output)` The challenge with this approach is that `InMemoryRelation` does not get the physical plan passed in directly. It is through `CachedRDDBuilder`. I originally tried making a new `CachedRDDBuilder` with a copy of the physical plan, but it broke other tests due to the private state it maintains. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
