Github user marmbrus commented on the pull request:
https://github.com/apache/spark/pull/1332#issuecomment-48375419
@aarondav, `newInstance()` is used before we perform resolution to ensure
that that all expression ids in a plan are unique. Consider the case where you
self join an `InMemoryRelation` with itself: we need to know which side of the
join a given attribute is coming from, so we produce unique instances of the
relation before resolving attributes.
I thought about the possible concurrency issues, but they will only arise
in edge cases (simultaneous self-join queries on a table that is cached, but
not yet materialized?), and will only result in double caching, not correctness
issues... so this patch is strictly better than what we had before I think.
That said I guess we could fix it with a SyncVar probably... I'll have to
think about it some more.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---