Github user srowen commented on the issue:
https://github.com/apache/spark/pull/14137
You are right, the code already had a `.cache()` call without
`.unpersist()`. It eventually gets cleaned up, but not good form and probably
didn't actually do much either. Except perhaps break the lineage.
The problem with cache and unpersist is that if nothing materializes the
RDD between the two calls, it does nothing. Yes, I know that's why you forced
materialization with an extra call to count.
The thing is, caching by itself can't make the underlying computation
faster and none of the RDDs that are now cached are used a second time
(right?). I think something else is at work here.
One guess is that the current calls to cache(), which don't seem quite
right, are the problem. It computes the whole lineage at once and attempts to
persist all RDDs at once, competing with others and evicting them and generally
wasting time because the RDDs aren't reused at all (I think). If that's true
then actually removing all the caching should also help a lot. If not, then,
that's not it.
(PS `sccGraphCountVertices` here is superfluous, you can omit it)
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]