GitHub user ankurdave opened a pull request:
https://github.com/apache/spark/pull/972
[SPARK-2025] Unpersist edges of previous graph in Pregel
Due to a bug introduced by apache/spark#497, Pregel does not unpersist
replicated vertices from previous iterations. As a result, they stay cached
until memory is full, wasting GC time.
This PR corrects the problem by unpersisting both the edges and the
replicated vertices of previous iterations. This is safe because the edges and
replicated vertices of the current iteration are cached by the call to
`g.cache()` and then materialized by the call to `messages.count()`. Therefore
no unmaterialized RDDs depend on `prevG.edges`. I verified that no
recomputation occurs by running PageRank with a custom patch to Spark that
warns when a partition is recomputed.
Thanks to Tim Weninger for reporting this bug.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/ankurdave/spark SPARK-2025
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/972.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #972
----
commit 13d5b07eb48999a967935d6349be556f21f8db2c
Author: Ankur Dave <[email protected]>
Date: 2014-06-05T00:19:29Z
Unpersist edges of previous graph in Pregel
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---