Github user wesolowskim commented on the issue:
https://github.com/apache/spark/pull/14137
Let me introduce some data first:
1. SCC run computed on randomly generated graph just like one provided by
me on databrics notebook takes about 120s
2. When doing sccGraph.vertices.count on computed graph it takes about 20
min
After caches i proposed scc run is few second longer but count takes few
seconds.
In total original scc and count takes about 22 min (depending on graph
generated) and with caches it takes about 2 min 10 seconds. With graph on real
world data I work with materializing of returned RDD took so much time I never
managed to do it (waiting hours without effect).
About introducing problems:
1. Original SCC doesn't unpersist workGraph and the problem already exists,
but I agree I added another RDDs that are not unpersisted
2. I tried to unpersist RDDs in original SCC, but it only added computing
time (I had to materialize every time reference was going to be lost) but it
didn't solve the problem
There is lot's of materialization going on in original SCC implementation
(numVetices), and none of it (outside of Pregel) is not unpersisted. When
operating on large graph with limited memory and lot's of iterations it causes
intermediary (workGraph) RDDs to be removed in LRU fashion. At the same time
sccGraph that is returned is not persisted, and when I take some action like
count outside of scc after run, there are no RDD's in cache that it depends on.
I tried few different solutions but only the one I'm proposing worked.
Check out code I provided on databrics. It should take more then 20 mins to
compute. If instead my implementation would be used it would take about less
then 3 mins total. I can provide runnable databrics notebook with both
solutions.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]