[GitHub] spark issue #14137: SPARK-16478 graphX (added graph caching in strongly conn...

wesolowskim Mon, 11 Jul 2016 05:44:26 -0700

Github user wesolowskim commented on the issue:

    https://github.com/apache/spark/pull/14137
  
    Let me introduce some data first:
    1. SCC run computed on randomly generated graph just like one provided by 
me on databrics notebook takes about 120s
    2. When doing sccGraph.vertices.count on computed graph it takes about 20 
min
    
    After caches i proposed scc run is few second longer but count takes few 
seconds. 
    
    In total original scc and count takes about 22 min (depending on graph 
generated) and with caches it takes about 2 min 10 seconds. With graph on real 
world data I work with materializing of returned RDD took so much time I never 
managed to do it (waiting hours without effect). 
    
    About introducing problems:
    
    1. Original SCC doesn't unpersist workGraph and the problem already exists, 
but I agree  I added another RDDs that are not unpersisted
    2. I tried to unpersist RDDs in original SCC, but it only added computing 
time (I had to materialize every time reference was going to be lost) but it 
didn't solve the problem
    
    
    There is lot's of materialization going on in original SCC implementation 
(numVetices), and none of it (outside of Pregel) is not unpersisted. When 
operating on large graph with limited memory and lot's of iterations it causes 
intermediary (workGraph) RDDs to be removed in LRU fashion. At the same time 
sccGraph that is returned is not persisted, and when I take some action like 
count outside of scc after run, there are no RDD's in cache that it depends on. 
    
    I tried few different solutions but only the one I'm proposing worked. 
    Check out code I provided on databrics. It should take more then 20 mins to 
compute. If instead my implementation would be used it would take about less 
then 3 mins total. I can provide runnable databrics notebook with both 
solutions.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #14137: SPARK-16478 graphX (added graph caching in strongly conn...

Reply via email to