[
https://issues.apache.org/jira/browse/SPARK-16478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15370595#comment-15370595
]
Michał Wesołowski edited comment on SPARK-16478 at 7/11/16 11:47 AM:
---------------------------------------------------------------------
If you run code that I provided on databrics you can see that without
materializing graph that is returned simple count on vertices takes about 20
minutes, whereas strongly connected components runs 2 minutes.
I tried to us it on some real data and I wasn't able to save the result because
of this. After materializing graph with every iteration I can save results with
no problem. Materializing only within outside loop caused less severe problems
but wasn't sufficient.
In original implementation there is lot's of RDD cached and immediately
matrialized. Some of them are removed before scc returnes due to LRU fashion
spark operates, but returned RDDs are not materialized and depend on the ones
already removed from RAM. That is my current understanding of observed
behavior.
was (Author: wesolows):
If you run code that I provided on databrics you can see that without
materializing graph that is returned simple count on vertices takes about 20
minutes, whereas strongly connected components runs 2 minutes.
I tried to us it on some real data and I wasn't able to save the result because
of this. After materializing graph with every iteration I can save results with
no problem. Materializing only within outside loop caused less severe problems
but wasn't sufficient.
> strongly connected components doesn't cache returned RDD
> --------------------------------------------------------
>
> Key: SPARK-16478
> URL: https://issues.apache.org/jira/browse/SPARK-16478
> Project: Spark
> Issue Type: Bug
> Components: GraphX
> Affects Versions: 1.6.2
> Reporter: Michał Wesołowski
>
> Strongly Connected Components algorithm caches intermediary RDD's but doesn't
> cache the one that is going to be returned. With large enough graph comparing
> to available memory when one tries to take action on returned RDD whole RDD
> has to be computed from scratch which takes much more time than
> StronglyConnectedComponents alone .
> I managed to replicate the issue on databrics platform.
> [Here|https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/4889410027417133/3634650767364730/3117184429335832/latest.html]
> is notebook.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]