[
https://issues.apache.org/jira/browse/SPARK-16478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15370595#comment-15370595
]
Michał Wesołowski commented on SPARK-16478:
-------------------------------------------
If you run code that I provided on databrics you can see that without
materializing graph that is returned simple count on vertices takes about 20
minutes, whereas strongly connected components runs 2 minutes.
I tried to us it on some real data and I wasn't able to save the result because
of this. After materializing graph with every iteration I can save results with
no problem. Materializing only within outside loop caused less severe problems
but wasn't sufficient.
> strongly connected components doesn't cache returned RDD
> --------------------------------------------------------
>
> Key: SPARK-16478
> URL: https://issues.apache.org/jira/browse/SPARK-16478
> Project: Spark
> Issue Type: Bug
> Components: GraphX
> Affects Versions: 1.6.2
> Reporter: Michał Wesołowski
>
> Strongly Connected Components algorithm caches intermediary RDD's but doesn't
> cache the one that is going to be returned. With large enough graph comparing
> to available memory when one tries to take action on returned RDD whole RDD
> has to be computed from scratch which takes much more time than
> StronglyConnectedComponents alone .
> I managed to replicate the issue on databrics platform.
> [Here|https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/4889410027417133/3634650767364730/3117184429335832/latest.html]
> is notebook.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]