[ 
https://issues.apache.org/jira/browse/SPARK-2245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14041343#comment-14041343
 ] 

Ankur Dave commented on SPARK-2245:
-----------------------------------

This problem occurs because VertexRDD stores the vertices in a blocked format 
for performance using VertexRDD.partitionsRDD. To present the pair interface 
RDD[(VertexId, VD)], it derives from the partitionsRDD. However, 
VertexRDD.count() can be implemented by operating directly on the partitionsRDD 
rather than first constructing all the pairs and then counting them. Therefore 
calling count() materializes only the partitionsRDD, not the VertexRDD itself.

The workaround is to call vertexRDD.partitionsRDD.checkpoint() rather than 
vertexRDD.checkpoint(). In the future we should override the checkpoint() 
method on VertexRDD to delegate to the partitionsRDD in the same way that 
cache() does.

> VertexRDD can not be materialized for checkpointing
> ---------------------------------------------------
>
>                 Key: SPARK-2245
>                 URL: https://issues.apache.org/jira/browse/SPARK-2245
>             Project: Spark
>          Issue Type: Bug
>          Components: GraphX
>            Reporter: Baoxu Shi
>
> Seems one can not materialize VertexRDD by simply calling count method, which 
> is overridden by VertexRDD. But if you call RDD's count, it could materialize 
> it.
> Is this a feature that designed to get the count without materialize 
> VertexRDD? If so, do you guys think it is necessary to add a materialize 
> method to VertexRDD?
> By the way, does count() is the cheapest way to materialize a RDD? Or it just 
> cost the same resources like other actions?
> The pull request is here:
> https://github.com/apache/spark/pull/1177
> Best,



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to