[
https://issues.apache.org/jira/browse/SPARK-6003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14336217#comment-14336217
]
Sean Owen commented on SPARK-6003:
----------------------------------
This has been discussed a few times. foreachPartition(p => None) does it. I
don't know if it's even worth wrapping up in a helper method.
> Spark should offer a "sync" method that guarantees that RDDs are eagerly
> evaluated and persisted
> ------------------------------------------------------------------------------------------------
>
> Key: SPARK-6003
> URL: https://issues.apache.org/jira/browse/SPARK-6003
> Project: Spark
> Issue Type: Improvement
> Components: Spark Core
> Affects Versions: 1.2.1
> Reporter: Derrick Burns
> Priority: Minor
> Original Estimate: 24h
> Remaining Estimate: 24h
>
> This may already exist, but I could not find it.
> One of the challenges in developing RELIABLE Spark application is dealing
> with the elegant lazy evaluation semantics of RDD transformations. It would
> be useful to have a action with no output whose side-effect is to ensure that
> the RDD is eagerly evaluated and persisted according the whatever persistence
> level is set for the RDD.
> Calling RDD.count() or any other action might do the trick -- and indeed I
> have tried this -- however, in can be the case that RDD.count() does NOT
> persist the data.
> For example, MappedRDD(x:RDD).count() === x.count(), so it is possible to
> implement count without persisting the result of MappedRDD(x). Without
> looking at the code, one cannot know whether an operation is eagerly
> evaluated AND persisted or not. Having a standard Spark primitive that both
> eagerly evaluated and RDD and persisted it according to its current
> persistence level would be very useful.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]