[jira] [Commented] (SPARK-6003) Spark should offer a "sync" method that guarantees that RDDs are eagerly evaluated and persisted

Sean Owen (JIRA) Wed, 25 Feb 2015 00:38:33 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-6003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14336217#comment-14336217
 ]


Sean Owen commented on SPARK-6003:
----------------------------------

This has been discussed a few times. foreachPartition(p => None) does it. I 
don't know if it's even worth wrapping up in a helper method.

> Spark should offer a "sync" method that guarantees that RDDs are eagerly 
> evaluated and persisted
> ------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-6003
>                 URL: https://issues.apache.org/jira/browse/SPARK-6003
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>    Affects Versions: 1.2.1
>            Reporter: Derrick Burns
>            Priority: Minor
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> This may already exist, but I could not find it.
> One of the challenges in developing RELIABLE Spark application is dealing 
> with the elegant lazy evaluation semantics of RDD transformations.  It would 
> be useful to have a action with no output whose side-effect is to ensure that 
> the RDD is eagerly evaluated and persisted according the whatever persistence 
> level is set for the RDD.  
> Calling RDD.count() or any other action might do the trick -- and indeed I 
> have tried this -- however, in can be the case that RDD.count() does NOT 
> persist the data. 
> For example,  MappedRDD(x:RDD).count() === x.count(), so it is possible to 
> implement count without persisting the result of MappedRDD(x).  Without 
> looking at the code, one cannot know whether an operation is eagerly 
> evaluated AND persisted or not.   Having a standard Spark primitive that both 
> eagerly evaluated and RDD and persisted it according to its current 
> persistence level would be very useful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-6003) Spark should offer a "sync" method that guarantees that RDDs are eagerly evaluated and persisted

Reply via email to