Derrick Burns created SPARK-6003:
------------------------------------

             Summary: Spark should offer a "sync" method that guarantees that 
RDDs are eagerly evaluted and persisted
                 Key: SPARK-6003
                 URL: https://issues.apache.org/jira/browse/SPARK-6003
             Project: Spark
          Issue Type: Improvement
          Components: Spark Core
    Affects Versions: 1.2.1
            Reporter: Derrick Burns
            Priority: Minor


This may already exist, but I could not find it.

One of the challenges in developing RELIABLE Spark application is dealing with 
the elegant lazy evaluation semantics of RDD transformations.  It would be 
useful to have a action with no output whose side-effect is to ensure that the 
RDD is eagerly evaluated and persisted according the whatever persistence level 
is set for the RDD.  

Calling RDD.count() or any other action might do the trick -- and indeed I have 
tried this -- however, in can be the case that RDD.count() does NOT persist the 
data. 

For example,  MappedRDD(x:RDD).count() === x.count(), so it is possible to 
implement count without persisting the result of MappedRDD(x).  Without looking 
at the code, one cannot know whether an operation is eagerly evaluated AND 
persisted or not.   Having a standard Spark primitive that both eagerly 
evaluated and RDD and persisted it according to its current persistence level 
would be very useful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to