[ 
https://issues.apache.org/jira/browse/SPARK-6003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Derrick Burns updated SPARK-6003:
---------------------------------
    Description: 
This may already exist, but I could not find it.

One of the challenges in developing RELIABLE Spark application is dealing with 
the elegant lazy evaluation semantics of RDD transformations.  It would be 
useful to have a action with no output whose side-effect is to ensure that the 
RDD is eagerly evaluated and persisted according the whatever persistence level 
is set for the RDD.  

Calling RDD.count() or any other action might do the trick -- and indeed I have 
tried this -- however, it may be the case that RDD.count() does NOT persist the 
data. 

For example,  MappedRDD(x:RDD).count() === x.count(), so it is possible to 
implement count without persisting the result of MappedRDD(x).  Without looking 
at the code, one cannot know whether an operation is eagerly evaluated AND 
persisted or not.   Having a standard Spark primitive that both eagerly 
evaluated and RDD and persisted it according to its current persistence level 
would be very useful.

  was:
This may already exist, but I could not find it.

One of the challenges in developing RELIABLE Spark application is dealing with 
the elegant lazy evaluation semantics of RDD transformations.  It would be 
useful to have a action with no output whose side-effect is to ensure that the 
RDD is eagerly evaluated and persisted according the whatever persistence level 
is set for the RDD.  

Calling RDD.count() or any other action might do the trick -- and indeed I have 
tried this -- however, in can be the case that RDD.count() does NOT persist the 
data. 

For example,  MappedRDD(x:RDD).count() === x.count(), so it is possible to 
implement count without persisting the result of MappedRDD(x).  Without looking 
at the code, one cannot know whether an operation is eagerly evaluated AND 
persisted or not.   Having a standard Spark primitive that both eagerly 
evaluated and RDD and persisted it according to its current persistence level 
would be very useful.


> Spark should offer a "sync" method that guarantees that RDDs are eagerly 
> evaluated and persisted
> ------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-6003
>                 URL: https://issues.apache.org/jira/browse/SPARK-6003
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>    Affects Versions: 1.2.1
>            Reporter: Derrick Burns
>            Priority: Minor
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> This may already exist, but I could not find it.
> One of the challenges in developing RELIABLE Spark application is dealing 
> with the elegant lazy evaluation semantics of RDD transformations.  It would 
> be useful to have a action with no output whose side-effect is to ensure that 
> the RDD is eagerly evaluated and persisted according the whatever persistence 
> level is set for the RDD.  
> Calling RDD.count() or any other action might do the trick -- and indeed I 
> have tried this -- however, it may be the case that RDD.count() does NOT 
> persist the data. 
> For example,  MappedRDD(x:RDD).count() === x.count(), so it is possible to 
> implement count without persisting the result of MappedRDD(x).  Without 
> looking at the code, one cannot know whether an operation is eagerly 
> evaluated AND persisted or not.   Having a standard Spark primitive that both 
> eagerly evaluated and RDD and persisted it according to its current 
> persistence level would be very useful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to