[
https://issues.apache.org/jira/browse/SPARK-6003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Derrick Burns updated SPARK-6003:
---------------------------------
Description:
This may already exist, but I could not find it.
One of the challenges in developing RELIABLE Spark application is dealing with
the elegant lazy evaluation semantics of RDD transformations. It would be
useful to have a action with no output whose side-effect is to ensure that the
RDD is eagerly evaluated and persisted according the whatever persistence level
is set for the RDD.
Calling RDD.count() or any other action might do the trick -- and indeed I have
tried this -- however, it may be the case that RDD.count() does NOT persist the
data.
For example, MappedRDD(x:RDD).count() === x.count(), so it is possible to
implement count without persisting the result of MappedRDD(x). Without looking
at the code, one cannot know whether an operation is eagerly evaluated AND
persisted or not. Having a standard Spark primitive that both eagerly
evaluated and RDD and persisted it according to its current persistence level
would be very useful.
was:
This may already exist, but I could not find it.
One of the challenges in developing RELIABLE Spark application is dealing with
the elegant lazy evaluation semantics of RDD transformations. It would be
useful to have a action with no output whose side-effect is to ensure that the
RDD is eagerly evaluated and persisted according the whatever persistence level
is set for the RDD.
Calling RDD.count() or any other action might do the trick -- and indeed I have
tried this -- however, in can be the case that RDD.count() does NOT persist the
data.
For example, MappedRDD(x:RDD).count() === x.count(), so it is possible to
implement count without persisting the result of MappedRDD(x). Without looking
at the code, one cannot know whether an operation is eagerly evaluated AND
persisted or not. Having a standard Spark primitive that both eagerly
evaluated and RDD and persisted it according to its current persistence level
would be very useful.
> Spark should offer a "sync" method that guarantees that RDDs are eagerly
> evaluated and persisted
> ------------------------------------------------------------------------------------------------
>
> Key: SPARK-6003
> URL: https://issues.apache.org/jira/browse/SPARK-6003
> Project: Spark
> Issue Type: Improvement
> Components: Spark Core
> Affects Versions: 1.2.1
> Reporter: Derrick Burns
> Priority: Minor
> Original Estimate: 24h
> Remaining Estimate: 24h
>
> This may already exist, but I could not find it.
> One of the challenges in developing RELIABLE Spark application is dealing
> with the elegant lazy evaluation semantics of RDD transformations. It would
> be useful to have a action with no output whose side-effect is to ensure that
> the RDD is eagerly evaluated and persisted according the whatever persistence
> level is set for the RDD.
> Calling RDD.count() or any other action might do the trick -- and indeed I
> have tried this -- however, it may be the case that RDD.count() does NOT
> persist the data.
> For example, MappedRDD(x:RDD).count() === x.count(), so it is possible to
> implement count without persisting the result of MappedRDD(x). Without
> looking at the code, one cannot know whether an operation is eagerly
> evaluated AND persisted or not. Having a standard Spark primitive that both
> eagerly evaluated and RDD and persisted it according to its current
> persistence level would be very useful.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]