Derrick Burns created SPARK-6003:
------------------------------------
Summary: Spark should offer a "sync" method that guarantees that
RDDs are eagerly evaluted and persisted
Key: SPARK-6003
URL: https://issues.apache.org/jira/browse/SPARK-6003
Project: Spark
Issue Type: Improvement
Components: Spark Core
Affects Versions: 1.2.1
Reporter: Derrick Burns
Priority: Minor
This may already exist, but I could not find it.
One of the challenges in developing RELIABLE Spark application is dealing with
the elegant lazy evaluation semantics of RDD transformations. It would be
useful to have a action with no output whose side-effect is to ensure that the
RDD is eagerly evaluated and persisted according the whatever persistence level
is set for the RDD.
Calling RDD.count() or any other action might do the trick -- and indeed I have
tried this -- however, in can be the case that RDD.count() does NOT persist the
data.
For example, MappedRDD(x:RDD).count() === x.count(), so it is possible to
implement count without persisting the result of MappedRDD(x). Without looking
at the code, one cannot know whether an operation is eagerly evaluated AND
persisted or not. Having a standard Spark primitive that both eagerly
evaluated and RDD and persisted it according to its current persistence level
would be very useful.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]