[
https://issues.apache.org/jira/browse/SPARK-5499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sean Owen resolved SPARK-5499.
------------------------------
Resolution: Not a Problem
To narrow this down, I tried:
{code:scala}
sc.setCheckpointDir("/tmp/checkpoint")
var pair = sc.parallelize(Array((1L,2L)))
for (i <- 1 to 1000) {
pair.checkpoint()
pair = pair.map(_.swap)
}
pair.count()
{code}
And it does overflow, but it's due to a serialization graph:
{code}
org.apache.spark.SparkException: Job aborted due to stage failure: Task
serialization failed: java.lang.StackOverflowError
java.io.ObjectStreamClass$FieldReflector.getPrimFieldValues(ObjectStreamClass.java:1930)
java.io.ObjectStreamClass.getPrimFieldValues(ObjectStreamClass.java:1233)
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1533)
java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
scala.collection.immutable.$colon$colon.writeObject(List.scala:379)
sun.reflect.GeneratedMethodAccessor1.invoke(Unknown Source)
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
{code}
However this loop body works:
{code:scala}
pair.checkpoint()
pair.count()
pair = pair.map(_.swap)
{code}
(Of course, you can call that every 10 or 100 iterations instead.)
(Note that the call to {{checkpoint()}} should happen before other ops.)
So, I think the issue is that RDDs are lazy of course, and {{checkpoint()}}
only *marks* an RDD for persisting. To get it to do so, you have to invoke an
operation on it. {{count()}} is cheap; the cheapest thing is
{{foreachPartition(p => None)}}. (This is another argument for making a
{{materialize()}} method, a la https://issues.apache.org/jira/browse/SPARK-6003)
So, I'm resolving just because I'm pretty certain the behavior is by design and
intended to be consistent with how {{persist()}} works. It does require this
formulation above with an explicit request to 'materialize', and that could be
easier.
If anything the follow-on issue is; should persistence methods be eager? though
that's a different question.
> iterative computing with 1000 iterations causes stage failure
> -------------------------------------------------------------
>
> Key: SPARK-5499
> URL: https://issues.apache.org/jira/browse/SPARK-5499
> Project: Spark
> Issue Type: Bug
> Components: Spark Core
> Affects Versions: 1.2.0
> Reporter: Tien-Dung LE
>
> I got an error "org.apache.spark.SparkException: Job aborted due to stage
> failure: Task serialization failed: java.lang.StackOverflowError" when
> executing an action with 1000 transformations.
> Here is a code snippet to re-produce the error:
> {code}
> import org.apache.spark.rdd.RDD
> var pair: RDD[(Long,Long)] = sc.parallelize(Array((1L,2L)))
> var newPair: RDD[(Long,Long)] = null
> for (i <- 1 to 1000) {
> newPair = pair.map(_.swap)
> pair = newPair
> }
> println("Count = " + pair.count())
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]