[ 
https://issues.apache.org/jira/browse/SPARK-5499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-5499.
------------------------------
    Resolution: Not a Problem

To narrow this down, I tried:

{code:scala}
sc.setCheckpointDir("/tmp/checkpoint")
var pair = sc.parallelize(Array((1L,2L)))
for (i <- 1 to 1000) {
  pair.checkpoint()
  pair = pair.map(_.swap)
}
pair.count()
{code}

And it does overflow, but it's due to a serialization graph:

{code}
org.apache.spark.SparkException: Job aborted due to stage failure: Task 
serialization failed: java.lang.StackOverflowError
java.io.ObjectStreamClass$FieldReflector.getPrimFieldValues(ObjectStreamClass.java:1930)
java.io.ObjectStreamClass.getPrimFieldValues(ObjectStreamClass.java:1233)
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1533)
java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
scala.collection.immutable.$colon$colon.writeObject(List.scala:379)
sun.reflect.GeneratedMethodAccessor1.invoke(Unknown Source)
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
{code}

However this loop body works:

{code:scala}
  pair.checkpoint()
  pair.count()
  pair = pair.map(_.swap)
{code}

(Of course, you can call that every 10 or 100 iterations instead.)
(Note that the call to {{checkpoint()}} should happen before other ops.)

So, I think the issue is that RDDs are lazy of course, and {{checkpoint()}} 
only *marks* an RDD for persisting. To get it to do so, you have to invoke an 
operation on it. {{count()}} is cheap; the cheapest thing is 
{{foreachPartition(p => None)}}. (This is another argument for making a 
{{materialize()}} method, a la https://issues.apache.org/jira/browse/SPARK-6003)

So, I'm resolving just because I'm pretty certain the behavior is by design and 
intended to be consistent with how {{persist()}} works. It does require this 
formulation above with an explicit request to 'materialize', and that could be 
easier.

If anything the follow-on issue is; should persistence methods be eager? though 
that's a different question.

> iterative computing with 1000 iterations causes stage failure
> -------------------------------------------------------------
>
>                 Key: SPARK-5499
>                 URL: https://issues.apache.org/jira/browse/SPARK-5499
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 1.2.0
>            Reporter: Tien-Dung LE
>
> I got an error "org.apache.spark.SparkException: Job aborted due to stage 
> failure: Task serialization failed: java.lang.StackOverflowError" when 
> executing an action with 1000 transformations.
> Here is a code snippet to re-produce the error:
> {code}
>   import org.apache.spark.rdd.RDD
>   var pair: RDD[(Long,Long)] = sc.parallelize(Array((1L,2L)))
>     var newPair: RDD[(Long,Long)] = null
>     for (i <- 1 to 1000) {
>       newPair = pair.map(_.swap)
>       pair = newPair
>     }
>     println("Count = " + pair.count())
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to