Yes, your explanation makes sense.

I try to disable the checkpoint and use that as a control group to measure
the overhead of checkpoint. Now I guess I can create a dummy CheckpointRDD
to do this task (a better way?).

Besides, I am curious about where exactly the StackOverflowError happens,
because serialization usually happens under the hood, it's hard to find it.
I have tried this line of code, but it seems this is not the root cause.

serializedTask = Task.serializeWithDependencies(task, sched.sc.addedFiles,
sched.sc.addedJars, ser)

On Tue, Dec 24, 2013 at 2:33 PM, Tathagata Das
<tathagata.das1...@gmail.com>wrote:

> Hello Dachuan,
>
> RDDs generated by StateDStream are checkpointed because the tree of RDD
> dependencies (i.e. the RDD lineage) can grow indefinitely as each state RDD
> depends on the state RDD from the previous batch of data. Checkpointing
> save an RDD to HDFS to cuts of all ties to its parent RDDs (i.e. truncates
> the lineage). If you do not periodically checkpoint of the state RDDs,
> these really large lineages can lead to all sorts of problems. The
> "mustCheckpoint" field ensures that state RDDs are automatically
> checkpointed with some periodicity even if the user does not explicitly
> specify one. Setting mustCheckpoint to false disables this automatic
> checkpointing. I think that is leading to really large lineages, and
> serializing the RDD with its lineage is causing the stack to overflow.
>
> On that note, what are you trying to achieve by setting mustCheckpoint =
> false? Maybe there is another way of achieving what you are trying to
> achieve.
>
> TD
>
>
> On Tue, Dec 24, 2013 at 9:05 AM, Dachuan Huang
> <huan...@cse.ohio-state.edu>wrote:
>
> > Hello, developers,
> >
> > Just out of curiosity, I have changed the "mustCheckpoint" in
> > StateDStream.scala to "false" by default. And run the
> > StatefulNetworkWordCount.scala example.
> >
> > My input is a 3MB/s speed Serversocket.
> >
> > It reports the following error after some time, the exception trace
> didn't
> > say anything about the spark code, so I don't know how to nail down the
> > root cause, can anybody help me with this? thanks.
> >
> > Exception in thread "DAGScheduler" java.lang.StackOverflowError
> > at
> > java.io.ObjectStreamClass.getPrimFieldValues(ObjectStreamClass.java:1233)
> > at
> >
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1532)
> > at
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
> > at
> >
> >
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
> > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
> > at
> >
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
> > at
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
> > at
> >
> >
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
> > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
> > at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
> > at scala.collection.immutable.$colon$colon.writeObject(List.scala:430)
> > at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
> > at
> >
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> > at java.lang.reflect.Method.invoke(Method.java:606)
> > at
> java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:988)
> > at
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1495)
> > at
> >
> >
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
> > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
> > at
> >
>

Reply via email to