Yes, your explanation makes sense. I try to disable the checkpoint and use that as a control group to measure the overhead of checkpoint. Now I guess I can create a dummy CheckpointRDD to do this task (a better way?).
Besides, I am curious about where exactly the StackOverflowError happens, because serialization usually happens under the hood, it's hard to find it. I have tried this line of code, but it seems this is not the root cause. serializedTask = Task.serializeWithDependencies(task, sched.sc.addedFiles, sched.sc.addedJars, ser) On Tue, Dec 24, 2013 at 2:33 PM, Tathagata Das <tathagata.das1...@gmail.com>wrote: > Hello Dachuan, > > RDDs generated by StateDStream are checkpointed because the tree of RDD > dependencies (i.e. the RDD lineage) can grow indefinitely as each state RDD > depends on the state RDD from the previous batch of data. Checkpointing > save an RDD to HDFS to cuts of all ties to its parent RDDs (i.e. truncates > the lineage). If you do not periodically checkpoint of the state RDDs, > these really large lineages can lead to all sorts of problems. The > "mustCheckpoint" field ensures that state RDDs are automatically > checkpointed with some periodicity even if the user does not explicitly > specify one. Setting mustCheckpoint to false disables this automatic > checkpointing. I think that is leading to really large lineages, and > serializing the RDD with its lineage is causing the stack to overflow. > > On that note, what are you trying to achieve by setting mustCheckpoint = > false? Maybe there is another way of achieving what you are trying to > achieve. > > TD > > > On Tue, Dec 24, 2013 at 9:05 AM, Dachuan Huang > <huan...@cse.ohio-state.edu>wrote: > > > Hello, developers, > > > > Just out of curiosity, I have changed the "mustCheckpoint" in > > StateDStream.scala to "false" by default. And run the > > StatefulNetworkWordCount.scala example. > > > > My input is a 3MB/s speed Serversocket. > > > > It reports the following error after some time, the exception trace > didn't > > say anything about the spark code, so I don't know how to nail down the > > root cause, can anybody help me with this? thanks. > > > > Exception in thread "DAGScheduler" java.lang.StackOverflowError > > at > > java.io.ObjectStreamClass.getPrimFieldValues(ObjectStreamClass.java:1233) > > at > > > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1532) > > at > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) > > at > > > > > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) > > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) > > at > > > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) > > at > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) > > at > > > > > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) > > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) > > at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347) > > at scala.collection.immutable.$colon$colon.writeObject(List.scala:430) > > at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source) > > at > > > > > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > > at java.lang.reflect.Method.invoke(Method.java:606) > > at > java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:988) > > at > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1495) > > at > > > > > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) > > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) > > at > > >