[
https://issues.apache.org/jira/browse/SPARK-6847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14495914#comment-14495914
]
Jack Hu commented on SPARK-6847:
--------------------------------
I did a little more investigation about this issue, that appears to be a
problem with some operations({{updateStateByKey}}, {{reduceByKeyAndWindow}}
with in-reduce function) which must be check-pointed and followed by a
operation with checkpoint (either manual added like the code of this JIRA
description or an operation which must be check-pointed) and the checkpoint
interval of these two operation is the same (or the followed operation has a
checkpoint interval the same with batch interval).
The following code will have this issue: assume default batch interval is 2
seconds, the default checkpoint interval is 10 seconds
# {{source.updateStateByKey(func).map(f).checkpoint(10 seconds)}}
# {{source.updateStateByKey(func).map(f).updateStateByKey(func2)}}
# {{source.updateStateByKey(func).map(f).checkpoint(2 seconds)}}
These DO NOT have this issue
# {{source.updateStateByKey(func).map(f).checkpoint(4 seconds)}}
# {{source.updateStateByKey(func).map(f).updateStateByKey(func2).checkpoint(4
seconds)}}
A rdd graph which contains two rdds needs to be check-pointed would be
generated from these sample codes.
If the child(ren) rdd(s) also need to do the checkpoint at the same time the
parent needs to do, then the parent will not do checkpoint according the
{{rdd.doCheckpoint}}. In this case, the rdd comes from {{updateStateByKey}}
will never be check-pointed at the issued sample code, that leads the stack
overflow. ({{updateStateByKey}} needs checkpoint to break the dependency in
this operation)
If the child(ren) rdd(s) is not always check-pointed at the same time of the
parent needs to do, there is a chance that the parent rdd (comes from
{{updateStateByKey}}) can do some successful checkpoint to break the
dependency, although the checkpoint may have some delay. So no stack overflow
will happen.
So, currently, we got a workaround of this issue by setting the checkpoint
interval to different values if we use operations that must be check-pointed in
streaming project. Maybe this is not a easy fix here, hope we can add some
validation at least
> Stack overflow on updateStateByKey which followed by a dstream with
> checkpoint set
> ----------------------------------------------------------------------------------
>
> Key: SPARK-6847
> URL: https://issues.apache.org/jira/browse/SPARK-6847
> Project: Spark
> Issue Type: Bug
> Components: Streaming
> Affects Versions: 1.3.0
> Reporter: Jack Hu
> Labels: StackOverflowError, Streaming
>
> The issue happens with the following sample code: uses {{updateStateByKey}}
> followed by a {{map}} with checkpoint interval 10 seconds
> {code}
> val sparkConf = new SparkConf().setAppName("test")
> val streamingContext = new StreamingContext(sparkConf, Seconds(10))
> streamingContext.checkpoint("""checkpoint""")
> val source = streamingContext.socketTextStream("localhost", 9999)
> val updatedResult = source.map(
> (1,_)).updateStateByKey(
> (newlist : Seq[String], oldstate : Option[String]) =>
> newlist.headOption.orElse(oldstate))
> updatedResult.map(_._2)
> .checkpoint(Seconds(10))
> .foreachRDD((rdd, t) => {
> println("Deep: " + rdd.toDebugString.split("\n").length)
> println(t.toString() + ": " + rdd.collect.length)
> })
> streamingContext.start()
> streamingContext.awaitTermination()
> {code}
> From the output, we can see that the dependency will be increasing time over
> time, the {{updateStateByKey}} never get check-pointed, and finally, the
> stack overflow will happen.
> Note:
> * The rdd in {{updatedResult.map(_._2)}} get check-pointed in this case, but
> not the {{updateStateByKey}}
> * If remove the {{checkpoint(Seconds(10))}} from the map result (
> {{updatedResult.map(_._2)}} ), the stack overflow will not happen
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]