Github user tgravescs commented on the issue:
https://github.com/apache/spark/pull/21698
Sorry for coming in late on this, first I saw this was the other day.
Could someone perhaps summarize the discussions here and exactly when this
happens and why? Checkpointing was mentioned to work around the issue, why?
Would be good to add those details to the jira anyway.
My initial reaction is this is very bad. Any correctness issue we cause
from handle failures is not something we should write off and expect the user
to handle.
repartition seems to be the most obvious case and I know lots of people use
it, although hopefully many are using the dataframe api) and we see fetch
failures on large jobs all the time, so it seems really serious.
Trying to use a similar example as what is listed in jira SPARK-23207 with
an RDD doesn't reproduce this:
```
import scala.sys.process._
import org.apache.spark.TaskContext
val res = sc.parallelize(0 to (1000000-1), 1).repartition(200).map { x =>
x
}.repartition(200).map { x =>
if (TaskContext.get.attemptNumber == 0 && TaskContext.get.partitionId <
2) {
throw new Exception("pkill -f java".!!)
}
x
}
res.distinct().count()
```
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]