[ 
https://issues.apache.org/jira/browse/SPARK-28699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16905250#comment-16905250
 ] 

Yuanjian Li edited comment on SPARK-28699 at 8/20/19 4:09 AM:
--------------------------------------------------------------

-The current [approach|https://github.com/apache/spark/pull/25420] just a 
bandage fix for returning the wrong answer.-

After further investigation, we found that this bug is nothing to do with cache 
operation. So we focus on the sort + shuffle self and finally found the root 
cause is about the wrong usage for radix sort.

In original logic, we open the radix sort only depends on the config, and use 
the radix for the binary data comparison. It’s maybe OK for the dataset only 
has one column which is numeric, but during this case, binary format after 
transform “map\{ x => (x%1000, x)}” operation can’t be sorted by radix sort.

After the fix in [https://github.com/apache/spark/pull/25491] all tests passed 
with the right answer.

Also, find a corner case of DAGScheduler during the test is fixed separately in 
[https://github.com/apache/spark/pull/25491].

After we finish the work of indeterminate stage rerunning(SPARK-25341), we can 
fix this by unpersisting the original RDD and rerunning the cached 
indeterminate stage. Gives a preview codebase 
[here|https://github.com/xuanyuanking/spark/tree/SPARK-28699-RERUN].


was (Author: xuanyuan):
The current [approach|https://github.com/apache/spark/pull/25420] just a 
bandage fix for returning the wrong answer.

After we finish the work of indeterminate stage rerunning(SPARK-25341), we can 
fix this by unpersisting the original RDD and rerunning the cached 
indeterminate stage. Gives a preview codebase 
[here|https://github.com/xuanyuanking/spark/tree/SPARK-28699-RERUN].

> Cache an indeterminate RDD could lead to incorrect result while stage rerun
> ---------------------------------------------------------------------------
>
>                 Key: SPARK-28699
>                 URL: https://issues.apache.org/jira/browse/SPARK-28699
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 3.0.0
>            Reporter: Yuanjian Li
>            Priority: Major
>              Labels: correctness
>
> Related with SPARK-23207 SPARK-23243
> It's another case for the indeterminate stage/RDD rerun while stage rerun 
> happened. In the CachedRDDBuilder, we miss tracking the `isOrderSensitive` 
> characteristic to the newly created MapPartitionsRDD.
> We can reproduce this by the following code, thanks to Tyson for reporting 
> this!
>  
> {code:scala}
> import scala.sys.process._
> import org.apache.spark.TaskContext
> val res = spark.range(0, 10000 * 10000, 1).map\{ x => (x % 1000, x)}
> // kill an executor in the stage that performs repartition(239)
> val df = res.repartition(113).cache.repartition(239).map { x =>
>  if (TaskContext.get.attemptNumber == 0 && TaskContext.get.partitionId < 1 && 
> TaskContext.get.stageAttemptNumber == 0) {
>  throw new Exception("pkill -f -n java".!!)
>  }
>  x
> }
> val r2 = df.distinct.count()
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to