[ https://issues.apache.org/jira/browse/SPARK-42582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17693579#comment-17693579 ]
Tengfei Huang edited comment on SPARK-42582 at 2/26/23 3:27 AM: ---------------------------------------------------------------- This is also discussed in PR: https://github.com/apache/spark/pull/39459 cc [~mridulm80] cc [~Ngone51] Created this ticket to track the issue about inconsistent persisted rdd blocks issue. was (Author: ivoson): This is also discussed in PR: https://github.com/apache/spark/pull/39459 > Persisted RDD blocks can be inconsistent if the RDD computation is > indeterminate > -------------------------------------------------------------------------------- > > Key: SPARK-42582 > URL: https://issues.apache.org/jira/browse/SPARK-42582 > Project: Spark > Issue Type: Improvement > Components: Spark Core > Affects Versions: 3.3.2 > Reporter: Tengfei Huang > Priority: Major > > When a rdd includes indeterminate operations, the rdd results can be > different each time we recompute it. > And when we cache such a rdd, we may have multiple rdd block replicas having > different data. Here is an example: > 1. Task A generated the rdd block rdd_1_1 on executor E1; > 2. Task B on executor E2 tried to fetch remote rdd_1_1 from E1 but failed, > then it will compute and cache another block on E2; > If the results on E1 and E2 are differnet, we'll have 2 blocks for the same > rdd partition with different data. > The behavior will be unexpcted for such cases. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org