[ https://issues.apache.org/jira/browse/SPARK-3098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14117622#comment-14117622 ]
Matei Zaharia commented on SPARK-3098: -------------------------------------- It's true that the ordering of values after a shuffle is nondeterministic, so that for example on failure you might get a different order of keys in a reduceByKey or distinct or operations like that. However, I think that's the way it should be (and we can document it). RDDs are deterministic when viewed as a multiset, but not when viewed as an ordered collection, unless you do sortByKey. Operations like zipWithIndex are meant to be more of a convenience to get unique IDs or act on something with a known ordering (such as a text file where you want to know the line numbers). But the freedom to control fetch ordering is quite important for performance, especially if you want to have a push-based shuffle in the future. If we wanted to get the same result every time, we could design reduce tasks to tell the master the order they fetched stuff in after the first time they ran, but even then, notice that it might limit the kind of shuffle mechanisms we allow (e.g. it would be harder to make a push-based shuffle deterministic). I'd rather not make that guarantee now. > In some cases, operation zipWithIndex get a wrong results > ---------------------------------------------------------- > > Key: SPARK-3098 > URL: https://issues.apache.org/jira/browse/SPARK-3098 > Project: Spark > Issue Type: Bug > Components: Spark Core > Affects Versions: 1.0.1 > Reporter: Guoqiang Li > Priority: Critical > > The reproduce code: > {code} > val c = sc.parallelize(1 to 7899).flatMap { i => > (1 to 10000).toSeq.map(p => i * 6000 + p) > }.distinct().zipWithIndex() > c.join(c).filter(t => t._2._1 != t._2._2).take(3) > {code} > => > {code} > Array[(Int, (Long, Long))] = Array((1732608,(11,12)), (45515264,(12,13)), > (36579712,(13,14))) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org