[ 
https://issues.apache.org/jira/browse/SPARK-3098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14117622#comment-14117622
 ] 

Matei Zaharia commented on SPARK-3098:
--------------------------------------

It's true that the ordering of values after a shuffle is nondeterministic, so 
that for example on failure you might get a different order of keys in a 
reduceByKey or distinct or operations like that. However, I think that's the 
way it should be (and we can document it). RDDs are deterministic when viewed 
as a multiset, but not when viewed as an ordered collection, unless you do 
sortByKey. Operations like zipWithIndex are meant to be more of a convenience 
to get unique IDs or act on something with a known ordering (such as a text 
file where you want to know the line numbers). But the freedom to control fetch 
ordering is quite important for performance, especially if you want to have a 
push-based shuffle in the future.

If we wanted to get the same result every time, we could design reduce tasks to 
tell the master the order they fetched stuff in after the first time they ran, 
but even then, notice that it might limit the kind of shuffle mechanisms we 
allow (e.g. it would be harder to make a push-based shuffle deterministic). I'd 
rather not make that guarantee now.

>  In some cases, operation zipWithIndex get a wrong results
> ----------------------------------------------------------
>
>                 Key: SPARK-3098
>                 URL: https://issues.apache.org/jira/browse/SPARK-3098
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 1.0.1
>            Reporter: Guoqiang Li
>            Priority: Critical
>
> The reproduce code:
> {code}
>      val c = sc.parallelize(1 to 7899).flatMap { i =>
>       (1 to 10000).toSeq.map(p => i * 6000 + p)
>     }.distinct().zipWithIndex() 
>     c.join(c).filter(t => t._2._1 != t._2._2).take(3)
> {code}
>  => 
> {code}
>  Array[(Int, (Long, Long))] = Array((1732608,(11,12)), (45515264,(12,13)), 
> (36579712,(13,14)))
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to