[ 
https://issues.apache.org/jira/browse/SPARK-10493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14737252#comment-14737252
 ] 

Sean Owen commented on SPARK-10493:
-----------------------------------

What do you mean that it's not collapsing key pairs? the output of temp5 shows 
the same keys and same count in both cases. The keys are distinct and in order 
after {{temp5.sortByKey(true).collect().foreach(println)}}

Here's my simplistic test case which gives a consistent count when I run the 
code above on this:

{code}
val bWords = sc.broadcast(sc.textFile("/usr/share/dict/words").collect())

val tempRDD1 = sc.parallelize(1 to 10000000, 10).mapPartitionsWithIndex { (i, 
ns) =>
  val words = bWords.value
  val random = new scala.util.Random(i)
  ns.map { n => 
    val a = words(random.nextInt(words.length))
    val b = words(random.nextInt(words.length))
    val c = words(random.nextInt(words.length))
    val d = random.nextInt(words.length)
    val e = random.nextInt(words.length)
    val f = random.nextInt(words.length)
    val g = random.nextInt(words.length)
    ((a, b), (c, d, e, f, g))
  }
}

val tempRDD2 = sc.parallelize(1 to 10000000, 10).mapPartitionsWithIndex { (i, 
ns) =>
  val words = bWords.value
  val random = new scala.util.Random(i)
  ns.map { n => 
    val a = words(random.nextInt(words.length))
    val b = words(random.nextInt(words.length))
    val c = words(random.nextInt(words.length))
    val d = random.nextInt(words.length)
    val e = random.nextInt(words.length)
    val f = random.nextInt(words.length)
    val g = random.nextInt(words.length)
    ((a, b), (c, d, e, f, g))
  }
}
{code}

> reduceByKey not returning distinct results
> ------------------------------------------
>
>                 Key: SPARK-10493
>                 URL: https://issues.apache.org/jira/browse/SPARK-10493
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>            Reporter: Glenn Strycker
>         Attachments: reduceByKey_example_001.scala
>
>
> I am running Spark 1.3.0 and creating an RDD by unioning several earlier RDDs 
> (using zipPartitions), partitioning by a hash partitioner, and then applying 
> a reduceByKey to summarize statistics by key.
> Since my set before the reduceByKey consists of records such as (K, V1), (K, 
> V2), (K, V3), I expect the results after reduceByKey to be just (K, 
> f(V1,V2,V3)), where the function f is appropriately associative, commutative, 
> etc.  Therefore, the results after reduceByKey ought to be distinct, correct? 
>  I am running counts of my RDD and finding that adding an additional 
> .distinct after my .reduceByKey is changing the final count!!
> Here is some example code:
> rdd3 = tempRDD1.
>    zipPartitions(tempRDD2, true)((iter, iter2) => iter++iter2).
>    partitionBy(new HashPartitioner(numPartitions)).
>    reduceByKey((a,b) => (math.Ordering.String.min(a._1, b._1), a._2 + b._2, 
> math.max(a._3, b._3), math.max(a._4, b._4), math.max(a._5, b._5)))
> println(rdd3.count)
> rdd4 = rdd3.distinct
> println(rdd4.count)
> I am using persistence, checkpointing, and other stuff in my actual code that 
> I did not paste here, so I can paste my actual code if it would be helpful.
> This issue may be related to SPARK-2620, except I am not using case classes, 
> to my knowledge.
> See also 
> http://stackoverflow.com/questions/32466176/apache-spark-rdd-reducebykey-operation-not-returning-correct-distinct-results



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to