Github user mgaido91 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22010#discussion_r209450199
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala ---
    @@ -396,7 +396,16 @@ abstract class RDD[T: ClassTag](
        * Return a new RDD containing the distinct elements in this RDD.
        */
       def distinct(numPartitions: Int)(implicit ord: Ordering[T] = null): 
RDD[T] = withScope {
    -    map(x => (x, null)).reduceByKey((x, y) => x, numPartitions).map(_._1)
    +    // If the data is already approriately partitioned with a known 
partitioner we can work locally.
    +    def removeDuplicatesInPartition(itr: Iterator[T]): Iterator[T] = {
    +      val set = new mutable.HashSet[T]()
    +      itr.filter(set.add(_))
    --- End diff --
    
    yes, it is not a big deal, but if you check the implementation in the scala 
lib you can see that the hash and the index for the key is computed despite it 
not needed (since `addElem` is called anyway). Probably it doesn't change much, 
but we could save this computation...


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to