Github user holdenk commented on the issue:

    https://github.com/apache/spark/pull/22010
  
    @rxin So that RDD could not exist with a known partitioner (regardless of 
range-based or hash based the partitioner must be deterministic so two elements 
with the same key must go to the same partition & if two elements do not have 
same key they can not be duplicates of each other). Distinct looks at both the 
input k/v as one elem not just v (e.g an RDD of `[(1, 2), (2, 2), (2, 
2)].distinct()` should produce `[(1,2), (2, 2)]`).



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to