Github user holdenk commented on the issue:
https://github.com/apache/spark/pull/22010
@rxin So that RDD could not exist with a known partitioner (regardless of
range-based or hash based the partitioner must be deterministic so two elements
with the same key must go to the same partition & if two elements do not have
same key they can not be duplicates of each other). Distinct looks at both the
input k/v as one elem not just v (e.g an RDD of `[(1, 2), (2, 2), (2,
2)].distinct()` should produce `[(1,2), (2, 2)]`).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]