most of the RDD methods which shuffle data take Partitioner as a parameter But rdd.distinct does not have such signature
Should I open a PR for that?
/**
* Return a new RDD containing the distinct elements in this RDD.
*/
def distinct(partitioner: Partitioner)(implicit ord: Ordering[T] =
null): RDD[T] = withScope {
map(x => (x, null)).reduceByKey(partitioner, (x, y) => x).map(_._1)
}
