Frankly speaking, I think reduceByKey with Partitioner has the same problem too and it should not be exposed to public user either. Because it is a little hard to fully understand how the partitioner behaves without looking at the actual code.
And if there exits a basic contract of a Partitioner, maybe it should be stated explicitly in the document if not enforced by code. However, I don’t feel too strong to argue about this issue except stating my concern. It will not cause too much trouble anyway once users learn the semantics. Just a judgement call by the API designer. > 在 2016年6月9日,下午12:51,Alexander Pivovarov <apivova...@gmail.com> 写道: > > reduceByKey(randomPartitioner, (a, b) => a + b) also gives incorrect result > > Why reduceByKey with Partitioner exists then? > > On Wed, Jun 8, 2016 at 9:22 PM, 汪洋 <tiandiwo...@icloud.com > <mailto:tiandiwo...@icloud.com>> wrote: > Hi Alexander, > > I think it does not guarantee to be right if an arbitrary Partitioner is > passed in. > > I have created a notebook and you can check it out. > (https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/7973071962862063/2110745399505739/58107563000366/latest.html > > <https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/7973071962862063/2110745399505739/58107563000366/latest.html>) > > Best regards, > > Yang > > >> 在 2016年6月9日,上午11:42,Alexander Pivovarov <apivova...@gmail.com >> <mailto:apivova...@gmail.com>> 写道: >> >> most of the RDD methods which shuffle data take Partitioner as a parameter >> >> But rdd.distinct does not have such signature >> >> Should I open a PR for that? >> >> /** >> * Return a new RDD containing the distinct elements in this RDD. >> */ >> def distinct(partitioner: Partitioner)(implicit ord: Ordering[T] = null): >> RDD[T] = withScope { >> map(x => (x, null)).reduceByKey(partitioner, (x, y) => x).map(_._1) >> } > >