Just to narrow down the issue, it looks like the issue is in 'reduceByKey' and derivates like 'distinct'.
groupByKey() seems to work sc.parallelize(ps).map(x=> (x.name,1)).groupByKey().collect res: Array[(String, Iterable[Int])] = Array((charly,ArrayBuffer(1)), (abe,ArrayBuffer(1)), (bob,ArrayBuffer(1, 1))) On Tue, Jul 22, 2014 at 4:20 PM, Gerard Maas <gerard.m...@gmail.com> wrote: > Using a case class as a key doesn't seem to work properly. [Spark 1.0.0] > > A minimal example: > > case class P(name:String) > val ps = Array(P("alice"), P("bob"), P("charly"), P("bob")) > sc.parallelize(ps).map(x=> (x,1)).reduceByKey((x,y) => x+y).collect > [Spark shell local mode] res : Array[(P, Int)] = Array((P(bob),1), > (P(bob),1), (P(abe),1), (P(charly),1)) > > In contrast to the expected behavior, that should be equivalent to: > sc.parallelize(ps).map(x=> (x.name,1)).reduceByKey((x,y) => x+y).collect > Array[(String, Int)] = Array((charly,1), (abe,1), (bob,2)) > > Any ideas why this doesn't work? > > -kr, Gerard. >