Looking at PairRDDFunctions.scala : def aggregateByKey[U: ClassTag](zeroValue: U, partitioner: Partitioner)(seqOp: (U, V) => U, combOp: (U, U) => U): RDD[(K, U)] = self.withScope { ... combineByKeyWithClassTag[U]((v: V) => cleanedSeqOp(createZero(), v), cleanedSeqOp, combOp, partitioner)
I think the two operations should be have similar performance. Cheers On Tue, Jan 5, 2016 at 1:13 PM, Marco Mistroni <mmistr...@gmail.com> wrote: > Hi all > i have the following dataSet > kv = [(2,Hi), (1,i), (2,am), (1,a), (4,test), (6,s tring)] > > It's a simple list of tuples containing (word_length, word) > > What i wanted to do was to group the result by key in order to have a > result in the form > > [(word_length_1, [word1, word2, word3], word_length_2, [word4, word5, > word6]) > > so i browsed spark API and was able to get the result i wanted using two > different > functions > . > > scala> kv.combineByKey(List(_), (x:List[String], y:String) => y :: x, > (x:List[St > > ring], y:List[String]) => x ::: y).collect() > > res86: Array[(Int, List[String])] = Array((1,List(i, a)), (2,List(Hi, > am)), (4,L > ist(test)), (6,List(string))) > > and > > scala> > > scala> kv.aggregateByKey(List[String]())((acc, item) => item :: acc, > > | (acc1, acc2) => acc1 ::: acc2).collect() > > > > > > > > res87: Array[(Int, List[String])] = Array((1,List(i, a)), (2,List(Hi, > am)), (4,L > ist(test)), (6,List(string))) > > Now, question is: any advantages of using one instead of the others? > Am i somehow misusing the API for what i want to do? > > kind regards > marco > > > > > > > >