Looking at PairRDDFunctions.scala :

  def aggregateByKey[U: ClassTag](zeroValue: U, partitioner:
Partitioner)(seqOp: (U, V) => U,
      combOp: (U, U) => U): RDD[(K, U)] = self.withScope {
...
    combineByKeyWithClassTag[U]((v: V) => cleanedSeqOp(createZero(), v),
      cleanedSeqOp, combOp, partitioner)

I think the two operations should be have similar performance.

Cheers

On Tue, Jan 5, 2016 at 1:13 PM, Marco Mistroni <mmistr...@gmail.com> wrote:

> Hi all
>  i have the following dataSet
> kv = [(2,Hi), (1,i), (2,am), (1,a), (4,test), (6,s tring)]
>
> It's a simple list of tuples containing (word_length, word)
>
> What i wanted to do was to group the result by key in order to have a
> result in the form
>
> [(word_length_1, [word1, word2, word3], word_length_2, [word4, word5,
> word6])
>
> so i browsed spark API and was able to get the result i wanted using two
> different
> functions
> .
>
> scala> kv.combineByKey(List(_), (x:List[String], y:String) => y :: x,
> (x:List[St
>
> ring], y:List[String]) => x ::: y).collect()
>
> res86: Array[(Int, List[String])] = Array((1,List(i, a)), (2,List(Hi,
> am)), (4,L
> ist(test)), (6,List(string)))
>
> and
>
> scala>
>
> scala> kv.aggregateByKey(List[String]())((acc, item) => item :: acc,
>
>      |                    (acc1, acc2) => acc1 ::: acc2).collect()
>
>
>
>
>
>
>
> res87: Array[(Int, List[String])] = Array((1,List(i, a)), (2,List(Hi,
> am)), (4,L
> ist(test)), (6,List(string)))
>
> Now, question is: any advantages of using one instead of the others?
> Am i somehow misusing the API for what i want to do?
>
> kind regards
>  marco
>
>
>
>
>
>
>
>

Reply via email to