This is implemented in MLlib:
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/rdd/MLPairRDDFunctions.scala#L41.
-Xiangrui
On Wed, Jun 10, 2015 at 1:53 PM, erisa wrote:
> Hi,
>
> I am a Spark newbie, and trying to solve the same problem, and have
> implement
Hi,
I am a Spark newbie, and trying to solve the same problem, and have
implemented the same exact solution that sowen is suggesting. I am using
priorityqueues to keep trak of the top 25 sub_categories, per each category,
and using the combineByKey function to do that.
However I run into the fol
You probably want to use combineByKey, and create an empty min queue
for each key. Merge values into the queue if its size is < K. If >= K,
only merge the value if it exceeds the smallest element; if so add it
and remove the smallest element.
This gives you an RDD of keys mapped to collections of