Re: Using rdd methods with Dstream

Tathagata Das Fri, 13 Mar 2015 15:09:45 -0700

Is the number of top K elements you want to keep small? That is, is K
small? In which case, you can
1.  either do it in the driver on the array
DStream.foreachRDD ( rdd => {
   val topK = rdd.top(K) ;
   // use top K
})


2. Or, you can use the topK to create another RDD using sc.makeRDD

DStream.transform ( rdd => {
   val topK = rdd.top(K) ;
   rdd.context.makeRDD(topK, numPartitions)
})

TD



On Fri, Mar 13, 2015 at 5:58 AM, Laeeq Ahmed <laeeqsp...@yahoo.com.invalid>
wrote:

> Hi,
>
> Earlier my code was like follwing but slow due to repartition. I want top
> K of each window in a stream.
>
> val counts = keyAndValues.map(x =>
> math.round(x._3.toDouble)).countByValueAndWindow(Seconds(4), Seconds(4))
> val topCounts = counts.repartition(1).map(_.swap).transform(rdd =>
> rdd.sortByKey(false)).map(_.swap).mapPartitions(rdd => rdd.take(10))
>
> so I thought to use dstream.transform(rdd=>rdd.top()) but this return
> Array rather than rdd. I have to perform further steps on topCounts dstream.
>
> [ERROR]  found   : Array[(Long, Long)]
> [ERROR]  required: org.apache.spark.rdd.RDD[?]
> [ERROR] val topCounts = counts.transform(rdd => rdd.top(10))
>
>
> Regards,
> Laeeq
>
>
>   On Friday, March 13, 2015 1:47 PM, Sean Owen <so...@cloudera.com> wrote:
>
>
> Hm, aren't you able to use the SparkContext here? DStream operations
> happen on the driver. So you can parallelize() the result?
>
> take() won't work as it's not the same as top()
>
> On Fri, Mar 13, 2015 at 11:23 AM, Akhil Das <ak...@sigmoidanalytics.com>
> wrote:
> > Like this?
> >
> > dtream.repartition(1).mapPartitions(it => it.take(5))
> >
> >
> >
> > Thanks
> > Best Regards
> >
> > On Fri, Mar 13, 2015 at 4:11 PM, Laeeq Ahmed <
> laeeqsp...@yahoo.com.invalid>
> > wrote:
> >>
> >> Hi,
> >>
> >> I normally use dstream.transform whenever I need to use methods which
> are
> >> available in RDD API but not in streaming API. e.g. dstream.transform(x
> =>
> >> x.sortByKey(true))
> >>
> >> But there are other RDD methods which return types other than RDD. e.g.
> >> dstream.transform(x => x.top(5)) top here returns Array.
> >>
> >> In the second scenario, how can i return RDD rather than array, so that
> i
> >> can perform further steps on dstream.
> >>
> >> Regards,
> >> Laeeq
>
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
>
>
>

Re: Using rdd methods with Dstream

Reply via email to