Re: How to restrict foreach on a streaming RDD only once upon receiver completion

Tathagata Das Mon, 06 Apr 2015 12:45:32 -0700

So you want to sort based on the total count of the all the records
received through receiver? In that case, you have to combine all the counts
using updateStateByKey (
https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/StatefulNetworkWordCount.scala
)
But stepping back, if you want to get the final results at the end of the
receiving all the data (as opposed to continuously), why are you even using
streaming? You could create a custom RDD that reads from ElasticSearch and
then use it in a Spark program. I think that's more natural as your
application is more batch-like than streaming-like as you are using the
results in real-time.


TD

On Mon, Apr 6, 2015 at 12:31 PM, Hari Polisetty <hpoli...@icloud.com> wrote:

> I have created a Custom Receiver to fetch records pertaining to a specific
> query from Elastic Search and have implemented Streaming RDD
> transformations to process the data generated by the receiver.
>
> The final RDD is a sorted list of name value pairs and I want to read the
> top 20 results programmatically rather than write to an external file.
> I use "foreach" on the RDD and take the top 20 values into a list. I see
> that forEach is processed every time there is a new microbatch from the
> receiver.
>
> However, I want the foreach computation to be done only once when the
> receiver has finished fetching all the records from Elastic Search and
> before the streaming context is killed so that I can populate the results
> into a list and process it in my driver program.
>
> Appreciate any guidance in this regard.
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: How to restrict foreach on a streaming RDD only once upon receiver completion

Reply via email to