Re: sampling operation for DStream

2016-08-01 Thread Cody Koeninger
Put the queue in a static variable that is first referenced on the workers (inside an rdd closure). That way it will be created on each of the workers, not the driver. Easiest way to do that is with a lazy val in a companion object. On Mon, Aug 1, 2016 at 3:22 PM, Martin Le wrote: > How to do t

Re: sampling operation for DStream

2016-08-01 Thread Martin Le
How to do that? if I put the queue inside .transform operation, it doesn't work. On Mon, Aug 1, 2016 at 6:43 PM, Cody Koeninger wrote: > Can you keep a queue per executor in memory? > > On Mon, Aug 1, 2016 at 11:24 AM, Martin Le > wrote: > > Hi Cody and all, > > > > Thank you for your answer. I

Re: sampling operation for DStream

2016-08-01 Thread Cody Koeninger
Can you keep a queue per executor in memory? On Mon, Aug 1, 2016 at 11:24 AM, Martin Le wrote: > Hi Cody and all, > > Thank you for your answer. I implement simple random sampling (SRS) for > DStream using transform method, and it works fine. > However, I have a problem when I implement reservoir

Re: sampling operation for DStream

2016-08-01 Thread Martin Le
Hi Cody and all, Thank you for your answer. I implement simple random sampling (SRS) for DStream using transform method, and it works fine. However, I have a problem when I implement reservoir sampling (RS). In RS, I need to maintain a reservoir (a queue) to store selected data items (RDDs). If I

Re: sampling operation for DStream

2016-07-29 Thread Cody Koeninger
Most stream systems you're still going to incur the cost of reading each message... I suppose you could rotate among reading just the latest messages from a single partition of a Kafka topic if they were evenly balanced. But once you've read the messages, nothing's stopping you from filtering most

sampling operation for DStream

2016-07-29 Thread Martin Le
Hi all, I have to handle high-speed rate data stream. To reduce the heavy load, I want to use sampling techniques for each stream window. It means that I want to process a subset of data instead of whole window data. I saw Spark support sampling operations for RDD, but for DStream, Spark supports