Re: Spark streaming question - SPARK-13758 Need to use an external RDD inside DStream processing...Please help

2017-02-07 Thread Shixiong(Ryan) Zhu
You can create lazily instantiated singleton instances. See http://spark.apache.org/docs/latest/streaming-programming-guide.html#accumulators-broadcast-variables-and-checkpoints for examples of accumulators and broadcast variables. You can use the same approach to create your cached RDD. On Tue,

Re: Spark streaming question - SPARK-13758 Need to use an external RDD inside DStream processing...Please help

2017-02-07 Thread shyla deshpande
and my cached RDD is not small. If it was maybe I could materialize and broadcast. Thanks On Tue, Feb 7, 2017 at 10:28 AM, shyla deshpande wrote: > I have a situation similar to the following and I get SPARK-13758 >

Spark streaming question - SPARK-13758 Need to use an external RDD inside DStream processing...Please help

2017-02-07 Thread shyla deshpande
I have a situation similar to the following and I get SPARK-13758 . I understand why I get this error, but I want to know what should be the approach in dealing with these situations. Thanks > var cached =

Re: Spark Streaming: question on sticky session across batches ?

2016-11-15 Thread Manish Malhotra
Thanks! On Tue, Nov 15, 2016 at 1:07 AM Takeshi Yamamuro wrote: > - dev > > Hi, > > AFAIK, if you use RDDs only, you can control the partition mapping to some > extent > by using a partition key RDD[(key, data)]. > A defined partitioner distributes data into partitions

Re: Spark Streaming: question on sticky session across batches ?

2016-11-15 Thread Takeshi Yamamuro
- dev Hi, AFAIK, if you use RDDs only, you can control the partition mapping to some extent by using a partition key RDD[(key, data)]. A defined partitioner distributes data into partitions depending on the key. As a good example to control partitions, you can see the GraphX code;

Re: Spark Streaming: question on sticky session across batches ?

2016-11-14 Thread Manish Malhotra
sending again. any help is appreciated ! thanks in advance. On Thu, Nov 10, 2016 at 8:42 AM, Manish Malhotra < manish.malhotra.w...@gmail.com> wrote: > Hello Spark Devs/Users, > > Im trying to solve the use case with Spark Streaming 1.6.2 where for every > batch ( say 2 mins) data needs to go

Spark Streaming: question on sticky session across batches ?

2016-11-10 Thread Manish Malhotra
Hello Spark Devs/Users, Im trying to solve the use case with Spark Streaming 1.6.2 where for every batch ( say 2 mins) data needs to go to the same reducer node after grouping by key. The underlying storage is Cassandra and not HDFS. This is a map-reduce job, where also trying to use the

Spark Streaming question batch size

2014-07-01 Thread Laeeq Ahmed
Hi, The window size in a spark streaming is time based which means we have different number of elements in each window. For example if you have two streams (might be more) which are related to each other and you want to compare them in a specific time interval. I am not clear how it will work.

Re: Spark Streaming question batch size

2014-07-01 Thread Yana Kadiyska
Are you saying that both streams come in at the same rate and you have the same batch interval but the batch size ends up different? i.e. two datapoints both arriving at X seconds after streaming starts end up in two different batches? How do you define real time values for both streams? I am

Re: Spark Streaming question batch size

2014-07-01 Thread Laeeq Ahmed
Hi Yana, Yes, that is what I am saying. I need both streams to be at same pace. I do have timestamps for each datapoint. There is a way suggested by Tathagata das in an earlier post where you have have a bigger window than required and you fetch your required data from that window based on

Re: spark streaming question

2014-05-05 Thread Tathagata Das
One main reason why Spark Streaming can achieve higher throughput than Storm is because Spark Streaming operates in coarser-grained batches - second-scale massive batches - which reduce per-tuple of overheads in shuffles, and other kinds of data movements, etc. Note that, this is also true that

spark streaming question

2014-05-04 Thread Weide Zhang
Hi , It might be a very general question to ask here but I'm curious to know why spark streaming can achieve better throughput than storm as claimed in the spark streaming paper. Does it depend on certain use cases and/or data source ? What drives better performance in spark streaming case or in

Re: spark streaming question

2014-05-04 Thread Chris Fregly
great questions, weide. in addition, i'd also like to hear more about how to horizontally scale a spark-streaming cluster. i've gone through the samples (standalone mode) and read the documentation, but it's still not clear to me how to scale this puppy out under high load. i assume i add more