Spark Streaming questions, just 2

2017-03-21 Thread shyla deshpande
Hello all,
I have a couple of spark streaming questions. Thanks.

1.  In the case of stateful operations, the data is, by default,
persistent in memory.
In memory does it mean MEMORY_ONLY?   When is it removed from memory?

2.   I do not see any documentation for spark.cleaner.ttl. Is this no
longer necessary? (SPARK-7689)


Re: spark streaming questions

2016-06-22 Thread pandees waran
For my question (2), From my understanding checkpointing ensures the recovery 
from failures.
Sent from my iPhone

> On Jun 22, 2016, at 10:27 AM, pandees waran <pande...@gmail.com> wrote:
> 
> In general, if you have multiple steps in a workflow :
> For every batch 
> 1.stream data from s3 
> 2.write it to hbase
> 3.execute a hive step using the data in s3 
> 
> In this case all these 3 steps are part of the workflow. That's the reason I 
> mentioned about workflow orchestration.
> 
> The other question (2) is about how to manage the clusters without any 
> downtime / data loss .(especially when you want k being down the cluster and 
> create a new one for running spark streaming )
> 
> 
> Sent from my iPhone
> 
>> On Jun 22, 2016, at 10:17 AM, Mich Talebzadeh <mich.talebza...@gmail.com> 
>> wrote:
>> 
>> Hi Pandees,
>> 
>> can you kindly explain what you are trying to achieve by incorporating Spark 
>> streaming with workflow orchestration. Is this some form of back-to-back 
>> seamless integration.
>> 
>> I have not used it myself but would be interested in knowing more about your 
>> use case.
>> 
>> Cheers,
>> 
>> 
>> 
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>  
>> http://talebzadehmich.wordpress.com
>>  
>> 
>>> On 22 June 2016 at 15:54, pandees waran <pande...@gmail.com> wrote:
>>> Hi Mich, please let me know if you have any thoughts on the below. 
>>> 
>>> -- Forwarded message --
>>> From: pandees waran <pande...@gmail.com>
>>> Date: Wed, Jun 22, 2016 at 7:53 AM
>>> Subject: spark streaming questions
>>> To: user@spark.apache.org
>>> 
>>> 
>>> Hello all,
>>> 
>>> I have few questions regarding spark streaming :
>>> 
>>> * I am wondering anyone uses spark streaming with workflow orchestrators 
>>> such as data pipeline/SWF/any other framework. Is there any advantages 
>>> /drawbacks on using a workflow orchestrator for spark streaming?
>>> 
>>> *How do you guys manage the cluster(bringing down /creating a new cluster ) 
>>> without any data loss in streaming? 
>>> 
>>> I would like to hear your thoughts on this.
>>> 
>>> 
>>> 
>>> 
>>> -- 
>>> Thanks,
>>> Pandeeswaran
>> 


Re: spark streaming questions

2016-06-22 Thread pandees waran
In general, if you have multiple steps in a workflow :
For every batch 
1.stream data from s3 
2.write it to hbase
3.execute a hive step using the data in s3 

In this case all these 3 steps are part of the workflow. That's the reason I 
mentioned about workflow orchestration.

The other question (2) is about how to manage the clusters without any downtime 
/ data loss .(especially when you want k being down the cluster and create a 
new one for running spark streaming )


Sent from my iPhone

> On Jun 22, 2016, at 10:17 AM, Mich Talebzadeh <mich.talebza...@gmail.com> 
> wrote:
> 
> Hi Pandees,
> 
> can you kindly explain what you are trying to achieve by incorporating Spark 
> streaming with workflow orchestration. Is this some form of back-to-back 
> seamless integration.
> 
> I have not used it myself but would be interested in knowing more about your 
> use case.
> 
> Cheers,
> 
> 
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
>  
> 
>> On 22 June 2016 at 15:54, pandees waran <pande...@gmail.com> wrote:
>> Hi Mich, please let me know if you have any thoughts on the below. 
>> 
>> -- Forwarded message --
>> From: pandees waran <pande...@gmail.com>
>> Date: Wed, Jun 22, 2016 at 7:53 AM
>> Subject: spark streaming questions
>> To: user@spark.apache.org
>> 
>> 
>> Hello all,
>> 
>> I have few questions regarding spark streaming :
>> 
>> * I am wondering anyone uses spark streaming with workflow orchestrators 
>> such as data pipeline/SWF/any other framework. Is there any advantages 
>> /drawbacks on using a workflow orchestrator for spark streaming?
>> 
>> *How do you guys manage the cluster(bringing down /creating a new cluster ) 
>> without any data loss in streaming? 
>> 
>> I would like to hear your thoughts on this.
>> 
>> 
>> 
>> 
>> -- 
>> Thanks,
>> Pandeeswaran
> 


Re: spark streaming questions

2016-06-22 Thread Mich Talebzadeh
Hi Pandees,

can you kindly explain what you are trying to achieve by incorporating
Spark streaming with workflow orchestration. Is this some form of
back-to-back seamless integration.

I have not used it myself but would be interested in knowing more about
your use case.

Cheers,




Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 22 June 2016 at 15:54, pandees waran <pande...@gmail.com> wrote:

> Hi Mich, please let me know if you have any thoughts on the below.
>
> -- Forwarded message --
> From: pandees waran <pande...@gmail.com>
> Date: Wed, Jun 22, 2016 at 7:53 AM
> Subject: spark streaming questions
> To: user@spark.apache.org
>
>
> Hello all,
>
> I have few questions regarding spark streaming :
>
> * I am wondering anyone uses spark streaming with workflow orchestrators
> such as data pipeline/SWF/any other framework. Is there any advantages
> /drawbacks on using a workflow orchestrator for spark streaming?
>
> *How do you guys manage the cluster(bringing down /creating a new cluster
> ) without any data loss in streaming?
>
> I would like to hear your thoughts on this.
>
>
>
>
> --
> Thanks,
> Pandeeswaran
>


spark streaming questions

2016-06-22 Thread pandees waran
Hello all,

I have few questions regarding spark streaming :

* I am wondering anyone uses spark streaming with workflow orchestrators
such as data pipeline/SWF/any other framework. Is there any advantages
/drawbacks on using a workflow orchestrator for spark streaming?

*How do you guys manage the cluster(bringing down /creating a new cluster )
without any data loss in streaming?

I would like to hear your thoughts on this.


Re: [SPARK STREAMING] Questions regarding foreachPartition

2015-11-17 Thread Nipun Arora
Thanks Cody, that's what I thought.
Currently in the cases where I want global ordering, I am doing a collect()
call and going through everything in the client.
I wonder if there is a way to do a global ordered execution across
micro-batches in a betterway?


I am having some trouble with acquiring resources and letting them go after
the iterator in Java.
It might have to do with my resource allocator itself. I will investigate
further and get back to you.

Thanks
Nipun


On Mon, Nov 16, 2015 at 5:11 PM Cody Koeninger  wrote:

> Ordering would be on a per-partition basis, not global ordering.
>
> You typically want to acquire resources inside the foreachpartition
> closure, just before handling the iterator.
>
>
> http://spark.apache.org/docs/latest/streaming-programming-guide.html#design-patterns-for-using-foreachrdd
>
> On Mon, Nov 16, 2015 at 4:02 PM, Nipun Arora 
> wrote:
>
>> Hi,
>> I wanted to understand forEachPartition logic. In the code below, I am
>> assuming the iterator is executing in a distributed fashion.
>>
>> 1. Assuming I have a stream which has timestamp data which is sorted.
>> Will the stringiterator in foreachPartition process each line in order?
>>
>> 2. Assuming I have a static pool of Kafka connections, where should I get
>> a connection from a pool to be used to send data to Kafka?
>>
>> addMTSUnmatched.foreachRDD(
>> new Function() {
>> @Override
>> public Void call(JavaRDD stringJavaRDD) throws Exception 
>> {
>> stringJavaRDD.foreachPartition(
>>
>> new VoidFunction() {
>> @Override
>> public void call(Iterator 
>> stringIterator) throws Exception {
>> while(stringIterator.hasNext()){
>> String str = stringIterator.next();
>> if(OnlineUtils.ESFlag) {
>> OnlineUtils.printToFile(str, 1, 
>> type1_outputFile, OnlineUtils.client);
>> }else{
>> OnlineUtils.printToFile(str, 1, 
>> type1_outputFile);
>> }
>> }
>> }
>> }
>> );
>> return null;
>> }
>> }
>> );
>>
>>
>>
>> Thanks
>>
>> Nipun
>>
>>
>


[SPARK STREAMING] Questions regarding foreachPartition

2015-11-16 Thread Nipun Arora
Hi,
I wanted to understand forEachPartition logic. In the code below, I am
assuming the iterator is executing in a distributed fashion.

1. Assuming I have a stream which has timestamp data which is sorted. Will
the stringiterator in foreachPartition process each line in order?

2. Assuming I have a static pool of Kafka connections, where should I get a
connection from a pool to be used to send data to Kafka?

addMTSUnmatched.foreachRDD(
new Function() {
@Override
public Void call(JavaRDD stringJavaRDD) throws Exception {
stringJavaRDD.foreachPartition(

new VoidFunction() {
@Override
public void call(Iterator
stringIterator) throws Exception {
while(stringIterator.hasNext()){
String str = stringIterator.next();
if(OnlineUtils.ESFlag) {
OnlineUtils.printToFile(str,
1, type1_outputFile, OnlineUtils.client);
}else{
OnlineUtils.printToFile(str,
1, type1_outputFile);
}
}
}
}
);
return null;
}
}
);



Thanks

Nipun


Re: [SPARK STREAMING] Questions regarding foreachPartition

2015-11-16 Thread Cody Koeninger
Ordering would be on a per-partition basis, not global ordering.

You typically want to acquire resources inside the foreachpartition
closure, just before handling the iterator.

http://spark.apache.org/docs/latest/streaming-programming-guide.html#design-patterns-for-using-foreachrdd

On Mon, Nov 16, 2015 at 4:02 PM, Nipun Arora 
wrote:

> Hi,
> I wanted to understand forEachPartition logic. In the code below, I am
> assuming the iterator is executing in a distributed fashion.
>
> 1. Assuming I have a stream which has timestamp data which is sorted. Will
> the stringiterator in foreachPartition process each line in order?
>
> 2. Assuming I have a static pool of Kafka connections, where should I get
> a connection from a pool to be used to send data to Kafka?
>
> addMTSUnmatched.foreachRDD(
> new Function() {
> @Override
> public Void call(JavaRDD stringJavaRDD) throws Exception {
> stringJavaRDD.foreachPartition(
>
> new VoidFunction() {
> @Override
> public void call(Iterator stringIterator) 
> throws Exception {
> while(stringIterator.hasNext()){
> String str = stringIterator.next();
> if(OnlineUtils.ESFlag) {
> OnlineUtils.printToFile(str, 1, 
> type1_outputFile, OnlineUtils.client);
> }else{
> OnlineUtils.printToFile(str, 1, 
> type1_outputFile);
> }
> }
> }
> }
> );
> return null;
> }
> }
> );
>
>
>
> Thanks
>
> Nipun
>
>


2 spark streaming questions

2014-11-23 Thread tian zhang

Hi, Dear Spark Streaming Developers and Users,
We are prototyping using spark streaming and hit the following 2 issues thatI 
would like to seek your expertise.
1) We have a spark streaming application in scala, that reads  data from Kafka 
intoa DStream, does some processing and output a transformed DStream. If for 
some reasonthe Kafka connection is not available or timed out, the spark 
streaming job will startto send empty RDD afterwards. The log is clean w/o any 
ERROR indicator. I googled  around and this seems to be a known issue.We 
believe that spark streaming infrastructure should either retry or return 
error/exception.Can you share how you handle this case?
2) We would like implement a spark streaming job that join an 1 minute  
duration DStream of real time eventswith a metadata RDD that was read from a 
database. The metadata only changes slightly each day in the database.So what 
is the best practice of refresh the RDD daily keep the streaming join job 
running? Is this do-able as of spark 1.1.0?
Thanks.
Tian



Re: spark streaming questions

2014-06-25 Thread Chen Song
Thanks Anwar.


On Tue, Jun 17, 2014 at 11:54 AM, Anwar Rizal anriza...@gmail.com wrote:


 On Tue, Jun 17, 2014 at 5:39 PM, Chen Song chen.song...@gmail.com wrote:

 Hey

 I am new to spark streaming and apologize if these questions have been
 asked.

 * In StreamingContext, reduceByKey() seems to only work on the RDDs of
 the current batch interval, not including RDDs of previous batches. Is my
 understanding correct?


 It's correct.



 * If the above statement is correct, what functions to use if one wants
 to do processing on the continuous stream batches of data? I see 2
 functions, reduceByKeyAndWindow and updateStateByKey which serve this
 purpose.


 I presume that you need to keep a state that goes beyond one batch, so
 multiple batches. In this case, yes, updateStateByKey is the one you will
 use. Basically, updateStateByKey wraps a state into an RDD.





 My use case is an aggregation and doesn't fit a windowing scenario.

 * As for updateStateByKey, I have a few questions.
 ** Over time, will spark stage original data somewhere to replay in case
 of failures? Say the Spark job run for weeks, I am wondering how that
 sustains?
 ** Say my reduce key space is partitioned by some date field and I would
 like to stop processing old dates after a period time (this is not a simply
 windowing scenario as which date the data belongs to is not the same thing
 when the data arrives). How can I handle this to tell spark to discard data
 for old dates?


 You will need to call checkpoint (see
 http://spark.apache.org/docs/latest/streaming-programming-guide.html#rdd-checkpointing)
  that will persist the metadata of RDD that will consume memory (and stack
 execution) otherwise. You can set the interval of checkpointing that suits
 your need.

 Now, if you want to also reset your state after some times, there is no
 immediate way I can think of ,but you can do it through updateStateByKey,
 maybe by book-keeping the timestamp.




 Thank you,

 Best
 Chen






-- 
Chen Song


spark streaming questions

2014-06-17 Thread Chen Song
Hey

I am new to spark streaming and apologize if these questions have been
asked.

* In StreamingContext, reduceByKey() seems to only work on the RDDs of the
current batch interval, not including RDDs of previous batches. Is my
understanding correct?

* If the above statement is correct, what functions to use if one wants to
do processing on the continuous stream batches of data? I see 2 functions,
reduceByKeyAndWindow and updateStateByKey which serve this purpose.

My use case is an aggregation and doesn't fit a windowing scenario.

* As for updateStateByKey, I have a few questions.
** Over time, will spark stage original data somewhere to replay in case of
failures? Say the Spark job run for weeks, I am wondering how that sustains?
** Say my reduce key space is partitioned by some date field and I would
like to stop processing old dates after a period time (this is not a simply
windowing scenario as which date the data belongs to is not the same thing
when the data arrives). How can I handle this to tell spark to discard data
for old dates?

Thank you,

Best
Chen


Fwd: spark streaming questions

2014-06-16 Thread Chen Song
Hey

I am new to spark streaming and apologize if these questions have been
asked.

* In StreamingContext, reduceByKey() seems to only work on the RDDs of the
current batch interval, not including RDDs of previous batches. Is my
understanding correct?

* If the above statement is correct, what functions to use if one wants to
do processing on the continuous stream batches of data? I see 2 functions,
reduceByKeyAndWindow and updateStateByKey which serve this purpose.

My use case is an aggregation and doesn't fit a windowing scenario.

* As for updateStateByKey, I have a few questions.
** Over time, will spark stage original data somewhere to replay in case of
failures? Say the Spark job run for weeks, I am wondering how that sustains?
** Say my reduce key space is partitioned by some date field and I would
like to stop processing old dates after a period time (this is not a simply
windowing scenario as which date the data belongs to is not the same thing
when the data arrives). How can I handle this to tell spark to discard data
for old dates?

Thank you,

Best
Chen




-- 
Chen Song