Spark Streaming questions, just 2
Hello all, I have a couple of spark streaming questions. Thanks. 1. In the case of stateful operations, the data is, by default, persistent in memory. In memory does it mean MEMORY_ONLY? When is it removed from memory? 2. I do not see any documentation for spark.cleaner.ttl. Is this no longer necessary? (SPARK-7689)
Re: spark streaming questions
For my question (2), From my understanding checkpointing ensures the recovery from failures. Sent from my iPhone > On Jun 22, 2016, at 10:27 AM, pandees waran <pande...@gmail.com> wrote: > > In general, if you have multiple steps in a workflow : > For every batch > 1.stream data from s3 > 2.write it to hbase > 3.execute a hive step using the data in s3 > > In this case all these 3 steps are part of the workflow. That's the reason I > mentioned about workflow orchestration. > > The other question (2) is about how to manage the clusters without any > downtime / data loss .(especially when you want k being down the cluster and > create a new one for running spark streaming ) > > > Sent from my iPhone > >> On Jun 22, 2016, at 10:17 AM, Mich Talebzadeh <mich.talebza...@gmail.com> >> wrote: >> >> Hi Pandees, >> >> can you kindly explain what you are trying to achieve by incorporating Spark >> streaming with workflow orchestration. Is this some form of back-to-back >> seamless integration. >> >> I have not used it myself but would be interested in knowing more about your >> use case. >> >> Cheers, >> >> >> >> >> Dr Mich Talebzadeh >> >> LinkedIn >> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >> >> http://talebzadehmich.wordpress.com >> >> >>> On 22 June 2016 at 15:54, pandees waran <pande...@gmail.com> wrote: >>> Hi Mich, please let me know if you have any thoughts on the below. >>> >>> -- Forwarded message -- >>> From: pandees waran <pande...@gmail.com> >>> Date: Wed, Jun 22, 2016 at 7:53 AM >>> Subject: spark streaming questions >>> To: user@spark.apache.org >>> >>> >>> Hello all, >>> >>> I have few questions regarding spark streaming : >>> >>> * I am wondering anyone uses spark streaming with workflow orchestrators >>> such as data pipeline/SWF/any other framework. Is there any advantages >>> /drawbacks on using a workflow orchestrator for spark streaming? >>> >>> *How do you guys manage the cluster(bringing down /creating a new cluster ) >>> without any data loss in streaming? >>> >>> I would like to hear your thoughts on this. >>> >>> >>> >>> >>> -- >>> Thanks, >>> Pandeeswaran >>
Re: spark streaming questions
In general, if you have multiple steps in a workflow : For every batch 1.stream data from s3 2.write it to hbase 3.execute a hive step using the data in s3 In this case all these 3 steps are part of the workflow. That's the reason I mentioned about workflow orchestration. The other question (2) is about how to manage the clusters without any downtime / data loss .(especially when you want k being down the cluster and create a new one for running spark streaming ) Sent from my iPhone > On Jun 22, 2016, at 10:17 AM, Mich Talebzadeh <mich.talebza...@gmail.com> > wrote: > > Hi Pandees, > > can you kindly explain what you are trying to achieve by incorporating Spark > streaming with workflow orchestration. Is this some form of back-to-back > seamless integration. > > I have not used it myself but would be interested in knowing more about your > use case. > > Cheers, > > > > > Dr Mich Talebzadeh > > LinkedIn > https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > > http://talebzadehmich.wordpress.com > > >> On 22 June 2016 at 15:54, pandees waran <pande...@gmail.com> wrote: >> Hi Mich, please let me know if you have any thoughts on the below. >> >> -- Forwarded message -- >> From: pandees waran <pande...@gmail.com> >> Date: Wed, Jun 22, 2016 at 7:53 AM >> Subject: spark streaming questions >> To: user@spark.apache.org >> >> >> Hello all, >> >> I have few questions regarding spark streaming : >> >> * I am wondering anyone uses spark streaming with workflow orchestrators >> such as data pipeline/SWF/any other framework. Is there any advantages >> /drawbacks on using a workflow orchestrator for spark streaming? >> >> *How do you guys manage the cluster(bringing down /creating a new cluster ) >> without any data loss in streaming? >> >> I would like to hear your thoughts on this. >> >> >> >> >> -- >> Thanks, >> Pandeeswaran >
Re: spark streaming questions
Hi Pandees, can you kindly explain what you are trying to achieve by incorporating Spark streaming with workflow orchestration. Is this some form of back-to-back seamless integration. I have not used it myself but would be interested in knowing more about your use case. Cheers, Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpress.com On 22 June 2016 at 15:54, pandees waran <pande...@gmail.com> wrote: > Hi Mich, please let me know if you have any thoughts on the below. > > -- Forwarded message -- > From: pandees waran <pande...@gmail.com> > Date: Wed, Jun 22, 2016 at 7:53 AM > Subject: spark streaming questions > To: user@spark.apache.org > > > Hello all, > > I have few questions regarding spark streaming : > > * I am wondering anyone uses spark streaming with workflow orchestrators > such as data pipeline/SWF/any other framework. Is there any advantages > /drawbacks on using a workflow orchestrator for spark streaming? > > *How do you guys manage the cluster(bringing down /creating a new cluster > ) without any data loss in streaming? > > I would like to hear your thoughts on this. > > > > > -- > Thanks, > Pandeeswaran >
spark streaming questions
Hello all, I have few questions regarding spark streaming : * I am wondering anyone uses spark streaming with workflow orchestrators such as data pipeline/SWF/any other framework. Is there any advantages /drawbacks on using a workflow orchestrator for spark streaming? *How do you guys manage the cluster(bringing down /creating a new cluster ) without any data loss in streaming? I would like to hear your thoughts on this.
Re: [SPARK STREAMING] Questions regarding foreachPartition
Thanks Cody, that's what I thought. Currently in the cases where I want global ordering, I am doing a collect() call and going through everything in the client. I wonder if there is a way to do a global ordered execution across micro-batches in a betterway? I am having some trouble with acquiring resources and letting them go after the iterator in Java. It might have to do with my resource allocator itself. I will investigate further and get back to you. Thanks Nipun On Mon, Nov 16, 2015 at 5:11 PM Cody Koeningerwrote: > Ordering would be on a per-partition basis, not global ordering. > > You typically want to acquire resources inside the foreachpartition > closure, just before handling the iterator. > > > http://spark.apache.org/docs/latest/streaming-programming-guide.html#design-patterns-for-using-foreachrdd > > On Mon, Nov 16, 2015 at 4:02 PM, Nipun Arora > wrote: > >> Hi, >> I wanted to understand forEachPartition logic. In the code below, I am >> assuming the iterator is executing in a distributed fashion. >> >> 1. Assuming I have a stream which has timestamp data which is sorted. >> Will the stringiterator in foreachPartition process each line in order? >> >> 2. Assuming I have a static pool of Kafka connections, where should I get >> a connection from a pool to be used to send data to Kafka? >> >> addMTSUnmatched.foreachRDD( >> new Function () { >> @Override >> public Void call(JavaRDD stringJavaRDD) throws Exception >> { >> stringJavaRDD.foreachPartition( >> >> new VoidFunction () { >> @Override >> public void call(Iterator >> stringIterator) throws Exception { >> while(stringIterator.hasNext()){ >> String str = stringIterator.next(); >> if(OnlineUtils.ESFlag) { >> OnlineUtils.printToFile(str, 1, >> type1_outputFile, OnlineUtils.client); >> }else{ >> OnlineUtils.printToFile(str, 1, >> type1_outputFile); >> } >> } >> } >> } >> ); >> return null; >> } >> } >> ); >> >> >> >> Thanks >> >> Nipun >> >> >
[SPARK STREAMING] Questions regarding foreachPartition
Hi, I wanted to understand forEachPartition logic. In the code below, I am assuming the iterator is executing in a distributed fashion. 1. Assuming I have a stream which has timestamp data which is sorted. Will the stringiterator in foreachPartition process each line in order? 2. Assuming I have a static pool of Kafka connections, where should I get a connection from a pool to be used to send data to Kafka? addMTSUnmatched.foreachRDD( new Function() { @Override public Void call(JavaRDD stringJavaRDD) throws Exception { stringJavaRDD.foreachPartition( new VoidFunction () { @Override public void call(Iterator stringIterator) throws Exception { while(stringIterator.hasNext()){ String str = stringIterator.next(); if(OnlineUtils.ESFlag) { OnlineUtils.printToFile(str, 1, type1_outputFile, OnlineUtils.client); }else{ OnlineUtils.printToFile(str, 1, type1_outputFile); } } } } ); return null; } } ); Thanks Nipun
Re: [SPARK STREAMING] Questions regarding foreachPartition
Ordering would be on a per-partition basis, not global ordering. You typically want to acquire resources inside the foreachpartition closure, just before handling the iterator. http://spark.apache.org/docs/latest/streaming-programming-guide.html#design-patterns-for-using-foreachrdd On Mon, Nov 16, 2015 at 4:02 PM, Nipun Arorawrote: > Hi, > I wanted to understand forEachPartition logic. In the code below, I am > assuming the iterator is executing in a distributed fashion. > > 1. Assuming I have a stream which has timestamp data which is sorted. Will > the stringiterator in foreachPartition process each line in order? > > 2. Assuming I have a static pool of Kafka connections, where should I get > a connection from a pool to be used to send data to Kafka? > > addMTSUnmatched.foreachRDD( > new Function () { > @Override > public Void call(JavaRDD stringJavaRDD) throws Exception { > stringJavaRDD.foreachPartition( > > new VoidFunction () { > @Override > public void call(Iterator stringIterator) > throws Exception { > while(stringIterator.hasNext()){ > String str = stringIterator.next(); > if(OnlineUtils.ESFlag) { > OnlineUtils.printToFile(str, 1, > type1_outputFile, OnlineUtils.client); > }else{ > OnlineUtils.printToFile(str, 1, > type1_outputFile); > } > } > } > } > ); > return null; > } > } > ); > > > > Thanks > > Nipun > >
2 spark streaming questions
Hi, Dear Spark Streaming Developers and Users, We are prototyping using spark streaming and hit the following 2 issues thatI would like to seek your expertise. 1) We have a spark streaming application in scala, that reads data from Kafka intoa DStream, does some processing and output a transformed DStream. If for some reasonthe Kafka connection is not available or timed out, the spark streaming job will startto send empty RDD afterwards. The log is clean w/o any ERROR indicator. I googled around and this seems to be a known issue.We believe that spark streaming infrastructure should either retry or return error/exception.Can you share how you handle this case? 2) We would like implement a spark streaming job that join an 1 minute duration DStream of real time eventswith a metadata RDD that was read from a database. The metadata only changes slightly each day in the database.So what is the best practice of refresh the RDD daily keep the streaming join job running? Is this do-able as of spark 1.1.0? Thanks. Tian
Re: spark streaming questions
Thanks Anwar. On Tue, Jun 17, 2014 at 11:54 AM, Anwar Rizal anriza...@gmail.com wrote: On Tue, Jun 17, 2014 at 5:39 PM, Chen Song chen.song...@gmail.com wrote: Hey I am new to spark streaming and apologize if these questions have been asked. * In StreamingContext, reduceByKey() seems to only work on the RDDs of the current batch interval, not including RDDs of previous batches. Is my understanding correct? It's correct. * If the above statement is correct, what functions to use if one wants to do processing on the continuous stream batches of data? I see 2 functions, reduceByKeyAndWindow and updateStateByKey which serve this purpose. I presume that you need to keep a state that goes beyond one batch, so multiple batches. In this case, yes, updateStateByKey is the one you will use. Basically, updateStateByKey wraps a state into an RDD. My use case is an aggregation and doesn't fit a windowing scenario. * As for updateStateByKey, I have a few questions. ** Over time, will spark stage original data somewhere to replay in case of failures? Say the Spark job run for weeks, I am wondering how that sustains? ** Say my reduce key space is partitioned by some date field and I would like to stop processing old dates after a period time (this is not a simply windowing scenario as which date the data belongs to is not the same thing when the data arrives). How can I handle this to tell spark to discard data for old dates? You will need to call checkpoint (see http://spark.apache.org/docs/latest/streaming-programming-guide.html#rdd-checkpointing) that will persist the metadata of RDD that will consume memory (and stack execution) otherwise. You can set the interval of checkpointing that suits your need. Now, if you want to also reset your state after some times, there is no immediate way I can think of ,but you can do it through updateStateByKey, maybe by book-keeping the timestamp. Thank you, Best Chen -- Chen Song
spark streaming questions
Hey I am new to spark streaming and apologize if these questions have been asked. * In StreamingContext, reduceByKey() seems to only work on the RDDs of the current batch interval, not including RDDs of previous batches. Is my understanding correct? * If the above statement is correct, what functions to use if one wants to do processing on the continuous stream batches of data? I see 2 functions, reduceByKeyAndWindow and updateStateByKey which serve this purpose. My use case is an aggregation and doesn't fit a windowing scenario. * As for updateStateByKey, I have a few questions. ** Over time, will spark stage original data somewhere to replay in case of failures? Say the Spark job run for weeks, I am wondering how that sustains? ** Say my reduce key space is partitioned by some date field and I would like to stop processing old dates after a period time (this is not a simply windowing scenario as which date the data belongs to is not the same thing when the data arrives). How can I handle this to tell spark to discard data for old dates? Thank you, Best Chen
Fwd: spark streaming questions
Hey I am new to spark streaming and apologize if these questions have been asked. * In StreamingContext, reduceByKey() seems to only work on the RDDs of the current batch interval, not including RDDs of previous batches. Is my understanding correct? * If the above statement is correct, what functions to use if one wants to do processing on the continuous stream batches of data? I see 2 functions, reduceByKeyAndWindow and updateStateByKey which serve this purpose. My use case is an aggregation and doesn't fit a windowing scenario. * As for updateStateByKey, I have a few questions. ** Over time, will spark stage original data somewhere to replay in case of failures? Say the Spark job run for weeks, I am wondering how that sustains? ** Say my reduce key space is partitioned by some date field and I would like to stop processing old dates after a period time (this is not a simply windowing scenario as which date the data belongs to is not the same thing when the data arrives). How can I handle this to tell spark to discard data for old dates? Thank you, Best Chen -- Chen Song