Semantics of Manual Offset Commit for Kafka Spark Streaming

2019-10-14 Thread Andre Piwoni
When using manual Kafka offset commit in Spark streaming job and application fails to process current batch without committing offset in executor, is it expected behavior that next batch will be processed and offset will be moved to next batch regardless of application failure to commit? It

Re: Kafka Spark Streaming integration : Relationship between DStreams and Tasks

2019-05-12 Thread Sheel Pancholi
ed to tasks *(T1, T2, T3, T4, T5).* Will >*ubatch2* of partitions *(P1',P2',P3',P4',P5')* at time *T5* be also >assigned to the same set of tasks *(T1, T2, T3, T4, T5)* or will new >tasks *(T6, T7, T8, T9, T10)* be created for *ubatch2*? > > > I have put up this question on SO @ > https://stackoverflow.com/questions/56102094/kafka-spark-streaming-integration-relation-between-tasks-and-dstreams > . > > Regards > Sheel >

Kafka Spark Streaming integration : Relationship between DStreams and Tasks

2019-05-12 Thread Sheel Pancholi
his question on SO @ https://stackoverflow.com/questions/56102094/kafka-spark-streaming-integration-relation-between-tasks-and-dstreams . Regards Sheel

Re: spark 2.3.1 with kafka spark-streaming-kafka-0-10 (java.lang.AbstractMethodError)

2018-06-28 Thread Peter Liu
Hello there, I just upgraded to spark 2.3.1 from spark 2.2.1, ran my streaming workload and got the error (java.lang.AbstractMethodError) never seen before; check the error stack attached in (a) bellow. anyone knows if spark 2.3.1 works well with kafka spark-streaming-kafka-0-10? this link

Re: Multiple Kafka Spark Streaming Dataframe Join query

2018-03-21 Thread Aakash Basu
> and compact data format if CSV isn't required. > > -- > *From:* Aakash Basu <aakash.spark@gmail.com> > *Sent:* Friday, March 16, 2018 9:12:39 AM > *To:* sagar grover > *Cc:* Bowden, Chris; Tathagata Das; Dylan Guedes; Georg Heiler; user; > jagrati.go...@myntra.com

Re: Multiple Kafka Spark Streaming Dataframe Join query

2018-03-16 Thread Aakash Basu
;>>> >>>> Cool! Shall try it and revert back tomm. >>>> >>>> Thanks a ton! >>>> >>>> On 15-Mar-2018 11:50 PM, "Bowden, Chris" <chris.bow...@microfocus.com> >>>> wrote: >>>> >>>>>

Re: Multiple Kafka Spark Streaming Dataframe Join query

2018-03-16 Thread Aakash Basu
nd >>>> deserialization is often an orthogonal and implicit transform. However, in >>>> Spark, serialization and deserialization is an explicit transform (e.g., >>>> you define it in your query plan). >>>> >>>> >>>> To make this mo

Re: Multiple Kafka Spark Streaming Dataframe Join query

2018-03-16 Thread Aakash Basu
t;> offers from_csv out of the box as an expression (although CSV is well >>> supported as a data source). You could implement an expression by reusing a >>> lot of the supporting CSV classes which may result in a better user >>> experience vs. explicitly using split

Re: Multiple Kafka Spark Streaming Dataframe Join query

2018-03-16 Thread sagar grover
may result in a better user >> experience vs. explicitly using split and array indices, etc. In this >> simple example, casting the binary to a string just works because there is >> a common understanding of string's encoded as bytes between Spark and Kafka >> by default. >&g

Re: Multiple Kafka Spark Streaming Dataframe Join query

2018-03-15 Thread Aakash Basu
--- > *From:* Aakash Basu <aakash.spark@gmail.com> > *Sent:* Thursday, March 15, 2018 10:48:45 AM > *To:* Bowden, Chris > *Cc:* Tathagata Das; Dylan Guedes; Georg Heiler; user > *Subject:* Re: Multiple Kafka Spark Streaming Dataframe Join query > > Hey Chris, > &g

Re: Multiple Kafka Spark Streaming Dataframe Join query

2018-03-15 Thread Aakash Basu
com> Sent: Thursday, March 15, 2018 7:52:28 AM To: Tathagata Das Cc: Dylan Guedes; Georg Heiler; user Subject: Re: Multiple Kafka Spark Streaming Dataframe Join query Hi, And if I run this below piece of code - from pyspark.sql import SparkSession import time class test: spark

Re: Multiple Kafka Spark Streaming Dataframe Join query

2018-03-15 Thread Tathagata Das
; From: Aakash Basu <aakash.spark@gmail.com> > Sent: Thursday, March 15, 2018 7:52:28 AM > To: Tathagata Das > Cc: Dylan Guedes; Georg Heiler; user > Subject: Re: Multiple Kafka Spark Streaming Dataframe Join query > > Hi, > > And if I run this below piece of code - > >

Re: Multiple Kafka Spark Streaming Dataframe Join query

2018-03-15 Thread Aakash Basu
Hi, And if I run this below piece of code - from pyspark.sql import SparkSession import time class test: spark = SparkSession.builder \ .appName("DirectKafka_Spark_Stream_Stream_Join") \ .getOrCreate() # ssc = StreamingContext(spark, 20) table1_stream =

Re: Multiple Kafka Spark Streaming Dataframe Join query

2018-03-15 Thread Aakash Basu
Any help on the above? On Thu, Mar 15, 2018 at 3:53 PM, Aakash Basu wrote: > Hi, > > I progressed a bit in the above mentioned topic - > > 1) I am feeding a CSV file into the Kafka topic. > 2) Feeding the Kafka topic as readStream as TD's article suggests. > 3) Then,

Re: Multiple Kafka Spark Streaming Dataframe Join query

2018-03-15 Thread Aakash Basu
Hi, I progressed a bit in the above mentioned topic - 1) I am feeding a CSV file into the Kafka topic. 2) Feeding the Kafka topic as readStream as TD's article suggests. 3) Then, simply trying to do a show on the streaming dataframe, using queryName('XYZ') in the writeStream and writing a sql

Re: Multiple Kafka Spark Streaming Dataframe Join query

2018-03-14 Thread Aakash Basu
Thanks to TD, the savior! Shall look into it. On Thu, Mar 15, 2018 at 1:04 AM, Tathagata Das wrote: > Relevant: https://databricks.com/blog/2018/03/13/ > introducing-stream-stream-joins-in-apache-spark-2-3.html > > This is true stream-stream join which will

Re: Multiple Kafka Spark Streaming Dataframe Join query

2018-03-14 Thread Tathagata Das
Relevant: https://databricks.com/blog/2018/03/13/introducing-stream-stream-joins-in-apache-spark-2-3.html This is true stream-stream join which will automatically buffer delayed data and appropriately join stuff with SQL join semantics. Please check it out :) TD On Wed, Mar 14, 2018 at 12:07

Re: Multiple Kafka Spark Streaming Dataframe Join query

2018-03-14 Thread Dylan Guedes
I misread it, and thought that you question was if pyspark supports kafka lol. Sorry! On Wed, Mar 14, 2018 at 3:58 PM, Aakash Basu wrote: > Hey Dylan, > > Great! > > Can you revert back to my initial and also the latest mail? > > Thanks, > Aakash. > > On 15-Mar-2018

Re: Multiple Kafka Spark Streaming Dataframe Join query

2018-03-14 Thread Aakash Basu
Hey Dylan, Great! Can you revert back to my initial and also the latest mail? Thanks, Aakash. On 15-Mar-2018 12:27 AM, "Dylan Guedes" wrote: > Hi, > > I've been using the Kafka with pyspark since 2.1. > > On Wed, Mar 14, 2018 at 3:49 PM, Aakash Basu

Re: Multiple Kafka Spark Streaming Dataframe Join query

2018-03-14 Thread Dylan Guedes
Hi, I've been using the Kafka with pyspark since 2.1. On Wed, Mar 14, 2018 at 3:49 PM, Aakash Basu wrote: > Hi, > > I'm yet to. > > Just want to know, when does Spark 2.3 with 0.10 Kafka Spark Package > allows Python? I read somewhere, as of now Scala and Java are

Re: Multiple Kafka Spark Streaming Dataframe Join query

2018-03-14 Thread Aakash Basu
Hi, I'm yet to. Just want to know, when does Spark 2.3 with 0.10 Kafka Spark Package allows Python? I read somewhere, as of now Scala and Java are the languages to be used. Please correct me if am wrong. Thanks, Aakash. On 14-Mar-2018 8:24 PM, "Georg Heiler" wrote:

Re: Multiple Kafka Spark Streaming Dataframe Join query

2018-03-14 Thread Georg Heiler
Did you try spark 2.3 with structured streaming? There watermarking and plain sql might be really interesting for you. Aakash Basu schrieb am Mi. 14. März 2018 um 14:57: > Hi, > > > > *Info (Using):Spark Streaming Kafka 0.8 package* > > *Spark 2.2.1* > *Kafka 1.0.1* >

Multiple Kafka Spark Streaming Dataframe Join query

2018-03-14 Thread Aakash Basu
Hi, *Info (Using):Spark Streaming Kafka 0.8 package* *Spark 2.2.1* *Kafka 1.0.1* As of now, I am feeding paragraphs in Kafka console producer and my Spark, which is acting as a receiver is printing the flattened words, which is a complete RDD operation. *My motive is to read two tables

Kafka + Spark Streaming consumer API offsets

2017-06-05 Thread Nipun Arora
I need some clarification for Kafka consumers in Spark or otherwise. I have the following Kafka Consumer. The consumer is reading from a topic, and I have a mechanism which blocks the consumer from time to time. The producer is a separate thread which is continuously sending data. I want to

Re: Message getting lost in Kafka + Spark Streaming

2017-06-01 Thread Vikash Pareek
other and that way > you'll have less final events. > > -Original Message- > From: Vikash Pareek [mailto:vikash.par...@infoobjects.com] > Sent: Tuesday, May 30, 2017 4:00 PM > To: user@spark.apache.org > Subject: Message getting lost in Kafka + Spark Streaming >

RE: Message getting lost in Kafka + Spark Streaming

2017-05-31 Thread Sidney Feiner
. -Original Message- From: Vikash Pareek [mailto:vikash.par...@infoobjects.com] Sent: Tuesday, May 30, 2017 4:00 PM To: user@spark.apache.org Subject: Message getting lost in Kafka + Spark Streaming I am facing an issue related to spark streaming with kafka, my use case is as follow: 1. Spark

Re: Message getting lost in Kafka + Spark Streaming

2017-05-30 Thread Cody Koeninger
val keyedMessage = new KeyedMessage[String, > String](props.getProperty("outTopicHarmonized"), > null, row.toString()) > producer.send(keyedMessage) > > } > //hack, should be done with the flush >

Message getting lost in Kafka + Spark Streaming

2017-05-30 Thread Vikash Pareek
ucer.send(keyedMessage) } //hack, should be done with the flush Thread.sleep(1000) producer.close() } * We explicitely added sleep(1000) for testing purpose. But this is also not solving the problem :( Any suggestion would be appreciat

Re: Can't access the data in Kafka Spark Streaming globally

2016-12-23 Thread Cody Koeninger
This doesn't sound like a question regarding Kafka streaming, it sounds like confusion about the scope of variables in spark generally. Is that right? If so, I'd suggest reading the documentation, starting with a simple rdd (e.g. using sparkContext.parallelize), and experimenting to confirm your

Can't access the data in Kafka Spark Streaming globally

2016-12-22 Thread Sree Eedupuganti
I am trying to stream the data from Kafka to Spark. JavaPairInputDStream directKafkaStream = KafkaUtils.createDirectStream(ssc, String.class, String.class, StringDecoder.class, StringDecoder.class,

Re: Getting empty values while receiving from kafka Spark streaming

2016-09-18 Thread Chawla,Sumit
How are you producing data? I just tested your code and i can receive the messages from Kafka. Regards Sumit Chawla On Sun, Sep 18, 2016 at 7:56 PM, Sateesh Karuturi < sateesh.karutu...@gmail.com> wrote: > i am very new to *Spark streaming* and i am implementing small exercise > like sending

Re: Getting empty values while receiving from kafka Spark streaming

2016-09-18 Thread ayan guha
Empty RDD generally means Kafka is not producing msgs in those intervals. For example, if I have batch duration of 10secs and there is no msgs within any 10 secs, RDD corresponding to that 10 secs will be empty. On Mon, Sep 19, 2016 at 12:56 PM, Sateesh Karuturi < sateesh.karutu...@gmail.com>

Getting empty values while receiving from kafka Spark streaming

2016-09-18 Thread Sateesh Karuturi
i am very new to *Spark streaming* and i am implementing small exercise like sending *XML* data from *kafka* and need to receive that *streaming* data through *spark streaming.* I tried in all possible ways.. but every time i am getting *empty values.* *There is no problem in Kafka side, only

Re: Improving performance of a kafka spark streaming app

2016-06-24 Thread Cody Koeninger
Unless I'm misreading the image you posted, it does show event counts for the single batch that is still running, with 1.7 billion events in it. The recent batches do show 0 events, but I'm guessing that's because they're actually empty. When you said you had a kafka topic with 1.7 billion

Re: Improving performance of a kafka spark streaming app

2016-06-22 Thread Colin Kincaid Williams
Streaming UI tab showing empty events and very different metrics than on 1.5.2 On Thu, Jun 23, 2016 at 5:06 AM, Colin Kincaid Williams wrote: > After a bit of effort I moved from a Spark cluster running 1.5.2, to a > Yarn cluster running 1.6.1 jars. I'm still setting the maxRPP.

Re: Improving performance of a kafka spark streaming app

2016-06-22 Thread Colin Kincaid Williams
After a bit of effort I moved from a Spark cluster running 1.5.2, to a Yarn cluster running 1.6.1 jars. I'm still setting the maxRPP. The completed batches are no longer showing the number of events processed in the Streaming UI tab . I'm getting around 4k inserts per second in hbase, but I

Re: Improving performance of a kafka spark streaming app

2016-06-21 Thread Colin Kincaid Williams
Thanks @Cody, I will try that out. In the interm, I tried to validate my Hbase cluster by running a random write test and see 30-40K writes per second. This suggests there is noticeable room for improvement. On Tue, Jun 21, 2016 at 8:32 PM, Cody Koeninger wrote: > Take HBase

Re: Improving performance of a kafka spark streaming app

2016-06-21 Thread Cody Koeninger
Take HBase out of the equation and just measure what your read performance is by doing something like createDirectStream(...).foreach(_.println) not take() or print() On Tue, Jun 21, 2016 at 3:19 PM, Colin Kincaid Williams wrote: > @Cody I was able to bring my processing time

Re: Improving performance of a kafka spark streaming app

2016-06-21 Thread Colin Kincaid Williams
@Cody I was able to bring my processing time down to a second by setting maxRatePerPartition as discussed. My bad that I didn't recognize it as the cause of my scheduling delay. Since then I've tried experimenting with a larger Spark Context duration. I've been trying to get some noticeable

Re: Improving performance of a kafka spark streaming app

2016-06-20 Thread Colin Kincaid Williams
I'll try dropping the maxRatePerPartition=400, or maybe even lower. However even at application starts up I have this large scheduling delay. I will report my progress later on. On Mon, Jun 20, 2016 at 2:12 PM, Cody Koeninger wrote: > If your batch time is 1 second and your

Re: Improving performance of a kafka spark streaming app

2016-06-20 Thread Cody Koeninger
If your batch time is 1 second and your average processing time is 1.16 seconds, you're always going to be falling behind. That would explain why you've built up an hour of scheduling delay after eight hours of running. On Sat, Jun 18, 2016 at 4:40 PM, Colin Kincaid Williams

Re: Improving performance of a kafka spark streaming app

2016-06-18 Thread Colin Kincaid Williams
Hi Mich again, Regarding batch window, etc. I have provided the sources, but I'm not currently calling the window function. Did you see the program source? It's only 100 lines. https://gist.github.com/drocsid/b0efa4ff6ff4a7c3c8bb56767d0b6877 Then I would expect I'm using defaults, other than

Re: Improving performance of a kafka spark streaming app

2016-06-18 Thread Mich Talebzadeh
Ok What is the set up for these please? batch window window length sliding interval And also in each batch window how much data do you get in (no of messages in the topic whatever)? Dr Mich Talebzadeh LinkedIn *

Re: Improving performance of a kafka spark streaming app

2016-06-18 Thread Mich Talebzadeh
I believe you have an issue with performance? have you checked spark GUI (default 4040) for details including shuffles etc? HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

Re: Improving performance of a kafka spark streaming app

2016-06-18 Thread Colin Kincaid Williams
I'm attaching a picture from the streaming UI. On Sat, Jun 18, 2016 at 7:59 PM, Colin Kincaid Williams wrote: > There are 25 nodes in the spark cluster. > > On Sat, Jun 18, 2016 at 7:53 PM, Mich Talebzadeh > wrote: >> how many nodes are in your

Re: Improving performance of a kafka spark streaming app

2016-06-18 Thread Colin Kincaid Williams
There are 25 nodes in the spark cluster. On Sat, Jun 18, 2016 at 7:53 PM, Mich Talebzadeh wrote: > how many nodes are in your cluster? > > --num-executors 6 \ > --driver-memory 4G \ > --executor-memory 2G \ > --total-executor-cores 12 \ > > > Dr Mich Talebzadeh > >

Re: Improving performance of a kafka spark streaming app

2016-06-18 Thread Mich Talebzadeh
how many nodes are in your cluster? --num-executors 6 \ --driver-memory 4G \ --executor-memory 2G \ --total-executor-cores 12 \ Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

Re: Improving performance of a kafka spark streaming app

2016-06-18 Thread Colin Kincaid Williams
I updated my app to Spark 1.5.2 streaming so that it consumes from Kafka using the direct api and inserts content into an hbase cluster, as described in this thread. I was away from this project for awhile due to events in my family. Currently my scheduling delay is high, but the processing time

Re: spark.hadoop.dfs.replication parameter not working for kafka-spark streaming

2016-05-31 Thread Abhishek Anand
I also tried jsc.sparkContext().sc().hadoopConfiguration().set("dfs.replication", "2") But, still its not working. Any ideas why its not working ? Abhi On Tue, May 31, 2016 at 4:03 PM, Abhishek Anand wrote: > My spark streaming checkpoint directory is being

spark.hadoop.dfs.replication parameter not working for kafka-spark streaming

2016-05-31 Thread Abhishek Anand
My spark streaming checkpoint directory is being written to HDFS with default replication factor of 3. In my streaming application where I am listening from kafka and setting the dfs.replication = 2 as below the files are still being written with replication factor=3 SparkConf sparkConfig = new

Re: Improving performance of a kafka spark streaming app

2016-05-03 Thread Colin Kincaid Williams
Thanks Cody, I can see that the partitions are well distributed... Then I'm in the process of using the direct api. On Tue, May 3, 2016 at 6:51 PM, Cody Koeninger wrote: > 60 partitions in and of itself shouldn't be a big performance issue > (as long as producers are

Re: Improving performance of a kafka spark streaming app

2016-05-03 Thread Cody Koeninger
60 partitions in and of itself shouldn't be a big performance issue (as long as producers are distributing across partitions evenly). On Tue, May 3, 2016 at 1:44 PM, Colin Kincaid Williams wrote: > Thanks again Cody. Regarding the details 66 kafka partitions on 3 > kafka servers,

Re: Improving performance of a kafka spark streaming app

2016-05-03 Thread Colin Kincaid Williams
Thanks again Cody. Regarding the details 66 kafka partitions on 3 kafka servers, likely 8 core systems with 10 disks each. Maybe the issue with the receiver was the large number of partitions. I had miscounted the disks and so 11*3*2 is how I decided to partition my topic on insertion, ( by my

Re: Improving performance of a kafka spark streaming app

2016-05-02 Thread Cody Koeninger
print() isn't really the best way to benchmark things, since it calls take(10) under the covers, but 380 records / second for a single receiver doesn't sound right in any case. Am I understanding correctly that you're trying to process a large number of already-existing kafka messages, not keep

Re: Improving performance of a kafka spark streaming app

2016-05-02 Thread Colin Kincaid Williams
Hello again. I searched for "backport kafka" in the list archives but couldn't find anything but a post from Spark 0.7.2 . I was going to use accumulators to make a counter, but then saw on the Streaming tab the Receiver Statistics. Then I removed all other "functionality" except:

Re: Improving performance of a kafka spark streaming app

2016-05-02 Thread Colin Kincaid Williams
Hi Cody, I'm going to use an accumulator right now to get an idea of the throughput. Thanks for mentioning the back ported module. Also it looks like I missed this section: https://spark.apache.org/docs/1.2.0/streaming-programming-guide.html#reducing-the-processing-time-of-each-batch from the

Re: Improving performance of a kafka spark streaming app

2016-05-02 Thread Colin Kincaid Williams
016 10:55 AM > To: user@spark.apache.org > Subject: Improving performance of a kafka spark streaming app > > I've written an application to get content from a kafka topic with 1.7 billion > entries, get the protobuf serialized entries, and insert into hbase. > Currently the enviro

Re: Improving performance of a kafka spark streaming app

2016-05-02 Thread Cody Koeninger
Have you tested for read throughput (without writing to hbase, just deserialize)? Are you limited to using spark 1.2, or is upgrading possible? The kafka direct stream is available starting with 1.3. If you're stuck on 1.2, I believe there have been some attempts to backport it, search the

Improving performance of a kafka spark streaming app

2016-05-02 Thread Colin Kincaid Williams
I've written an application to get content from a kafka topic with 1.7 billion entries, get the protobuf serialized entries, and insert into hbase. Currently the environment that I'm running in is Spark 1.2. With 8 executors and 2 cores, and 2 jobs, I'm only getting between 0-2500 writes /

Re: Get Pair of Topic and Message from Kafka + Spark Streaming

2016-03-19 Thread Cody Koeninger
There's 1 topic per partition, so you're probably better off dealing with topics that way rather than at the individual message level. http://spark.apache.org/docs/latest/streaming-kafka-integration.html#approach-2-direct-approach-no-receivers Look at the discussion of "HasOffsetRanges" If you

Get Pair of Topic and Message from Kafka + Spark Streaming

2016-03-16 Thread Imre Nagi
Hi, I'm just trying to process the data that come from the kafka source in my spark streaming application. What I want to do is get the pair of topic and message in a tuple from the message stream. Here is my streams: val streams = KafkaUtils.createDirectStream[String, Array[Byte], >

Get Pair of Topic and Message from Kafka + Spark Streaming

2016-03-15 Thread Imre Nagi
Hi, I'm just trying to process the data that come from the kafka source in my spark streaming application. What I want to do is get the pair of topic and message in a tuple from the message stream. Here is my streams: val streams = KafkaUtils.createDirectStream[String, Array[Byte], >

RE: Kafka + Spark streaming, RDD partitions not processed in parallel

2016-03-14 Thread Mukul Gupta
Koeninger [mailto:c...@koeninger.org] Sent: Monday, March 14, 2016 9:39 PM To: Mukul Gupta <mukul.gu...@aricent.com> Cc: user@spark.apache.org Subject: Re: Kafka + Spark streaming, RDD partitions not processed in parallel So what's happening here is that print() uses take(). Take() will try to

Re: Kafka + Spark streaming, RDD partitions not processed in parallel

2016-03-14 Thread Cody Koeninger
s://github.com/guptamukul/sparktest.git > > > From: Cody Koeninger <c...@koeninger.org> > Sent: 11 March 2016 23:04 > To: Mukul Gupta > Cc: user@spark.apache.org > Subject: Re: Kafka + Spark streaming, RDD partitions not processed in

Re: Kafka + Spark streaming, RDD partitions not processed in parallel

2016-03-13 Thread Mukul Gupta
. Following is the link to repository: https://github.com/guptamukul/sparktest.git From: Cody Koeninger <c...@koeninger.org> Sent: 11 March 2016 23:04 To: Mukul Gupta Cc: user@spark.apache.org Subject: Re: Kafka + Spark streaming, RDD partitions not pro

Re: Kafka + Spark streaming, RDD partitions not processed in parallel

2016-03-11 Thread Cody Koeninger
the test after >> increasing the partitions of kafka topic to 5. This time also RDD partition >> corresponding to partition 1 of kafka was processed on one of the spark >> executor. Once processing is finished for this RDD partition, then RDD >> partitions corresponding to

Re: Kafka + Spark streaming, RDD partitions not processed in parallel

2016-03-11 Thread Mukul Gupta
ssed.print(90); try { jssc.start(); jssc.awaitTermination(); } catch (Exception e) { } finally { jssc.close(); } } } From: Cody Koeninger <c...@koeninger.org> Sent: 11 March 2016 20:42 To: Mukul Gupta Cc: user@spark.apache.org Subject: Re: Kafka + Spark streaming, RDD partition

Re: Kafka + Spark streaming, RDD partitions not processed in parallel

2016-03-11 Thread Cody Koeninger
a partitions were processed > in parallel on different spark executors. I am not clear about why spark is > waiting for operations on first RDD partition to finish, while it could > process remaining partitions in parallel? Am I missing any configuration? > Any help is appreciated. Thanks

Kafka + Spark streaming, RDD partitions not processed in parallel

2016-03-10 Thread Mukul Gupta
Am I missing any configuration? Any help is appreciated.Thanks,Mukul -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Kafka-Spark-streaming-RDD-partitions-not-processed-in-parallel-tp26457.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Kafka & Spark Streaming

2015-09-25 Thread Cody Koeninger
Yes, the partition IDs are the same. As far as the failure / subclassing goes, you may want to keep an eye on https://issues.apache.org/jira/browse/SPARK-10320 , not sure if the suggestions in there will end up going anywhere. On Fri, Sep 25, 2015 at 3:01 PM, Neelesh wrote:

Re: Kafka & Spark Streaming

2015-09-25 Thread Neelesh
Thanks. Ill keep an eye on this. Our implementation of the DStream basically accepts a function to compute current offsets. The implementation of the function fetches list of topics from zookeeper once in while. It then adds consumer offsets for newly added topics with the currentOffsets thats in

Re: Kafka & Spark Streaming

2015-09-25 Thread Neelesh
Thanks Petr, Cody. This is a reasonable place to start for me. What I'm trying to achieve stream.foreachRDD {rdd=> rdd.foreachPartition { p=> Try(myFunc(...)) match { case Sucess(s) => updatewatermark for this partition //of course, expectation is that it will work only if

Re: Kafka & Spark Streaming

2015-09-25 Thread Neelesh
For the 1-1 mapping case, can I use TaskContext.get().partitionId as an index in to the offset ranges? For the failure case, yes, I'm subclassing of DirectKafkaInputDStream. As for failures, different partitions in the same batch may be talking to different RDBMS servers due to multitenancy - a

Re: Kafka & Spark Streaming

2015-09-25 Thread Cody Koeninger
Your success case will work fine, it is a 1-1 mapping as you said. To handle failures in exactly the way you describe, you'd need to subclass or modify DirectKafkaInputDStream and change the way compute() works. Unless you really are going to have very fine-grained failures (why would only a

Kafka & Spark Streaming

2015-09-25 Thread Neelesh
Hi, We are using DirectKafkaInputDStream and store completed consumer offsets in Kafka (0.8.2). However, some of our use case require that offsets be not written if processing of a partition fails with certain exceptions. This allows us to build various backoff strategies for that partition,

Re: Kafka & Spark Streaming

2015-09-25 Thread Petr Novak
You can have offsetRanges on workers f.e. object Something { var offsetRanges = Array[OffsetRange]() def create[F : ClassTag](stream: InputDStream[Array[Byte]]) (implicit codec: Codec[F]: DStream[F] = { stream transform { rdd => offsetRanges =

Re: Kafka & Spark Streaming

2015-09-25 Thread Cody Koeninger
http://spark.apache.org/docs/latest/streaming-kafka-integration.html#approach-2-direct-approach-no-receivers also has an example of how to close over the offset ranges so they are available on executors. On Fri, Sep 25, 2015 at 12:50 PM, Neelesh wrote: > Hi, >We are

Re: Unable to see my kafka spark streaming output

2015-09-19 Thread kali.tumm...@gmail.com
er ssc.start() ssc.awaitTermination() } } Thanks Sri -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Unable-to-see-my-kafka-spark-streaming-output-tp24750p24751.html Sent from the Apache Spark User List mailing list archive at Nabble.

Unable to see my kafka spark streaming output

2015-09-19 Thread kali.tumm...@gmail.com
awaitTermination() } } Thanks Sri -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Unable-to-see-my-kafka-spark-streaming-output-tp24750.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --

Re: kafka spark streaming with mesos

2015-06-24 Thread Akhil Das
, Bartek Radziszewski bar...@scalaric.com wrote: Hey, I’m trying to run kafka spark streaming using mesos with following example: *sc.stop* *import org.apache.spark.SparkConf* *import org.apache.spark.SparkContext._* *import kafka.serializer.StringDecoder* *import org.apache.spark.streaming

kafka spark streaming with mesos

2015-06-23 Thread Bartek Radziszewski
Hey, I’m trying to run kafka spark streaming using mesos with following example: sc.stop import org.apache.spark.SparkConf import org.apache.spark.SparkContext._ import kafka.serializer.StringDecoder import org.apache.spark.streaming._ import org.apache.spark.streaming.kafka._ import

Re: kafka spark streaming working example

2015-06-18 Thread Akhil Das
.setMaster(local) set it to local[2] or local[*] Thanks Best Regards On Thu, Jun 18, 2015 at 5:59 PM, Bartek Radziszewski bar...@scalaric.com wrote: hi, I'm trying to run simple kafka spark streaming example over spark-shell: sc.stop import org.apache.spark.SparkConf import

kafka spark streaming working example

2015-06-18 Thread Bartek Radziszewski
hi, I'm trying to run simple kafka spark streaming example over spark-shell: sc.stop import org.apache.spark.SparkConf import org.apache.spark.SparkContext._ import kafka.serializer.DefaultDecoder import org.apache.spark.streaming._ import org.apache.spark.streaming.kafka._ import

Re: Kafka Spark Streaming: ERROR EndpointWriter: dropping message

2015-06-10 Thread karma243
. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Kafka-Spark-Streaming-ERROR-EndpointWriter-dropping-message-tp23228p23240.html Sent from the Apache Spark User List mailing list archive at Nabble.com

Re: Kafka Spark Streaming: ERROR EndpointWriter: dropping message

2015-06-10 Thread Dibyendu Bhattacharya
on different machine. All the three cases were working properly. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Kafka-Spark-Streaming-ERROR-EndpointWriter-dropping-message-tp23228p23240.html Sent from the Apache Spark User List mailing list archive

Kafka Spark Streaming: ERROR EndpointWriter: dropping message

2015-06-09 Thread karma243
at [akka.tcp://sparkMaster@localhost:7077] inbound addresses are [akka.tcp://sparkMaster@karma-HP-Pavilion-g6-Notebook-PC:7077] -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Kafka-Spark-Streaming-ERROR-EndpointWriter-dropping-message-tp23228.html Sent from

Re: Kafka Spark Streaming: ERROR EndpointWriter: dropping message

2015-06-09 Thread nsalian
1) Could you share your command? 2) Are the kafka brokers on the same host? 3) Could you run a --describe on the topic to see if the topic is setup correctly (just to be sure)? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Kafka-Spark-Streaming-ERROR

Re: kafka + Spark Streaming with checkPointing fails to start with

2015-05-15 Thread Alexander Krasheninnikov
I had same problem. The solution, I've found was to use: JavaStreamingContext streamingContext = JavaStreamingContext.getOrCreate('checkpoint_dir', contextFactory); ALL configuration should be performed inside contextFactory. If you try to configure streamContext after ::getOrCreate, you

Re: kafka + Spark Streaming with checkPointing fails to restart

2015-05-14 Thread NB
but everything is recreated from the checkpointed data. Hope this helps, NB -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/kafka-Spark-Streaming-with-checkPointing-fails-to-restart-tp22864p22878.html Sent from the Apache Spark User List mailing list archive

Re: kafka + Spark Streaming with checkPointing fails to restart

2015-05-14 Thread Ankur Chauhan
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Thanks everyone, that was the problem. the create new streaming context function was supposed to setup the stream processing as well as the checkpoint directory. I had missed the whole process of checkpoint setup. With that done, everything works as

kafka + Spark Streaming with checkPointing fails to restart

2015-05-13 Thread Ankur Chauhan
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hi, I have a simple application which fails with the following exception only when the application is restarted (i.e. the checkpointDir has entires from a previous execution): Exception in thread main org.apache.spark.SparkException:

kafka + Spark Streaming with checkPointing fails to restart

2015-05-13 Thread ankurcha
.1001560.n3.nabble.com/kafka-Spark-Streaming-with-checkPointing-fails-to-restart-tp22864.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org

kafka + Spark Streaming with checkPointing fails to restart

2015-05-13 Thread Ankur Chauhan
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hi, I have a simple application which fails with the following exception only when the application is restarted (i.e. the checkpointDir has entires from a previous execution): Exception in thread main org.apache.spark.SparkException:

Re: kafka + Spark Streaming with checkPointing fails to restart

2015-05-13 Thread Cody Koeninger
be happening here. --- Ankur Chauhan -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/kafka-Spark-Streaming-with-checkPointing-fails-to-restart-tp22864.html Sent from the Apache Spark User List mailing list archive at Nabble.com

Re: Kafka + Spark streaming

2014-12-31 Thread Samya Maiti
the Receiver for Kafka, Will the receiver be restarted on some other worker? Regards, Sam -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Kafka-Spark-streaming-tp20914.html Sent from the Apache Spark User List mailing list archive at Nabble.com

Kafka + Spark streaming

2014-12-30 Thread SamyaMaiti
block interval will go in the same block. 2. If a worker goes down which runs the Receiver for Kafka, Will the receiver be restarted on some other worker? Regards, Sam -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Kafka-Spark-streaming-tp20914.html Sent

Re: Kafka + Spark streaming

2014-12-30 Thread Tathagata Das
in context: http://apache-spark-user-list.1001560.n3.nabble.com/Kafka-Spark-streaming-tp20914.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr

Kafka+Spark-streaming issue: Stream 0 received 0 blocks

2014-12-01 Thread m.sarosh
Hi, I am integrating Kafka and Spark, using spark-streaming. I have created a topic as a kafka producer: bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic test I am publishing messages in kafka and trying to read them using

Re: Kafka+Spark-streaming issue: Stream 0 received 0 blocks

2014-12-01 Thread Akhil Das
It says: 14/11/27 11:56:05 WARN scheduler.TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory A quick guess would be, you are giving the wrong master url. ( spark:// 192.168.88.130:7077 ) Open the

RE: Kafka+Spark-streaming issue: Stream 0 received 0 blocks

2014-12-01 Thread m.sarosh
...@sigmoidanalytics.com] Sent: Monday, December 01, 2014 3:56 PM To: Sarosh, M. Cc: user@spark.apache.org Subject: Re: Kafka+Spark-streaming issue: Stream 0 received 0 blocks It says: 14/11/27 11:56:05 WARN scheduler.TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster

  1   2   >