Re: Spark streaming not processing messages from partitioned topics

2016-08-09 Thread Cody Koeninger
;dataFrame.foreach(println) >} >else >{ > println("Empty DStream ") >}*/ > }) > > On Wed, Aug 10, 2016 at 2:35 AM, Cody Koeninger <c...@koeninger.org> wrote: >> >> Take out the conditional and the sqlcontext and just do >>

Re: Spark streaming not processing messages from partitioned topics

2016-08-09 Thread Cody Koeninger
Take out the conditional and the sqlcontext and just do rdd => { rdd.foreach(println) as a base line to see if you're reading the data you expect On Tue, Aug 9, 2016 at 3:47 PM, Diwakar Dhanuskodi wrote: > Hi, > > I am reading json messages from kafka . Topics

Re: How to get recommand result for users in a kafka SparkStreaming Application

2016-08-03 Thread Cody Koeninger
MatrixFactorizationModel is serializable. Instantiate it on the driver, not on the executors. On Wed, Aug 3, 2016 at 2:01 AM, wrote: > hello guys: > I have an app which consumes json messages from kafka and recommend > movies for the users in those messages ,the

Re: sampling operation for DStream

2016-08-01 Thread Cody Koeninger
leq...@gmail.com> wrote: > How to do that? if I put the queue inside .transform operation, it doesn't > work. > > On Mon, Aug 1, 2016 at 6:43 PM, Cody Koeninger <c...@koeninger.org> wrote: >> >> Can you keep a queue per executor in memory? >> >> On Mon

Re: sampling operation for DStream

2016-08-01 Thread Cody Koeninger
roblem in detail here: > https://docs.google.com/document/d/1YBV_eHH6U_dVF1giiajuG4ayoVN5R8Qw3IthoKvW5ok > > Could you please give me some suggestions or advice to fix this problem? > > Thanks > > On Fri, Jul 29, 2016 at 6:28 PM, Cody Koeninger <c...@koeninger.org> wrote: >&

Re: sampling operation for DStream

2016-07-29 Thread Cody Koeninger
Most stream systems you're still going to incur the cost of reading each message... I suppose you could rotate among reading just the latest messages from a single partition of a Kafka topic if they were evenly balanced. But once you've read the messages, nothing's stopping you from filtering

Re: read only specific jsons

2016-07-27 Thread Cody Koeninger
> > clickDF = cDF.filter(cDF['request.clientIP'].isNotNull()) > > It fails for some cases and errors our with below message > > AnalysisException: u'No such struct field clientIP in cookies, nscClientIP1, > nscClientIP2, uAgent;' > > > On Tue, Jul 26, 2016 at 12:05 PM,

Re: read only specific jsons

2016-07-26 Thread Cody Koeninger
Have you tried filtering out corrupt records with something along the lines of df.filter(df("_corrupt_record").isNull) On Tue, Jul 26, 2016 at 1:53 PM, vr spark wrote: > i am reading data from kafka using spark streaming. > > I am reading json and creating dataframe. > I

Re: Spark streaming lost data when ReceiverTracker writes Blockinfo to hdfs timeout

2016-07-26 Thread Cody Koeninger
Can you go ahead and open a Jira ticket with that explanation? Is there a reason you need to use receivers instead of the direct stream? On Tue, Jul 26, 2016 at 4:45 AM, Andy Zhao wrote: > Hi guys, > > I wrote a spark streaming program which consume 1000 messages from

Re: Potential Change in Kafka's Partition Assignment Semantics when Subscription Changes

2016-07-25 Thread Cody Koeninger
This seems really low risk to me. In order to be impacted, it'd have to be someone who was using the kafka integration in spark 2.0, which isn't even officially released yet. On Mon, Jul 25, 2016 at 7:23 PM, Vahid S Hashemian wrote: > Sorry, meant to ask if any Apache

Re: where I can find spark-streaming-kafka for spark2.0

2016-07-25 Thread Cody Koeninger
For 2.0, the kafka dstream support is in two separate subprojects depending on which version of Kafka you are using spark-streaming-kafka-0-10 or spark-streaming-kafka-0-8 corresponding to brokers that are version 0.10+ or 0.8+ On Mon, Jul 25, 2016 at 12:29 PM, Reynold Xin

Re: Rebalancing when adding kafka partitions

2016-07-22 Thread Cody Koeninger
s example. > Looks simple but its not very obvious how it works :-) > I'll watch out for the docs and ScalaDoc. > > Srikanth > > On Fri, Jul 22, 2016 at 2:15 PM, Cody Koeninger <c...@koeninger.org> wrote: >> >> No, restarting from a checkpoint won't do it, you

Re: Rebalancing when adding kafka partitions

2016-07-22 Thread Cody Koeninger
-0.10 On Fri, Jul 22, 2016 at 1:05 PM, Srikanth <srikanth...@gmail.com> wrote: > In Spark 1.x, if we restart from a checkpoint, will it read from new > partitions? > > If you can, pls point us to some doc/link that talks about Kafka 0.10 integ > in Spark 2.0. > > On Fri, Ju

Re: Rebalancing when adding kafka partitions

2016-07-22 Thread Cody Koeninger
For the integration for kafka 0.8, you are literally starting a streaming job against a fixed set of topicapartitions, It will not change throughout the job, so you'll need to restart the spark job if you change kafka partitions. For the integration for kafka 0.10 / spark 2.0, if you use

Re: Latest 200 messages per topic

2016-07-20 Thread Cody Koeninger
the data per key sorted with timestamp , I will always > get the latest 4 ts data on get(key). Spark streaming will get the ID from > Kafka, then read the data from HBASE using get(ID). This will eliminate > usage of Windowing from Spark-Streaming . Is it good to use ? > > Regards,

Re: Latest 200 messages per topic

2016-07-19 Thread Cody Koeninger
Unless you're using only 1 partition per topic, there's no reasonable way of doing this. Offsets for one topicpartition do not necessarily have anything to do with offsets for another topicpartition. You could do the last (200 / number of partitions) messages per topicpartition, but you have no

Re: Spark streaming takes longer time to read json into dataframes

2016-07-19 Thread Cody Koeninger
Yes, if you need more parallelism, you need to either add more kafka partitions or shuffle in spark. Do you actually need the dataframe api, or are you just using it as a way to infer the json schema? Inferring the schema is going to require reading through the RDD once before doing any other

Re: Complications with saving Kafka offsets?

2016-07-15 Thread Cody Koeninger
The bottom line short answer for this is that if you actually care about data integrity, you need to store your offsets transactionally alongside your results in the same data store. If you're ok with double-counting in the event of failures, saving offsets _after_ saving your results, using

Re: Spark Streaming - Direct Approach

2016-07-15 Thread Cody Koeninger
We've been running direct stream jobs in production for over a year, with uptimes in the range of months. I'm pretty slammed with work right now, but when I get time to submit a PR for the 0.10 docs i'll remove the experimental note from 0.8 On Mon, Jul 11, 2016 at 4:35 PM, Tathagata Das

Re: is dataframe.write() async? Streaming performance problem

2016-07-08 Thread Cody Koeninger
Maybe obvious, but what happens when you change the s3 write to a println of all the data? That should identify whether it's the issue. count() and read.json() will involve additional tasks (run through the items in the rdd to count them, likewise to infer the schema) but for 300 records that

Re: Why is KafkaUtils.createRDD offsetRanges an Array rather than a Seq?

2016-07-08 Thread Cody Koeninger
Yeah, it's a reasonable lowest common denominator between java and scala, and what's passed to that convenience constructor is actually what's used to construct the class. FWIW, in the 0.10 direct stream api when there's unavoidable wrapping / conversion anyway (since the underlying class takes a

Re: Memory grows exponentially

2016-07-08 Thread Cody Koeninger
Just as an offhand guess, are you doing something like updateStateByKey without expiring old keys? On Fri, Jul 8, 2016 at 2:44 AM, Jörn Franke wrote: > Memory fragmentation? Quiet common with in-memory systems. > >> On 08 Jul 2016, at 08:56, aasish.kumar

Re: Spark streaming. Strict discretizing by time

2016-07-06 Thread Cody Koeninger
In any case thanks, now I understand how to use Spark. > > PS: I will continue work with Spark but to minimize emails stream I plan to > unsubscribe from this mail list > > 2016-07-06 18:55 GMT+02:00 Cody Koeninger <c...@koeninger.org>: >> >> If you aren't proce

Re: Spark streaming. Strict discretizing by time

2016-07-06 Thread Cody Koeninger
s ok. If there are peaks of loading more than possibility of > computational system or data dependent time of calculation, Spark is not > able to provide a periodically stable results output. Sometimes this is > appropriate but sometime this is not appropriate. > > 2016-0

Re: Spark streaming. Strict discretizing by time

2016-07-06 Thread Cody Koeninger
, it works but throughput is much less than without limitations > because of this is an absolute upper limit. And time of processing is half > of available. > > Regarding Spark 2.0 structured streaming I will look it some later. Now I > don't know how to strictly measure th

Re: Spark streaming. Strict discretizing by time

2016-07-06 Thread Cody Koeninger
t; that Flink will strictly terminate processing of messages by time. Deviation > of the time window from 10 seconds to several minutes is impossible. > > PS: I prepared this example to make possible easy observe the problem and > fix it if it is a bug. For me it is obvious. May I ask y

Re: Spark streaming. Strict discretizing by time

2016-07-06 Thread Cody Koeninger
g speed in Spark's app is near to speed of data generation all > is ok. > I added delayFactor in > https://github.com/rssdev10/spark-kafka-streaming/blob/master/src/main/java/SparkStreamingConsumer.java > to emulate slow processing. And streaming process is in degradation. When &g

Re: Spark streaming. Strict discretizing by time

2016-07-05 Thread Cody Koeninger
. But in any case I need first response after 10 second. Not minutes > or hours after. > > Thanks. > > > > 2016-07-05 17:12 GMT+02:00 Cody Koeninger <c...@koeninger.org>: >> >> If you're talking about limiting the number of messages per batch to &g

Re: Spark streaming. Strict discretizing by time

2016-07-05 Thread Cody Koeninger
If you're talking about limiting the number of messages per batch to try and keep from exceeding batch time, see http://spark.apache.org/docs/latest/configuration.html look for backpressure and maxRatePerParition But if you're only seeing zeros after your job runs for a minute, it sounds like

Re: Read Kafka topic in a Spark batch job

2016-07-05 Thread Cody Koeninger
If it's a batch job, don't use a stream. You have to store the offsets reliably somewhere regardless. So it sounds like your only issue is with identifying offsets per partition? Look at KafkaCluster.scala, methods getEarliestLeaderOffsets / getLatestLeaderOffsets. On Tue, Jul 5, 2016 at 7:40

Re: Improving performance of a kafka spark streaming app

2016-06-24 Thread Cody Koeninger
ile an issue? >> >> On Tue, Jun 21, 2016 at 9:04 PM, Colin Kincaid Williams <disc...@uw.edu> >> wrote: >>> Thanks @Cody, I will try that out. In the interm, I tried to validate >>> my Hbase cluster by running a random write test and see 30-40K writes >>&g

Re: NullPointerException when starting StreamingContext

2016-06-24 Thread Cody Koeninger
That looks like a classpath problem. You should not have to include the kafka_2.10 artifact in your pom, spark-streaming-kafka_2.10 already has a transitive dependency on it. That being said, 0.8.2.1 is the correct version, so that's a little strange. How are you building and submitting your

Re: how to avoid duplicate messages with spark streaming using checkpoint after restart in case of failure

2016-06-22 Thread Cody Koeninger
The direct stream doesn't automagically give you exactly-once semantics. Indeed, you should be pretty suspicious of anything that claims to give you end-to-end exactly-once semantics without any additional work on your part. To the original poster, have you read / watched the materials linked

Re: Improving performance of a kafka spark streaming app

2016-06-21 Thread Cody Koeninger
ve written. > > On Mon, Jun 20, 2016 at 12:32 PM, Colin Kincaid Williams <disc...@uw.edu> > wrote: >> I'll try dropping the maxRatePerPartition=400, or maybe even lower. >> However even at application starts up I have this large scheduling >> delay. I will report my p

Re: Number of consumers in Kafka with Spark Streaming

2016-06-21 Thread Cody Koeninger
If you're using the direct stream, and don't have speculative execution turned on, there is one executor consumer created per partition, plus a driver consumer for getting the latest offsets. If you have fewer executors than partitions, not all of those consumers will be running at the same time.

Re: Improving performance of a kafka spark streaming app

2016-06-20 Thread Cody Koeninger
>>>> >> >>>> >> --conf spark.eventLog.overwrite=true \ >>>> >> >>>> >> --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \ >>>> >> >>>> >> --conf spark.streaming.backpres

Re: choice of RDD function

2016-06-15 Thread Cody Koeninger
Doesn't that result in consuming each RDD twice, in order to infer the json schema? On Wed, Jun 15, 2016 at 11:19 AM, Sivakumaran S wrote: > Of course :) > > object sparkStreaming { > def main(args: Array[String]) { > StreamingExamples.setStreamingLogLevels() //Set

Re: [Spark 2.0.0] Structured Stream on Kafka

2016-06-14 Thread Cody Koeninger
I haven't done any significant work on using structured streaming with kafka, there's a jira ticket for tracking purposes https://issues.apache.org/jira/browse/SPARK-15406 On Tue, Jun 14, 2016 at 9:21 AM, andy petrella wrote: > Heya folks, > > Just wondering if there

Re: Kafka Exceptions

2016-06-13 Thread Cody Koeninger
e - when leader is > shifted, for example, it does not appear that direct stream reader correctly > handles this. We're running 1.6.1. > > Bryan Jeffrey > > On Mon, Jun 13, 2016 at 10:37 AM, Cody Koeninger <c...@koeninger.org> wrote: >> >> htt

Re: Kafka Exceptions

2016-06-13 Thread Cody Koeninger
http://spark.apache.org/docs/latest/configuration.html spark.streaming.kafka.maxRetries spark.task.maxFailures On Mon, Jun 13, 2016 at 8:25 AM, Bryan Jeffrey wrote: > All, > > We're running a Spark job that is consuming data from a large Kafka cluster > using the

Re: Seeking advice on realtime querying over JDBC

2016-06-02 Thread Cody Koeninger
Why are you wanting to expose spark over jdbc as opposed to just inserting the records from kafka into a jdbc compatible data store? On Thu, Jun 2, 2016 at 12:47 PM, Sunita Arvind wrote: > Hi Experts, > > We are trying to get a kafka stream ingested in Spark and expose the

Re: Spark + Kafka processing trouble

2016-05-31 Thread Cody Koeninger
> LinkedIn > https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > > > > http://talebzadehmich.wordpress.com > > > > > On 31 May 2016 at 15:34, Cody Koeninger <c...@koeninger.org> wrote: >> >> There isn't a magic spark co

Re: Spark + Kafka processing trouble

2016-05-31 Thread Cody Koeninger
There isn't a magic spark configuration setting that would account for multiple-second-long fixed overheads, you should be looking at maybe 200ms minimum for a streaming batch. 1024 kafka topicpartitions is not reasonable for the volume you're talking about. Unless you have really extreme

Re: Kafka connection logs in Spark

2016-05-26 Thread Cody Koeninger
t; > >> On May 26, 2016, at 11:04 AM, Cody Koeninger <c...@koeninger.org> wrote: >> >> I wouldn't expect kerberos to work with anything earlier than the beta >> consumer for kafka 0.10 >> >>> On Wed, May 25, 2016 at 9:41 PM, Mail.com <pradeep.mi...@m

Re: about an exception when receiving data from kafka topic using Direct mode of Spark Streaming

2016-05-26 Thread Cody Koeninger
Honestly given this thread, and the stack overflow thread, I'd say you need to back up, start very simply, and learn spark. If for some reason the official docs aren't doing it for you, learning spark from oreilly is a good book. Given your specific question, why not just messages.foreachRDD {

Re: Kafka connection logs in Spark

2016-05-26 Thread Cody Koeninger
I wouldn't expect kerberos to work with anything earlier than the beta consumer for kafka 0.10 On Wed, May 25, 2016 at 9:41 PM, Mail.com wrote: > Hi All, > > I am connecting Spark 1.6 streaming to Kafka 0.8.2 with Kerberos. I ran > spark streaming in debug mode, but do

Re: Spark Streaming - Kafka Direct Approach: re-compute from specific time

2016-05-25 Thread Cody Koeninger
ver how can i pass this > offset to Spark Streaming job using that offset? ( using Direct Approach) > > On May 25, 2016 9:42 AM, "Cody Koeninger" <c...@koeninger.org> wrote: >> >> Kafka does not yet have meaningful time indexing, there's a kafka >> improvem

Re: Spark Streaming - Kafka Direct Approach: re-compute from specific time

2016-05-25 Thread Cody Koeninger
Kafka does not yet have meaningful time indexing, there's a kafka improvement proposal for it but it has gotten pushed back to at least 0.10.1 If you want to do this kind of thing, you will need to maintain your own index from time to offset. On Wed, May 25, 2016 at 8:15 AM, trung kien

Re: Spark Streaming - Kafka - java.nio.BufferUnderflowException

2016-05-25 Thread Cody Koeninger
I'd fix the kafka version on the executor classpath (should be 0.8.2.1) before trying anything else, even if it may be unrelated to the actual error. Definitely don't upgrade your brokers to 0.9 On Wed, May 25, 2016 at 2:30 AM, Scott W wrote: > I'm running into below error

Re: about an exception when receiving data from kafka topic using Direct mode of Spark Streaming

2016-05-25 Thread Cody Koeninger
Am I reading this correctly that you're calling messages.foreachRDD inside of the messages.foreachRDD block? Don't do that. On Wed, May 25, 2016 at 8:59 AM, Alonso wrote: > Hi, i am receiving this exception when direct spark streaming process > tries to pull data from kafka

Re: Maintain kafka offset externally as Spark streaming processes records.

2016-05-24 Thread Cody Koeninger
Have you looked at everything linked from https://github.com/koeninger/kafka-exactly-once On Tue, May 24, 2016 at 2:07 PM, sagarcasual . wrote: > In spark streaming consuming kafka using KafkaUtils.createDirectStream, > there are examples of the kafka offset level

Re: Couldn't find leader offsets

2016-05-19 Thread Cody Koeninger
Looks like a networking issue to me. Make sure you can connect to the broker on the specified host and port from the spark driver (and the executors too, for that matter) On Wed, May 18, 2016 at 4:04 PM, samsayiam wrote: > I have seen questions posted about this on SO and

Re: Does Structured Streaming support Kafka as data source?

2016-05-19 Thread Cody Koeninger
I went ahead and created https://issues.apache.org/jira/browse/SPARK-15406 to track this On Wed, May 18, 2016 at 9:55 PM, Todd wrote: > Hi, > I brief the spark code, and it looks that structured streaming doesn't > support kafka as data source yet?

Re: KafkaUtils.createDirectStream Not Fetching Messages with Confluent Serializers as Value Decoder.

2016-05-16 Thread Cody Koeninger
Have you checked to make sure you can receive messages just using a byte array for value? On Mon, May 16, 2016 at 12:33 PM, Ramaswamy, Muthuraman wrote: > I am trying to consume AVRO formatted message through > KafkaUtils.createDirectStream. I followed the listed

Re: Kafka partition increased while Spark Streaming is running

2016-05-13 Thread Cody Koeninger
? > > Thanks, > Chandan > > > On Wed, Feb 24, 2016 at 9:30 PM, Cody Koeninger <c...@koeninger.org> wrote: >> >> That's correct, when you create a direct stream, you specify the >> topicpartitions you want to be a part of the stream (the other method for >&

Re: Spark Streaming : is spark.streaming.receiver.maxRate valid for DirectKafkaApproach

2016-05-10 Thread Cody Koeninger
control over kafka topics and partitions, they are a central > system used by many other systems as well. > > Regards, > Chandan > > On Tue, May 10, 2016 at 8:01 PM, Cody Koeninger <c...@koeninger.org> wrote: >> >> maxRate is not used by the direct stream. >

Re: Spark Streaming : is spark.streaming.receiver.maxRate valid for DirectKafkaApproach

2016-05-10 Thread Cody Koeninger
maxRate is not used by the direct stream. Significant skew in rate across different partitions for the same topic is going to cause you all kinds of problems, not just with spark streaming. You can turn on backpressure, but you're better off addressing the underlying issue if you can. On Tue,

Re: createDirectStream with offsets

2016-05-06 Thread Cody Koeninger
Look carefully at the error message, the types you're passing in don't match. For instance, you're passing in a message handler that returns a tuple, but the rdd return type you're specifying (the 5th type argument) is just String. On Fri, May 6, 2016 at 9:49 AM, Eric Friedman

Re: Error Kafka/Spark. Ran out of messages before reaching ending offset

2016-05-06 Thread Cody Koeninger
Yeah, so that means the driver talked to kafka and kafka told it the highest available offset was 2723431. Then when the executor tried to consume messages, it stopped getting messages before reaching that offset. That almost certainly means something's wrong with Kafka, have you looked at your

Re: Missing data in Kafka Consumer

2016-05-05 Thread Cody Koeninger
aveAsTextFile(hdfs_path+"/eventlogs/"+getTimeFormatToFile()) > } > }) > ssc.start() > ssc.awaitTermination() >} >def getTimeFormatToFile(): String = { > val dateFormat =new SimpleDateFormat("_MM_dd_HH_mm_ss") >val dt = new Date() > val cg= new G

Re: Missing data in Kafka Consumer

2016-05-05 Thread Cody Koeninger
That's not much information to go on. Any relevant code sample or log messages? On Thu, May 5, 2016 at 11:18 AM, Jerry wrote: > Hi, > > Does anybody give me an idea why the data is lost at the Kafka Consumer > side? I use Kafka 0.8.2 and Spark (streaming) version is

Re: run-example streaming.KafkaWordCount fails on CDH 5.7.0

2016-05-04 Thread Cody Koeninger
Kafka 0.8.2 should be fine. If it works on your laptop but not on CDH, as Sean said you'll probably get better help on CDH forums. On Wed, May 4, 2016 at 4:19 AM, Michel Hubert wrote: > We're running Kafka 0.8.2.2 > Is that the problem, why? > > -Oorspronkelijk

Re: Improving performance of a kafka spark streaming app

2016-05-03 Thread Cody Koeninger
he number of partitions in Kafka > if it causes big performance issues. > > On Tue, May 3, 2016 at 4:08 AM, Cody Koeninger <c...@koeninger.org> wrote: >> print() isn't really the best way to benchmark things, since it calls >> take(10) under the covers, but 380 records

Re: Improving performance of a kafka spark streaming app

2016-05-02 Thread Cody Koeninger
issues with the 1.2 receivers, is this the expected way > to use the Kafka streaming API, or am I doing something terribly > wrong? > > My application looks like > https://gist.github.com/drocsid/b0efa4ff6ff4a7c3c8bb56767d0b6877 > > On Mon, May 2, 2016 at 6:09 PM, Cody Koeninger <

Re: Improving performance of a kafka spark streaming app

2016-05-02 Thread Cody Koeninger
Have you tested for read throughput (without writing to hbase, just deserialize)? Are you limited to using spark 1.2, or is upgrading possible? The kafka direct stream is available starting with 1.3. If you're stuck on 1.2, I believe there have been some attempts to backport it, search the

Re: kafka direct streaming python API fromOffsets

2016-05-02 Thread Cody Koeninger
If you're confused about the type of an argument, you're probably better off looking at documentation that includes static types: http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.streaming.kafka.KafkaUtils$ createDirectStream's fromOffsets parameter takes a map from

Re: What is the default value of rebalance.backoff.ms in Spark Kafka Direct?

2016-04-29 Thread Cody Koeninger
et Leader Lost Errors. >> >> Is refresh.leader.backoff.ms the right setting in the app for it to wait >> till the leader election and rebalance is done from the Kafka side assuming >> that Kafka has rebalance.backoff.ms of 2000 ? >> >> On Wed, Apr 27, 2016 at 11:0

Re: What is the default value of rebalance.backoff.ms in Spark Kafka Direct?

2016-04-27 Thread Cody Koeninger
Seems like it'd be better to look into the Kafka side of things to determine why you're losing leaders frequently, as opposed to trying to put a bandaid on it. On Wed, Apr 27, 2016 at 11:49 AM, SRK wrote: > Hi, > > We seem to be getting a lot of LeaderLostExceptions

Re: Kafka exception in Apache Spark

2016-04-26 Thread Cody Koeninger
That error indicates a message bigger than the buffer's capacity https://issues.apache.org/jira/browse/KAFKA-1196 On Tue, Apr 26, 2016 at 3:07 AM, Michel Hubert wrote: > Hi, > > > > > > I use a Kafka direct stream approach. > > My Spark application was running ok. > > This

Re: Converting from KafkaUtils.createDirectStream to KafkaUtils.createStream

2016-04-25 Thread Cody Koeninger
, > [error] ^ > [error] one error found > [error] (compile:compileIncremental) Compilation failed > > Any ideas will be appreciated > > > Dr Mich Talebzadeh > > > > LinkedIn > https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > > >

Re: Which jar file has import org.apache.spark.internal.Logging

2016-04-25 Thread Cody Koeninger
I would suggest reading the documentation first. http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.streaming.kafka.OffsetRange$ The OffsetRange class is not private. The instance constructor is private. You obtain instances by using the apply method on the companion

Re: Converting from KafkaUtils.createDirectStream to KafkaUtils.createStream

2016-04-22 Thread Cody Koeninger
Talebzadeh > > > > LinkedIn > https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > > > > http://talebzadehmich.wordpress.com > > > > > On 22 April 2016 at 21:56, Cody Koeninger <c...@koeninger.org> wrote: >> >> You can sti

Re: Converting from KafkaUtils.createDirectStream to KafkaUtils.createStream

2016-04-22 Thread Cody Koeninger
batching > > Dr Mich Talebzadeh > > > > LinkedIn > https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > > > > http://talebzadehmich.wordpress.com > > > > > On 22 April 2016 at 21:51, Cody Koeninger <c...@koeninger.org&g

Re: Converting from KafkaUtils.createDirectStream to KafkaUtils.createStream

2016-04-22 Thread Cody Koeninger
Why are you wanting to convert? As far as doing the conversion, createStream doesn't take the same arguments, look at the docs. On Fri, Apr 22, 2016 at 3:44 PM, Mich Talebzadeh wrote: > Hi, > > What is the best way of converting this program of that uses >

Re: Spark Streaming / Kafka Direct Approach: Dynamic Partitionning / Multi DC Spark

2016-04-19 Thread Cody Koeninger
gt; > On Tue, Apr 19, 2016 at 4:25 PM, Jason Nerothin <jasonnerot...@gmail.com> > wrote: > >> It the main concern uptime or disaster recovery? >> >> On Apr 19, 2016, at 9:12 AM, Cody Koeninger <c...@koeninger.org> wrote: >> >> I think the bigger questi

Re: Spark Streaming / Kafka Direct Approach: Dynamic Partitionning / Multi DC Spark

2016-04-19 Thread Cody Koeninger
rker 1.2 my_group P2*, P4* > DC 2 Master 2.1 > > Worker 2.1 my_group P3 > Worker 2.2 my_group P4 > > I would like to know if it's possible: > - using consumer group ? > - using direct approach ? I prefer this one as I don't want to activate > WAL. > > Hope the explana

Re: Spark Streaming / Kafka Direct Approach: Dynamic Partitionning / Multi DC Spark

2016-04-18 Thread Cody Koeninger
The current direct stream only handles exactly the partitions specified at startup. You'd have to restart the job if you changed partitions. https://issues.apache.org/jira/browse/SPARK-12177 has the ongoing work towards using the kafka 0.10 consumer, which would allow for dynamic topicparittions

Re: Spark replacing Hadoop

2016-04-14 Thread Cody Koeninger
I've been using spark for years and have (thankfully) been able to avoid needing HDFS, aside from one contract where it was already in use. At this point, many of the people I know would consider Kafka to be more important than HDFS. On Thu, Apr 14, 2016 at 3:11 PM, Jörn Franke

Re: how to deploy new code with checkpointing

2016-04-12 Thread Cody Koeninger
- Checkpointing alone isn't enough to get exactly-once semantics. Events will be replayed in case of failure. You must have idempotent output operations. - Another way to handle upgrades is to just start a second app with the new code, then stop the old one once everything's caught up. On Tue,

Re: Spark error with checkpointing

2016-04-05 Thread Cody Koeninger
http://spark.apache.org/docs/latest/streaming-programming-guide.html#accumulators-and-broadcast-variables On Tue, Apr 5, 2016 at 3:51 PM, Akhilesh Pathodia wrote: > Hi, > > I am running spark jobs on yarn in cluster mode. The job reads the messages > from kafka

Re: Spark streaming issue

2016-04-01 Thread Cody Koeninger
Stream(streamingContext, > rhes564:2181, rhes564:9092, newtopic 1) > > > > > > > Dr Mich Talebzadeh > > > > LinkedIn > https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > > > > http://talebzadehmich.wordpress.com > >

Re: Spark streaming issue

2016-04-01 Thread Cody Koeninger
It looks like you're using a plain socket stream to connect to a zookeeper port, which won't work. Look at spark.apache.org/docs/latest/streaming-kafka-integration.html On Fri, Apr 1, 2016 at 3:03 PM, Mich Talebzadeh wrote: > > Hi, > > I am just testing Spark

Re: Restart App and consume from checkpoint using direct kafka API

2016-03-31 Thread Cody Koeninger
Long story short, no. Don't rely on checkpoints if you cant handle reprocessing some of your data. On Thu, Mar 31, 2016 at 3:02 AM, Imre Nagi wrote: > I'm dont know how to read the data from the checkpoint. But AFAIK and based > on my experience, I think the best thing

Re: Direct Kafka input stream and window(…) function

2016-03-24 Thread Cody Koeninger
If this is related to https://issues.apache.org/jira/browse/SPARK-14105 , are you windowing before doing any transformations at all? Try using map to extract the data you care about before windowing. On Tue, Mar 22, 2016 at 12:24 PM, Cody Koeninger <c...@koeninger.org> wrote: > I defini

Re: Problem using saveAsNewAPIHadoopFile API

2016-03-22 Thread Cody Koeninger
If you want 1 minute granularity, why not use a 1 minute batch time? Also, HDFS is not a great match for this kind of thing, because of the small files issue. On Tue, Mar 22, 2016 at 12:26 PM, vetal king wrote: > We are using Spark 1.4 for Spark Streaming. Kafka is data

Re: Direct Kafka input stream and window(…) function

2016-03-22 Thread Cody Koeninger
I definitely have direct stream jobs that use window() without problems... Can you post a minimal code example that reproduces the problem? Using print() will confuse the issue, since print() will try to only use the first partition. Use foreachRDD { rdd => rdd.foreach(println) or something

Re: How to gracefully handle Kafka OffsetOutOfRangeException

2016-03-21 Thread Cody Koeninger
oo sure if this is an issue with spark engine or with the > streaming module. Please let me know if you need more logs or you want me to > raise a github issue/JIRA. > > Sorry for digressing on the original thread. > > On Fri, Mar 18, 2016 at 8:10 PM, Cody Koeninger <c...@koeninger.org>

Re: How to collect data for some particular point in spark streaming

2016-03-21 Thread Cody Koeninger
Kafka doesn't have an accurate time-based index. Your options are to maintain an index yourself, or start at a sufficiently early offset and filter messages. On Mon, Mar 21, 2016 at 7:28 AM, Nagu Kothapalli wrote: > Hi, > > > I Want to collect data from kafka ( json

Re: ERROR ArrayBuffer(java.nio.channels.ClosedChannelException

2016-03-19 Thread Cody Koeninger
running I'm seeing it retry correctly. However, I am > having trouble getting the job started - number of retries does not seem to > help with startup behavior. > > Thoughts? > > Regards, > > Bryan Jeffrey > > On Fri, Mar 18, 2016 at 10:44 AM, Cody Koeninger <c...@ko

Re: ERROR ArrayBuffer(java.nio.channels.ClosedChannelException

2016-03-19 Thread Cody Koeninger
That's a networking error when the driver is attempting to contact leaders to get the latest available offsets. If it's a transient error, you can look at increasing the value of spark.streaming.kafka.maxRetries, see http://spark.apache.org/docs/latest/configuration.html If it's not a transient

Re: How to gracefully handle Kafka OffsetOutOfRangeException

2016-03-19 Thread Cody Koeninger
Is that happening only at startup, or during processing? If that's happening during normal operation of the stream, you don't have enough resources to process the stream in time. There's not a clean way to deal with that situation, because it's a violation of preconditions. If you want to

Re: Get Pair of Topic and Message from Kafka + Spark Streaming

2016-03-19 Thread Cody Koeninger
There's 1 topic per partition, so you're probably better off dealing with topics that way rather than at the individual message level. http://spark.apache.org/docs/latest/streaming-kafka-integration.html#approach-2-direct-approach-no-receivers Look at the discussion of "HasOffsetRanges" If you

Re: Problem running JavaDirectKafkaWordCount

2016-03-14 Thread Cody Koeninger
Sounds like the jar you built doesn't include the dependencies (in this case, the spark-streaming-kafka subproject). When you use spark-submit to submit a job to spark, you need to either specify all dependencies as additional --jars arguments (which is a pain), or build an uber-jar containing

Re: Kafka + Spark streaming, RDD partitions not processed in parallel

2016-03-14 Thread Cody Koeninger
s://github.com/guptamukul/sparktest.git > > ____ > From: Cody Koeninger <c...@koeninger.org> > Sent: 11 March 2016 23:04 > To: Mukul Gupta > Cc: user@spark.apache.org > Subject: Re: Kafka + Spark streaming, RDD partitions not processed in

Re: Kafka + Spark streaming, RDD partitions not processed in parallel

2016-03-11 Thread Cody Koeninger
s, String.class, > StringDecoder.class, StringDecoder.class, kafkaParams, topicsSet); > > JavaDStream processed = messages.map(new Function<Tuple2<String, > String>, String>() { > > @Override > public String call(Tuple2<String, String> arg0) throws Exception {

Re: Kafka + Spark streaming, RDD partitions not processed in parallel

2016-03-11 Thread Cody Koeninger
Can you post your actual code? On Thu, Mar 10, 2016 at 9:55 PM, Mukul Gupta wrote: > Hi All, I was running the following test: Setup 9 VM runing spark workers > with 1 spark executor each. 1 VM running kafka and spark master. Spark > version is 1.6.0 Kafka version is

Re: Problem with union of DirectStream

2016-03-10 Thread Cody Koeninger
If you do any RDD transformation, it's going to return a different RDD than the original. The implication for casting to HasOffsetRanges is specifically called out in the docs at http://spark.apache.org/docs/latest/streaming-kafka-integration.html#approach-2-direct-approach-no-receivers On Thu,

Re: Use cases for kafka direct stream messageHandler

2016-03-09 Thread Cody Koeninger
e relevance to what > you're talking about. > > Perhaps if both functions (the one with partitions arg and the one without) > returned just ConsumerRecord, I would like that more. > > - Alan > > On Tue, Mar 8, 2016 at 6:49 AM, Cody Koeninger <c...@koeninger.org> wrote: >

Re: Installing Spark on Mac

2016-03-08 Thread Cody Koeninger
ory > > When I try ./bin/spark-shell --master local[2] > > I get: no such file or directory > Failed to find spark assembly, you need to build Spark before running this > program > > > > Sent from my iPhone > >> On 8 Mar 2016, at 21:50, Cody Koe

Re: Installing Spark on Mac

2016-03-08 Thread Cody Koeninger
aster.sh" > > Thanks, > > Aida > Sent from my iPhone > >> On 8 Mar 2016, at 19:02, Cody Koeninger <c...@koeninger.org> wrote: >> >> You said you downloaded a prebuilt version. >> >> You shouldn't have to mess with maven or building spark at all

<    1   2   3   4   5   6   7   >