[Spark Streaming][Spark SQL] Design suggestions needed for sessionization

2017-03-10 Thread Ramkumar Venkataraman
At high-level, I am looking to do sessionization. I want to combine events based on some key, do some transformations and emit data to HDFS. The catch is there are time boundaries, say, I group events in a window of 0.5 hours, based on some timestamp key in the event. Typical event-time windowing

Re: How to gracefully handle Kafka OffsetOutOfRangeException

2017-03-10 Thread Ramkumar Venkataraman
Nope, but when we migrated to spark 1.6, we didnt see the errors yet. Not sure if they fixed in between releases or it just be a weird timing thing that we havent discovered yet in 1.6 as well. On Sat, Mar 4, 2017 at 12:00 AM, nimmi.cv [via Apache Spark User List] <

[Spark Streaming] NoClassDefFoundError : StateSpec

2017-01-12 Thread Ramkumar Venkataraman
Spark: 1.6.1 I am trying to use the new mapWithState API and I am getting the following error: Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/streaming/StateSpec$ Caused by: java.lang.ClassNotFoundException: org.apache.spark.streaming.StateSpec$ Build.sbt

Re: How to gracefully handle Kafka OffsetOutOfRangeException

2016-03-21 Thread Ramkumar Venkataraman
org> wrote: > Spark streaming in general will retry a batch N times then move on to > the next one... off the top of my head, I'm not sure why checkpointing > would have an effect on that. > > On Mon, Mar 21, 2016 at 3:25 AM, Ramkumar Venkataraman > <ram.the.m...@gmail.

Re: How to gracefully handle Kafka OffsetOutOfRangeException

2016-03-21 Thread Ramkumar Venkataraman
at > makes sense for you, start looking at handleFetchErr in KafkaRDD.scala > Recompiling that package isn't a big deal, because it's not a part > of the core spark deployment, so you'll only have to change your job, > not the deployed version of spark. > > > > On Fri

How to gracefully handle Kafka OffsetOutOfRangeException

2016-03-19 Thread Ramkumar Venkataraman
I am using Spark streaming and reading data from Kafka using KafkaUtils.createDirectStream. I have the "auto.offset.reset" set to smallest. But in some Kafka partitions, I get kafka.common.OffsetOutOfRangeException and my spark job crashes. I want to understand if there is a graceful way to