Hi Cody, What if the Offsets that are tracked are not present in Kafka. How do I skip those offsets and go to the next Offset? Also would specifying rebalance.backoff.ms be of any help?
Thanks, Swteha On Thu, Nov 12, 2015 at 9:07 AM, Cody Koeninger <c...@koeninger.org> wrote: > To be blunt, if you care about being able to recover from weird > situations, you should be tracking offsets yourself and specifying offsets > on job start, not relying on checkpoints. > > On Tue, Nov 10, 2015 at 3:54 AM, Adrian Tanase <atan...@adobe.com> wrote: > >> I’ve seen this before during an extreme outage on the cluster, where the >> kafka offsets checkpointed by the directstreamRdd were bigger than what >> kafka reported. The checkpoint was therefore corrupted. >> I don’t know the root cause but since I was stressing the cluster during >> a reliability test I can only assume that one of the Kafka partitions was >> restored from an out-of-sync replica and did not contain all the data. >> Seems extreme but I don’t have another idea. >> >> @Cody – do you know of a way to recover from a situation like this? Can >> someone manually delete folders from the checkpoint folder to help the job >> recover? E.g. Go 2 steps back, hoping that kafka has those offsets. >> >> -adrian >> >> From: swetha kasireddy >> Date: Monday, November 9, 2015 at 10:40 PM >> To: Cody Koeninger >> Cc: "user@spark.apache.org" >> Subject: Re: Kafka Direct does not recover automatically when the Kafka >> Stream gets messed up? >> >> OK. But, one thing that I observed is that when there is a problem with >> Kafka Stream, unless I delete the checkpoint directory the Streaming job >> does not restart. I guess it tries to retry the failed tasks and if it's >> not able to recover, it fails again. Sometimes, it fails with StackOverFlow >> Error. >> >> Why does the Streaming job not restart from checkpoint directory when the >> job failed earlier with Kafka Brokers getting messed up? We have the >> checkpoint directory in our hdfs. >> >> On Mon, Nov 9, 2015 at 12:34 PM, Cody Koeninger <c...@koeninger.org> >> wrote: >> >>> I don't think deleting the checkpoint directory is a good way to restart >>> the streaming job, you should stop the spark context or at the very least >>> kill the driver process, then restart. >>> >>> On Mon, Nov 9, 2015 at 2:03 PM, swetha kasireddy < >>> swethakasire...@gmail.com> wrote: >>> >>>> Hi Cody, >>>> >>>> Our job is our failsafe as we don't have Control over Kafka Stream as >>>> of now. Can setting rebalance max retries help? We do not have any monitors >>>> setup as of now. We need to setup the monitors. >>>> >>>> My idea is to to have some kind of Cron job that queries the Streaming >>>> API for monitoring like every 5 minutes and then send an email alert and >>>> automatically restart the Streaming job by deleting the Checkpoint >>>> directory. Would that help? >>>> >>>> >>>> >>>> Thanks! >>>> >>>> On Mon, Nov 9, 2015 at 11:09 AM, Cody Koeninger <c...@koeninger.org> >>>> wrote: >>>> >>>>> The direct stream will fail the task if there is a problem with the >>>>> kafka broker. Spark will retry failed tasks automatically, which should >>>>> handle broker rebalances that happen in a timely fashion. >>>>> spark.tax.maxFailures controls the maximum number of retries before >>>>> failing >>>>> the job. Direct stream isn't any different from any other spark task in >>>>> that regard. >>>>> >>>>> The question of what kind of monitoring you need is more a question >>>>> for your particular infrastructure and what you're already using for >>>>> monitoring. We put all metrics (application level or system level) into >>>>> graphite and alert from there. >>>>> >>>>> I will say that if you've regularly got problems with kafka falling >>>>> over for half an hour, I'd look at fixing that before worrying about spark >>>>> monitoring... >>>>> >>>>> >>>>> On Mon, Nov 9, 2015 at 12:26 PM, swetha <swethakasire...@gmail.com> >>>>> wrote: >>>>> >>>>>> Hi, >>>>>> >>>>>> How to recover Kafka Direct automatically when the there is a problem >>>>>> with >>>>>> Kafka brokers? Sometimes our Kafka Brokers gets messed up and the >>>>>> entire >>>>>> Streaming job blows up unlike some other consumers which do recover >>>>>> automatically. How can I make sure that Kafka Direct recovers >>>>>> automatically >>>>>> when the broker fails for sometime say 30 minutes? What kind of >>>>>> monitors >>>>>> should be in place to recover the job? >>>>>> >>>>>> Thanks, >>>>>> Swetha >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> View this message in context: >>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/Kafka-Direct-does-not-recover-automatically-when-the-Kafka-Stream-gets-messed-up-tp25331.html >>>>>> Sent from the Apache Spark User List mailing list archive at >>>>>> Nabble.com. >>>>>> >>>>>> --------------------------------------------------------------------- >>>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>>>>> For additional commands, e-mail: user-h...@spark.apache.org >>>>>> >>>>>> >>>>> >>>> >>> >> >