Re: Kafka Direct does not recover automatically when the Kafka Stream gets messed up?

swetha kasireddy Mon, 30 Nov 2015 11:50:28 -0800

Hi Cody,

What if the Offsets that are tracked are not present in Kafka. How do I
skip those offsets and go to the next Offset? Also would specifying
rebalance.backoff.ms be of any help?


Thanks,
Swteha

On Thu, Nov 12, 2015 at 9:07 AM, Cody Koeninger <c...@koeninger.org> wrote:

> To be blunt, if you care about being able to recover from weird
> situations, you should be tracking offsets yourself and specifying offsets
> on job start, not relying on checkpoints.
>
> On Tue, Nov 10, 2015 at 3:54 AM, Adrian Tanase <atan...@adobe.com> wrote:
>
>> I’ve seen this before during an extreme outage on the cluster, where the
>> kafka offsets checkpointed by the directstreamRdd were bigger than what
>> kafka reported. The checkpoint was therefore corrupted.
>> I don’t know the root cause but since I was stressing the cluster during
>> a reliability test I can only assume that one of the Kafka partitions was
>> restored from an out-of-sync replica and did not contain all the data.
>> Seems extreme but I don’t have another idea.
>>
>> @Cody – do you know of a way to recover from a situation like this? Can
>> someone manually delete folders from the checkpoint folder to help the job
>> recover? E.g. Go 2 steps back, hoping that kafka has those offsets.
>>
>> -adrian
>>
>> From: swetha kasireddy
>> Date: Monday, November 9, 2015 at 10:40 PM
>> To: Cody Koeninger
>> Cc: "user@spark.apache.org"
>> Subject: Re: Kafka Direct does not recover automatically when the Kafka
>> Stream gets messed up?
>>
>> OK. But, one thing that I observed is that when there is a problem with
>> Kafka Stream, unless I delete the checkpoint directory the Streaming job
>> does not restart. I guess it tries to retry the failed tasks and if it's
>> not able to recover, it fails again. Sometimes, it fails with StackOverFlow
>> Error.
>>
>> Why does the Streaming job not restart from checkpoint directory when the
>> job failed earlier with Kafka Brokers getting messed up? We have the
>> checkpoint directory in our hdfs.
>>
>> On Mon, Nov 9, 2015 at 12:34 PM, Cody Koeninger <c...@koeninger.org>
>> wrote:
>>
>>> I don't think deleting the checkpoint directory is a good way to restart
>>> the streaming job, you should stop the spark context or at the very least
>>> kill the driver process, then restart.
>>>
>>> On Mon, Nov 9, 2015 at 2:03 PM, swetha kasireddy <
>>> swethakasire...@gmail.com> wrote:
>>>
>>>> Hi Cody,
>>>>
>>>> Our job is our failsafe as we don't have Control over Kafka Stream as
>>>> of now. Can setting rebalance max retries help? We do not have any monitors
>>>> setup as of now. We need to setup the monitors.
>>>>
>>>> My idea is to to have some kind of Cron job that queries the Streaming
>>>> API for monitoring like every 5 minutes and then send an email alert and
>>>> automatically restart the Streaming job by deleting the Checkpoint
>>>> directory. Would that help?
>>>>
>>>>
>>>>
>>>> Thanks!
>>>>
>>>> On Mon, Nov 9, 2015 at 11:09 AM, Cody Koeninger <c...@koeninger.org>
>>>> wrote:
>>>>
>>>>> The direct stream will fail the task if there is a problem with the
>>>>> kafka broker.  Spark will retry failed tasks automatically, which should
>>>>> handle broker rebalances that happen in a timely fashion.
>>>>> spark.tax.maxFailures controls the maximum number of retries before 
>>>>> failing
>>>>> the job.  Direct stream isn't any different from any other spark task in
>>>>> that regard.
>>>>>
>>>>> The question of what kind of monitoring you need is more a question
>>>>> for your particular infrastructure and what you're already using for
>>>>> monitoring.  We put all metrics (application level or system level) into
>>>>> graphite and alert from there.
>>>>>
>>>>> I will say that if you've regularly got problems with kafka falling
>>>>> over for half an hour, I'd look at fixing that before worrying about spark
>>>>> monitoring...
>>>>>
>>>>>
>>>>> On Mon, Nov 9, 2015 at 12:26 PM, swetha <swethakasire...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> How to recover Kafka Direct automatically when the there is a problem
>>>>>> with
>>>>>> Kafka brokers? Sometimes our Kafka Brokers gets messed up and the
>>>>>> entire
>>>>>> Streaming job blows up unlike some other consumers which do recover
>>>>>> automatically. How can I make sure that Kafka Direct recovers
>>>>>> automatically
>>>>>> when the broker fails for sometime say 30 minutes? What kind of
>>>>>> monitors
>>>>>> should be in place to recover the job?
>>>>>>
>>>>>> Thanks,
>>>>>> Swetha
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> View this message in context:
>>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/Kafka-Direct-does-not-recover-automatically-when-the-Kafka-Stream-gets-messed-up-tp25331.html
>>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>>> Nabble.com.
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Kafka Direct does not recover automatically when the Kafka Stream gets messed up?

Reply via email to