Re: Reprocessing failed jobs in Streaming job

2016-12-07 Thread Cody Koeninger
If your operations are idempotent, you should be able to just run a
totally separate job that looks for failed batches and does a kafkaRDD
to reprocess that batch.  C* probably isn't the first choice for what
is essentially a queue, but if the frequency of batches is relatively
low it probably doesn't matter.

That is indeed a weird stacktrace, did you investigate driver logs to
see if there was something else preceding it?

On Wed, Dec 7, 2016 at 2:41 PM, map reduced  wrote:
>> Personally I think forcing the stream to fail (e.g. check offsets in
>> downstream store and throw exception if they aren't as expected) is
>> the safest thing to do.
>
>
> I would think so too, but just for say 2-3 (sometimes just 1) failed batches
> in a whole day, I am trying to not kill the whole processing and restart.
>
> I am storing the offsets per batch and success/failure in a separate C*
> table - checkpointing was not an option due to it not working with
> application jar change etc.  Since I have access to the offsets, you think
> #2 or some variation of it may work?
>
> Btw, some of those failures I mentioned are strange, for instance (Spark
> 2.0.0 and spark-streaming-kafka-0-8_2.11):
>
> Job aborted due to stage failure: Task 173 in stage 92312.0 failed 10 times,
> most recent failure: Lost task 173.9 in stage 92312.0 (TID 27689025,
> 17.162.114.161): java.util.NoSuchElementException
>   at
> java.util.concurrent.ConcurrentSkipListMap.firstKey(ConcurrentSkipListMap.java:2036)
>   at
> com.yammer.metrics.stats.ExponentiallyDecayingSample.update(ExponentiallyDecayingSample.java:102)
>   at
> com.yammer.metrics.stats.ExponentiallyDecayingSample.update(ExponentiallyDecayingSample.java:81)
>   at com.yammer.metrics.core.Histogram.update(Histogram.java:110)
>   at com.yammer.metrics.core.Timer.update(Timer.java:198)
>   at com.yammer.metrics.core.Timer.update(Timer.java:76)
>   at com.yammer.metrics.core.TimerContext.stop(TimerContext.java:31)
>   at kafka.metrics.KafkaTimer.time(KafkaTimer.scala:36)
>   at
> kafka.consumer.SimpleConsumer$$anonfun$fetch$1.apply$mcV$sp(SimpleConsumer.scala:111)
>   at
> kafka.consumer.SimpleConsumer$$anonfun$fetch$1.apply(SimpleConsumer.scala:111)
>   at
> kafka.consumer.SimpleConsumer$$anonfun$fetch$1.apply(SimpleConsumer.scala:111)
>   at kafka.metrics.KafkaTimer.time(KafkaTimer.scala:33)
>   at kafka.consumer.SimpleConsumer.fetch(SimpleConsumer.scala:110)
>   at
> org.apache.spark.streaming.kafka.KafkaRDD$KafkaRDDIterator.fetchBatch(KafkaRDD.scala:193)
>   at
> org.apache.spark.streaming.kafka.KafkaRDD$KafkaRDDIterator.getNext(KafkaRDD.scala:209)
>   at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at
> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:161)
>   at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>   at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
>
>
> On Wed, Dec 7, 2016 at 12:16 PM, Cody Koeninger  wrote:
>>
>> Personally I think forcing the stream to fail (e.g. check offsets in
>> downstream store and throw exception if they aren't as expected) is
>> the safest thing to do.
>>
>> If you proceed after a failure, you need a place to reliably record
>> the batches that failed for later processing.
>>
>> On Wed, Dec 7, 2016 at 1:46 PM, map reduced  wrote:
>> > Hi,
>> >
>> > I am trying to solve this problem - in my streaming flow, every day few
>> > jobs
>> > fail due to some (say kafka cluster maintenance etc, mostly unavoidable)
>> > reasons for few batches and resumes back to success.
>> > I want to reprocess those failed jobs programmatically (assume I have a
>> > way
>> > of getting start-end offsets for kafka topics for failed jobs). I was
>> > thinking of these options:
>> > 1) Somehow pause streaming job when it detects failing jobs - this seems
>> > not
>> > possible.
>> > 2) From driver - run additional processing to check every few minutes
>> > using
>> > driver rest api (/api/v1/applications...) what jobs have failed and
>> > submit
>> > batch jobs for those failed jobs
>> >
>> > 1 - doesn't seem to be possible, and I don't want to kill streaming
>> > context
>> > just for few failing batches to stop the job for some time and resume
>> > after
>> > few minutes.
>> > 2 - seems like a viable 

Re: Reprocessing failed jobs in Streaming job

2016-12-07 Thread map reduced
>
> Personally I think forcing the stream to fail (e.g. check offsets in
> downstream store and throw exception if they aren't as expected) is
> the safest thing to do.


I would think so too, but just for say 2-3 (sometimes just 1) failed
batches in a whole day, I am trying to not kill the whole processing and
restart.

I am storing the offsets per batch and success/failure in a separate C*
table - checkpointing was not an option due to it not working with
application jar change etc.  Since I have access to the offsets, you think
#2 or some variation of it may work?

Btw, some of those failures I mentioned are strange, for instance (Spark
2.0.0 and spark-streaming-kafka-0-8_2.11):

Job aborted due to stage failure: Task 173 in stage 92312.0 failed 10
times, most recent failure: Lost task 173.9 in stage 92312.0 (TID
27689025, 17.162.114.161): java.util.NoSuchElementException
at 
java.util.concurrent.ConcurrentSkipListMap.firstKey(ConcurrentSkipListMap.java:2036)
at 
com.yammer.metrics.stats.ExponentiallyDecayingSample.update(ExponentiallyDecayingSample.java:102)
at 
com.yammer.metrics.stats.ExponentiallyDecayingSample.update(ExponentiallyDecayingSample.java:81)
at com.yammer.metrics.core.Histogram.update(Histogram.java:110)
at com.yammer.metrics.core.Timer.update(Timer.java:198)
at com.yammer.metrics.core.Timer.update(Timer.java:76)
at com.yammer.metrics.core.TimerContext.stop(TimerContext.java:31)
at kafka.metrics.KafkaTimer.time(KafkaTimer.scala:36)
at 
kafka.consumer.SimpleConsumer$$anonfun$fetch$1.apply$mcV$sp(SimpleConsumer.scala:111)
at 
kafka.consumer.SimpleConsumer$$anonfun$fetch$1.apply(SimpleConsumer.scala:111)
at 
kafka.consumer.SimpleConsumer$$anonfun$fetch$1.apply(SimpleConsumer.scala:111)
at kafka.metrics.KafkaTimer.time(KafkaTimer.scala:33)
at kafka.consumer.SimpleConsumer.fetch(SimpleConsumer.scala:110)
at 
org.apache.spark.streaming.kafka.KafkaRDD$KafkaRDDIterator.fetchBatch(KafkaRDD.scala:193)
at 
org.apache.spark.streaming.kafka.KafkaRDD$KafkaRDDIterator.getNext(KafkaRDD.scala:209)
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:161)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)


On Wed, Dec 7, 2016 at 12:16 PM, Cody Koeninger  wrote:

> Personally I think forcing the stream to fail (e.g. check offsets in
> downstream store and throw exception if they aren't as expected) is
> the safest thing to do.
>
> If you proceed after a failure, you need a place to reliably record
> the batches that failed for later processing.
>
> On Wed, Dec 7, 2016 at 1:46 PM, map reduced  wrote:
> > Hi,
> >
> > I am trying to solve this problem - in my streaming flow, every day few
> jobs
> > fail due to some (say kafka cluster maintenance etc, mostly unavoidable)
> > reasons for few batches and resumes back to success.
> > I want to reprocess those failed jobs programmatically (assume I have a
> way
> > of getting start-end offsets for kafka topics for failed jobs). I was
> > thinking of these options:
> > 1) Somehow pause streaming job when it detects failing jobs - this seems
> not
> > possible.
> > 2) From driver - run additional processing to check every few minutes
> using
> > driver rest api (/api/v1/applications...) what jobs have failed and
> submit
> > batch jobs for those failed jobs
> >
> > 1 - doesn't seem to be possible, and I don't want to kill streaming
> context
> > just for few failing batches to stop the job for some time and resume
> after
> > few minutes.
> > 2 - seems like a viable option, but a little complicated, since even the
> > batch job can fail due to whatever reasons and I am back to tracking that
> > separately etc.
> >
> > Does anyone has faced this issue or have any suggestions?
> >
> > Thanks,
> > KP
>


Re: Reprocessing failed jobs in Streaming job

2016-12-07 Thread Cody Koeninger
Personally I think forcing the stream to fail (e.g. check offsets in
downstream store and throw exception if they aren't as expected) is
the safest thing to do.

If you proceed after a failure, you need a place to reliably record
the batches that failed for later processing.

On Wed, Dec 7, 2016 at 1:46 PM, map reduced  wrote:
> Hi,
>
> I am trying to solve this problem - in my streaming flow, every day few jobs
> fail due to some (say kafka cluster maintenance etc, mostly unavoidable)
> reasons for few batches and resumes back to success.
> I want to reprocess those failed jobs programmatically (assume I have a way
> of getting start-end offsets for kafka topics for failed jobs). I was
> thinking of these options:
> 1) Somehow pause streaming job when it detects failing jobs - this seems not
> possible.
> 2) From driver - run additional processing to check every few minutes using
> driver rest api (/api/v1/applications...) what jobs have failed and submit
> batch jobs for those failed jobs
>
> 1 - doesn't seem to be possible, and I don't want to kill streaming context
> just for few failing batches to stop the job for some time and resume after
> few minutes.
> 2 - seems like a viable option, but a little complicated, since even the
> batch job can fail due to whatever reasons and I am back to tracking that
> separately etc.
>
> Does anyone has faced this issue or have any suggestions?
>
> Thanks,
> KP

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Reprocessing failed jobs in Streaming job

2016-12-07 Thread map reduced
Hi,

I am trying to solve this problem - in my streaming flow, every day few
jobs fail due to some (say kafka cluster maintenance etc, mostly
unavoidable) reasons for few batches and resumes back to success.
I want to reprocess those failed jobs programmatically (assume I have a way
of getting start-end offsets for kafka topics for failed jobs). I was
thinking of these options:
1) Somehow pause streaming job when it detects failing jobs - this seems
not possible.
2) From driver - run additional processing to check every few minutes using
driver rest api (/api/v1/applications...) what jobs have failed and submit
batch jobs for those failed jobs

1 - doesn't seem to be possible, and I don't want to kill streaming context
just for few failing batches to stop the job for some time and resume after
few minutes.
2 - seems like a viable option, but a little complicated, since even the
batch job can fail due to whatever reasons and I am back to tracking that
separately etc.

Does anyone has faced this issue or have any suggestions?

Thanks,
KP