Re: Reprocessing failed jobs in Streaming job
If your operations are idempotent, you should be able to just run a totally separate job that looks for failed batches and does a kafkaRDD to reprocess that batch. C* probably isn't the first choice for what is essentially a queue, but if the frequency of batches is relatively low it probably doesn't matter. That is indeed a weird stacktrace, did you investigate driver logs to see if there was something else preceding it? On Wed, Dec 7, 2016 at 2:41 PM, map reducedwrote: >> Personally I think forcing the stream to fail (e.g. check offsets in >> downstream store and throw exception if they aren't as expected) is >> the safest thing to do. > > > I would think so too, but just for say 2-3 (sometimes just 1) failed batches > in a whole day, I am trying to not kill the whole processing and restart. > > I am storing the offsets per batch and success/failure in a separate C* > table - checkpointing was not an option due to it not working with > application jar change etc. Since I have access to the offsets, you think > #2 or some variation of it may work? > > Btw, some of those failures I mentioned are strange, for instance (Spark > 2.0.0 and spark-streaming-kafka-0-8_2.11): > > Job aborted due to stage failure: Task 173 in stage 92312.0 failed 10 times, > most recent failure: Lost task 173.9 in stage 92312.0 (TID 27689025, > 17.162.114.161): java.util.NoSuchElementException > at > java.util.concurrent.ConcurrentSkipListMap.firstKey(ConcurrentSkipListMap.java:2036) > at > com.yammer.metrics.stats.ExponentiallyDecayingSample.update(ExponentiallyDecayingSample.java:102) > at > com.yammer.metrics.stats.ExponentiallyDecayingSample.update(ExponentiallyDecayingSample.java:81) > at com.yammer.metrics.core.Histogram.update(Histogram.java:110) > at com.yammer.metrics.core.Timer.update(Timer.java:198) > at com.yammer.metrics.core.Timer.update(Timer.java:76) > at com.yammer.metrics.core.TimerContext.stop(TimerContext.java:31) > at kafka.metrics.KafkaTimer.time(KafkaTimer.scala:36) > at > kafka.consumer.SimpleConsumer$$anonfun$fetch$1.apply$mcV$sp(SimpleConsumer.scala:111) > at > kafka.consumer.SimpleConsumer$$anonfun$fetch$1.apply(SimpleConsumer.scala:111) > at > kafka.consumer.SimpleConsumer$$anonfun$fetch$1.apply(SimpleConsumer.scala:111) > at kafka.metrics.KafkaTimer.time(KafkaTimer.scala:33) > at kafka.consumer.SimpleConsumer.fetch(SimpleConsumer.scala:110) > at > org.apache.spark.streaming.kafka.KafkaRDD$KafkaRDDIterator.fetchBatch(KafkaRDD.scala:193) > at > org.apache.spark.streaming.kafka.KafkaRDD$KafkaRDDIterator.getNext(KafkaRDD.scala:209) > at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:161) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) > at org.apache.spark.scheduler.Task.run(Task.scala:85) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > > > On Wed, Dec 7, 2016 at 12:16 PM, Cody Koeninger wrote: >> >> Personally I think forcing the stream to fail (e.g. check offsets in >> downstream store and throw exception if they aren't as expected) is >> the safest thing to do. >> >> If you proceed after a failure, you need a place to reliably record >> the batches that failed for later processing. >> >> On Wed, Dec 7, 2016 at 1:46 PM, map reduced wrote: >> > Hi, >> > >> > I am trying to solve this problem - in my streaming flow, every day few >> > jobs >> > fail due to some (say kafka cluster maintenance etc, mostly unavoidable) >> > reasons for few batches and resumes back to success. >> > I want to reprocess those failed jobs programmatically (assume I have a >> > way >> > of getting start-end offsets for kafka topics for failed jobs). I was >> > thinking of these options: >> > 1) Somehow pause streaming job when it detects failing jobs - this seems >> > not >> > possible. >> > 2) From driver - run additional processing to check every few minutes >> > using >> > driver rest api (/api/v1/applications...) what jobs have failed and >> > submit >> > batch jobs for those failed jobs >> > >> > 1 - doesn't seem to be possible, and I don't want to kill streaming >> > context >> > just for few failing batches to stop the job for some time and resume >> > after >> > few minutes. >> > 2 - seems like a viable
Re: Reprocessing failed jobs in Streaming job
> > Personally I think forcing the stream to fail (e.g. check offsets in > downstream store and throw exception if they aren't as expected) is > the safest thing to do. I would think so too, but just for say 2-3 (sometimes just 1) failed batches in a whole day, I am trying to not kill the whole processing and restart. I am storing the offsets per batch and success/failure in a separate C* table - checkpointing was not an option due to it not working with application jar change etc. Since I have access to the offsets, you think #2 or some variation of it may work? Btw, some of those failures I mentioned are strange, for instance (Spark 2.0.0 and spark-streaming-kafka-0-8_2.11): Job aborted due to stage failure: Task 173 in stage 92312.0 failed 10 times, most recent failure: Lost task 173.9 in stage 92312.0 (TID 27689025, 17.162.114.161): java.util.NoSuchElementException at java.util.concurrent.ConcurrentSkipListMap.firstKey(ConcurrentSkipListMap.java:2036) at com.yammer.metrics.stats.ExponentiallyDecayingSample.update(ExponentiallyDecayingSample.java:102) at com.yammer.metrics.stats.ExponentiallyDecayingSample.update(ExponentiallyDecayingSample.java:81) at com.yammer.metrics.core.Histogram.update(Histogram.java:110) at com.yammer.metrics.core.Timer.update(Timer.java:198) at com.yammer.metrics.core.Timer.update(Timer.java:76) at com.yammer.metrics.core.TimerContext.stop(TimerContext.java:31) at kafka.metrics.KafkaTimer.time(KafkaTimer.scala:36) at kafka.consumer.SimpleConsumer$$anonfun$fetch$1.apply$mcV$sp(SimpleConsumer.scala:111) at kafka.consumer.SimpleConsumer$$anonfun$fetch$1.apply(SimpleConsumer.scala:111) at kafka.consumer.SimpleConsumer$$anonfun$fetch$1.apply(SimpleConsumer.scala:111) at kafka.metrics.KafkaTimer.time(KafkaTimer.scala:33) at kafka.consumer.SimpleConsumer.fetch(SimpleConsumer.scala:110) at org.apache.spark.streaming.kafka.KafkaRDD$KafkaRDDIterator.fetchBatch(KafkaRDD.scala:193) at org.apache.spark.streaming.kafka.KafkaRDD$KafkaRDDIterator.getNext(KafkaRDD.scala:209) at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:161) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) at org.apache.spark.scheduler.Task.run(Task.scala:85) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) On Wed, Dec 7, 2016 at 12:16 PM, Cody Koeningerwrote: > Personally I think forcing the stream to fail (e.g. check offsets in > downstream store and throw exception if they aren't as expected) is > the safest thing to do. > > If you proceed after a failure, you need a place to reliably record > the batches that failed for later processing. > > On Wed, Dec 7, 2016 at 1:46 PM, map reduced wrote: > > Hi, > > > > I am trying to solve this problem - in my streaming flow, every day few > jobs > > fail due to some (say kafka cluster maintenance etc, mostly unavoidable) > > reasons for few batches and resumes back to success. > > I want to reprocess those failed jobs programmatically (assume I have a > way > > of getting start-end offsets for kafka topics for failed jobs). I was > > thinking of these options: > > 1) Somehow pause streaming job when it detects failing jobs - this seems > not > > possible. > > 2) From driver - run additional processing to check every few minutes > using > > driver rest api (/api/v1/applications...) what jobs have failed and > submit > > batch jobs for those failed jobs > > > > 1 - doesn't seem to be possible, and I don't want to kill streaming > context > > just for few failing batches to stop the job for some time and resume > after > > few minutes. > > 2 - seems like a viable option, but a little complicated, since even the > > batch job can fail due to whatever reasons and I am back to tracking that > > separately etc. > > > > Does anyone has faced this issue or have any suggestions? > > > > Thanks, > > KP >
Re: Reprocessing failed jobs in Streaming job
Personally I think forcing the stream to fail (e.g. check offsets in downstream store and throw exception if they aren't as expected) is the safest thing to do. If you proceed after a failure, you need a place to reliably record the batches that failed for later processing. On Wed, Dec 7, 2016 at 1:46 PM, map reducedwrote: > Hi, > > I am trying to solve this problem - in my streaming flow, every day few jobs > fail due to some (say kafka cluster maintenance etc, mostly unavoidable) > reasons for few batches and resumes back to success. > I want to reprocess those failed jobs programmatically (assume I have a way > of getting start-end offsets for kafka topics for failed jobs). I was > thinking of these options: > 1) Somehow pause streaming job when it detects failing jobs - this seems not > possible. > 2) From driver - run additional processing to check every few minutes using > driver rest api (/api/v1/applications...) what jobs have failed and submit > batch jobs for those failed jobs > > 1 - doesn't seem to be possible, and I don't want to kill streaming context > just for few failing batches to stop the job for some time and resume after > few minutes. > 2 - seems like a viable option, but a little complicated, since even the > batch job can fail due to whatever reasons and I am back to tracking that > separately etc. > > Does anyone has faced this issue or have any suggestions? > > Thanks, > KP - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Reprocessing failed jobs in Streaming job
Hi, I am trying to solve this problem - in my streaming flow, every day few jobs fail due to some (say kafka cluster maintenance etc, mostly unavoidable) reasons for few batches and resumes back to success. I want to reprocess those failed jobs programmatically (assume I have a way of getting start-end offsets for kafka topics for failed jobs). I was thinking of these options: 1) Somehow pause streaming job when it detects failing jobs - this seems not possible. 2) From driver - run additional processing to check every few minutes using driver rest api (/api/v1/applications...) what jobs have failed and submit batch jobs for those failed jobs 1 - doesn't seem to be possible, and I don't want to kill streaming context just for few failing batches to stop the job for some time and resume after few minutes. 2 - seems like a viable option, but a little complicated, since even the batch job can fail due to whatever reasons and I am back to tracking that separately etc. Does anyone has faced this issue or have any suggestions? Thanks, KP