Sorry, I sent this to the dev list instead of user. Please ignore. I'll re-post to the correct list.
Regards, Art On Thu, Jul 24, 2014 at 11:09 AM, Art Peel <found...@gmail.com> wrote: > Our system works with RDDs generated from Hadoop files. It processes each > record in a Hadoop file and for a subset of those records generates output > that is written to an external system via RDD.foreach. There are no > dependencies between the records that are processed. > > If writing to the external system fails (due to a detail of what is being > written) and throws an exception, I see the following behavior: > > 1. Spark retries the entire partition (thus wasting time and effort), > reaches the problem record and fails again. > 2. It repeats step 1 until the configured number of retries is reached and > then gives up. As a result, the rest of records from that Hadoop file are > not processed. > 3. The executor where the final attempt occurred is marked as failed and > told to shut down and thus I lose a core for processing the remaining > Hadoop files, thus slowing down the entire process. > > For this particular problem, I know how to prevent the underlying > exception, but I'd still like to get a handle on error handling for future > situations that I haven't yet encountered. > > My goal is this: > Retry the problem record only (rather than starting over at the beginning > of the partition) up to N times, then give up and move on to process the > rest of the partition. > > As far as I can tell, I need to supply my own retry behavior and if I want > to process records after the problem record I have to swallow exceptions > inside the foreach block. > > My 2 questions are: > 1. Is there anything I can do to prevent the executor where the final > retry occurred from being shut down when a failure occurs? > > 2. Are there ways Spark can help me get closer to my goal of retrying only > the problem record without writing my own re-try code and swallowing > exceptions? > > Regards, > Art > >