Re: Feature to restart Spark job from previous failure point

Mich Talebzadeh Tue, 05 Sep 2023 08:41:48 -0700

Hi Dipayan,

You ought to maintain data source consistency minimising changes. upstream.
Spark is not a Swiss Army knife :)


Anyhow, we already do this in spark structured streaming with the concept
of checkpointing.You can do so by implementing


   - Checkpointing
   - Stateful processing in Spark.
   - Retry mechanism:


In Pyspark you can use

rdd.checkpoint("hdfs://<checkpoint_directory>")  # chekpointing rdd
or
dataframe.write.option("path",
"hdfs://<checkpoint_directory>("overwrite").saveAsTable("checkpointed_table")
# checkpointing a DF

Retry mechanism

something like below

def myfunction(input_file_path,checkpoint_directory, max_retries):
retries = 0
while retries < max_retries:
   try:
   .....
   except Exception as e:
      print(f"Error: {str(e)}")
      retries += 1
     if retries < max_retries:
       print(f"Retrying... (Retry {retries}/{max_retries})")
     else:
       print("Max retries reached. Exiting.")
       break

Remember checkpointing incurs I/O and is expensive!. You can use cloud
buckets for checkpointing as well

HTH

Mich Talebzadeh,
Distinguished Technologist, Solutions Architect & Engineer
London
United Kingdom


   view my Linkedin profile


 https://en.everybodywiki.com/Mich_Talebzadeh



Disclaimer: Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



Mich Talebzadeh,
Distinguished Technologist, Solutions Architect & Engineer
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Tue, 5 Sept 2023 at 10:12, Dipayan Dev <dev.dipaya...@gmail.com> wrote:

> Hi Team,
>
> One of the biggest pain points we're facing is when Spark reads upstream
> partition data and during Action, the upstream also gets refreshed and the
> application fails with 'File not exists' error. It could happen that the
> job has already spent a reasonable amount of time, and re-running the
> entire application is unwanted.
>
> I know the general solution to this is to handle how the upstream is
> managing the data, but is there a way to tackle this problem from the Spark
> applicable side? One approach I was thinking of is to at least save some
> state of operations done by Spark job till that point, and on a retry,
> resume the operation from that point?
>
>
>
> With Best Regards,
>
> Dipayan Dev
>

Re: Feature to restart Spark job from previous failure point

Reply via email to