Hi Dipayan, You ought to maintain data source consistency minimising changes. upstream. Spark is not a Swiss Army knife :)
Anyhow, we already do this in spark structured streaming with the concept of checkpointing.You can do so by implementing - Checkpointing - Stateful processing in Spark. - Retry mechanism: In Pyspark you can use rdd.checkpoint("hdfs://<checkpoint_directory>") # chekpointing rdd or dataframe.write.option("path", "hdfs://<checkpoint_directory>("overwrite").saveAsTable("checkpointed_table") # checkpointing a DF Retry mechanism something like below def myfunction(input_file_path,checkpoint_directory, max_retries): retries = 0 while retries < max_retries: try: ..... except Exception as e: print(f"Error: {str(e)}") retries += 1 if retries < max_retries: print(f"Retrying... (Retry {retries}/{max_retries})") else: print("Max retries reached. Exiting.") break Remember checkpointing incurs I/O and is expensive!. You can use cloud buckets for checkpointing as well HTH Mich Talebzadeh, Distinguished Technologist, Solutions Architect & Engineer London United Kingdom view my Linkedin profile https://en.everybodywiki.com/Mich_Talebzadeh Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction. Mich Talebzadeh, Distinguished Technologist, Solutions Architect & Engineer London United Kingdom view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> https://en.everybodywiki.com/Mich_Talebzadeh *Disclaimer:* Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction. On Tue, 5 Sept 2023 at 10:12, Dipayan Dev <dev.dipaya...@gmail.com> wrote: > Hi Team, > > One of the biggest pain points we're facing is when Spark reads upstream > partition data and during Action, the upstream also gets refreshed and the > application fails with 'File not exists' error. It could happen that the > job has already spent a reasonable amount of time, and re-running the > entire application is unwanted. > > I know the general solution to this is to handle how the upstream is > managing the data, but is there a way to tackle this problem from the Spark > applicable side? One approach I was thinking of is to at least save some > state of operations done by Spark job till that point, and on a retry, > resume the operation from that point? > > > > With Best Regards, > > Dipayan Dev >