HeartSaVioR commented on PR #23782: URL: https://github.com/apache/spark/pull/23782#issuecomment-2039266260
https://github.com/apache/spark/pull/23782#issuecomment-555210613 This comment explains everything. Also I do not agree that spark.sql.files.ignoreCorruptFiles is a rescue, likewise I commented above. If you ever require Spark to provide at-least-once fault tolerance, there should be never a change to the source on replay. If the input file is somehow overwritten between the batch failure and the reprocessing of the same batch, fault tolerance is going to be broken. It's a hard problem, not a trivial one. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
