[GitHub] [spark] yaooqinn commented on a diff in pull request #38024: [SPARK-40591][CORE][SQL] Fix data loss caused by ignoreCorruptFiles

GitBox Wed, 28 Sep 2022 20:00:44 -0700


yaooqinn commented on code in PR #38024:
URL: https://github.com/apache/spark/pull/38024#discussion_r983019509



##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FilePartitionReader.scala:
##########
@@ -36,8 +36,15 @@ class FilePartitionReader[T](
 
   private def ignoreMissingFiles = options.ignoreMissingFiles
   private def ignoreCorruptFiles = options.ignoreCorruptFiles
+  private def ignoreCorruptFilesAfterRetries = 
options.ignoreCorruptFilesAfterRetries
 
   override def next(): Boolean = {
+
+    def shouldSkipCorruptFiles(): Boolean = {

Review Comment:
   > [DOCS] Whether to ignore corrupt files. If true, the Spark jobs will 
continue to run when encountering corrupted files and the contents that have 
been read will still be returned. This configuration is effective only when 
using file-based sources such as Parquet, JSON and ORC.
   
   FYI, the doc says that it ignores corrupt files, so we are willing to ignore 
corrupt files, right? For transient failures like a network issue, we want to 
retry it, because the file is not corrupt. Anyway, like what you said, we can 
not distinguish whether a failure is transient or permanent, so in this PR, I 
retry them both with a limit.
   
   > Or I could spin a different argument: if the data can't be read 2 times, 
and is read a 3rd time, are you OK with that being correct?
   
   For this case, in this PR, we can set ingoreCorruptFilesAfterRetries greater 
than the maximum task failures, and let it `fail` after retrying enough times. 
Or maybe we shall rename it to `determineFileCorruptAfterRetries`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] yaooqinn commented on a diff in pull request #38024: [SPARK-40591][CORE][SQL] Fix data loss caused by ignoreCorruptFiles

Reply via email to