yaooqinn commented on code in PR #38024:
URL: https://github.com/apache/spark/pull/38024#discussion_r983019509
##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FilePartitionReader.scala:
##########
@@ -36,8 +36,15 @@ class FilePartitionReader[T](
private def ignoreMissingFiles = options.ignoreMissingFiles
private def ignoreCorruptFiles = options.ignoreCorruptFiles
+ private def ignoreCorruptFilesAfterRetries =
options.ignoreCorruptFilesAfterRetries
override def next(): Boolean = {
+
+ def shouldSkipCorruptFiles(): Boolean = {
Review Comment:
> [DOCS] Whether to ignore corrupt files. If true, the Spark jobs will
continue to run when encountering corrupted files and the contents that have
been read will still be returned. This configuration is effective only when
using file-based sources such as Parquet, JSON and ORC.
FYI, the doc says that it ignores corrupt files, so we are willing to ignore
corrupt files, right? For transient failures like a network issue, we want to
retry it, because the file is not corrupt. Anyway, like what you said, we can
not distinguish whether a failure is transient or permanent, so in this PR, I
retry them both with a limit.
> Or I could spin a different argument: if the data can't be read 2 times,
and is read a 3rd time, are you OK with that being correct?
For this case, in this PR, we can set ingoreCorruptFilesAfterRetries greater
than the maximum task failures, and let it `fail` after retrying enough times.
Or maybe we shall rename it to `determineFileCorruptAfterRetries`
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]