[ https://issues.apache.org/jira/browse/SPARK-23308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16356271#comment-16356271 ]
Márcio Furlani Carmona commented on SPARK-23308: ------------------------------------------------ That's true Steve. I totally agree that it'd be hard to identify what is retryable from Spark's perspective. That being said, I agree that the FS should be responsible for that decision. I believe one option would be leaving the retry responsibility to FS implementations (as it already seems to be) and adding the documentation for this flag making clear you might experience some data losses. Other option would be creating a special exception (CorruptedFileException?) that could be thrown by FS implementations and let them decide what is a corrupted file or just a transient error. A more complex approach would be having a file blacklist mechanism rather than this flag, similarly to the `spark.blacklist.*` feature, where you can decide how many times to retry a file before considering it corrupted, then you won't need to decide what is worth retrying or not. The downside is that you'll always retry, even when there's no point in retrying. *Regarding the socket timeouts:* I also believe it's some kind of throttling. I'm reading over 80k files with over 10TB of data. The file sizes are not uniform, so some files may be read in a single request, other might get split into multiple partitions. Considering I'm not overriding the default `spark.files.maxPartitionBytes` value of `128 MB`, I should get at least ~82k partitions, thus the same number of S3 requests. Also, the files share some common prefixes, which [might be bad for S3 index access|https://docs.aws.amazon.com/AmazonS3/latest/dev/request-rate-perf-considerations.html]. I have a total of 120 executors, with 9 cores each, across 30 workers. But since each task involves some good computational time, the median task execution time is 8s, so I don't think we're getting close to the 100 TPS for a common prefix mentioned in the S3 documentation above. So, I'm more inclined to say it might be some EC2 network throttling rather than S3 throttling. Another reason to believe on that is because I've seen some [503 Slow Down|https://docs.aws.amazon.com/AmazonS3/latest/API/ErrorResponses.html] errors from S3 in the past when my requests got throttled, which I'm not seeing this time. > ignoreCorruptFiles should not ignore retryable IOException > ---------------------------------------------------------- > > Key: SPARK-23308 > URL: https://issues.apache.org/jira/browse/SPARK-23308 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.2.1 > Reporter: Márcio Furlani Carmona > Priority: Minor > > When `spark.sql.files.ignoreCorruptFiles` is set it totally ignores any kind > of RuntimeException or IOException, but some possible IOExceptions may happen > even if the file is not corrupted. > One example is the SocketTimeoutException which can be retried to possibly > fetch the data without meaning the data is corrupted. > > See: > https://github.com/apache/spark/blob/e30e2698a2193f0bbdcd4edb884710819ab6397c/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala#L163 -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org