[jira] [Commented] (SPARK-23308) ignoreCorruptFiles should not ignore retryable IOException

JIRA Wed, 07 Feb 2018 16:33:21 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-23308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16356271#comment-16356271
 ]


Márcio Furlani Carmona commented on SPARK-23308:
------------------------------------------------

That's true Steve. I totally agree that it'd be hard to identify what is 
retryable from Spark's perspective. That being said, I agree that the FS should 
be responsible for that decision.

I believe one option would be leaving the retry responsibility to FS 
implementations (as it already seems to be) and adding the documentation for 
this flag making clear you might experience some data losses. Other option 
would be creating a special exception (CorruptedFileException?) that could be 
thrown by FS implementations and let them decide what is a corrupted file or 
just a transient error.

A more complex approach would be having a file blacklist mechanism rather than 
this flag, similarly to the `spark.blacklist.*` feature, where you can decide 
how many times to retry a file before considering it corrupted, then you won't 
need to decide what is worth retrying or not. The downside is that you'll 
always retry, even when there's no point in retrying.

 

*Regarding the socket timeouts:* I also believe it's some kind of throttling.

I'm reading over 80k files with over 10TB of data. The file sizes are not 
uniform, so some files may be read in a single request, other might get split 
into multiple partitions. Considering I'm not overriding the default 
`spark.files.maxPartitionBytes` value of `128 MB`, I should get at least ~82k 
partitions, thus the same number of S3 requests. Also, the files share some 
common prefixes, which [might be bad for S3 index 
access|https://docs.aws.amazon.com/AmazonS3/latest/dev/request-rate-perf-considerations.html].

I have a total of 120 executors, with 9 cores each, across 30 workers. But 
since each task involves some good computational time, the median task 
execution time is 8s, so I don't think we're getting close to the 100 TPS for a 
common prefix mentioned in the S3 documentation above.

So, I'm more inclined to say it might be some EC2 network throttling rather 
than S3 throttling. Another reason to believe on that is because I've seen some 
[503 Slow 
Down|https://docs.aws.amazon.com/AmazonS3/latest/API/ErrorResponses.html] 
errors from S3 in the past when my requests got throttled, which I'm not seeing 
this time.

> ignoreCorruptFiles should not ignore retryable IOException
> ----------------------------------------------------------
>
>                 Key: SPARK-23308
>                 URL: https://issues.apache.org/jira/browse/SPARK-23308
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.2.1
>            Reporter: Márcio Furlani Carmona
>            Priority: Minor
>
> When `spark.sql.files.ignoreCorruptFiles` is set it totally ignores any kind 
> of RuntimeException or IOException, but some possible IOExceptions may happen 
> even if the file is not corrupted.
> One example is the SocketTimeoutException which can be retried to possibly 
> fetch the data without meaning the data is corrupted.
>  
> See: 
> https://github.com/apache/spark/blob/e30e2698a2193f0bbdcd4edb884710819ab6397c/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala#L163



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23308) ignoreCorruptFiles should not ignore retryable IOException

Reply via email to