Steve Loughran commented on SPARK-23308:


bq  I should get at least ~82k partitions, thus the same number of S3 requests. 

if your input stream is doing abort/reopen on seek & positioned read, then you 
get many more S3 requests when reading columnar data, which bounces around a 
lot.  See HADOOP-13203 for the work there, & the recent change of HADOOP-14965 
to at least reduce the TCP abort call count, but not doing anything for the GET 
count, just using smaller ranges in the GET calls. Oh, and if you use SSE-KMS, 
separate throttling, but AFAIK it should only surface on the initial GET, when 
the encryption kicks off

> ignoreCorruptFiles should not ignore retryable IOException
> ----------------------------------------------------------
>                 Key: SPARK-23308
>                 URL: https://issues.apache.org/jira/browse/SPARK-23308
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.2.1
>            Reporter: Márcio Furlani Carmona
>            Priority: Minor
> When `spark.sql.files.ignoreCorruptFiles` is set it totally ignores any kind 
> of RuntimeException or IOException, but some possible IOExceptions may happen 
> even if the file is not corrupted.
> One example is the SocketTimeoutException which can be retried to possibly 
> fetch the data without meaning the data is corrupted.
> See: 
> https://github.com/apache/spark/blob/e30e2698a2193f0bbdcd4edb884710819ab6397c/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala#L163

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to