[ https://issues.apache.org/jira/browse/SPARK-23308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16360809#comment-16360809 ]
Steve Loughran edited comment on SPARK-23308 at 2/14/18 11:27 AM: ------------------------------------------------------------------ BTW bq. I should get at least ~82k partitions, thus the same number of S3 requests. if your input stream is doing abort/reopen on seek & positioned read, then you get many more S3 requests when reading columnar data, which bounces around a lot. See HADOOP-13203 for the work there, & the recent change of HADOOP-14965 to at least reduce the TCP abort call count, but not doing anything for the GET count, just using smaller ranges in the GET calls. Oh, and if you use SSE-KMS, separate throttling, but AFAIK it should only surface on the initial GET, when the encryption kicks off was (Author: ste...@apache.org): BTW bq I should get at least ~82k partitions, thus the same number of S3 requests. if your input stream is doing abort/reopen on seek & positioned read, then you get many more S3 requests when reading columnar data, which bounces around a lot. See HADOOP-13203 for the work there, & the recent change of HADOOP-14965 to at least reduce the TCP abort call count, but not doing anything for the GET count, just using smaller ranges in the GET calls. Oh, and if you use SSE-KMS, separate throttling, but AFAIK it should only surface on the initial GET, when the encryption kicks off > ignoreCorruptFiles should not ignore retryable IOException > ---------------------------------------------------------- > > Key: SPARK-23308 > URL: https://issues.apache.org/jira/browse/SPARK-23308 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.2.1 > Reporter: Márcio Furlani Carmona > Priority: Minor > > When `spark.sql.files.ignoreCorruptFiles` is set it totally ignores any kind > of RuntimeException or IOException, but some possible IOExceptions may happen > even if the file is not corrupted. > One example is the SocketTimeoutException which can be retried to possibly > fetch the data without meaning the data is corrupted. > > See: > https://github.com/apache/spark/blob/e30e2698a2193f0bbdcd4edb884710819ab6397c/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala#L163 -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org