[jira] [Comment Edited] (SPARK-23308) ignoreCorruptFiles should not ignore retryable IOException

Steve Loughran (JIRA) Wed, 14 Feb 2018 03:29:12 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-23308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16360809#comment-16360809
 ]


Steve Loughran edited comment on SPARK-23308 at 2/14/18 11:27 AM:
------------------------------------------------------------------

BTW

bq.  I should get at least ~82k partitions, thus the same number of S3 
requests. 

if your input stream is doing abort/reopen on seek & positioned read, then you 
get many more S3 requests when reading columnar data, which bounces around a 
lot.  See HADOOP-13203 for the work there, & the recent change of HADOOP-14965 
to at least reduce the TCP abort call count, but not doing anything for the GET 
count, just using smaller ranges in the GET calls. Oh, and if you use SSE-KMS, 
separate throttling, but AFAIK it should only surface on the initial GET, when 
the encryption kicks off


was (Author: ste...@apache.org):
BTW

bq  I should get at least ~82k partitions, thus the same number of S3 requests. 

if your input stream is doing abort/reopen on seek & positioned read, then you 
get many more S3 requests when reading columnar data, which bounces around a 
lot.  See HADOOP-13203 for the work there, & the recent change of HADOOP-14965 
to at least reduce the TCP abort call count, but not doing anything for the GET 
count, just using smaller ranges in the GET calls. Oh, and if you use SSE-KMS, 
separate throttling, but AFAIK it should only surface on the initial GET, when 
the encryption kicks off

> ignoreCorruptFiles should not ignore retryable IOException
> ----------------------------------------------------------
>
>                 Key: SPARK-23308
>                 URL: https://issues.apache.org/jira/browse/SPARK-23308
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.2.1
>            Reporter: Márcio Furlani Carmona
>            Priority: Minor
>
> When `spark.sql.files.ignoreCorruptFiles` is set it totally ignores any kind 
> of RuntimeException or IOException, but some possible IOExceptions may happen 
> even if the file is not corrupted.
> One example is the SocketTimeoutException which can be retried to possibly 
> fetch the data without meaning the data is corrupted.
>  
> See: 
> https://github.com/apache/spark/blob/e30e2698a2193f0bbdcd4edb884710819ab6397c/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala#L163



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-23308) ignoreCorruptFiles should not ignore retryable IOException

Reply via email to