[GitHub] spark pull request #16474: [SPARK-19082][SQL] Make ignoreCorruptFiles work f...

viirya Wed, 04 Jan 2017 21:06:06 -0800

GitHub user viirya opened a pull request:

    https://github.com/apache/spark/pull/16474


    [SPARK-19082][SQL] Make ignoreCorruptFiles work for Parquet

    ## What changes were proposed in this pull request?
    
    We have a config `spark.sql.files.ignoreCorruptFiles` which can be used to 
ignore corrupt files when reading files in SQL. Currently the 
`ignoreCorruptFiles` config has two issues and can't work for Parquet:
    
    1. We only ignore corrupt files in `FileScanRDD` . Actually, we begin to 
read those files as early as inferring data schema from the files. For corrupt 
files, we can't read the schema and fail the program. A related issue reported 
at 
http://apache-spark-developers-list.1001551.n3.nabble.com/Skip-Corrupted-Parquet-blocks-footer-tc20418.html
    2. In `FileScanRDD`, we assume that we only begin to read the files when 
starting to consume the iterator. However, it is possibly the files are read 
before that. In this case, `ignoreCorruptFiles` config doesn't work too.
    
    This patch targets Parquet datasource. If this direction is ok, we can 
address the same issue for other datasources like Orc.
    
    Two main changes in this patch:
    
    1. Replace `ParquetFileReader.readAllFootersInParallel` by implementing the 
logic to read footers in multi-threaded manner
    
        We can't ignore corrupt files if we use 
`ParquetFileReader.readAllFootersInParallel`. So this patch implements the 
logic to do the similar thing in `readParquetFootersInParallel`.
    
    2. In `FileScanRDD`, we need to ignore corrupt file too when we call 
`readFunction` to return iterator.
    
    One thing to notice is:
    
    We read schema from Parquet file's footer. The method to read footer 
`ParquetFileReader.readFooter` throws `RuntimeException`, instead of 
`IOException`, if it can't successfully read the footer. Please check out 
https://github.com/apache/parquet-mr/blob/df9d8e415436292ae33e1ca0b8da256640de9710/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L470.
 So this patch catches `RuntimeException`.  One concern is that it might also 
shadow other runtime exceptions other than reading corrupt files.
    
    ## How was this patch tested?
    
    Jenkins tests.
    
    Please review http://spark.apache.org/contributing.html before opening a 
pull request.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/viirya/spark-1 
fix-ignorecorrupted-parquet-files

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/16474.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #16474
    
----
commit 586b347b04b64ddf2b70e4fb16035f80ad5a400e
Author: Liang-Chi Hsieh <[email protected]>
Date:   2017-01-05T04:02:13Z

    Make ignoreCorruptFiles work for Parquet.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #16474: [SPARK-19082][SQL] Make ignoreCorruptFiles work f...

Reply via email to