[GitHub] spark pull request #14627: [SPARK-16975][SQL] Do not duplicately check file ...

HyukjinKwon Fri, 12 Aug 2016 18:19:09 -0700

GitHub user HyukjinKwon opened a pull request:

    https://github.com/apache/spark/pull/14627


    [SPARK-16975][SQL] Do not duplicately check file paths in data sources 
implementing FileFormat and prevent to attempt to list twice in ORC

    ## What changes were proposed in this pull request?
    
    This PR cleans up duplicated checking for file paths in implemented data 
sources and prevent to attempt to list twice in ORC data source.
    
    https://github.com/apache/spark/pull/14585 handles a problem for the 
partition column name having `_` and the issue itself is resolved correctly. 
However, it seems the data sources implementing `FileFormat` are validating the 
paths duplicately. Assuming from the comment in `CSVFileFormat`, `// TODO: Move 
filtering.`, I guess we don't have to check this duplicately.
    
    Currently, this seems being filtered in `HadoopFsRelation.shouldFilterOut` 
and`PartitioningAwareFileCatalog.isDataPath`. So, `FileFormat.inferSchema` will 
always receive left files. For example, running to codes below:
    
    ```scala
    spark.range(10).withColumn("_locality_code", 
$"id").write.partitionBy("_locality_code").save("/tmp/parquet")
    spark.read.parquet("/tmp/parquet")
    ```
    
    gives the paths below without directories but just valid data files:
    
    ```bash
    
/tmp/parquet/_col=0/part-r-00000-094a8efa-bece-4b50-b54c-7918d1f7b3f8.snappy.parquet
    
/tmp/parquet/_col=1/part-r-00000-094a8efa-bece-4b50-b54c-7918d1f7b3f8.snappy.parquet
    
/tmp/parquet/_col=2/part-r-00000-25de2b50-225a-4bcf-a2bc-9eb9ed407ef6.snappy.parquet
    ...
    ```
    
    to `FileFormat.inferSchema`.
    
    In addition, this PR handles the problem in ORC. Due to the reason above, 
we don't have to validate the paths but for ORC, it is validating both the 
paths and whether it is a directory or not (trying to list the leaf files) in 
`OrcFileOperator`. Since the paths are not directories and validated, we don't 
have to attempt to list and validate this in ORC.
    
    This PR fixes both problems above.
    
    ## How was this patch tested?
    
    Unit test added in `HadoopFsRelationTest` and related existing tests.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/HyukjinKwon/spark SPARK-16975

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/14627.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #14627
    
----
commit 3fa597c140af57898f1050c54ea27e2e6e6f322c
Author: hyukjinkwon <gurwls...@gmail.com>
Date:   2016-08-13T00:53:25Z

    Do not duplicately check file paths and list twice in ORC

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14627: [SPARK-16975][SQL] Do not duplicately check file ...

Reply via email to