[GitHub] spark pull request: [SPARK-7673] [SQL] WIP: Moves file status cach...

liancheng Sun, 17 May 2015 18:50:56 -0700

GitHub user liancheng opened a pull request:

    https://github.com/apache/spark/pull/6225


    [SPARK-7673] [SQL] WIP: Moves file status cache into HadoopFSRelation

    This PR tries to optimize `HadoopFsRelation` related query planning by 
moving `FileStatus` listing from `DataSourceStrategy` into a cache within 
`HadoopFsRelation`. To reuse cached `FileStatus` objects, 
`HadoopFsRelation.buildScan` methods now receive `Array[FileStatus]` instead of 
`Array[String]`.
    
    TODO
    
    - [ ] Fix failed test cases.
    - [ ] Pass HDFS paths as `Path` instead of `String` as that part of code 
has been made private.
    - [ ] Reuse `FileStatus` cache when reading Parquet footers.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/liancheng/spark spark-7673

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/6225.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #6225
    
----
commit 6a08b02a06d08f5e22db790560691cb91bb2a180
Author: Cheng Lian <[email protected]>
Date:   2015-05-18T01:43:30Z

    WIP: Moves file status cache into HadoopFSRelation

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-7673] [SQL] WIP: Moves file status cach...

Reply via email to