GitHub user ericl opened a pull request:

    https://github.com/apache/spark/pull/15539

    [SPARK-17994] [SQL] Add back a file status cache for catalog tables

    ## What changes were proposed in this pull request?
    
    In SPARK-16980, we removed the full in-memory cache of table partitions in 
favor of loading only needed partitions from the metastore. This greatly 
improves the initial latency of queries that only read a small fraction of 
table partitions.
    
    However, since the metastore does not store file statistics, we need to 
discover those from remote storage. With the loss of the in-memory file status 
cache this has to happen on each query, increasing the latency of repeated 
queries over the same partitions.
    
    The proposal is to add back a per-table cache of partition contents, i.e. 
Map[Path, Array[FileStatus]]. This cache would be retained per-table, and can 
be invalidated through refreshTable() and refreshByPath(). Unlike the prior 
cache, it can be incrementally updated as new partitions are read.
    
    ## How was this patch tested?
    
    Existing tests and new tests in `HiveTablePerfStatsSuite`.
    
    cc @mallman 

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/ericl/spark meta-cache

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/15539.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #15539
    
----
commit c2eacb7da1d2d4129b19be89a2c07e91dbff3964
Author: Michael Allman <mich...@videoamp.com>
Date:   2016-08-10T19:07:34Z

    [SPARK-16980][SQL] Load only catalog table partition metadata required
    to answer a query

commit 1f611c4089102744242b73346d9724d248635cac
Author: Michael Allman <mich...@videoamp.com>
Date:   2016-09-13T01:21:38Z

    Add a new catalyst optimizer rule to SQL core for pruning unneeded
    partitions' files from a table file catalog

commit 8b24eada4a0b49f39d16570ee86f52ddc1682251
Author: Michael Allman <mich...@videoamp.com>
Date:   2016-10-08T00:15:11Z

    Include the type of file catalog in the FileSourceScanExec metadata

commit f82f0d228141dd026b0b631e8d984961ee8b827b
Author: Michael Allman <mich...@videoamp.com>
Date:   2016-10-08T00:15:54Z

    TODO: Consider renaming FileCatalog to better differentiate it from
    BasicFileCatalog (or vice-versa)

commit 1f0d5d88538da058e474098eabba53d387f70f53
Author: Eric Liang <e...@databricks.com>
Date:   2016-10-11T02:54:53Z

    try out parquet case insensitive fallback

commit 198dd9457fad08516f65ea1bcfa6edf4af17d948
Author: Michael Allman <mich...@videoamp.com>
Date:   2016-10-11T17:53:13Z

    Refactor the FileSourceScanExec.metadata val to make it prettier

commit acc84f07f53d3c87c5637636e69b1c564421484a
Author: Michael Allman <mich...@videoamp.com>
Date:   2016-10-11T19:00:43Z

    Refactor `TableFileCatalog.listFiles` to call `listDataLeafFiles` once
    instead of once per partition

commit 59de5ca2c8b209a190dc0c6082fc6e2d2de0096b
Author: Eric Liang <e...@databricks.com>
Date:   2016-10-11T23:03:18Z

    fix and add test for input files

commit 3b51624263cfcedd3e51b71342b940592a5f6118
Author: Eric Liang <e...@databricks.com>
Date:   2016-10-11T23:09:06Z

    rename test

commit f94863dd386a8654986a1fde09e5d87ded97a6e3
Author: Eric Liang <e...@databricks.com>
Date:   2016-10-13T01:09:02Z

    fix it

commit 0958bcd8f088d5641fc78952b8265ce05232c3f9
Author: Eric Liang <e...@databricks.com>
Date:   2016-10-12T20:20:11Z

    feature flag

commit 291cee788e1bcc3ecbd7b1a4187f8eba58e134fb
Author: Eric Liang <e...@databricks.com>
Date:   2016-10-12T22:48:03Z

    add comments

commit 022d5b9873018dad8ac08646704f567176977877
Author: Eric Liang <e...@databricks.com>
Date:   2016-10-13T01:26:23Z

    more test cases

commit 8bd27be814f7721f3764364c72b33c7f67e0e9ff
Author: Eric Liang <e...@databricks.com>
Date:   2016-10-13T01:46:41Z

    also fix a bug with zero partitions selected

commit 627572e0020d313a9c1378349e2ee4ab0d0e97f1
Author: Eric Liang <e...@databricks.com>
Date:   2016-10-13T17:30:48Z

    extend and fix flakiness in test

commit 6d8e7ea9f904e33af4ca7372f5b31379aede9308
Author: Michael Allman <mich...@videoamp.com>
Date:   2016-10-13T17:55:26Z

    Enhance `ParquetMetastoreSuite` with mixed-case partition columns

commit 21caa932a157ec3dd394829061b06bd3d857de0f
Author: Michael Allman <mich...@videoamp.com>
Date:   2016-10-13T18:29:25Z

    Tidy up a little by removing some unused imports, an unused method and
    moving a protected method down and making it private

commit d7795cd0f3bc517bdf278e626ca25ce08ea23bcb
Author: Michael Allman <mich...@videoamp.com>
Date:   2016-10-13T18:44:15Z

    Put partition count in `FileSourceScanExec.metadata` for partitioned
    tables

commit 765f93ce664ef33c1c62bf80b678ff5ba2992b85
Author: Michael Allman <mich...@videoamp.com>
Date:   2016-10-13T20:48:33Z

    Fix some errors in my revision of `ParquetSourceSuite`

commit e1635e4570c0e4b892b93d1ac1e71d52d5a4f66b
Author: Eric Liang <ekhli...@gmail.com>
Date:   2016-10-14T01:24:31Z

    Add metrics and cost tests for partition pruning effectiveness (#5)
    
    * [SPARK-16980][SQL] Load only catalog table partition metadata required
    to answer a query
    
    * Add a new catalyst optimizer rule to SQL core for pruning unneeded
    partitions' files from a table file catalog
    
    * Include the type of file catalog in the FileSourceScanExec metadata
    
    * TODO: Consider renaming FileCatalog to better differentiate it from
    BasicFileCatalog (or vice-versa)
    
    * try out parquet case insensitive fallback
    
    * Refactor the FileSourceScanExec.metadata val to make it prettier
    
    * fix and add test for input files
    
    * rename test
    
    * Refactor `TableFileCatalog.listFiles` to call `listDataLeafFiles` once
    instead of once per partition
    
    * fix it
    
    * more test cases
    
    * also fix a bug with zero partitions selected
    
    * feature flag
    
    * add comments
    
    * extend and fix flakiness in test
    
    * Enhance `ParquetMetastoreSuite` with mixed-case partition columns
    
    * Tidy up a little by removing some unused imports, an unused method and
    moving a protected method down and making it private
    
    * Put partition count in `FileSourceScanExec.metadata` for partitioned
    tables
    
    * Fix some errors in my revision of `ParquetSourceSuite`
    
    * Thu Oct 13 17:18:14 PDT 2016
    
    * more generic
    
    * Thu Oct 13 18:09:42 PDT 2016
    
    * Thu Oct 13 18:09:55 PDT 2016
    
    * Thu Oct 13 18:22:31 PDT 2016

commit 71049d130e89aedba75e8875d8fde7620d6a55e2
Author: Eric Liang <ekhli...@gmail.com>
Date:   2016-10-14T02:27:01Z

    Actually register the hive catalog metrics, also revert broken tests (#6)
    
    * Thu Oct 13 19:02:36 PDT 2016
    
    * Thu Oct 13 19:03:06 PDT 2016

commit 6a63afd156d4806122b9ad0c2593de69a0ae790c
Author: Eric Liang <e...@databricks.com>
Date:   2016-10-14T21:04:01Z

    Fri Oct 14 14:04:01 PDT 2016

commit 6b02b3c36b3c1f99695262f9d60fe2aaaf25c5bc
Author: Michael Allman <mich...@videoamp.com>
Date:   2016-10-14T21:35:10Z

    [SPARK-16980][SQL] Load only catalog table partition metadata required
    to answer a query

commit e816919fe8b4cd06cc91fb373e8e55f7c18e99b6
Author: Michael Allman <mich...@videoamp.com>
Date:   2016-09-13T01:21:38Z

    Add a new catalyst optimizer rule to SQL core for pruning unnecessary
    partition data from a HadoopFsRelation's file catalog

commit 8cca6dc02847eb04740ec1ed5d29920b4f2f0030
Author: Michael Allman <mich...@videoamp.com>
Date:   2016-10-08T00:15:11Z

    Include the type of file catalog in the FileSourceScanExec metadata

commit 7acc3f1072ece6b2e5f5324ff84bbcbeae487ef2
Author: Eric Liang <e...@databricks.com>
Date:   2016-10-11T02:54:53Z

    try out parquet case insensitive fallback

commit cf7d1f15e0045cbd12c81a39138e7c3439c611d7
Author: Michael Allman <mich...@videoamp.com>
Date:   2016-10-11T17:53:13Z

    Refactor the FileSourceScanExec.metadata val to make it prettier

commit c75855c0615d88001a83c03a9515a9b1fff0b241
Author: Eric Liang <e...@databricks.com>
Date:   2016-10-11T23:03:18Z

    fix and add test for input files

commit 821372f2fdc09ebd882bb6958bed24a42738235c
Author: Eric Liang <e...@databricks.com>
Date:   2016-10-11T23:09:06Z

    rename test

commit d0b893ba5c45db32aad640ea6732a8803c054f07
Author: Michael Allman <mich...@videoamp.com>
Date:   2016-10-11T19:00:43Z

    Refactor `TableFileCatalog.listFiles` to call `listDataLeafFiles` once
    instead of once per partition

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to