[GitHub] spark pull request #16500: [SPARK-19120] [SPARK-19121] Refresh Metadata Cach...

gatorsmile Sat, 07 Jan 2017 16:34:54 -0800

GitHub user gatorsmile opened a pull request:

    https://github.com/apache/spark/pull/16500


    [SPARK-19120] [SPARK-19121] Refresh Metadata Cache After Load Partitioned 
Hive Tables

    ### What changes were proposed in this pull request?
    ```Scala
            sql("CREATE TABLE tab (a STRING) STORED AS PARQUET")
    
            // This table fetch is to fill the cache with zero leaf files
            spark.table("tab").show()
    
            sql(
              s"""
                 |LOAD DATA LOCAL INPATH '$newPartitionDir' OVERWRITE
                 |INTO TABLE tab
               """.stripMargin)
    
            spark.table("tab").show()
    ```
    
    In the above example, the returned result is empty after table loading. The 
metadata cache could be out of dated after loading new data into the table, 
because loading/inserting does not update the cache. So far, the metadata cache 
are only used for data source tables. Thus, only `parquet` and `orc` formats 
are facing such issues, because the Hive tables are converted to data source 
tables when 
`spark.sql.hive.convertMetastoreParquet`/`spark.sql.hive.convertMetastoreOrc` 
is on. 
    
    This PR is to refresh the metadata cache after processing the `LOAD DATA` 
command.
    
    In addition, Spark SQL does not convert **partitioned** Hive tables 
(orc/parquet) to data source tables in the write path, but the read path is 
using the metadata cache for both **partitioned** and non-partitioned Hive 
tables (orc/parquet). That means, writing the partitioned parquet/orc tables 
still use `InsertIntoHiveTable`, instead of 
`InsertIntoHadoopFsRelationCommand`. To avoid reading the out-of-dated cache, 
`InsertIntoHiveTable` needs to refresh the metadata cache for partitioned 
tables. Note, it does not need to refresh the cache for non-partitioned 
parquet/orc tables, because it does not call `InsertIntoHiveTable` at all. 
    
    ### How was this patch tested?
    Added a test case in parquetSuites.scala

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/gatorsmile/spark refreshInsertIntoHiveTable

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/16500.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #16500
    
----
commit ea91cb077e4ea307020dc5b3b4ffe9a0d2a4dc88
Author: gatorsmile <[email protected]>
Date:   2017-01-07T18:59:52Z

    fix.

commit b7013c2853bf993d82b88c4a605ce921d4593ebe
Author: gatorsmile <[email protected]>
Date:   2017-01-07T22:55:25Z

    fix.

commit 27fab56bdac74dac2d7dbd36db4c240d35c89dac
Author: gatorsmile <[email protected]>
Date:   2017-01-08T00:06:47Z

    fix.

commit 0f70e912402e118a79f48db38c1697baf6905cde
Author: gatorsmile <[email protected]>
Date:   2017-01-08T00:33:20Z

    more test cases.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #16500: [SPARK-19120] [SPARK-19121] Refresh Metadata Cach...

Reply via email to