[GitHub] spark pull request #13818: [SPARK-15968][SQL] Nonempty partitioned metastore...

mallman Tue, 21 Jun 2016 14:17:35 -0700

GitHub user mallman opened a pull request:

    https://github.com/apache/spark/pull/13818


    [SPARK-15968][SQL] Nonempty partitioned metastore tables are not cached

    (Please note this is a revision of PR #13686, which has been closed in 
favor of this PR.)
    
    ## What changes were proposed in this pull request?
    
    The `getCached` method of `HiveMetastoreCatalog` computes 
`pathsInMetastore` from the metastore relation's catalog table. This only 
returns the table base path, which is incomplete/inaccurate for a nonempty 
partitioned table. As a result, cached lookups on nonempty partitioned tables 
always miss.
    
    Rather than get `pathsInMetastore` from
    
        metastoreRelation.catalogTable.storage.locationUri.toSeq
    
    I modified the `getCached` method to take a `pathsInMetastore` argument. 
Calls to this method pass in the paths computed from calls to the Hive 
metastore. This is how `getCached` was implemented in Spark 1.5:
    
    
https://github.com/apache/spark/blob/e0c3212a9b42e3e704b070da4ac25b68c584427f/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala#L444.
    
    I also added a call in `InsertIntoHiveTable.scala` to invalidate the table 
from the SQL session catalog.
    
    ## How was this patch tested?
    
    I've added a new unit test to `parquetSuites.scala`:
    
        SPARK-15968: nonempty partitioned metastore Parquet table lookup should 
use cached relation
    
    Note that the only difference between this new test and the one above it in 
the file is that the new test populates its partitioned table with a single 
value, while the existing test leaves the table empty. This reveals a subtle, 
unexpected hole in test coverage present before this patch.
    
    Note I also modified a different but related unit test in 
`parquetSuites.scala`:
    
        SPARK-15248: explicitly added partitions should be readable
    
    This unit test asserts that Spark SQL should return data from a table 
partition which has been placed there outside a metastore query immediately 
after it is added. I changed the test so that, instead of adding the data as a 
parquet file saved in the partition's location, the data is added through a SQL 
`INSERT` query. I made this change because I could find no way to efficiently 
support partitioned table caching without failing that test.
    
    In addition to my primary motivation, I can offer a few reasons I believe 
this is an acceptable weakening of that test. First, it still validates a fix 
for [SPARK-15248](https://issues.apache.org/jira/browse/SPARK-15248), the issue 
for which it was written. Second, the assertion made is stronger than that 
required for non-partitioned tables. If you write data to the storage location 
of a non-partitioned metastore table without using a proper SQL DML query, a 
subsequent call to show that data will not return it. I believe this is an 
intentional limitation put in place to make table caching feasible, but I'm 
only speculating.
    
    Building a large `HadoopFsRelation` requires `stat`-ing all of its data 
files. In our environment, where we have tables with 10's of thousands of 
partitions, the difference between using a cached relation versus a new one is 
a matter of seconds versus minutes. Caching partitioned table metadata vastly 
improves the usability of Spark SQL in this cases.
    
    Thanks.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/VideoAmp/spark-public spark-15968

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/13818.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #13818
    
----
commit 8a058c65c6c20e311bde5c0ade87c14c6b6b5f37
Author: Michael Allman <[email protected]>
Date:   2016-06-15T16:52:17Z

    [SPARK-15968][SQL] HiveMetastoreCatalog does not correctly validate
    partitioned metastore relation when searching the internal table cache
    
    The `getCached` method of `HiveMetastoreCatalog` computes
    `pathsInMetastore` from the metastore relation's catalog table. This
    only returns the table base path, which is not correct for nonempty
    partitioned tables. As a result, cached lookups on nonempty partitioned
    tables always miss.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #13818: [SPARK-15968][SQL] Nonempty partitioned metastore...

Reply via email to