Github user liancheng commented on the pull request:

    https://github.com/apache/spark/pull/2352#issuecomment-55440621
  
    Got more clue on this, which explains why `HiveQuerySuite` doesn't fail 
previously. (but @chenghao-intel, why it fails on your side? Still mysterious.) 
Basically, we were jumping between two different sets of local 
metastore/warehouse directories while testing. The detailed process is:
    
    1. When the `TestHive` singleton object is instantiated, we create a pair 
of temporary directories and [configure them as local testing 
metastore/warehouse](https://github.com/apache/spark/blob/533377621f1e178e18fa0b79d653a11b66e4e250/sql/hive/src/main/scala/org/apache/spark/sql/hive/TestHive.scala#L80).
 Let's abbreviate them as `m1` and `w1`.
    
       At this point, these two directories are created, but remain empty. 
Default Hive database will be created lazily later.
    
    1. Then `HiveQuerySuite` gets started. Whenever a test case created via 
`HiveComparisonTest.createQueryTest` is executed, we first [execute a `SHOW 
TABLES` 
command](https://github.com/apache/spark/blob/533377621f1e178e18fa0b79d653a11b66e4e250/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveComparisonTest.scala#L253-L254)
 (notice the "MINOR HACK" comment).
    
       An important thing happens here is that the "default" database gets 
created in `m1` lazily at this point.
    
    1. Then [`reset()` is 
called](https://github.com/apache/spark/blob/533377621f1e178e18fa0b79d653a11b66e4e250/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveComparisonTest.scala#L255).
    
       Within `reset()`, first of all, we execute a Hive `RESET` command, which 
sets all configurations to their default values, including 
`javax.jdo.option.ConnectionURL` and `hive.metastore.warehouse.dir`. This 
implies metastore is reset to the `metastore_db` directory under current 
working directory and warehouse is reset to `/user/hive/warehouse` (which 
usually doesn't exist).
    
    1. Then follows the `getAllTables` call, which is used to delete all tables 
in the "default" database.
    
       During the `getAllTables` call, the `metastore_db` directory is created 
if it's not there, and again, Hive creates an empty "default" database in it 
lazily. Hmm... wait, so here we end up with two "default" databases, one in 
`m1` and another in `metastore_db`! As a result, [these 
lines](https://github.com/apache/spark/blob/533377621f1e178e18fa0b79d653a11b66e4e250/sql/hive/src/main/scala/org/apache/spark/sql/hive/TestHive.scala#L389-L405)
 are actually always trying to cleanup tables and databases under the newly 
created `metastore_db` directory, which is empty.
    
    1. At last, we call `configure()` again and sets metadata/warehouse 
directories back to `m1`/`w1`, which remain intact.
    
    In a word, the TL;DR here is, previously, testing databases and testing 
tables created by test suites inherited from `HiveComparisonTest` never really 
got cleaned up, and the "MINOR HACK" perfectly covered up probably the oldest 
bug in the history of Spark SQL! By applying this PR, we should be able to 
remove this hack safely.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to