Github user liancheng commented on the pull request: https://github.com/apache/spark/pull/2352#issuecomment-55440621 Got more clue on this, which explains why `HiveQuerySuite` doesn't fail previously. (but @chenghao-intel, why it fails on your side? Still mysterious.) Basically, we were jumping between two different sets of local metastore/warehouse directories while testing. The detailed process is: 1. When the `TestHive` singleton object is instantiated, we create a pair of temporary directories and [configure them as local testing metastore/warehouse](https://github.com/apache/spark/blob/533377621f1e178e18fa0b79d653a11b66e4e250/sql/hive/src/main/scala/org/apache/spark/sql/hive/TestHive.scala#L80). Let's abbreviate them as `m1` and `w1`. At this point, these two directories are created, but remain empty. Default Hive database will be created lazily later. 1. Then `HiveQuerySuite` gets started. Whenever a test case created via `HiveComparisonTest.createQueryTest` is executed, we first [execute a `SHOW TABLES` command](https://github.com/apache/spark/blob/533377621f1e178e18fa0b79d653a11b66e4e250/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveComparisonTest.scala#L253-L254) (notice the "MINOR HACK" comment). An important thing happens here is that the "default" database gets created in `m1` lazily at this point. 1. Then [`reset()` is called](https://github.com/apache/spark/blob/533377621f1e178e18fa0b79d653a11b66e4e250/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveComparisonTest.scala#L255). Within `reset()`, first of all, we execute a Hive `RESET` command, which sets all configurations to their default values, including `javax.jdo.option.ConnectionURL` and `hive.metastore.warehouse.dir`. This implies metastore is reset to the `metastore_db` directory under current working directory and warehouse is reset to `/user/hive/warehouse` (which usually doesn't exist). 1. Then follows the `getAllTables` call, which is used to delete all tables in the "default" database. During the `getAllTables` call, the `metastore_db` directory is created if it's not there, and again, Hive creates an empty "default" database in it lazily. Hmm... wait, so here we end up with two "default" databases, one in `m1` and another in `metastore_db`! As a result, [these lines](https://github.com/apache/spark/blob/533377621f1e178e18fa0b79d653a11b66e4e250/sql/hive/src/main/scala/org/apache/spark/sql/hive/TestHive.scala#L389-L405) are actually always trying to cleanup tables and databases under the newly created `metastore_db` directory, which is empty. 1. At last, we call `configure()` again and sets metadata/warehouse directories back to `m1`/`w1`, which remain intact. In a word, the TL;DR here is, previously, testing databases and testing tables created by test suites inherited from `HiveComparisonTest` never really got cleaned up, and the "MINOR HACK" perfectly covered up probably the oldest bug in the history of Spark SQL! By applying this PR, we should be able to remove this hack safely.
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org