Github user liancheng commented on the pull request:
https://github.com/apache/spark/pull/2352#issuecomment-55440621
Got more clue on this, which explains why `HiveQuerySuite` doesn't fail
previously. (but @chenghao-intel, why it fails on your side? Still mysterious.)
Basically, we were jumping between two different sets of local
metastore/warehouse directories while testing. The detailed process is:
1. When the `TestHive` singleton object is instantiated, we create a pair
of temporary directories and [configure them as local testing
metastore/warehouse](https://github.com/apache/spark/blob/533377621f1e178e18fa0b79d653a11b66e4e250/sql/hive/src/main/scala/org/apache/spark/sql/hive/TestHive.scala#L80).
Let's abbreviate them as `m1` and `w1`.
At this point, these two directories are created, but remain empty.
Default Hive database will be created lazily later.
1. Then `HiveQuerySuite` gets started. Whenever a test case created via
`HiveComparisonTest.createQueryTest` is executed, we first [execute a `SHOW
TABLES`
command](https://github.com/apache/spark/blob/533377621f1e178e18fa0b79d653a11b66e4e250/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveComparisonTest.scala#L253-L254)
(notice the "MINOR HACK" comment).
An important thing happens here is that the "default" database gets
created in `m1` lazily at this point.
1. Then [`reset()` is
called](https://github.com/apache/spark/blob/533377621f1e178e18fa0b79d653a11b66e4e250/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveComparisonTest.scala#L255).
Within `reset()`, first of all, we execute a Hive `RESET` command, which
sets all configurations to their default values, including
`javax.jdo.option.ConnectionURL` and `hive.metastore.warehouse.dir`. This
implies metastore is reset to the `metastore_db` directory under current
working directory and warehouse is reset to `/user/hive/warehouse` (which
usually doesn't exist).
1. Then follows the `getAllTables` call, which is used to delete all tables
in the "default" database.
During the `getAllTables` call, the `metastore_db` directory is created
if it's not there, and again, Hive creates an empty "default" database in it
lazily. Hmm... wait, so here we end up with two "default" databases, one in
`m1` and another in `metastore_db`! As a result, [these
lines](https://github.com/apache/spark/blob/533377621f1e178e18fa0b79d653a11b66e4e250/sql/hive/src/main/scala/org/apache/spark/sql/hive/TestHive.scala#L389-L405)
are actually always trying to cleanup tables and databases under the newly
created `metastore_db` directory, which is empty.
1. At last, we call `configure()` again and sets metadata/warehouse
directories back to `m1`/`w1`, which remain intact.
In a word, the TL;DR here is, previously, testing databases and testing
tables created by test suites inherited from `HiveComparisonTest` never really
got cleaned up, and the "MINOR HACK" perfectly covered up probably the oldest
bug in the history of Spark SQL! By applying this PR, we should be able to
remove this hack safely.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]