Github user JoshRosen commented on the pull request:
https://github.com/apache/spark/pull/3121#issuecomment-61935132
This SQL situation is going to be a huge mess to fix and I fear that I'll
wind up having to touch nearly every line of test code (or at least a
substantial fraction of them).
Part of the problem is that there are also global `TestJsonData`,
`TestData`, and `ParquetTestData` classes that hold RDDs created using
`TestSQLContext`. These classes' fields are statically imported into most of
the test suites, so there's tons of references that would need to be updated.
There are a few code smells that suggest that this approach might have caused
other problems, such as these lines at the top of some test suites to ensure
that the test data is initialized:
```scala
class CachedTableSuite extends QueryTest {
TestData // Load test tables.
class InMemoryColumnarQuerySuite extends QueryTest {
// Make sure the tables are loaded.
TestData
```
I want to avoid rewriting the bulk of the actual test logic. One approach
is to leave these as global objects but convert their fields to `def`s and add
an `initialize()` method for reconstructing the global objects' states when a
new SQLContext is created. This isolates most of the changes to those test
data classes. We could create a test suite mixin that calls these
`initialize()` methods from `beforeAll()`. If we combine this with a few small
changes to remove the `TestSQLContext` global object and replace it by a
`SharedSQLContext` trait, then I think we can enable cleanup of the SQL tests'
`SparkContext` with fairly minimal impact to the actual test code itself.
Perhaps this design using global objects that are shared across test suites
was motivated by performance concerns, since it might be expensive to create a
bunch of tables. I think that the right approach to addressing test
performance is through coarse-grained parallelization by running individual
test suites in separate JVMs, since global shared state can be confusing.
Also, I think the performance impact might be fairly minimal: we'd only be
re-initializing once per suite, not once per test.
@marmbrus, do you have any feedback here? Is there a cleaner way to enable
cleanup of the SparkContext that the SQL tests create?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]