[GitHub] spark pull request: [SPARK-4180] [Core] Prevent creation of multip...

JoshRosen Wed, 05 Nov 2014 23:05:46 -0800

Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/3121#issuecomment-61935132
  
    This SQL situation is going to be a huge mess to fix and I fear that I'll 
wind up having to touch nearly every line of test code (or at least a 
substantial fraction of them).
    
    Part of the problem is that there are also global `TestJsonData`, 
`TestData`, and `ParquetTestData` classes that hold RDDs created using 
`TestSQLContext`.  These classes' fields are statically imported into most of 
the test suites, so there's tons of references that would need to be updated.  
There are a few code smells that suggest that this approach might have caused 
other problems, such as these lines at the top of some test suites to ensure 
that the test data is initialized:
    
    ```scala
    class CachedTableSuite extends QueryTest {
      TestData // Load test tables.
    
    class InMemoryColumnarQuerySuite extends QueryTest {
      // Make sure the tables are loaded.
      TestData
    ```
    
    I want to avoid rewriting the bulk of the actual test logic.  One approach 
is to leave these as global objects but convert their fields to `def`s and add 
an `initialize()` method for reconstructing the global objects' states when a 
new SQLContext is created.  This isolates most of the changes to those test 
data classes.  We could create a test suite mixin that calls these 
`initialize()` methods from `beforeAll()`.  If we combine this with a few small 
changes to remove the `TestSQLContext` global object and replace it by a 
`SharedSQLContext` trait, then I think we can enable cleanup of the SQL tests' 
`SparkContext` with fairly minimal impact to the actual test code itself.
    
    Perhaps this design using global objects that are shared across test suites 
was motivated by performance concerns, since it might be expensive to create a 
bunch of tables.  I think that the right approach to addressing test 
performance is through coarse-grained parallelization by running individual 
test suites in separate JVMs, since global shared state can be confusing.  
Also, I think the performance impact might be fairly minimal: we'd only be 
re-initializing once per suite, not once per test.
    
    @marmbrus, do you have any feedback here?  Is there a cleaner way to enable 
cleanup of the SparkContext that the SQL tests create?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-4180] [Core] Prevent creation of multip...

Reply via email to