[ 
https://issues.apache.org/jira/browse/SPARK-25344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16615091#comment-16615091
 ] 

Imran Rashid commented on SPARK-25344:
--------------------------------------

{quote}
1. When to create a separate test file, for each module? and how to name? e.g. 
"test_rdd.py"
{quote}

The general principles I'd aim for are
(1) avoid giant test files
(2) test files should have names indicate what they're testing
(3) whenever possible, the test module name should mirror the name of the main 
code its testing

I wouldn't expect the rules to be super-firm, its a judgement call when there 
isn't a perfect correspondence, sometimes a test will specifically test the 
integration between modules, etc.  These guidelines are super-vague, its still 
relying on developers to use their judgement.

We don't have to get this perfect to start making improvements.  The current 
organization seems obviously wrong.

{quote}
2. Where to put the test files? same dir as source or subdir named "tests"
{quote}

"tests" subdir makes sense to me, though I don't have a really strong opinion.

{quote}
3. Start splitting tests immediately as new tests are written? Incrementally as 
subtasks in this JIRA?
{quote}

I feel strongly the answer should be *both*.  If we keep adding tests to the 
giant files, then the work for this jira just keeps increasing.  It seems 
pretty easy to have new tests just go in the right place when they're added.  
At *some* point, you're going to have to tell folks to start adding to new 
files instead of the existing giant one -- if you don't do it now, when will 
you?  wait till everything is moved?  then you've got a moving target that's 
tough to hit.  Furthermore, you'll have races to merge things between the newly 
added tests which modify tests.py vs. the changes for this jira to pull the 
tests out.  Might only be trivial conflicts, but might as well avoid it.

The only downside I see is that in the interim, developers will have to look in 
multiple files for tests.  But if we're agreeing to break things into smaller 
files, then they'll have to do that anyway, right?

> Break large tests.py files into smaller files
> ---------------------------------------------
>
>                 Key: SPARK-25344
>                 URL: https://issues.apache.org/jira/browse/SPARK-25344
>             Project: Spark
>          Issue Type: Improvement
>          Components: PySpark
>    Affects Versions: 2.4.0
>            Reporter: Imran Rashid
>            Priority: Major
>
> We've got a ton of tests in one humongous tests.py file, rather than breaking 
> it out into smaller files.
> Having one huge file doesn't seem great for code organization, and it also 
> makes the test parallelization in run-tests.py not work as well.  On my 
> laptop, tests.py takes 150s, and the next longest test file takes only 20s.  
> There are similarly large files in other pyspark modules, eg. sql/tests.py, 
> ml/tests.py, mllib/tests.py, streaming/tests.py.
> It seems that at least for some of these files, its already broken into 
> independent test classes, so it shouldn't be too hard to just move them into 
> their own files.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to