[ 
https://issues.apache.org/jira/browse/SPARK-25344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-25344:
---------------------------------
    Description: 
We've got a ton of tests in one humongous tests.py file, rather than breaking 
it out into smaller files.

Having one huge file doesn't seem great for code organization, and it also 
makes the test parallelization in run-tests.py not work as well. On my laptop, 
tests.py takes 150s, and the next longest test file takes only 20s. There are 
similarly large files in other pyspark modules, eg. sql/tests.py, ml/tests.py, 
mllib/tests.py, streaming/tests.py.

It seems that at least for some of these files, its already broken into 
independent test classes, so it shouldn't be too hard to just move them into 
their own files.

We could pick up one example and follow. The current style looks closer to 
NumPy structure and looks easier to follow.

  was:
We've got a ton of tests in one humongous tests.py file, rather than breaking 
it out into smaller files.

Having one huge file doesn't seem great for code organization, and it also 
makes the test parallelization in run-tests.py not work as well.  On my laptop, 
tests.py takes 150s, and the next longest test file takes only 20s.  There are 
similarly large files in other pyspark modules, eg. sql/tests.py, ml/tests.py, 
mllib/tests.py, streaming/tests.py.

It seems that at least for some of these files, its already broken into 
independent test classes, so it shouldn't be too hard to just move them into 
their own files.


> Break large PySpark unittests into smaller files
> ------------------------------------------------
>
>                 Key: SPARK-25344
>                 URL: https://issues.apache.org/jira/browse/SPARK-25344
>             Project: Spark
>          Issue Type: Improvement
>          Components: PySpark
>    Affects Versions: 2.4.0
>            Reporter: Imran Rashid
>            Assignee: Hyukjin Kwon
>            Priority: Major
>
> We've got a ton of tests in one humongous tests.py file, rather than breaking 
> it out into smaller files.
> Having one huge file doesn't seem great for code organization, and it also 
> makes the test parallelization in run-tests.py not work as well. On my 
> laptop, tests.py takes 150s, and the next longest test file takes only 20s. 
> There are similarly large files in other pyspark modules, eg. sql/tests.py, 
> ml/tests.py, mllib/tests.py, streaming/tests.py.
> It seems that at least for some of these files, its already broken into 
> independent test classes, so it shouldn't be too hard to just move them into 
> their own files.
> We could pick up one example and follow. The current style looks closer to 
> NumPy structure and looks easier to follow.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to