[ https://issues.apache.org/jira/browse/SPARK-25344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16685394#comment-16685394 ]
Hyukjin Kwon commented on SPARK-25344: -------------------------------------- Hey [~bryanc], once the first try got merged, mind if I ask to take a look for some of sub tasks? In particular, I would appreciate if you have a change to take a look for ML and MLlib. > Break large PySpark unittests into smaller files > ------------------------------------------------ > > Key: SPARK-25344 > URL: https://issues.apache.org/jira/browse/SPARK-25344 > Project: Spark > Issue Type: Improvement > Components: PySpark > Affects Versions: 2.4.0 > Reporter: Imran Rashid > Assignee: Hyukjin Kwon > Priority: Major > > We've got a ton of tests in one humongous tests.py file, rather than breaking > it out into smaller files. > Having one huge file doesn't seem great for code organization, and it also > makes the test parallelization in run-tests.py not work as well. On my > laptop, tests.py takes 150s, and the next longest test file takes only 20s. > There are similarly large files in other pyspark modules, eg. sql/tests.py, > ml/tests.py, mllib/tests.py, streaming/tests.py. > It seems that at least for some of these files, its already broken into > independent test classes, so it shouldn't be too hard to just move them into > their own files. > We could pick up one example and follow. The current style looks closer to > NumPy structure and looks easier to follow. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org