Yikun opened a new pull request #33174: URL: https://github.com/apache/spark/pull/33174
### What changes were proposed in this pull request? Add path level discover for python unittests. ![image](https://user-images.githubusercontent.com/1736354/124094503-6bdeb980-da8b-11eb-9bbe-b086024f6902.png) Change list: - Introduce a **python_discover_paths** in modules. - Add **_discover_python_unittests** function: it would be called in pthon/run-tests.py to load test module - Add **_append_discovred_goals function**: call _discover_python_unittests to refresh m.python_test_goals - if modules have python_test_goals or **python_discover_paths** would also be considered as python tests. - Fix: Move logging.basicConfig to head to make sure logging config before any possible logging print. - Fix: Change python/pyspark/testing/utils.py SPARK_HOME use _find_spark_home to get value. - Fix: export py4j PYTHONPATH before run test. Note that the test discover will do real import for every modules, so we need install all deps of module(which are expeceted to be test) before run-tests. ### Why are the changes needed? Now we need to specify the python test cases by manually when we add a new testcase. Sometime, we forgot to add the testcase to module list, the testcase would not be executed. Such as: pyspark-core pyspark.tests.test_pin_thread Thus we need some auto-discover way to find all testcase rather than specified every case by manually. related: https://github.com/apache/spark/pull/32867 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? 1. Add doc tests for _discover_python_unittests. 2. Compare the CI results, see diff in: Build modules: pyspark-sql, pyspark-mllib, pyspark-resource: https://www.diffchecker.com/CRtc3jph Build modules: pyspark-core, pyspark-streaming, pyspark-ml: https://www.diffchecker.com/We0fGQnx Build modules: pyspark-pandas:https://www.diffchecker.com/vbSh6LiP Build modules: pyspark-pandas-slow:https://www.diffchecker.com/1DXA88iH 3. local test for python modules: ./dev/run-tests --parallelism 2 --modules "pyspark-sql" -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org