This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
     new 4b77986b3945 [SPARK-54868][PYTHON][INFRA] Fail hanging tests and log 
the tracebacks
4b77986b3945 is described below

commit 4b77986b39450854c707310f7f6f58dae163317a
Author: Ruifeng Zheng <[email protected]>
AuthorDate: Tue Dec 30 17:41:59 2025 +0800

    [SPARK-54868][PYTHON][INFRA] Fail hanging tests and log the tracebacks
    
    ### What changes were proposed in this pull request?
    Fail hanging tests and log the tracebacks
    The timeout is set by env `PYSPARK_TEST_TIMEOUT`
    
    ### Why are the changes needed?
    when a test gets stuck, there is no useful information
    
    ### Does this PR introduce _any_ user-facing change?
    no, dev-only
    
    ### How was this patch tested?
    1, PR builder with
    ```
    PYSPARK_TEST_TIMEOUT: 100
    ```
    
    
https://github.com/zhengruifeng/spark/actions/runs/20522703690/job/58962106131
    
    2, manually check
    ```
    (spark_dev_313) ➜  spark git:(py_test_timeout) PYSPARK_TEST_TIMEOUT=15 
python/run-tests -k --python-executables python3 --testnames 
'pyspark.ml.tests.connect.test_parity_clustering'
    Running PySpark tests. Output is in 
/Users/ruifeng.zheng/spark/python/unit-tests.log
    Will test against the following Python executables: ['python3']
    Will test the following Python tests: 
['pyspark.ml.tests.connect.test_parity_clustering']
    python3 python_implementation is CPython
    python3 version is: Python 3.13.5
    Starting test(python3): pyspark.ml.tests.connect.test_parity_clustering 
(temp output: 
/Users/ruifeng.zheng/spark/python/target/c014880c-80d2-49db-8fb1-a26ab4e5246d/python3__pyspark.ml.tests.connect.test_parity_clustering__u8n7t6zc.log)
    Got TimeoutExpired while running 
pyspark.ml.tests.connect.test_parity_clustering with python3
    Traceback (most recent call last):
      File "/Users/ruifeng.zheng/spark/./python/run-tests.py", line 157, in 
run_individual_python_test
        retcode = proc.wait(timeout=timeout)
      File 
"/Users/ruifeng.zheng/.dev/miniconda3/envs/spark_dev_313/lib/python3.13/subprocess.py",
 line 1280, in wait
        return self._wait(timeout=timeout)
               ~~~~~~~~~~^^^^^^^^^^^^^^^^^
      File 
"/Users/ruifeng.zheng/.dev/miniconda3/envs/spark_dev_313/lib/python3.13/subprocess.py",
 line 2058, in _wait
        raise TimeoutExpired(self.args, timeout)
    subprocess.TimeoutExpired: Command 
'['/Users/ruifeng.zheng/spark/bin/pyspark', 
'pyspark.ml.tests.connect.test_parity_clustering']' timed out after 15 seconds
    
    Running tests...
    ----------------------------------------------------------------------
    WARNING: Using incubator modules: jdk.incubator.vector
    Setting default log level to "WARN".
    To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).
    /Users/ruifeng.zheng/spark/python/pyspark/sql/connect/conf.py:64: 
UserWarning: Failed to set 
spark.connect.execute.reattachable.senderMaxStreamDuration to Some(1s) due to 
[CANNOT_MODIFY_STATIC_CONFIG] Cannot modify the value of the static Spark 
config: "spark.connect.execute.reattachable.senderMaxStreamDuration". SQLSTATE: 
46110
      warnings.warn(warn)
    /Users/ruifeng.zheng/spark/python/pyspark/sql/connect/conf.py:64: 
UserWarning: Failed to set 
spark.connect.execute.reattachable.senderMaxStreamSize to Some(123) due to 
[CANNOT_MODIFY_STATIC_CONFIG] Cannot modify the value of the static Spark 
config: "spark.connect.execute.reattachable.senderMaxStreamSize". SQLSTATE: 
46110
      warnings.warn(warn)
    /Users/ruifeng.zheng/spark/python/pyspark/sql/connect/conf.py:64: 
UserWarning: Failed to set spark.connect.authenticate.token to Some(deadbeef) 
due to [CANNOT_MODIFY_STATIC_CONFIG] Cannot modify the value of the static 
Spark config: "spark.connect.authenticate.token". SQLSTATE: 46110
      warnings.warn(warn)
      test_assert_remote_mode 
(pyspark.ml.tests.connect.test_parity_clustering.ClusteringParityTests.test_assert_remote_mode)
 ... ok (0.450s)
    /Users/ruifeng.zheng/spark/python/pyspark/ml/clustering.py:1016: 
FutureWarning: Deprecated in 3.0.0. It will be removed in future versions. Use 
ClusteringEvaluator instead. You can also get the cost on the training dataset 
in the summary.
      warnings.warn(
    ok (6.541s)
      test_distributed_lda 
(pyspark.ml.tests.connect.test_parity_clustering.ClusteringParityTests.test_distributed_lda)
 ... Thread 0x0000000173083000 (most recent call first):
      File 
"/Users/ruifeng.zheng/.dev/miniconda3/envs/spark_dev_313/lib/python3.13/site-packages/grpc/_channel.py",
 line 1727 in channel_spin
      File 
"/Users/ruifeng.zheng/.dev/miniconda3/envs/spark_dev_313/lib/python3.13/threading.py",
 line 994 in run
      File 
"/Users/ruifeng.zheng/.dev/miniconda3/envs/spark_dev_313/lib/python3.13/threading.py",
 line 1043 in _bootstrap_inner
      File 
"/Users/ruifeng.zheng/.dev/miniconda3/envs/spark_dev_313/lib/python3.13/threading.py",
 line 1014 in _bootstrap
    
    Thread 0x000000017509b000 (most recent call first):
      File 
"/Users/ruifeng.zheng/.dev/miniconda3/envs/spark_dev_313/lib/python3.13/concurrent/futures/thread.py",
 line 90 in _worker
      File 
"/Users/ruifeng.zheng/.dev/miniconda3/envs/spark_dev_313/lib/python3.13/threading.py",
 line 994 in run
      File 
"/Users/ruifeng.zheng/.dev/miniconda3/envs/spark_dev_313/lib/python3.13/threading.py",
 line 1043 in _bootstrap_inner
      File 
"/Users/ruifeng.zheng/.dev/miniconda3/envs/spark_dev_313/lib/python3.13/threading.py",
 line 1014 in _bootstrap
    
    Thread 0x000000017408f000 (most recent call first):
      File 
"/Users/ruifeng.zheng/.dev/miniconda3/envs/spark_dev_313/lib/python3.13/concurrent/futures/thread.py",
 line 90 in _worker
      File 
"/Users/ruifeng.zheng/.dev/miniconda3/envs/spark_dev_313/lib/python3.13/threading.py",
 line 994 in run
      File 
"/Users/ruifeng.zheng/.dev/miniconda3/envs/spark_dev_313/lib/python3.13/threading.py",
 line 1043 in _bootstrap_inner
      File 
"/Users/ruifeng.zheng/.dev/miniconda3/envs/spark_dev_313/lib/python3.13/threading.py",
 line 1014 in _bootstrap
    
    Thread 0x00000001719e7000 (most recent call first):
      File 
"/Users/ruifeng.zheng/.dev/miniconda3/envs/spark_dev_313/lib/python3.13/selectors.py",
 line 398 in select
      File 
"/Users/ruifeng.zheng/.dev/miniconda3/envs/spark_dev_313/lib/python3.13/socketserver.py",
 line 235 in serve_forever
      File 
"/Users/ruifeng.zheng/.dev/miniconda3/envs/spark_dev_313/lib/python3.13/threading.py",
 line 994 in run
      File 
"/Users/ruifeng.zheng/.dev/miniconda3/envs/spark_dev_313/lib/python3.13/threading.py",
 line 1043 in _bootstrap_inner
      File 
"/Users/ruifeng.zheng/.dev/miniconda3/envs/spark_dev_313/lib/python3.13/threading.py",
 line 1014 in _bootstrap
    
    Thread 0x00000001709db000 (most recent call first):
      File 
"/Users/ruifeng.zheng/spark/python/lib/py4j-0.10.9.9-src.zip/py4j/clientserver.py",
 line 58 in run
      File 
"/Users/ruifeng.zheng/.dev/miniconda3/envs/spark_dev_313/lib/python3.13/threading.py",
 line 1043 in _bootstrap_inner
      File 
"/Users/ruifeng.zheng/.dev/miniconda3/envs/spark_dev_313/lib/python3.13/threading.py",
 line 1014 in _bootstrap
    
    Current thread 0x00000001f372e200 (most recent call first):
      File 
"/Users/ruifeng.zheng/.dev/miniconda3/envs/spark_dev_313/lib/python3.13/threading.py",
 line 363 in wait
      File 
"/Users/ruifeng.zheng/.dev/miniconda3/envs/spark_dev_313/lib/python3.13/site-packages/grpc/_common.py",
 line 114 in _wait_once
      File 
"/Users/ruifeng.zheng/.dev/miniconda3/envs/spark_dev_313/lib/python3.13/site-packages/grpc/_common.py",
 line 154 in wait
      File 
"/Users/ruifeng.zheng/.dev/miniconda3/envs/spark_dev_313/lib/python3.13/site-packages/grpc/_channel.py",
 line 953 in _next
      File 
"/Users/ruifeng.zheng/.dev/miniconda3/envs/spark_dev_313/lib/python3.13/site-packages/grpc/_channel.py",
 line 538 in __next__
      File 
"/Users/ruifeng.zheng/spark/python/pyspark/sql/connect/client/reattach.py", 
line 164 in <lambda>
      File 
"/Users/ruifeng.zheng/spark/python/pyspark/sql/connect/client/reattach.py", 
line 266 in _call_iter
      File 
"/Users/ruifeng.zheng/spark/python/pyspark/sql/connect/client/reattach.py", 
line 163 in _has_next
      File 
"/Users/ruifeng.zheng/spark/python/pyspark/sql/connect/client/reattach.py", 
line 139 in send
      File "<frozen _collections_abc>", line 360 in __next__
      File 
"/Users/ruifeng.zheng/spark/python/pyspark/sql/connect/client/core.py", line 
1625 in _execute_and_fetch_as_iterator
      File 
"/Users/ruifeng.zheng/spark/python/pyspark/sql/connect/client/core.py", line 
1664 in _execute_and_fetch
      File 
"/Users/ruifeng.zheng/spark/python/pyspark/sql/connect/client/core.py", line 
1162 in execute_command
      File "/Users/ruifeng.zheng/spark/python/pyspark/ml/util.py", line 308 in 
remote_call
      File "/Users/ruifeng.zheng/spark/python/pyspark/ml/util.py", line 322 in 
wrapped
      File "/Users/ruifeng.zheng/spark/python/pyspark/ml/clustering.py", line 
1548 in toLocal
      File 
"/Users/ruifeng.zheng/spark/python/pyspark/ml/tests/test_clustering.py", line 
449 in test_distributed_lda
      File 
"/Users/ruifeng.zheng/.dev/miniconda3/envs/spark_dev_313/lib/python3.13/unittest/case.py",
 line 606 in _callTestMethod
      File 
"/Users/ruifeng.zheng/.dev/miniconda3/envs/spark_dev_313/lib/python3.13/unittest/case.py",
 line 651 in run
      File 
"/Users/ruifeng.zheng/.dev/miniconda3/envs/spark_dev_313/lib/python3.13/unittest/case.py",
 line 707 in __call__
      File 
"/Users/ruifeng.zheng/.dev/miniconda3/envs/spark_dev_313/lib/python3.13/unittest/suite.py",
 line 122 in run
      File 
"/Users/ruifeng.zheng/.dev/miniconda3/envs/spark_dev_313/lib/python3.13/unittest/suite.py",
 line 84 in __call__
      File 
"/Users/ruifeng.zheng/.dev/miniconda3/envs/spark_dev_313/lib/python3.13/unittest/suite.py",
 line 122 in run
      File 
"/Users/ruifeng.zheng/.dev/miniconda3/envs/spark_dev_313/lib/python3.13/unittest/suite.py",
 line 84 in __call__
      File 
"/Users/ruifeng.zheng/.dev/miniconda3/envs/spark_dev_313/lib/python3.13/site-packages/xmlrunner/runner.py",
 line 67 in run
      File 
"/Users/ruifeng.zheng/.dev/miniconda3/envs/spark_dev_313/lib/python3.13/unittest/main.py",
 line 270 in runTests
      File 
"/Users/ruifeng.zheng/.dev/miniconda3/envs/spark_dev_313/lib/python3.13/unittest/main.py",
 line 104 in __init__
      File "/Users/ruifeng.zheng/spark/python/pyspark/testing/__init__.py", 
line 30 in unittest_main
      File 
"/Users/ruifeng.zheng/spark/python/pyspark/ml/tests/connect/test_parity_clustering.py",
 line 37 in <module>
      File "<frozen runpy>", line 88 in _run_code
      File "<frozen runpy>", line 198 in _run_module_as_main
    
    Had test failures in pyspark.ml.tests.connect.test_parity_clustering with 
python3; see logs.
    ```
    
    ### Was this patch authored or co-authored using generative AI tooling?
    no
    
    Closes #53528 from zhengruifeng/py_test_timeout.
    
    Authored-by: Ruifeng Zheng <[email protected]>
    Signed-off-by: Ruifeng Zheng <[email protected]>
---
 .github/workflows/build_and_test.yml   |  1 +
 python/pyspark/testing/connectutils.py |  9 +++++++++
 python/run-tests.py                    | 28 +++++++++++++++++++++++-----
 3 files changed, 33 insertions(+), 5 deletions(-)

diff --git a/.github/workflows/build_and_test.yml 
b/.github/workflows/build_and_test.yml
index c990337dc939..49f7d8eab5c2 100644
--- a/.github/workflows/build_and_test.yml
+++ b/.github/workflows/build_and_test.yml
@@ -566,6 +566,7 @@ jobs:
       SKIP_PACKAGING: true
       METASPACE_SIZE: 1g
       BRANCH: ${{ inputs.branch }}
+      PYSPARK_TEST_TIMEOUT: 300
     steps:
     - name: Checkout Spark repository
       uses: actions/checkout@v4
diff --git a/python/pyspark/testing/connectutils.py 
b/python/pyspark/testing/connectutils.py
index 6e3c282a41ee..63dc350dd011 100644
--- a/python/pyspark/testing/connectutils.py
+++ b/python/pyspark/testing/connectutils.py
@@ -17,6 +17,9 @@
 import shutil
 import tempfile
 import os
+import sys
+import signal
+import faulthandler
 import functools
 import unittest
 import uuid
@@ -177,6 +180,9 @@ class ReusedConnectTestCase(unittest.TestCase, 
SQLTestUtils, PySparkErrorTestUti
 
     @classmethod
     def setUpClass(cls):
+        if os.environ.get("PYSPARK_TEST_TIMEOUT"):
+            faulthandler.register(signal.SIGTERM, file=sys.__stderr__, 
all_threads=True)
+
         # This environment variable is for interrupting hanging ML-handler and 
making the
         # tests fail fast.
         os.environ["SPARK_CONNECT_ML_HANDLER_INTERRUPTION_TIMEOUT_MINUTES"] = 
"5"
@@ -197,6 +203,9 @@ class ReusedConnectTestCase(unittest.TestCase, 
SQLTestUtils, PySparkErrorTestUti
 
     @classmethod
     def tearDownClass(cls):
+        if os.environ.get("PYSPARK_TEST_TIMEOUT"):
+            faulthandler.unregister(signal.SIGTERM)
+
         shutil.rmtree(cls.tempdir.name, ignore_errors=True)
         cls.spark.stop()
 
diff --git a/python/run-tests.py b/python/run-tests.py
index f8ed1cefa571..9c58f1dcda5f 100755
--- a/python/run-tests.py
+++ b/python/run-tests.py
@@ -234,6 +234,11 @@ def run_individual_python_test(target_dir, test_name, 
pyspark_python, keep_test_
 
     env["PYSPARK_SUBMIT_ARGS"] = " ".join(spark_args)
 
+    timeout = os.environ.get("PYSPARK_TEST_TIMEOUT")
+    if timeout is not None:
+        env["PYSPARK_TEST_TIMEOUT"] = timeout
+        timeout = int(timeout)
+
     output_prefix = get_valid_filename(pyspark_python + "__" + test_name + 
"__").lstrip("_")
     # Delete is always set to False since the cleanup will be either done by 
removing the
     # whole test dir, or the test output is retained.
@@ -241,13 +246,17 @@ def run_individual_python_test(target_dir, test_name, 
pyspark_python, keep_test_
                                                   suffix=".log", delete=False)
     LOGGER.info(
         "Starting test(%s): %s (temp output: %s)", pyspark_python, test_name, 
per_test_output.name)
+    cmd = [os.path.join(SPARK_HOME, "bin/pyspark")] + test_name.split()
     start_time = time.time()
+
+    retcode = None
+    proc = None
     try:
-        retcode = TestRunner(
-            [os.path.join(SPARK_HOME, "bin/pyspark")] + test_name.split(),
-            env,
-            per_test_output
-        ).run()
+        if timeout:
+            proc = subprocess.Popen(cmd, stderr=per_test_output, 
stdout=per_test_output, env=env)
+            retcode = proc.wait(timeout=timeout)
+        else:
+            retcode = TestRunner(cmd, env, per_test_output).run()
         if not keep_test_output:
             # There exists a race condition in Python and it causes flakiness 
in MacOS
             # https://github.com/python/cpython/issues/73885
@@ -255,6 +264,15 @@ def run_individual_python_test(target_dir, test_name, 
pyspark_python, keep_test_
                 os.system("rm -rf " + tmp_dir)
             else:
                 shutil.rmtree(tmp_dir, ignore_errors=True)
+    except subprocess.TimeoutExpired:
+        if timeout and proc:
+            LOGGER.exception(
+                "Got TimeoutExpired while running %s with %s", test_name, 
pyspark_python
+            )
+            proc.terminate()
+            proc.communicate(timeout=60)
+        else:
+            raise
     except BaseException:
         LOGGER.exception("Got exception while running %s with %s", test_name, 
pyspark_python)
         # Here, we use os._exit() instead of sys.exit() in order to force 
Python to exit even if


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to