This is an automated email from the ASF dual-hosted git repository.
ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git
The following commit(s) were added to refs/heads/master by this push:
new 4b77986b3945 [SPARK-54868][PYTHON][INFRA] Fail hanging tests and log
the tracebacks
4b77986b3945 is described below
commit 4b77986b39450854c707310f7f6f58dae163317a
Author: Ruifeng Zheng <[email protected]>
AuthorDate: Tue Dec 30 17:41:59 2025 +0800
[SPARK-54868][PYTHON][INFRA] Fail hanging tests and log the tracebacks
### What changes were proposed in this pull request?
Fail hanging tests and log the tracebacks
The timeout is set by env `PYSPARK_TEST_TIMEOUT`
### Why are the changes needed?
when a test gets stuck, there is no useful information
### Does this PR introduce _any_ user-facing change?
no, dev-only
### How was this patch tested?
1, PR builder with
```
PYSPARK_TEST_TIMEOUT: 100
```
https://github.com/zhengruifeng/spark/actions/runs/20522703690/job/58962106131
2, manually check
```
(spark_dev_313) ➜ spark git:(py_test_timeout) PYSPARK_TEST_TIMEOUT=15
python/run-tests -k --python-executables python3 --testnames
'pyspark.ml.tests.connect.test_parity_clustering'
Running PySpark tests. Output is in
/Users/ruifeng.zheng/spark/python/unit-tests.log
Will test against the following Python executables: ['python3']
Will test the following Python tests:
['pyspark.ml.tests.connect.test_parity_clustering']
python3 python_implementation is CPython
python3 version is: Python 3.13.5
Starting test(python3): pyspark.ml.tests.connect.test_parity_clustering
(temp output:
/Users/ruifeng.zheng/spark/python/target/c014880c-80d2-49db-8fb1-a26ab4e5246d/python3__pyspark.ml.tests.connect.test_parity_clustering__u8n7t6zc.log)
Got TimeoutExpired while running
pyspark.ml.tests.connect.test_parity_clustering with python3
Traceback (most recent call last):
File "/Users/ruifeng.zheng/spark/./python/run-tests.py", line 157, in
run_individual_python_test
retcode = proc.wait(timeout=timeout)
File
"/Users/ruifeng.zheng/.dev/miniconda3/envs/spark_dev_313/lib/python3.13/subprocess.py",
line 1280, in wait
return self._wait(timeout=timeout)
~~~~~~~~~~^^^^^^^^^^^^^^^^^
File
"/Users/ruifeng.zheng/.dev/miniconda3/envs/spark_dev_313/lib/python3.13/subprocess.py",
line 2058, in _wait
raise TimeoutExpired(self.args, timeout)
subprocess.TimeoutExpired: Command
'['/Users/ruifeng.zheng/spark/bin/pyspark',
'pyspark.ml.tests.connect.test_parity_clustering']' timed out after 15 seconds
Running tests...
----------------------------------------------------------------------
WARNING: Using incubator modules: jdk.incubator.vector
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use
setLogLevel(newLevel).
/Users/ruifeng.zheng/spark/python/pyspark/sql/connect/conf.py:64:
UserWarning: Failed to set
spark.connect.execute.reattachable.senderMaxStreamDuration to Some(1s) due to
[CANNOT_MODIFY_STATIC_CONFIG] Cannot modify the value of the static Spark
config: "spark.connect.execute.reattachable.senderMaxStreamDuration". SQLSTATE:
46110
warnings.warn(warn)
/Users/ruifeng.zheng/spark/python/pyspark/sql/connect/conf.py:64:
UserWarning: Failed to set
spark.connect.execute.reattachable.senderMaxStreamSize to Some(123) due to
[CANNOT_MODIFY_STATIC_CONFIG] Cannot modify the value of the static Spark
config: "spark.connect.execute.reattachable.senderMaxStreamSize". SQLSTATE:
46110
warnings.warn(warn)
/Users/ruifeng.zheng/spark/python/pyspark/sql/connect/conf.py:64:
UserWarning: Failed to set spark.connect.authenticate.token to Some(deadbeef)
due to [CANNOT_MODIFY_STATIC_CONFIG] Cannot modify the value of the static
Spark config: "spark.connect.authenticate.token". SQLSTATE: 46110
warnings.warn(warn)
test_assert_remote_mode
(pyspark.ml.tests.connect.test_parity_clustering.ClusteringParityTests.test_assert_remote_mode)
... ok (0.450s)
/Users/ruifeng.zheng/spark/python/pyspark/ml/clustering.py:1016:
FutureWarning: Deprecated in 3.0.0. It will be removed in future versions. Use
ClusteringEvaluator instead. You can also get the cost on the training dataset
in the summary.
warnings.warn(
ok (6.541s)
test_distributed_lda
(pyspark.ml.tests.connect.test_parity_clustering.ClusteringParityTests.test_distributed_lda)
... Thread 0x0000000173083000 (most recent call first):
File
"/Users/ruifeng.zheng/.dev/miniconda3/envs/spark_dev_313/lib/python3.13/site-packages/grpc/_channel.py",
line 1727 in channel_spin
File
"/Users/ruifeng.zheng/.dev/miniconda3/envs/spark_dev_313/lib/python3.13/threading.py",
line 994 in run
File
"/Users/ruifeng.zheng/.dev/miniconda3/envs/spark_dev_313/lib/python3.13/threading.py",
line 1043 in _bootstrap_inner
File
"/Users/ruifeng.zheng/.dev/miniconda3/envs/spark_dev_313/lib/python3.13/threading.py",
line 1014 in _bootstrap
Thread 0x000000017509b000 (most recent call first):
File
"/Users/ruifeng.zheng/.dev/miniconda3/envs/spark_dev_313/lib/python3.13/concurrent/futures/thread.py",
line 90 in _worker
File
"/Users/ruifeng.zheng/.dev/miniconda3/envs/spark_dev_313/lib/python3.13/threading.py",
line 994 in run
File
"/Users/ruifeng.zheng/.dev/miniconda3/envs/spark_dev_313/lib/python3.13/threading.py",
line 1043 in _bootstrap_inner
File
"/Users/ruifeng.zheng/.dev/miniconda3/envs/spark_dev_313/lib/python3.13/threading.py",
line 1014 in _bootstrap
Thread 0x000000017408f000 (most recent call first):
File
"/Users/ruifeng.zheng/.dev/miniconda3/envs/spark_dev_313/lib/python3.13/concurrent/futures/thread.py",
line 90 in _worker
File
"/Users/ruifeng.zheng/.dev/miniconda3/envs/spark_dev_313/lib/python3.13/threading.py",
line 994 in run
File
"/Users/ruifeng.zheng/.dev/miniconda3/envs/spark_dev_313/lib/python3.13/threading.py",
line 1043 in _bootstrap_inner
File
"/Users/ruifeng.zheng/.dev/miniconda3/envs/spark_dev_313/lib/python3.13/threading.py",
line 1014 in _bootstrap
Thread 0x00000001719e7000 (most recent call first):
File
"/Users/ruifeng.zheng/.dev/miniconda3/envs/spark_dev_313/lib/python3.13/selectors.py",
line 398 in select
File
"/Users/ruifeng.zheng/.dev/miniconda3/envs/spark_dev_313/lib/python3.13/socketserver.py",
line 235 in serve_forever
File
"/Users/ruifeng.zheng/.dev/miniconda3/envs/spark_dev_313/lib/python3.13/threading.py",
line 994 in run
File
"/Users/ruifeng.zheng/.dev/miniconda3/envs/spark_dev_313/lib/python3.13/threading.py",
line 1043 in _bootstrap_inner
File
"/Users/ruifeng.zheng/.dev/miniconda3/envs/spark_dev_313/lib/python3.13/threading.py",
line 1014 in _bootstrap
Thread 0x00000001709db000 (most recent call first):
File
"/Users/ruifeng.zheng/spark/python/lib/py4j-0.10.9.9-src.zip/py4j/clientserver.py",
line 58 in run
File
"/Users/ruifeng.zheng/.dev/miniconda3/envs/spark_dev_313/lib/python3.13/threading.py",
line 1043 in _bootstrap_inner
File
"/Users/ruifeng.zheng/.dev/miniconda3/envs/spark_dev_313/lib/python3.13/threading.py",
line 1014 in _bootstrap
Current thread 0x00000001f372e200 (most recent call first):
File
"/Users/ruifeng.zheng/.dev/miniconda3/envs/spark_dev_313/lib/python3.13/threading.py",
line 363 in wait
File
"/Users/ruifeng.zheng/.dev/miniconda3/envs/spark_dev_313/lib/python3.13/site-packages/grpc/_common.py",
line 114 in _wait_once
File
"/Users/ruifeng.zheng/.dev/miniconda3/envs/spark_dev_313/lib/python3.13/site-packages/grpc/_common.py",
line 154 in wait
File
"/Users/ruifeng.zheng/.dev/miniconda3/envs/spark_dev_313/lib/python3.13/site-packages/grpc/_channel.py",
line 953 in _next
File
"/Users/ruifeng.zheng/.dev/miniconda3/envs/spark_dev_313/lib/python3.13/site-packages/grpc/_channel.py",
line 538 in __next__
File
"/Users/ruifeng.zheng/spark/python/pyspark/sql/connect/client/reattach.py",
line 164 in <lambda>
File
"/Users/ruifeng.zheng/spark/python/pyspark/sql/connect/client/reattach.py",
line 266 in _call_iter
File
"/Users/ruifeng.zheng/spark/python/pyspark/sql/connect/client/reattach.py",
line 163 in _has_next
File
"/Users/ruifeng.zheng/spark/python/pyspark/sql/connect/client/reattach.py",
line 139 in send
File "<frozen _collections_abc>", line 360 in __next__
File
"/Users/ruifeng.zheng/spark/python/pyspark/sql/connect/client/core.py", line
1625 in _execute_and_fetch_as_iterator
File
"/Users/ruifeng.zheng/spark/python/pyspark/sql/connect/client/core.py", line
1664 in _execute_and_fetch
File
"/Users/ruifeng.zheng/spark/python/pyspark/sql/connect/client/core.py", line
1162 in execute_command
File "/Users/ruifeng.zheng/spark/python/pyspark/ml/util.py", line 308 in
remote_call
File "/Users/ruifeng.zheng/spark/python/pyspark/ml/util.py", line 322 in
wrapped
File "/Users/ruifeng.zheng/spark/python/pyspark/ml/clustering.py", line
1548 in toLocal
File
"/Users/ruifeng.zheng/spark/python/pyspark/ml/tests/test_clustering.py", line
449 in test_distributed_lda
File
"/Users/ruifeng.zheng/.dev/miniconda3/envs/spark_dev_313/lib/python3.13/unittest/case.py",
line 606 in _callTestMethod
File
"/Users/ruifeng.zheng/.dev/miniconda3/envs/spark_dev_313/lib/python3.13/unittest/case.py",
line 651 in run
File
"/Users/ruifeng.zheng/.dev/miniconda3/envs/spark_dev_313/lib/python3.13/unittest/case.py",
line 707 in __call__
File
"/Users/ruifeng.zheng/.dev/miniconda3/envs/spark_dev_313/lib/python3.13/unittest/suite.py",
line 122 in run
File
"/Users/ruifeng.zheng/.dev/miniconda3/envs/spark_dev_313/lib/python3.13/unittest/suite.py",
line 84 in __call__
File
"/Users/ruifeng.zheng/.dev/miniconda3/envs/spark_dev_313/lib/python3.13/unittest/suite.py",
line 122 in run
File
"/Users/ruifeng.zheng/.dev/miniconda3/envs/spark_dev_313/lib/python3.13/unittest/suite.py",
line 84 in __call__
File
"/Users/ruifeng.zheng/.dev/miniconda3/envs/spark_dev_313/lib/python3.13/site-packages/xmlrunner/runner.py",
line 67 in run
File
"/Users/ruifeng.zheng/.dev/miniconda3/envs/spark_dev_313/lib/python3.13/unittest/main.py",
line 270 in runTests
File
"/Users/ruifeng.zheng/.dev/miniconda3/envs/spark_dev_313/lib/python3.13/unittest/main.py",
line 104 in __init__
File "/Users/ruifeng.zheng/spark/python/pyspark/testing/__init__.py",
line 30 in unittest_main
File
"/Users/ruifeng.zheng/spark/python/pyspark/ml/tests/connect/test_parity_clustering.py",
line 37 in <module>
File "<frozen runpy>", line 88 in _run_code
File "<frozen runpy>", line 198 in _run_module_as_main
Had test failures in pyspark.ml.tests.connect.test_parity_clustering with
python3; see logs.
```
### Was this patch authored or co-authored using generative AI tooling?
no
Closes #53528 from zhengruifeng/py_test_timeout.
Authored-by: Ruifeng Zheng <[email protected]>
Signed-off-by: Ruifeng Zheng <[email protected]>
---
.github/workflows/build_and_test.yml | 1 +
python/pyspark/testing/connectutils.py | 9 +++++++++
python/run-tests.py | 28 +++++++++++++++++++++++-----
3 files changed, 33 insertions(+), 5 deletions(-)
diff --git a/.github/workflows/build_and_test.yml
b/.github/workflows/build_and_test.yml
index c990337dc939..49f7d8eab5c2 100644
--- a/.github/workflows/build_and_test.yml
+++ b/.github/workflows/build_and_test.yml
@@ -566,6 +566,7 @@ jobs:
SKIP_PACKAGING: true
METASPACE_SIZE: 1g
BRANCH: ${{ inputs.branch }}
+ PYSPARK_TEST_TIMEOUT: 300
steps:
- name: Checkout Spark repository
uses: actions/checkout@v4
diff --git a/python/pyspark/testing/connectutils.py
b/python/pyspark/testing/connectutils.py
index 6e3c282a41ee..63dc350dd011 100644
--- a/python/pyspark/testing/connectutils.py
+++ b/python/pyspark/testing/connectutils.py
@@ -17,6 +17,9 @@
import shutil
import tempfile
import os
+import sys
+import signal
+import faulthandler
import functools
import unittest
import uuid
@@ -177,6 +180,9 @@ class ReusedConnectTestCase(unittest.TestCase,
SQLTestUtils, PySparkErrorTestUti
@classmethod
def setUpClass(cls):
+ if os.environ.get("PYSPARK_TEST_TIMEOUT"):
+ faulthandler.register(signal.SIGTERM, file=sys.__stderr__,
all_threads=True)
+
# This environment variable is for interrupting hanging ML-handler and
making the
# tests fail fast.
os.environ["SPARK_CONNECT_ML_HANDLER_INTERRUPTION_TIMEOUT_MINUTES"] =
"5"
@@ -197,6 +203,9 @@ class ReusedConnectTestCase(unittest.TestCase,
SQLTestUtils, PySparkErrorTestUti
@classmethod
def tearDownClass(cls):
+ if os.environ.get("PYSPARK_TEST_TIMEOUT"):
+ faulthandler.unregister(signal.SIGTERM)
+
shutil.rmtree(cls.tempdir.name, ignore_errors=True)
cls.spark.stop()
diff --git a/python/run-tests.py b/python/run-tests.py
index f8ed1cefa571..9c58f1dcda5f 100755
--- a/python/run-tests.py
+++ b/python/run-tests.py
@@ -234,6 +234,11 @@ def run_individual_python_test(target_dir, test_name,
pyspark_python, keep_test_
env["PYSPARK_SUBMIT_ARGS"] = " ".join(spark_args)
+ timeout = os.environ.get("PYSPARK_TEST_TIMEOUT")
+ if timeout is not None:
+ env["PYSPARK_TEST_TIMEOUT"] = timeout
+ timeout = int(timeout)
+
output_prefix = get_valid_filename(pyspark_python + "__" + test_name +
"__").lstrip("_")
# Delete is always set to False since the cleanup will be either done by
removing the
# whole test dir, or the test output is retained.
@@ -241,13 +246,17 @@ def run_individual_python_test(target_dir, test_name,
pyspark_python, keep_test_
suffix=".log", delete=False)
LOGGER.info(
"Starting test(%s): %s (temp output: %s)", pyspark_python, test_name,
per_test_output.name)
+ cmd = [os.path.join(SPARK_HOME, "bin/pyspark")] + test_name.split()
start_time = time.time()
+
+ retcode = None
+ proc = None
try:
- retcode = TestRunner(
- [os.path.join(SPARK_HOME, "bin/pyspark")] + test_name.split(),
- env,
- per_test_output
- ).run()
+ if timeout:
+ proc = subprocess.Popen(cmd, stderr=per_test_output,
stdout=per_test_output, env=env)
+ retcode = proc.wait(timeout=timeout)
+ else:
+ retcode = TestRunner(cmd, env, per_test_output).run()
if not keep_test_output:
# There exists a race condition in Python and it causes flakiness
in MacOS
# https://github.com/python/cpython/issues/73885
@@ -255,6 +264,15 @@ def run_individual_python_test(target_dir, test_name,
pyspark_python, keep_test_
os.system("rm -rf " + tmp_dir)
else:
shutil.rmtree(tmp_dir, ignore_errors=True)
+ except subprocess.TimeoutExpired:
+ if timeout and proc:
+ LOGGER.exception(
+ "Got TimeoutExpired while running %s with %s", test_name,
pyspark_python
+ )
+ proc.terminate()
+ proc.communicate(timeout=60)
+ else:
+ raise
except BaseException:
LOGGER.exception("Got exception while running %s with %s", test_name,
pyspark_python)
# Here, we use os._exit() instead of sys.exit() in order to force
Python to exit even if
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]