(spark) branch master updated: [SPARK-54632][PYTHON] Add the option to use ruff for lint

gurwls223 Tue, 09 Dec 2025 15:53:48 -0800

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git



The following commit(s) were added to refs/heads/master by this push:
     new e09f2a3c0950 [SPARK-54632][PYTHON] Add the option to use ruff for lint
e09f2a3c0950 is described below

commit e09f2a3c0950cba9c42b3cfc2c1d999763862b0b
Author: Tian Gao <[email protected]>
AuthorDate: Wed Dec 10 08:53:27 2025 +0900

    [SPARK-54632][PYTHON] Add the option to use ruff for lint
    
    ### What changes were proposed in this pull request?
    
    Add `ruff` as an option to lint our code.
    
    ### Why are the changes needed?
    
    Our pinned `flake8` version is just too old - it can't even run on 3.12+. 
We can upgrade flake8 version but I think gradually switch to `ruff` is a 
better options. The main reason is `ruff` is much much faster than `flake8`. 
`ruff` returns the result almost immediately (ms-level) on whole spark repo - 
which means we can even hook it in the pre-commit in the future.
    
    It is surprisingly compatible with flake8 - there's almost no code change 
needed (with two extra ignored lint types which we can fix in the future). 
Everything it finds is a real issue instead of a different taste.
    
    `ruff` can also serve as a black-compatible formatter which means we can 
probably ditch both `flake8` and `black` in the future.
    
    For now we only enable this option - it's not hooked into any CI or full 
`./dev/lint-python`. However, I think we should do that soon.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No
    
    ### How was this patch tested?
    
    Local lint test passed.
    
    ### Was this patch authored or co-authored using generative AI tooling?
    
    No
    
    Closes #53378 from gaogaotiantian/add-ruff.
    
    Authored-by: Tian Gao <[email protected]>
    Signed-off-by: Hyukjin Kwon <[email protected]>
---
 dev/lint-python                                    | 49 +++++++++++++++++++-
 dev/pyproject.toml                                 | 52 ++++++++++++++++++++++
 python/pyspark/ml/evaluation.py                    |  2 +-
 python/pyspark/sql/connect/merge.py                |  8 ++--
 .../sql/tests/arrow/test_arrow_udf_typehints.py    |  2 +-
 .../sql/tests/pandas/test_pandas_udf_typehints.py  |  2 +-
 ...pandas_udf_typehints_with_future_annotations.py |  2 +-
 python/pyspark/taskcontext.py                      |  2 -
 python/pyspark/util.py                             |  2 +-
 9 files changed, 109 insertions(+), 12 deletions(-)

diff --git a/dev/lint-python b/dev/lint-python
index b8703310bc4b..fab7ad9ebe26 100755
--- a/dev/lint-python
+++ b/dev/lint-python
@@ -18,6 +18,8 @@
 # define test binaries + versions
 FLAKE8_BUILD="flake8"
 MINIMUM_FLAKE8="3.9.0"
+RUFF_BUILD="ruff"
+MINIMUM_RUFF="0.14.0"
 MINIMUM_MYPY="1.8.0"
 MYPY_BUILD="mypy"
 PYTEST_BUILD="pytest"
@@ -52,6 +54,9 @@ while (( "$#" )); do
     --flake8)
       FLAKE8_TEST=true
       ;;
+    --ruff)
+      RUFF_TEST=true
+      ;;
     --mypy)
       MYPY_TEST=true
       ;;
@@ -69,7 +74,7 @@ while (( "$#" )); do
   shift
 done
 
-if [[ -z 
"$COMPILE_TEST$BLACK_TEST$PYSPARK_CUSTOM_ERRORS_CHECK_TEST$FLAKE8_TEST$MYPY_TEST$MYPY_EXAMPLES_TEST$MYPY_DATA_TEST"
 ]]; then
+if [[ -z 
"$COMPILE_TEST$BLACK_TEST$PYSPARK_CUSTOM_ERRORS_CHECK_TEST$FLAKE8_TEST$RUFF_TEST$MYPY_TEST$MYPY_EXAMPLES_TEST$MYPY_DATA_TEST"
 ]]; then
   COMPILE_TEST=true
   BLACK_TEST=true
   PYSPARK_CUSTOM_ERRORS_CHECK_TEST=true
@@ -270,6 +275,45 @@ flake8 checks failed."
     fi
 }
 
+function ruff_test {
+    local RUFF_VERSION=
+    local EXPECTED_RUFF=
+    local RUFF_REPORT=
+    local RUFF_STATUS=
+
+    if ! hash "$RUFF_BUILD" 2> /dev/null; then
+        echo "The ruff command was not found. Skipping for now."
+        return
+    fi
+
+    _RUFF_VERSION=($($RUFF_BUILD --version))
+    RUFF_VERSION="${_RUFF_VERSION[1]}"
+    EXPECTED_RUFF="$(satisfies_min_version $RUFF_VERSION $MINIMUM_RUFF)"
+
+    if [[ "$EXPECTED_RUFF" == "False" ]]; then
+        echo "\
+The minimum ruff version needs to be $MINIMUM_RUFF. Your current version is 
$RUFF_VERSION
+
+ruff checks failed."
+        exit 1
+    fi
+
+    echo "starting $RUFF_BUILD test..."
+    RUFF_REPORT=$( ($RUFF_BUILD check --config dev/pyproject.toml) 2>&1)
+    RUFF_STATUS=$?
+
+    if [ "$RUFF_STATUS" -ne 0 ]; then
+        echo "ruff checks failed:"
+        echo "$RUFF_REPORT"
+        echo "$RUFF_STATUS"
+        exit "$RUFF_STATUS"
+    else
+        echo "ruff checks passed."
+        echo
+    fi
+
+}
+
 function black_test {
     local BLACK_REPORT=
     local BLACK_STATUS=
@@ -335,6 +379,9 @@ fi
 if [[ "$FLAKE8_TEST" == "true" ]]; then
     flake8_test
 fi
+if [[ "$RUFF_TEST" == "true" ]]; then
+    ruff_test
+fi
 if [[ "$MYPY_TEST" == "true" ]] || [[ "$MYPY_EXAMPLES_TEST" == "true" ]] || [[ 
"$MYPY_DATA_TEST" == "true" ]]; then
     mypy_test
 fi
diff --git a/dev/pyproject.toml b/dev/pyproject.toml
index 8b9194300955..11a042305dd7 100644
--- a/dev/pyproject.toml
+++ b/dev/pyproject.toml
@@ -24,6 +24,58 @@ testpaths = [
   "pyspark/ml/typing",
 ]
 
+[tool.ruff]
+exclude = [
+    "*/target/*",
+    "**/*.ipynb",
+    "docs/.local_ruby_bundle/",
+    "*python/pyspark/cloudpickle/*.py",
+    "*python/pyspark/ml/deepspeed/tests/*.py",
+    "*python/docs/build/*",
+    "*python/docs/source/conf.py",
+    "*python/.eggs/*",
+    "dist/*",
+    ".git/*",
+    "*python/pyspark/sql/pandas/functions.pyi",
+    "*python/pyspark/sql/column.pyi",
+    "*python/pyspark/worker.pyi",
+    "*python/pyspark/java_gateway.pyi",
+    "*python/pyspark/sql/connect/proto/*",
+    "*python/pyspark/sql/streaming/proto/*",
+    "*venv*/*",
+]
+
+[tool.ruff.lint]
+ignore = [
+    "E203", # Skip as black formatter adds a whitespace around ':'.
+    "E402", # Module top level import is disabled for optional import check, 
etc.
+    # TODO
+    "E721", # Use isinstance for type comparison, too many for now.
+    "E741", # Ambiguous variables like l, I or O.
+]
+
+[tool.ruff.lint.per-file-ignores]
+    # E501 is ignored as shared.py is auto-generated.
+    "python/pyspark/ml/param/shared.py" = ["E501"]
+    # E501 is ignored as we should keep the json string format in 
error_classes.py.
+    "python/pyspark/errors/error_classes.py" = ["E501"]
+    # Examples contain some unused variables.
+    "examples/src/main/python/sql/datasource.py" = ["F841"]
+    # Exclude * imports in test files
+    "python/pyspark/errors/tests/*.py" = ["F403"]
+    "python/pyspark/logger/tests/*.py" = ["F403"]
+    "python/pyspark/logger/tests/connect/*.py" = ["F403"]
+    "python/pyspark/ml/tests/*.py" = ["F403"]
+    "python/pyspark/mllib/tests/*.py" = ["F403"]
+    "python/pyspark/pandas/tests/*.py" = ["F401", "F403"]
+    "python/pyspark/pandas/tests/connect/*.py" = ["F401", "F403"]
+    "python/pyspark/resource/tests/*.py" = ["F403"]
+    "python/pyspark/sql/tests/*.py" = ["F403"]
+    "python/pyspark/streaming/tests/*.py" = ["F403"]
+    "python/pyspark/tests/*.py" = ["F403"]
+    "python/pyspark/testing/*.py" = ["F401"]
+    "python/pyspark/testing/tests/*.py" = ["F403"]
+
 [tool.black]
 # When changing the version, we have to update
 # GitHub workflow version and dev/reformat-python
diff --git a/python/pyspark/ml/evaluation.py b/python/pyspark/ml/evaluation.py
index 56747d07441b..719282d54143 100644
--- a/python/pyspark/ml/evaluation.py
+++ b/python/pyspark/ml/evaluation.py
@@ -710,7 +710,7 @@ class MulticlassClassificationEvaluator(
 
     def isLargerBetter(self) -> bool:
         """Override this function to make it run on connect"""
-        return not self.getMetricName() in [
+        return self.getMetricName() not in [
             "weightedFalsePositiveRate",
             "falsePositiveRateByLabel",
             "logLoss",
diff --git a/python/pyspark/sql/connect/merge.py 
b/python/pyspark/sql/connect/merge.py
index 295e6089e092..913b5099b776 100644
--- a/python/pyspark/sql/connect/merge.py
+++ b/python/pyspark/sql/connect/merge.py
@@ -19,7 +19,7 @@ from pyspark.sql.connect.utils import check_dependencies
 check_dependencies(__name__)
 
 import sys
-from typing import Dict, Optional, TYPE_CHECKING, List, Callable
+from typing import Dict, Optional, TYPE_CHECKING, Callable
 
 from pyspark.sql.connect import proto
 from pyspark.sql.connect.column import Column
@@ -73,9 +73,9 @@ class MergeIntoWriter:
 
         self._callback = callback if callback is not None else lambda _: None
         self._schema_evolution_enabled = False
-        self._matched_actions = list()  # type: List[proto.MergeAction]
-        self._not_matched_actions = list()  # type: List[proto.MergeAction]
-        self._not_matched_by_source_actions = list()  # type: 
List[proto.MergeAction]
+        self._matched_actions: list[proto.MergeAction] = list()
+        self._not_matched_actions: list[proto.MergeAction] = list()
+        self._not_matched_by_source_actions: list[proto.MergeAction] = list()
 
     def whenMatched(self, condition: Optional[Column] = None) -> 
"MergeIntoWriter.WhenMatched":
         return self.WhenMatched(self, condition)
diff --git a/python/pyspark/sql/tests/arrow/test_arrow_udf_typehints.py 
b/python/pyspark/sql/tests/arrow/test_arrow_udf_typehints.py
index 5604012bf2a5..a117a049952c 100644
--- a/python/pyspark/sql/tests/arrow/test_arrow_udf_typehints.py
+++ b/python/pyspark/sql/tests/arrow/test_arrow_udf_typehints.py
@@ -461,7 +461,7 @@ class ArrowUDFTypeHintsTests(ReusedSQLTestCase):
 
 
 if __name__ == "__main__":
-    from pyspark.sql.tests.arrow.test_arrow_udf_typehints import *  # noqa: 
#401
+    from pyspark.sql.tests.arrow.test_arrow_udf_typehints import *  # noqa: 
#F401
 
     try:
         import xmlrunner
diff --git a/python/pyspark/sql/tests/pandas/test_pandas_udf_typehints.py 
b/python/pyspark/sql/tests/pandas/test_pandas_udf_typehints.py
index 09de2a2e3198..7e9f22bb94bb 100644
--- a/python/pyspark/sql/tests/pandas/test_pandas_udf_typehints.py
+++ b/python/pyspark/sql/tests/pandas/test_pandas_udf_typehints.py
@@ -444,7 +444,7 @@ class PandasUDFTypeHintsTests(ReusedSQLTestCase):
 
 
 if __name__ == "__main__":
-    from pyspark.sql.tests.pandas.test_pandas_udf_typehints import *  # noqa: 
#401
+    from pyspark.sql.tests.pandas.test_pandas_udf_typehints import *  # noqa: 
#F401
 
     try:
         import xmlrunner
diff --git 
a/python/pyspark/sql/tests/pandas/test_pandas_udf_typehints_with_future_annotations.py
 
b/python/pyspark/sql/tests/pandas/test_pandas_udf_typehints_with_future_annotations.py
index 9fa3531ee137..88cc4b81778a 100644
--- 
a/python/pyspark/sql/tests/pandas/test_pandas_udf_typehints_with_future_annotations.py
+++ 
b/python/pyspark/sql/tests/pandas/test_pandas_udf_typehints_with_future_annotations.py
@@ -367,7 +367,7 @@ class 
PandasUDFTypeHintsWithFutureAnnotationsTests(ReusedSQLTestCase):
 
 
 if __name__ == "__main__":
-    from 
pyspark.sql.tests.pandas.test_pandas_udf_typehints_with_future_annotations 
import *  # noqa: #401
+    from 
pyspark.sql.tests.pandas.test_pandas_udf_typehints_with_future_annotations 
import *  # noqa: #F401
 
     try:
         import xmlrunner
diff --git a/python/pyspark/taskcontext.py b/python/pyspark/taskcontext.py
index 957f9d70687b..f967a66838f4 100644
--- a/python/pyspark/taskcontext.py
+++ b/python/pyspark/taskcontext.py
@@ -252,8 +252,6 @@ class TaskContext:
         dict
             a dictionary of a string resource name, and 
:class:`ResourceInformation`.
         """
-        from pyspark.resource import ResourceInformation
-
         return cast(Dict[str, "ResourceInformation"], self._resources)
 
 
diff --git a/python/pyspark/util.py b/python/pyspark/util.py
index f8750fbbec2e..22c653508fbb 100644
--- a/python/pyspark/util.py
+++ b/python/pyspark/util.py
@@ -722,7 +722,7 @@ def _local_iterator_from_socket(sock_info: "JavaArray", 
serializer: "Serializer"
         def __init__(self, _sock_info: "JavaArray", _serializer: "Serializer"):
             port: int
             auth_secret: str
-            jsocket_auth_server: "JavaObject"
+            self.jsocket_auth_server: "JavaObject"
             port, auth_secret, self.jsocket_auth_server = _sock_info
             self._sockfile, self._sock = _create_local_socket((port, 
auth_secret))
             self._serializer = _serializer


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

(spark) branch master updated: [SPARK-54632][PYTHON] Add the option to use ruff for lint

Reply via email to