This is an automated email from the ASF dual-hosted git repository.
gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git
The following commit(s) were added to refs/heads/master by this push:
new e09f2a3c0950 [SPARK-54632][PYTHON] Add the option to use ruff for lint
e09f2a3c0950 is described below
commit e09f2a3c0950cba9c42b3cfc2c1d999763862b0b
Author: Tian Gao <[email protected]>
AuthorDate: Wed Dec 10 08:53:27 2025 +0900
[SPARK-54632][PYTHON] Add the option to use ruff for lint
### What changes were proposed in this pull request?
Add `ruff` as an option to lint our code.
### Why are the changes needed?
Our pinned `flake8` version is just too old - it can't even run on 3.12+.
We can upgrade flake8 version but I think gradually switch to `ruff` is a
better options. The main reason is `ruff` is much much faster than `flake8`.
`ruff` returns the result almost immediately (ms-level) on whole spark repo -
which means we can even hook it in the pre-commit in the future.
It is surprisingly compatible with flake8 - there's almost no code change
needed (with two extra ignored lint types which we can fix in the future).
Everything it finds is a real issue instead of a different taste.
`ruff` can also serve as a black-compatible formatter which means we can
probably ditch both `flake8` and `black` in the future.
For now we only enable this option - it's not hooked into any CI or full
`./dev/lint-python`. However, I think we should do that soon.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Local lint test passed.
### Was this patch authored or co-authored using generative AI tooling?
No
Closes #53378 from gaogaotiantian/add-ruff.
Authored-by: Tian Gao <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
---
dev/lint-python | 49 +++++++++++++++++++-
dev/pyproject.toml | 52 ++++++++++++++++++++++
python/pyspark/ml/evaluation.py | 2 +-
python/pyspark/sql/connect/merge.py | 8 ++--
.../sql/tests/arrow/test_arrow_udf_typehints.py | 2 +-
.../sql/tests/pandas/test_pandas_udf_typehints.py | 2 +-
...pandas_udf_typehints_with_future_annotations.py | 2 +-
python/pyspark/taskcontext.py | 2 -
python/pyspark/util.py | 2 +-
9 files changed, 109 insertions(+), 12 deletions(-)
diff --git a/dev/lint-python b/dev/lint-python
index b8703310bc4b..fab7ad9ebe26 100755
--- a/dev/lint-python
+++ b/dev/lint-python
@@ -18,6 +18,8 @@
# define test binaries + versions
FLAKE8_BUILD="flake8"
MINIMUM_FLAKE8="3.9.0"
+RUFF_BUILD="ruff"
+MINIMUM_RUFF="0.14.0"
MINIMUM_MYPY="1.8.0"
MYPY_BUILD="mypy"
PYTEST_BUILD="pytest"
@@ -52,6 +54,9 @@ while (( "$#" )); do
--flake8)
FLAKE8_TEST=true
;;
+ --ruff)
+ RUFF_TEST=true
+ ;;
--mypy)
MYPY_TEST=true
;;
@@ -69,7 +74,7 @@ while (( "$#" )); do
shift
done
-if [[ -z
"$COMPILE_TEST$BLACK_TEST$PYSPARK_CUSTOM_ERRORS_CHECK_TEST$FLAKE8_TEST$MYPY_TEST$MYPY_EXAMPLES_TEST$MYPY_DATA_TEST"
]]; then
+if [[ -z
"$COMPILE_TEST$BLACK_TEST$PYSPARK_CUSTOM_ERRORS_CHECK_TEST$FLAKE8_TEST$RUFF_TEST$MYPY_TEST$MYPY_EXAMPLES_TEST$MYPY_DATA_TEST"
]]; then
COMPILE_TEST=true
BLACK_TEST=true
PYSPARK_CUSTOM_ERRORS_CHECK_TEST=true
@@ -270,6 +275,45 @@ flake8 checks failed."
fi
}
+function ruff_test {
+ local RUFF_VERSION=
+ local EXPECTED_RUFF=
+ local RUFF_REPORT=
+ local RUFF_STATUS=
+
+ if ! hash "$RUFF_BUILD" 2> /dev/null; then
+ echo "The ruff command was not found. Skipping for now."
+ return
+ fi
+
+ _RUFF_VERSION=($($RUFF_BUILD --version))
+ RUFF_VERSION="${_RUFF_VERSION[1]}"
+ EXPECTED_RUFF="$(satisfies_min_version $RUFF_VERSION $MINIMUM_RUFF)"
+
+ if [[ "$EXPECTED_RUFF" == "False" ]]; then
+ echo "\
+The minimum ruff version needs to be $MINIMUM_RUFF. Your current version is
$RUFF_VERSION
+
+ruff checks failed."
+ exit 1
+ fi
+
+ echo "starting $RUFF_BUILD test..."
+ RUFF_REPORT=$( ($RUFF_BUILD check --config dev/pyproject.toml) 2>&1)
+ RUFF_STATUS=$?
+
+ if [ "$RUFF_STATUS" -ne 0 ]; then
+ echo "ruff checks failed:"
+ echo "$RUFF_REPORT"
+ echo "$RUFF_STATUS"
+ exit "$RUFF_STATUS"
+ else
+ echo "ruff checks passed."
+ echo
+ fi
+
+}
+
function black_test {
local BLACK_REPORT=
local BLACK_STATUS=
@@ -335,6 +379,9 @@ fi
if [[ "$FLAKE8_TEST" == "true" ]]; then
flake8_test
fi
+if [[ "$RUFF_TEST" == "true" ]]; then
+ ruff_test
+fi
if [[ "$MYPY_TEST" == "true" ]] || [[ "$MYPY_EXAMPLES_TEST" == "true" ]] || [[
"$MYPY_DATA_TEST" == "true" ]]; then
mypy_test
fi
diff --git a/dev/pyproject.toml b/dev/pyproject.toml
index 8b9194300955..11a042305dd7 100644
--- a/dev/pyproject.toml
+++ b/dev/pyproject.toml
@@ -24,6 +24,58 @@ testpaths = [
"pyspark/ml/typing",
]
+[tool.ruff]
+exclude = [
+ "*/target/*",
+ "**/*.ipynb",
+ "docs/.local_ruby_bundle/",
+ "*python/pyspark/cloudpickle/*.py",
+ "*python/pyspark/ml/deepspeed/tests/*.py",
+ "*python/docs/build/*",
+ "*python/docs/source/conf.py",
+ "*python/.eggs/*",
+ "dist/*",
+ ".git/*",
+ "*python/pyspark/sql/pandas/functions.pyi",
+ "*python/pyspark/sql/column.pyi",
+ "*python/pyspark/worker.pyi",
+ "*python/pyspark/java_gateway.pyi",
+ "*python/pyspark/sql/connect/proto/*",
+ "*python/pyspark/sql/streaming/proto/*",
+ "*venv*/*",
+]
+
+[tool.ruff.lint]
+ignore = [
+ "E203", # Skip as black formatter adds a whitespace around ':'.
+ "E402", # Module top level import is disabled for optional import check,
etc.
+ # TODO
+ "E721", # Use isinstance for type comparison, too many for now.
+ "E741", # Ambiguous variables like l, I or O.
+]
+
+[tool.ruff.lint.per-file-ignores]
+ # E501 is ignored as shared.py is auto-generated.
+ "python/pyspark/ml/param/shared.py" = ["E501"]
+ # E501 is ignored as we should keep the json string format in
error_classes.py.
+ "python/pyspark/errors/error_classes.py" = ["E501"]
+ # Examples contain some unused variables.
+ "examples/src/main/python/sql/datasource.py" = ["F841"]
+ # Exclude * imports in test files
+ "python/pyspark/errors/tests/*.py" = ["F403"]
+ "python/pyspark/logger/tests/*.py" = ["F403"]
+ "python/pyspark/logger/tests/connect/*.py" = ["F403"]
+ "python/pyspark/ml/tests/*.py" = ["F403"]
+ "python/pyspark/mllib/tests/*.py" = ["F403"]
+ "python/pyspark/pandas/tests/*.py" = ["F401", "F403"]
+ "python/pyspark/pandas/tests/connect/*.py" = ["F401", "F403"]
+ "python/pyspark/resource/tests/*.py" = ["F403"]
+ "python/pyspark/sql/tests/*.py" = ["F403"]
+ "python/pyspark/streaming/tests/*.py" = ["F403"]
+ "python/pyspark/tests/*.py" = ["F403"]
+ "python/pyspark/testing/*.py" = ["F401"]
+ "python/pyspark/testing/tests/*.py" = ["F403"]
+
[tool.black]
# When changing the version, we have to update
# GitHub workflow version and dev/reformat-python
diff --git a/python/pyspark/ml/evaluation.py b/python/pyspark/ml/evaluation.py
index 56747d07441b..719282d54143 100644
--- a/python/pyspark/ml/evaluation.py
+++ b/python/pyspark/ml/evaluation.py
@@ -710,7 +710,7 @@ class MulticlassClassificationEvaluator(
def isLargerBetter(self) -> bool:
"""Override this function to make it run on connect"""
- return not self.getMetricName() in [
+ return self.getMetricName() not in [
"weightedFalsePositiveRate",
"falsePositiveRateByLabel",
"logLoss",
diff --git a/python/pyspark/sql/connect/merge.py
b/python/pyspark/sql/connect/merge.py
index 295e6089e092..913b5099b776 100644
--- a/python/pyspark/sql/connect/merge.py
+++ b/python/pyspark/sql/connect/merge.py
@@ -19,7 +19,7 @@ from pyspark.sql.connect.utils import check_dependencies
check_dependencies(__name__)
import sys
-from typing import Dict, Optional, TYPE_CHECKING, List, Callable
+from typing import Dict, Optional, TYPE_CHECKING, Callable
from pyspark.sql.connect import proto
from pyspark.sql.connect.column import Column
@@ -73,9 +73,9 @@ class MergeIntoWriter:
self._callback = callback if callback is not None else lambda _: None
self._schema_evolution_enabled = False
- self._matched_actions = list() # type: List[proto.MergeAction]
- self._not_matched_actions = list() # type: List[proto.MergeAction]
- self._not_matched_by_source_actions = list() # type:
List[proto.MergeAction]
+ self._matched_actions: list[proto.MergeAction] = list()
+ self._not_matched_actions: list[proto.MergeAction] = list()
+ self._not_matched_by_source_actions: list[proto.MergeAction] = list()
def whenMatched(self, condition: Optional[Column] = None) ->
"MergeIntoWriter.WhenMatched":
return self.WhenMatched(self, condition)
diff --git a/python/pyspark/sql/tests/arrow/test_arrow_udf_typehints.py
b/python/pyspark/sql/tests/arrow/test_arrow_udf_typehints.py
index 5604012bf2a5..a117a049952c 100644
--- a/python/pyspark/sql/tests/arrow/test_arrow_udf_typehints.py
+++ b/python/pyspark/sql/tests/arrow/test_arrow_udf_typehints.py
@@ -461,7 +461,7 @@ class ArrowUDFTypeHintsTests(ReusedSQLTestCase):
if __name__ == "__main__":
- from pyspark.sql.tests.arrow.test_arrow_udf_typehints import * # noqa:
#401
+ from pyspark.sql.tests.arrow.test_arrow_udf_typehints import * # noqa:
#F401
try:
import xmlrunner
diff --git a/python/pyspark/sql/tests/pandas/test_pandas_udf_typehints.py
b/python/pyspark/sql/tests/pandas/test_pandas_udf_typehints.py
index 09de2a2e3198..7e9f22bb94bb 100644
--- a/python/pyspark/sql/tests/pandas/test_pandas_udf_typehints.py
+++ b/python/pyspark/sql/tests/pandas/test_pandas_udf_typehints.py
@@ -444,7 +444,7 @@ class PandasUDFTypeHintsTests(ReusedSQLTestCase):
if __name__ == "__main__":
- from pyspark.sql.tests.pandas.test_pandas_udf_typehints import * # noqa:
#401
+ from pyspark.sql.tests.pandas.test_pandas_udf_typehints import * # noqa:
#F401
try:
import xmlrunner
diff --git
a/python/pyspark/sql/tests/pandas/test_pandas_udf_typehints_with_future_annotations.py
b/python/pyspark/sql/tests/pandas/test_pandas_udf_typehints_with_future_annotations.py
index 9fa3531ee137..88cc4b81778a 100644
---
a/python/pyspark/sql/tests/pandas/test_pandas_udf_typehints_with_future_annotations.py
+++
b/python/pyspark/sql/tests/pandas/test_pandas_udf_typehints_with_future_annotations.py
@@ -367,7 +367,7 @@ class
PandasUDFTypeHintsWithFutureAnnotationsTests(ReusedSQLTestCase):
if __name__ == "__main__":
- from
pyspark.sql.tests.pandas.test_pandas_udf_typehints_with_future_annotations
import * # noqa: #401
+ from
pyspark.sql.tests.pandas.test_pandas_udf_typehints_with_future_annotations
import * # noqa: #F401
try:
import xmlrunner
diff --git a/python/pyspark/taskcontext.py b/python/pyspark/taskcontext.py
index 957f9d70687b..f967a66838f4 100644
--- a/python/pyspark/taskcontext.py
+++ b/python/pyspark/taskcontext.py
@@ -252,8 +252,6 @@ class TaskContext:
dict
a dictionary of a string resource name, and
:class:`ResourceInformation`.
"""
- from pyspark.resource import ResourceInformation
-
return cast(Dict[str, "ResourceInformation"], self._resources)
diff --git a/python/pyspark/util.py b/python/pyspark/util.py
index f8750fbbec2e..22c653508fbb 100644
--- a/python/pyspark/util.py
+++ b/python/pyspark/util.py
@@ -722,7 +722,7 @@ def _local_iterator_from_socket(sock_info: "JavaArray",
serializer: "Serializer"
def __init__(self, _sock_info: "JavaArray", _serializer: "Serializer"):
port: int
auth_secret: str
- jsocket_auth_server: "JavaObject"
+ self.jsocket_auth_server: "JavaObject"
port, auth_secret, self.jsocket_auth_server = _sock_info
self._sockfile, self._sock = _create_local_socket((port,
auth_secret))
self._serializer = _serializer
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]