HyukjinKwon commented on code in PR #49624:
URL: https://github.com/apache/spark/pull/49624#discussion_r1940507357
##########
python/pyspark/ml/feature.py:
##########
@@ -5069,11 +5070,19 @@ def __init__(
self._java_obj = self._new_java_obj(
"org.apache.spark.ml.feature.StopWordsRemover", self.uid
)
- self._setDefault(
- stopWords=StopWordsRemover.loadDefaultStopWords("english"),
- caseSensitive=False,
- locale=self._java_obj.getLocale(),
- )
+ if isinstance(self._java_obj, str):
+ # Skip setting the default value of 'locale', which needs to
invoke a JVM method.
+ # So if users don't explicitly set 'locale', then getLocale fails.
+ self._setDefault(
+ stopWords=StopWordsRemover.loadDefaultStopWords("english"),
Review Comment:
Seems like this will still use Py4J:
```
======================================================================
ERROR [0.004s]: test_stop_words_remover
(pyspark.ml.tests.connect.test_parity_feature.FeatureParityTests.test_stop_words_remover)
----------------------------------------------------------------------
Traceback (most recent call last):
File
"/opt/hostedtoolcache/Python/3.11.11/x64/lib/python3.11/site-packages/pyspark/ml/tests/test_feature.py",
line 866, in test_stop_words_remover
remover = StopWordsRemover(stopWords=["b"])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File
"/opt/hostedtoolcache/Python/3.11.11/x64/lib/python3.11/site-packages/pyspark/__init__.py",
line 115, in wrapper
return func(self, **kwargs)
^^^^^^^^^^^^^^^^^^^^
File
"/opt/hostedtoolcache/Python/3.11.11/x64/lib/python3.11/site-packages/pyspark/ml/feature.py",
line 5098, in __init__
stopWords=StopWordsRemover.loadDefaultStopWords("english"),
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File
"/opt/hostedtoolcache/Python/3.11.11/x64/lib/python3.11/site-packages/pyspark/ml/feature.py",
line 5231, in loadDefaultStopWords
stopWordsObj = getattr(_jvm(),
"org.apache.spark.ml.feature.StopWordsRemover")
^^^^^^
File
"/opt/hostedtoolcache/Python/3.11.11/x64/lib/python3.11/site-packages/pyspark/ml/util.py",
line 376, in _jvm
from pyspark.core.context import SparkContext
ModuleNotFoundError: No module named 'pyspark.core'
```
https://github.com/apache/spark/actions/runs/13120894941/job/36606320881
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]