bzhaoopenstack opened a new pull request, #37369:
URL: https://github.com/apache/spark/pull/37369
The input parameter of nsmallest should be validated as Integer. So I think
we might miss this validation.
And PySpark will raise Error when we input the strange types into nsmallest
func.
### What changes were proposed in this pull request?
validate the input num is integer type only.
### Why are the changes needed?
PySpark will raise Error if we not limit the type.
···
>>> df = ps.DataFrame({'A': [1, 2, 3, 4], 'B': [3, 4, 5, 6]}, columns=['A',
'B'])
>>> df.groupby(['A'])['B']
<pyspark.pandas.groupby.SeriesGroupBy object at 0x7fda5a171fa0>
>>> df.groupby(['A'])['B'].nsmallest(True)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/spark/spark/python/pyspark/pandas/groupby.py", line 3598, in
nsmallest
sdf.withColumn(temp_rank_column, F.row_number().over(window))
File "/home/spark/spark/python/pyspark/sql/dataframe.py", line 2129, in
filter
jdf = self._jdf.filter(condition._jc)
File
"/home/spark/.pyenv/versions/3.8.13/lib/python3.8/site-packages/py4j/java_gateway.py",
line 1321, in __call__
return_value = get_return_value(
File "/home/spark/spark/python/pyspark/sql/utils.py", line 196, in deco
raise converted from None
pyspark.sql.utils.AnalysisException: cannot resolve '(__rank__ <= true)' due
to data type mismatch: differing types in '(__rank__ <= true)' (int and
boolean).;
'Filter (__rank__#4995 <= true)
+- Project [__index_level_0__#4988L, __index_level_1__#4989L, B#4979L,
__natural_order__#4983L, __rank__#4995]
+- Project [__index_level_0__#4988L, __index_level_1__#4989L, B#4979L,
__natural_order__#4983L, __rank__#4995, __rank__#4995]
+- Window [row_number() windowspecdefinition(__index_level_0__#4988L,
B#4979L ASC NULLS FIRST, __natural_order__#4983L ASC NULLS FIRST,
specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS
__rank__#4995], [__index_level_0__#4988L], [B#4979L ASC NULLS FIRST,
__natural_order__#4983L ASC NULLS FIRST]
+- Project [__index_level_0__#4988L, __index_level_1__#4989L,
B#4979L, __natural_order__#4983L]
+- Project [A#4978L AS __index_level_0__#4988L,
__index_level_0__#4977L AS __index_level_1__#4989L, B#4979L,
__natural_order__#4983L]
+- Project [__index_level_0__#4977L, A#4978L, B#4979L,
monotonically_increasing_id() AS __natural_order__#4983L]
+- LogicalRDD [__index_level_0__#4977L, A#4978L, B#4979L],
false
···
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
input non-Integer type will raise AssersionError during calling nsmallest
func
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]