ueshin commented on a change in pull request #33882:
URL: https://github.com/apache/spark/pull/33882#discussion_r700579072
##########
File path: python/pyspark/pandas/namespace.py
##########
@@ -2814,9 +2824,18 @@ def to_numeric(arg):
1.0
"""
if isinstance(arg, Series):
- return arg._with_new_scol(arg.spark.column.cast("float"))
+ if errors == "coerce":
+ return arg._with_new_scol(arg.spark.column.cast("int"))
+ elif errors == "ignore":
+ scol = arg.spark.column
+ casted_scol = scol.cast("int")
+ return arg._with_new_scol(F.when(casted_scol.isNull(),
scol).otherwise(casted_scol))
Review comment:
Actually the case @itholic raised is a bit tricky.
pandas can return numeric type if there is no error.
```py
>>> pd.to_numeric(pd.Series(["1", "2", "3"]), errors="ignore")
0 1
1 2
2 3
dtype: int64
```
whereas the current implementation always returns `StringType`:
```py
>>> ps.to_numeric(ps.Series(["1", "2", "3"]), errors="ignore")
0 1
1 2
2 3
dtype: object
```
As Spark can't change the data type depending on whether there is an error
or not, we have to check it by ourselves beforehand. (or just we don't support
this?)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]