Github user holdenk commented on a diff in the pull request:
https://github.com/apache/spark/pull/16792#discussion_r99471688
--- Diff: python/pyspark/sql/dataframe.py ---
@@ -1272,16 +1272,18 @@ def replace(self, to_replace, value, subset=None):
"""Returns a new :class:`DataFrame` replacing a value with another
value.
:func:`DataFrame.replace` and :func:`DataFrameNaFunctions.replace`
are
aliases of each other.
+ Values `to_replace` and `value` should be homogeneous. Mixed
string and numeric
--- End diff --
I don't think we need to cast the types - if you look inside of `replace0`
all of the numerics are turned into doubles in the map (but we should probably
- in your other PR - add a test around that so that if the internals change we
know we need to update the Python side).
Doing `sc.parallelize([Row(name='Alice', age=0,
height=80)]).toDF().replace(0, 12.5).collect()` is what I was talking about
cutting of the deciminal component (so while it runs it arguable doesn't do
what the user expects - but that ).
What about something along the lines of: `to_replace` and `value` should
contain either all numerics, all booleans, or all strings. When replacing, the
new value will be cast to the type of the existing column."
I think this more clearly communicates the requirements, but is still a bit
awkward -- can you think of something better?)
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]