Github user HyukjinKwon commented on a diff in the pull request:
https://github.com/apache/spark/pull/20499#discussion_r166300631
--- Diff: python/pyspark/sql/dataframe.py ---
@@ -1532,7 +1532,7 @@ def fillna(self, value, subset=None):
return DataFrame(self._jdf.na().fill(value,
self._jseq(subset)), self.sql_ctx)
@since(1.4)
- def replace(self, to_replace, value=None, subset=None):
+ def replace(self, to_replace, *args, **kwargs):
--- End diff --
Yea, I think that summarises the issue
> Can we use an invalid value as the default value for value? Then we can
throw exception if the value is not set by user.
Yea, we could define a class / instance to indeicate no value like NumPy
does -
https://github.com/numpy/numpy/blob/master/numpy/_globals.py#L76 . I was
thinking resembling this way too but this is kind of a new approach to Spark
and this is a single case so far.
To get to the point, yea, we could maybe use an invalid value and unset if
`to_replace` is a dictionary. For example, I can assign `{}`. But then the
problem is docstring by pydoc and API documentation. It will show something
like:
```
Help on method replace in module pyspark.sql.dataframe:
replace(self, to_replace, value={}, subset=None) method of
pyspark.sql.dataframe.DataFrame instance
Returns a new :class:`DataFrame` replacing a value with another value.
...
```
This is pretty confusing. Up to my knowledge, we can't really override this
signature - I tried few times before, and I failed if I remember this correctly.
Maybe, this is good enough but I didn't want to start it by such because
the issue @rxin raised sounds like because it has a default value, to be more
strictly.
To be honest, seems Pandas's `replace` also has `None` for default value -
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.replace.html#pandas.DataFrame.replace.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]