itholic commented on code in PR #37444:
URL: https://github.com/apache/spark/pull/37444#discussion_r955514877
##########
python/pyspark/sql/dataframe.py:
##########
@@ -3358,7 +3451,20 @@ def fillna(
Examples
--------
- >>> df4.na.fill(50).show()
+ Fill all null values with 50 when the data type of the column is an
integer
+
+ >>> df = spark.createDataFrame([(10, 80, "Alice"), (5, None, "Bob"),
+ ... (None, None, "Tom"), (None, None, None)], ["age", "height",
"name"])
+ >>> df.show()
+ +----+------+-----+
+ | age|height| name|
+ +----+------+-----+
+ | 10| 80|Alice|
+ | 5| null| Bob|
+ |null| null| Tom|
+ |null| null| null|
+ +----+------+-----+
+ >>> df.na.fill(50).show()
Review Comment:
ditto ?
##########
python/pyspark/sql/dataframe.py:
##########
@@ -4064,11 +4200,31 @@ def drop(self, *cols: "ColumnOrName") -> "DataFrame":
# type: ignore[misc]
Examples
--------
+ >>> df = spark.createDataFrame([(14, "Tom"), (23, "Alice"),
+ ... (16, "Bob")], ["age", "name"])
+ >>> df.show()
+ +---+-----+
+ |age| name|
+ +---+-----+
+ | 14| Tom|
+ | 23|Alice|
+ | 16| Bob|
+ +---+-----+
>>> df.drop('age').collect()
Review Comment:
How about using `show()` here to make it easier to show that a column has
been deleted?
##########
python/pyspark/sql/dataframe.py:
##########
@@ -3489,7 +3620,18 @@ def replace( # type: ignore[misc]
Examples
--------
- >>> df4.na.replace(10, 20).show()
+ >>> df = spark.createDataFrame([(10, 80, "Alice"), (5, None, "Bob"),
+ ... (None, None, "Tom"), (None, None, None)], ["age", "height",
"name"])
+ >>> df.show()
+ +----+------+-----+
+ | age|height| name|
+ +----+------+-----+
+ | 10| 80|Alice|
+ | 5| null| Bob|
+ |null| null| Tom|
+ |null| null| null|
+ +----+------+-----+
+ >>> df.na.replace(10, 20).show()
Review Comment:
ditto ?
##########
python/pyspark/sql/dataframe.py:
##########
@@ -4100,7 +4256,7 @@ def toDF(self, *cols: "ColumnOrName") -> "DataFrame":
Parameters
----------
cols : str
- new column names
+ new column names. The length of the list needs to be the same as
the number of columns in the initial :class:`DataFrame`
Review Comment:
Seems like it exceeds the 100 lines, which violates `flake8` rule.
```shell
starting flake8 test...
flake8 checks failed:
./python/pyspark/sql/dataframe.py:4250:101: E501 line too long (128 > 100
characters)
"""
Returns a best-effort snapshot of the files that compose this
:class:`DataFrame`.
This method simply asks each constituent BaseRelation for its
respective files and
takes the union of all results. Depending on the source relations,
this may not find
all input files. Duplicates are removed.
new column names. The length of the list needs to be the same as the
number of columns in the initial :class:`DataFrame`
.. versionadded:: 3.1.0
Returns
-------
list
List of file paths.
Examples
--------
>>> df = spark.read.load("examples/src/main/resources/people.json",
format="json")
>>> len(df.inputFiles())
1
"""
^
1 E501 line too long (128 > 100 characters)
```
##########
python/pyspark/sql/dataframe.py:
##########
@@ -3368,7 +3474,19 @@ def fillna(
| 50| 50| null|
+---+------+-----+
- >>> df5.na.fill(False).show()
+ Fill all null values with ``False`` when the data type of the column is
a boolean
+
+ >>> df = spark.createDataFrame([(10, "Alice", None), (5, "Bob", None),
+ ... (None, "Mallory", True)], ["age", "name", "spy"])
+ >>> df.show()
+ +----+-------+----+
+ | age| name| spy|
+ +----+-------+----+
+ | 10| Alice|null|
+ | 5| Bob|null|
+ |null|Mallory|true|
+ +----+-------+----+
+ >>> df.na.fill(False).show()
Review Comment:
ditto ?
##########
python/pyspark/sql/dataframe.py:
##########
@@ -4100,7 +4256,7 @@ def toDF(self, *cols: "ColumnOrName") -> "DataFrame":
Parameters
----------
cols : str
- new column names
+ new column names. The length of the list needs to be the same as
the number of columns in the initial :class:`DataFrame`
Review Comment:
You can run `dev/lint-python` to check if the static analysis is passed.
##########
python/pyspark/sql/dataframe.py:
##########
@@ -1722,8 +1806,17 @@ def dtypes(self) -> List[Tuple[str, str]]:
Examples
--------
+ >>> df = spark.createDataFrame([(14, "Tom"),
+ ... (23, "Alice")], ["age", "name"])
+ >>> df.show()
+ +---+-----+
+ |age| name|
+ +---+-----+
+ | 14| Tom|
+ | 23|Alice|
+ +---+-----+
>>> df.dtypes
Review Comment:
Can we have a new line between the examples with short description, for
better readability ?
##########
python/pyspark/sql/dataframe.py:
##########
@@ -4064,11 +4200,31 @@ def drop(self, *cols: "ColumnOrName") -> "DataFrame":
# type: ignore[misc]
Examples
--------
+ >>> df = spark.createDataFrame([(14, "Tom"), (23, "Alice"),
+ ... (16, "Bob")], ["age", "name"])
+ >>> df.show()
+ +---+-----+
+ |age| name|
+ +---+-----+
+ | 14| Tom|
+ | 23|Alice|
+ | 16| Bob|
+ +---+-----+
>>> df.drop('age').collect()
- [Row(name='Alice'), Row(name='Bob')]
+ [Row(name='Tom'), Row(name='Alice'), Row(name='Bob')]
+ >>> df = spark.createDataFrame([(14, "Tom"), (23, "Alice"),
+ ... (16, "Bob")], ["age", "name"])
+ >>> df.show()
+ +---+-----+
+ |age| name|
+ +---+-----+
+ | 14| Tom|
+ | 23|Alice|
+ | 16| Bob|
+ +---+-----+
Review Comment:
I think we don't need to create a new DataFrame here, since `drop()` doesn't
remove the column in-place.
e.g.
```python
>>> df.drop('age').show()
+-----+
| name|
+-----+
| Tom|
|Alice|
| Bob|
+-----+
>>> df.drop(df.age).show()
+-----+
| name|
+-----+
| Tom|
|Alice|
| Bob|
+-----+
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]