[GitHub] [spark] itholic commented on a diff in pull request #37444: [SPARK-40012][PYTHON][DOCS] Make pyspark.sql.dataframe examples self-contained (Part 1)

GitBox Thu, 25 Aug 2022 17:18:00 -0700


itholic commented on code in PR #37444:
URL: https://github.com/apache/spark/pull/37444#discussion_r955514877



##########
python/pyspark/sql/dataframe.py:
##########
@@ -3358,7 +3451,20 @@ def fillna(
 
         Examples
         --------
-        >>> df4.na.fill(50).show()
+        Fill all null values with 50 when the data type of the column is an 
integer
+
+       >>> df = spark.createDataFrame([(10, 80, "Alice"), (5, None, "Bob"),
+        ...    (None, None, "Tom"), (None, None, None)], ["age", "height", 
"name"])
+        >>> df.show()
+        +----+------+-----+ 
+        | age|height| name|
+        +----+------+-----+
+        |  10|    80|Alice|
+        |   5|  null|  Bob|
+        |null|  null|  Tom|
+        |null|  null| null|
+        +----+------+-----+
+        >>> df.na.fill(50).show()

Review Comment:
   ditto ?



##########
python/pyspark/sql/dataframe.py:
##########
@@ -4064,11 +4200,31 @@ def drop(self, *cols: "ColumnOrName") -> "DataFrame":  
# type: ignore[misc]
 
         Examples
         --------
+        >>> df = spark.createDataFrame([(14, "Tom"), (23, "Alice"),
+        ... (16, "Bob")], ["age", "name"])
+        >>> df.show()
+        +---+-----+
+        |age| name|
+        +---+-----+
+        | 14|  Tom|
+        | 23|Alice|
+        | 16|  Bob|
+        +---+-----+
         >>> df.drop('age').collect()

Review Comment:
   How about using `show()` here to make it easier to show that a column has 
been deleted?



##########
python/pyspark/sql/dataframe.py:
##########
@@ -3489,7 +3620,18 @@ def replace(  # type: ignore[misc]
 
         Examples
         --------
-        >>> df4.na.replace(10, 20).show()
+        >>> df = spark.createDataFrame([(10, 80, "Alice"), (5, None, "Bob"),
+        ...     (None, None, "Tom"), (None, None, None)], ["age", "height", 
"name"])
+        >>> df.show()
+        +----+------+-----+
+        | age|height| name|
+        +----+------+-----+
+        |  10|    80|Alice|
+        |   5|  null|  Bob|
+        |null|  null|  Tom|
+        |null|  null| null|
+        +----+------+-----+
+        >>> df.na.replace(10, 20).show()

Review Comment:
   ditto ?



##########
python/pyspark/sql/dataframe.py:
##########
@@ -4100,7 +4256,7 @@ def toDF(self, *cols: "ColumnOrName") -> "DataFrame":
         Parameters
         ----------
         cols : str
-            new column names
+            new column names. The length of the list needs to be the same as 
the number of columns in the initial :class:`DataFrame`

Review Comment:
   Seems like it exceeds the 100 lines, which violates `flake8` rule.
   
   ```shell
   starting flake8 test...
   flake8 checks failed:
   ./python/pyspark/sql/dataframe.py:4250:101: E501 line too long (128 > 100 
characters)
           """
           Returns a best-effort snapshot of the files that compose this 
:class:`DataFrame`.
           This method simply asks each constituent BaseRelation for its 
respective files and
           takes the union of all results. Depending on the source relations, 
this may not find
           all input files. Duplicates are removed.
   
           new column names. The length of the list needs to be the same as the 
number of columns in the initial :class:`DataFrame`
   
           .. versionadded:: 3.1.0
   
           Returns
           -------
           list
               List of file paths.
   
           Examples
           --------
           >>> df = spark.read.load("examples/src/main/resources/people.json", 
format="json")
           >>> len(df.inputFiles())
           1
           """
   
                                                                                
           ^
   1     E501 line too long (128 > 100 characters)
   ```



##########
python/pyspark/sql/dataframe.py:
##########
@@ -3368,7 +3474,19 @@ def fillna(
         | 50|    50| null|
         +---+------+-----+
 
-        >>> df5.na.fill(False).show()
+       Fill all null values with ``False`` when the data type of the column is 
a boolean
+
+        >>> df = spark.createDataFrame([(10, "Alice", None), (5, "Bob", None),
+        ...     (None, "Mallory", True)], ["age", "name", "spy"])
+        >>> df.show()
+        +----+-------+----+
+        | age|   name| spy|
+        +----+-------+----+
+        |  10|  Alice|null|
+        |   5|    Bob|null|
+        |null|Mallory|true|
+        +----+-------+----+
+        >>> df.na.fill(False).show()

Review Comment:
   ditto ?



##########
python/pyspark/sql/dataframe.py:
##########
@@ -4100,7 +4256,7 @@ def toDF(self, *cols: "ColumnOrName") -> "DataFrame":
         Parameters
         ----------
         cols : str
-            new column names
+            new column names. The length of the list needs to be the same as 
the number of columns in the initial :class:`DataFrame`

Review Comment:
   You can run `dev/lint-python` to check if the static analysis is passed.



##########
python/pyspark/sql/dataframe.py:
##########
@@ -1722,8 +1806,17 @@ def dtypes(self) -> List[Tuple[str, str]]:
 
         Examples
         --------
+        >>> df = spark.createDataFrame([(14, "Tom"),
+        ...     (23, "Alice")], ["age", "name"])
+        >>> df.show()
+        +---+-----+
+        |age| name|
+        +---+-----+
+        | 14|  Tom|
+        | 23|Alice|
+        +---+-----+
         >>> df.dtypes

Review Comment:
   Can we have a new line between the examples with short description, for 
better readability ?



##########
python/pyspark/sql/dataframe.py:
##########
@@ -4064,11 +4200,31 @@ def drop(self, *cols: "ColumnOrName") -> "DataFrame":  
# type: ignore[misc]
 
         Examples
         --------
+        >>> df = spark.createDataFrame([(14, "Tom"), (23, "Alice"),
+        ... (16, "Bob")], ["age", "name"])
+        >>> df.show()
+        +---+-----+
+        |age| name|
+        +---+-----+
+        | 14|  Tom|
+        | 23|Alice|
+        | 16|  Bob|
+        +---+-----+
         >>> df.drop('age').collect()
-        [Row(name='Alice'), Row(name='Bob')]
+        [Row(name='Tom'), Row(name='Alice'), Row(name='Bob')]
 
+        >>> df = spark.createDataFrame([(14, "Tom"), (23, "Alice"),
+        ... (16, "Bob")], ["age", "name"])
+        >>> df.show()
+        +---+-----+
+        |age| name|
+        +---+-----+
+        | 14|  Tom|
+        | 23|Alice|
+        | 16|  Bob|
+        +---+-----+

Review Comment:
   I think we don't need to create a new DataFrame here, since `drop()` doesn't 
remove the column in-place.
   
   e.g.
   
   ```python
   >>> df.drop('age').show()
   +-----+
   | name|
   +-----+
   |  Tom|
   |Alice|
   |  Bob|
   +-----+
   
   >>> df.drop(df.age).show()
   +-----+
   | name|
   +-----+
   |  Tom|
   |Alice|
   |  Bob|
   +-----+
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] itholic commented on a diff in pull request #37444: [SPARK-40012][PYTHON][DOCS] Make pyspark.sql.dataframe examples self-contained (Part 1)

Reply via email to