[GitHub] [spark] zhengruifeng commented on a diff in pull request #42151: [WIP][PYTHON][DOCS] Refine the docs for `Union`, `UnionAll` and `unionByName`

via GitHub Tue, 25 Jul 2023 18:00:41 -0700


zhengruifeng commented on code in PR #42151:
URL: https://github.com/apache/spark/pull/42151#discussion_r1274258295



##########
python/pyspark/sql/dataframe.py:
##########
@@ -3729,63 +3729,90 @@ def observe(
             )
 
     def union(self, other: "DataFrame") -> "DataFrame":
-        """Return a new :class:`DataFrame` containing union of rows in this 
and another
+        """Return a new :class:`DataFrame` containing the union of rows in 
this and another
         :class:`DataFrame`.
 
         .. versionadded:: 2.0.0
-
-        .. versionchanged:: 3.4.0
-            Supports Spark Connect.
+        .. versionchanged:: 3.4.0 Supports Spark Connect.
 
         Parameters
         ----------
         other : :class:`DataFrame`
-            Another :class:`DataFrame` that needs to be unioned
+            Another :class:`DataFrame` that needs to be unioned.
 
         Returns
         -------
         :class:`DataFrame`
+            A new :class:`DataFrame` containing the combined rows with 
corresponding columns.
 
         See Also
         --------
         DataFrame.unionAll
 
         Notes
         -----
-        This is equivalent to `UNION ALL` in SQL. To do a SQL-style set union
-        (that does deduplication of elements), use this function followed by 
:func:`distinct`.
-
-        Also as standard in SQL, this function resolves columns by position 
(not by name).
+        - This method performs a SQL-style set union of the rows from both 
`DataFrame` objects,
+        with no automatic deduplication of elements.
+        - Use the `distinct()` method to perform deduplication of rows.
+        - The method resolves columns by position (not by name), following the 
standard behavior
+        in SQL.
+        - Alias: The `union` method was previously named `unionAll` in 
versions before 2.0.0.
 
         Examples
         --------
-        >>> df1 = spark.createDataFrame([[1, 2, 3]], ["col0", "col1", "col2"])
-        >>> df2 = spark.createDataFrame([[4, 5, 6]], ["col1", "col2", "col0"])
-        >>> df1.union(df2).show()
-        +----+----+----+
-        |col0|col1|col2|
-        +----+----+----+
-        |   1|   2|   3|
-        |   4|   5|   6|
-        +----+----+----+
-        >>> df1.union(df1).show()
-        +----+----+----+
-        |col0|col1|col2|
-        +----+----+----+
-        |   1|   2|   3|
-        |   1|   2|   3|
-        +----+----+----+
+        Example 1: Combining two DataFrames with the same schema
+        >>> df1 = spark.createDataFrame([(1, 'A'), (2, 'B')], ['id', 'value'])
+        >>> df2 = spark.createDataFrame([(3, 'C'), (4, 'D')], ['id', 'value'])
+        >>> df3 = df1.union(df2)
+        >>> df3.show()
+        +---+-----+
+        | id|value|
+        +---+-----+
+        |  1|    A|
+        |  2|    B|
+        |  3|    C|
+        |  4|    D|
+        +---+-----+
+
+        Example 2: Combining two DataFrames with different schemas

Review Comment:
   this example is pretty awesome, since I don't know `union`'s type coercion 
before



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] zhengruifeng commented on a diff in pull request #42151: [WIP][PYTHON][DOCS] Refine the docs for `Union`, `UnionAll` and `unionByName`

Reply via email to