zhengruifeng commented on code in PR #42151:
URL: https://github.com/apache/spark/pull/42151#discussion_r1275746720
##########
python/pyspark/sql/dataframe.py:
##########
@@ -3729,63 +3729,90 @@ def observe(
)
def union(self, other: "DataFrame") -> "DataFrame":
- """Return a new :class:`DataFrame` containing union of rows in this
and another
+ """Return a new :class:`DataFrame` containing the union of rows in
this and another
:class:`DataFrame`.
.. versionadded:: 2.0.0
-
- .. versionchanged:: 3.4.0
- Supports Spark Connect.
+ .. versionchanged:: 3.4.0 Supports Spark Connect.
Parameters
----------
other : :class:`DataFrame`
- Another :class:`DataFrame` that needs to be unioned
+ Another :class:`DataFrame` that needs to be unioned.
Returns
-------
:class:`DataFrame`
+ A new :class:`DataFrame` containing the combined rows with
corresponding columns.
See Also
--------
DataFrame.unionAll
Notes
-----
- This is equivalent to `UNION ALL` in SQL. To do a SQL-style set union
- (that does deduplication of elements), use this function followed by
:func:`distinct`.
-
- Also as standard in SQL, this function resolves columns by position
(not by name).
+ - This method performs a SQL-style set union of the rows from both
`DataFrame` objects,
+ with no automatic deduplication of elements.
+ - Use the `distinct()` method to perform deduplication of rows.
+ - The method resolves columns by position (not by name), following the
standard behavior
+ in SQL.
+ - Alias: The `union` method was previously named `unionAll` in
versions before 2.0.0.
Examples
--------
- >>> df1 = spark.createDataFrame([[1, 2, 3]], ["col0", "col1", "col2"])
- >>> df2 = spark.createDataFrame([[4, 5, 6]], ["col1", "col2", "col0"])
- >>> df1.union(df2).show()
- +----+----+----+
- |col0|col1|col2|
- +----+----+----+
- | 1| 2| 3|
- | 4| 5| 6|
- +----+----+----+
- >>> df1.union(df1).show()
- +----+----+----+
- |col0|col1|col2|
- +----+----+----+
- | 1| 2| 3|
- | 1| 2| 3|
- +----+----+----+
+ Example 1: Combining two DataFrames with the same schema
+ >>> df1 = spark.createDataFrame([(1, 'A'), (2, 'B')], ['id', 'value'])
+ >>> df2 = spark.createDataFrame([(3, 'C'), (4, 'D')], ['id', 'value'])
+ >>> df3 = df1.union(df2)
+ >>> df3.show()
+ +---+-----+
+ | id|value|
+ +---+-----+
+ | 1| A|
+ | 2| B|
+ | 3| C|
+ | 4| D|
+ +---+-----+
+
+ Example 2: Combining two DataFrames with different schemas
+ >>> from pyspark.sql.functions import lit
Review Comment:
need to add the missing import
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]