HyukjinKwon commented on code in PR #43039:
URL: https://github.com/apache/spark/pull/43039#discussion_r1355911443
##########
python/pyspark/sql/dataframe.py:
##########
@@ -2646,67 +2647,147 @@ def join(
Examples
--------
- The following performs a full outer join between ``df1`` and ``df2``.
+ The following examples demonstrate various join types between ``df1``
and ``df2``.
+ >>> import pyspark.sql.functions as sf
>>> from pyspark.sql import Row
- >>> from pyspark.sql.functions import desc
- >>> df = spark.createDataFrame([(2, "Alice"), (5, "Bob")]).toDF("age",
"name")
- >>> df2 = spark.createDataFrame([Row(height=80, name="Tom"),
Row(height=85, name="Bob")])
- >>> df3 = spark.createDataFrame([Row(age=2, name="Alice"), Row(age=5,
name="Bob")])
- >>> df4 = spark.createDataFrame([
- ... Row(age=10, height=80, name="Alice"),
- ... Row(age=5, height=None, name="Bob"),
- ... Row(age=None, height=None, name="Tom"),
- ... Row(age=None, height=None, name=None),
+ >>> df = spark.createDataFrame([Row(name="Alice", age=2),
Row(name="Bob", age=5)])
+ >>> df2 = spark.createDataFrame([Row(name="Tom", height=80),
Row(name="Bob", height=85)])
+ >>> df3 = spark.createDataFrame([
+ ... Row(name="Alice", age=10, height=80),
+ ... Row(name="Bob", age=5, height=None),
+ ... Row(name="Tom", age=None, height=None),
+ ... Row(name=None, age=None, height=None),
... ])
Inner join on columns (default)
- >>> df.join(df2, 'name').select(df.name, df2.height).show()
- +----+------+
- |name|height|
- +----+------+
- | Bob| 85|
- +----+------+
- >>> df.join(df4, ['name', 'age']).select(df.name, df.age).show()
- +----+---+
- |name|age|
- +----+---+
- | Bob| 5|
- +----+---+
-
- Outer join for both DataFrames on the 'name' column.
-
- >>> df.join(df2, df.name == df2.name, 'outer').select(
- ... df.name, df2.height).sort(desc("name")).show()
+ >>> df.join(df2, "name").show()
+ +----+---+------+
+ |name|age|height|
+ +----+---+------+
+ | Bob| 5| 85|
+ +----+---+------+
+
+ >>> df.join(df3, ["name", "age"]).show()
+ +----+---+------+
+ |name|age|height|
+ +----+---+------+
+ | Bob| 5| NULL|
+ +----+---+------+
+
+ Outer join on a single column with an explicit join condition.
+
+ When the join condition is explicited stated: `df.name == df2.name`,
this will
+ produce all records where the names match, as well as those that don't
(since
+ it's an outer join). If there are names in `df2` that are not present
in `df`,
+ they will appear with `NULL` in the `name` column of `df`, and vice
versa for `df2`.
+
+ >>> joined = df.join(df2, df.name == df2.name,
"outer").sort(sf.desc(df.name))
+ >>> joined.show()
Review Comment:
Can we exclude those examples in this PRs, and mind filing JIRAs for both
issues @allisonwang-db?
##########
python/pyspark/sql/dataframe.py:
##########
@@ -2646,67 +2647,147 @@ def join(
Examples
--------
- The following performs a full outer join between ``df1`` and ``df2``.
+ The following examples demonstrate various join types between ``df1``
and ``df2``.
+ >>> import pyspark.sql.functions as sf
>>> from pyspark.sql import Row
- >>> from pyspark.sql.functions import desc
- >>> df = spark.createDataFrame([(2, "Alice"), (5, "Bob")]).toDF("age",
"name")
- >>> df2 = spark.createDataFrame([Row(height=80, name="Tom"),
Row(height=85, name="Bob")])
- >>> df3 = spark.createDataFrame([Row(age=2, name="Alice"), Row(age=5,
name="Bob")])
- >>> df4 = spark.createDataFrame([
- ... Row(age=10, height=80, name="Alice"),
- ... Row(age=5, height=None, name="Bob"),
- ... Row(age=None, height=None, name="Tom"),
- ... Row(age=None, height=None, name=None),
+ >>> df = spark.createDataFrame([Row(name="Alice", age=2),
Row(name="Bob", age=5)])
+ >>> df2 = spark.createDataFrame([Row(name="Tom", height=80),
Row(name="Bob", height=85)])
+ >>> df3 = spark.createDataFrame([
+ ... Row(name="Alice", age=10, height=80),
+ ... Row(name="Bob", age=5, height=None),
+ ... Row(name="Tom", age=None, height=None),
+ ... Row(name=None, age=None, height=None),
... ])
Inner join on columns (default)
- >>> df.join(df2, 'name').select(df.name, df2.height).show()
- +----+------+
- |name|height|
- +----+------+
- | Bob| 85|
- +----+------+
- >>> df.join(df4, ['name', 'age']).select(df.name, df.age).show()
- +----+---+
- |name|age|
- +----+---+
- | Bob| 5|
- +----+---+
-
- Outer join for both DataFrames on the 'name' column.
-
- >>> df.join(df2, df.name == df2.name, 'outer').select(
- ... df.name, df2.height).sort(desc("name")).show()
+ >>> df.join(df2, "name").show()
+ +----+---+------+
+ |name|age|height|
+ +----+---+------+
+ | Bob| 5| 85|
+ +----+---+------+
+
+ >>> df.join(df3, ["name", "age"]).show()
+ +----+---+------+
+ |name|age|height|
+ +----+---+------+
+ | Bob| 5| NULL|
+ +----+---+------+
+
+ Outer join on a single column with an explicit join condition.
+
+ When the join condition is explicited stated: `df.name == df2.name`,
this will
+ produce all records where the names match, as well as those that don't
(since
+ it's an outer join). If there are names in `df2` that are not present
in `df`,
+ they will appear with `NULL` in the `name` column of `df`, and vice
versa for `df2`.
+
+ >>> joined = df.join(df2, df.name == df2.name,
"outer").sort(sf.desc(df.name))
+ >>> joined.show()
Review Comment:
Can we exclude those examples in this PR, and mind filing JIRAs for both
issues @allisonwang-db?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]