binyang created SPARK-42760:
-------------------------------
Summary: The partition of result data frame of join is always 1
Key: SPARK-42760
URL: https://issues.apache.org/jira/browse/SPARK-42760
Project: Spark
Issue Type: Bug
Components: PySpark, SQL
Affects Versions: 3.3.2
Environment: standard spark 3.0.3/3.3.2, using in jupyter notebook,
local mode
Reporter: binyang
I am using pyspark. The partition of result data frame of join is always 1.
Here is my code from
https://stackoverflow.com/questions/51876281/is-partitioning-retained-after-a-spark-sql-join
print(spark.version)
def example_shuffle_partitions(data_partitions=10, shuffle_partitions=4):
spark.conf.set("spark.sql.shuffle.partitions", shuffle_partitions)
spark.sql("SET spark.sql.autoBroadcastJoinThreshold=-1")
df1 = spark.range(1, 1000).repartition(data_partitions)
df2 = spark.range(1, 2000).repartition(data_partitions)
df3 = spark.range(1, 3000).repartition(data_partitions)
print("Data partitions is: {}. Shuffle partitions is
{}".format(data_partitions, shuffle_partitions))
print("Data partitions before join: {}".format(df1.rdd.getNumPartitions()))
df = (df1.join(df2, df1.id == df2.id)
.join(df3, df1.id == df3.id))
print("Data partitions after join : {}".format(df.rdd.getNumPartitions()))
example_shuffle_partitions()
In Spark 3.0.3, it prints out:
3.0.3
Data partitions is: 10. Shuffle partitions is 4
Data partitions before join: 10
Data partitions after join : 4
However, it prints out the following in the latest 3.3.2
3.3.2
Data partitions is: 10. Shuffle partitions is 4
Data partitions before join: 10
Data partitions after join : 1
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]