binyang created SPARK-42760:
-------------------------------

             Summary: The partition of result data frame of join is always 1
                 Key: SPARK-42760
                 URL: https://issues.apache.org/jira/browse/SPARK-42760
             Project: Spark
          Issue Type: Bug
          Components: PySpark, SQL
    Affects Versions: 3.3.2
         Environment: standard spark 3.0.3/3.3.2, using in jupyter notebook, 
local mode
            Reporter: binyang


I am using pyspark. The partition of result data frame of join is always 1.

Here is my code from 
https://stackoverflow.com/questions/51876281/is-partitioning-retained-after-a-spark-sql-join

 

print(spark.version)

def example_shuffle_partitions(data_partitions=10, shuffle_partitions=4):
    spark.conf.set("spark.sql.shuffle.partitions", shuffle_partitions)
    spark.sql("SET spark.sql.autoBroadcastJoinThreshold=-1")
    df1 = spark.range(1, 1000).repartition(data_partitions)
    df2 = spark.range(1, 2000).repartition(data_partitions)
    df3 = spark.range(1, 3000).repartition(data_partitions)

    print("Data partitions is: {}. Shuffle partitions is 
{}".format(data_partitions, shuffle_partitions))
    print("Data partitions before join: {}".format(df1.rdd.getNumPartitions()))

    df = (df1.join(df2, df1.id == df2.id)
          .join(df3, df1.id == df3.id))

    print("Data partitions after join : {}".format(df.rdd.getNumPartitions()))

example_shuffle_partitions()

 


In Spark 3.0.3, it prints out:
3.0.3
Data partitions is: 10. Shuffle partitions is 4
Data partitions before join: 10
Data partitions after join : 4


However, it prints out the following in the latest 3.3.2
3.3.2
Data partitions is: 10. Shuffle partitions is 4
Data partitions before join: 10
Data partitions after join : 1



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to