Naveen Nagaraj created SPARK-38172:
--------------------------------------

             Summary: Adaptive coalesce not working with df persist
                 Key: SPARK-38172
                 URL: https://issues.apache.org/jira/browse/SPARK-38172
             Project: Spark
          Issue Type: Bug
          Components: Spark Core
    Affects Versions: 3.2.1
         Environment: OS: Linux

Spark Version: 3.2.3
            Reporter: Naveen Nagaraj
         Attachments: image-2022-02-10-15-32-30-355.png

{quote}val spark = SparkSession.builder().master("local[4]").appName("Test")
                        .config("spark.sql.adaptive.enabled", "true")
                        
.config("spark.sql.adaptive.coalescePartitions.enabled", "true")
                        
.config("spark.sql.adaptive.advisoryPartitionSizeInBytes", "50m")
                        
.config("spark.sql.adaptive.coalescePartitions.minPartitionNum", "1")
                        
.config("spark.sql.adaptive.coalescePartitions.initialPartitionNum", "1024")
                        .getOrCreate()

val df = spark.read.csv("<Input File Path>")
val df1 = df.distinct()
df1.persist() // On removing this line. Code works as expected
df1.write.csv("<Output File Path>")
{quote}
val spark = SparkSession.builder().master("local[4]").appName("Test")
                        .config("spark.sql.adaptive.enabled", "true")
                        
.config("spark.sql.adaptive.coalescePartitions.enabled", "true")
                        
.config("spark.sql.adaptive.advisoryPartitionSizeInBytes", "50m")
                        
.config("spark.sql.adaptive.coalescePartitions.minPartitionNum", "1")
                        
.config("spark.sql.adaptive.coalescePartitions.initialPartitionNum", "1024")
                        .getOrCreate()

val df = spark.read.csv("<Input File Path>")
val df1 = df.distinct()
df1.persist() // On removing this line. Code works as expected
df1.write.csv("<Output File Path>")
I have an input file of size 2 GB which is read as 16 partitions of size 128 MB 
each. I have enabled adaptive SQL to coalesce partitions after the shuffle

Without df1.persist, df1.write.csv writes 4 partition files of 50 MB each which 
is expected

!image-2022-02-10-15-29-31-708.png!

If I include df1.persist, Spark is writing 200 partitions(adaptive coalesce not 
working) With persist

!image-2022-02-10-15-29-57-176.png!



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to