[
https://issues.apache.org/jira/browse/SPARK-38172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Naveen Nagaraj updated SPARK-38172:
-----------------------------------
Description:
{code:java}
// code placeholder
val spark = SparkSession.builder().master("local[4]").appName("Test")
.config("spark.sql.adaptive.enabled", "true")
.config("spark.sql.adaptive.coalescePartitions.enabled", "true")
.config("spark.sql.adaptive.advisoryPartitionSizeInBytes", "50m")
.config("spark.sql.adaptive.coalescePartitions.minPartitionNum", "1")
.config("spark.sql.adaptive.coalescePartitions.initialPartitionNum", "1024")
.getOrCreate()val df = spark.read.csv("<Input File
Path>")
val df1 = df.distinct()
df1.persist() // On removing this line. Code works as expected
df1.write.csv("<Output File Path>") {code}
Without df1.persist, df1.write.csv writes 4 partition files of 50 MB each which
is expected
[https://i.stack.imgur.com/tDxpV.png]
If I include df1.persist, Spark is writing 200 partitions(adaptive coalesce not
working) With persist
[https://i.stack.imgur.com/W13hA.png]
was:
{quote}val spark = SparkSession.builder().master("local[4]").appName("Test")
.config("spark.sql.adaptive.enabled", "true")
.config("spark.sql.adaptive.coalescePartitions.enabled", "true")
.config("spark.sql.adaptive.advisoryPartitionSizeInBytes", "50m")
.config("spark.sql.adaptive.coalescePartitions.minPartitionNum", "1")
.config("spark.sql.adaptive.coalescePartitions.initialPartitionNum", "1024")
.getOrCreate()
val df = spark.read.csv("<Input File Path>")
val df1 = df.distinct()
df1.persist() // On removing this line. Code works as expected
df1.write.csv("<Output File Path>")
{quote}
Without df1.persist, df1.write.csv writes 4 partition files of 50 MB each which
is expected
https://i.stack.imgur.com/tDxpV.png
If I include df1.persist, Spark is writing 200 partitions(adaptive coalesce not
working) With persist
https://i.stack.imgur.com/W13hA.png
> Adaptive coalesce not working with df persist
> ---------------------------------------------
>
> Key: SPARK-38172
> URL: https://issues.apache.org/jira/browse/SPARK-38172
> Project: Spark
> Issue Type: Bug
> Components: Spark Core
> Affects Versions: 3.2.1
> Environment: OS: Linux
> Spark Version: 3.2.3
> Reporter: Naveen Nagaraj
> Priority: Major
> Attachments: image-2022-02-10-15-32-30-355.png,
> image-2022-02-10-15-33-08-018.png, image-2022-02-10-15-33-32-607.png
>
>
> {code:java}
> // code placeholder
> val spark = SparkSession.builder().master("local[4]").appName("Test")
> .config("spark.sql.adaptive.enabled", "true")
>
> .config("spark.sql.adaptive.coalescePartitions.enabled", "true")
>
> .config("spark.sql.adaptive.advisoryPartitionSizeInBytes", "50m")
>
> .config("spark.sql.adaptive.coalescePartitions.minPartitionNum", "1")
>
> .config("spark.sql.adaptive.coalescePartitions.initialPartitionNum", "1024")
> .getOrCreate()val df = spark.read.csv("<Input File
> Path>")
> val df1 = df.distinct()
> df1.persist() // On removing this line. Code works as expected
> df1.write.csv("<Output File Path>") {code}
> Without df1.persist, df1.write.csv writes 4 partition files of 50 MB each
> which is expected
> [https://i.stack.imgur.com/tDxpV.png]
> If I include df1.persist, Spark is writing 200 partitions(adaptive coalesce
> not working) With persist
> [https://i.stack.imgur.com/W13hA.png]
>
--
This message was sent by Atlassian Jira
(v8.20.1#820001)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]