Priyanka Raju created SPARK-44378:
-------------------------------------

             Summary: Jobs that have join & have .rdd calls get executed 2x 
when AQE is enabled.
                 Key: SPARK-44378
                 URL: https://issues.apache.org/jira/browse/SPARK-44378
             Project: Spark
          Issue Type: Question
          Components: Spark Submit
    Affects Versions: 3.1.2
            Reporter: Priyanka Raju


We have a few spark scala jobs that are currently running in production. Most 
jobs typically use Dataset, Dataframes. There is a small code in our custom 
library code, that makes rdd calls example to check if the dataframe is empty: 
df.rdd.getNumPartitions == 0

When I enable aqe for these jobs, this .rdd is converted into a separate job of 
it's own and the entire dag is executed 2x, taking 2x more time. This does not 
happen when AQE is disabled. Why does this happen and what is the best way to 
fix the issue?

 

Sample code to reproduce the issue:

 

 
{code:java}
import org.apache.spark.sql._ 
  case class Record(
    id: Int,
    name: String
 )
 
    val partCount = 4
    val input1 = (0 until 100).map(part => Record(part, "a"))
 
    val input2 = (100 until 110).map(part => Record(part, "c"))
 
    implicit val enc: Encoder[Record] = Encoders.product[Record]
 
    val ds1 = spark.createDataset(
      spark.sparkContext
        .parallelize(input1, partCount)
    )
 
    val ds2 = spark.createDataset(
      spark.sparkContext
        .parallelize(input2, partCount)
    )
 
    val ds3 = ds1.join(ds2, Seq("id"))
    val l = ds3.count()
 
    val incomingPartitions = ds3.rdd.getNumPartitions
    log.info(s"Num partitions ${incomingPartitions}")
  {code}
 

 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to