[GitHub] [spark] c21 commented on a change in pull request #30138: [SPARK-33075][SQL] Enable auto bucketed scan by default (disable only for cached query)

GitBox Fri, 23 Oct 2020 14:56:28 -0700


c21 commented on a change in pull request #30138:
URL: https://github.com/apache/spark/pull/30138#discussion_r511167105




##########
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanHelper.scala
##########
@@ -139,15 +138,21 @@ trait AdaptiveSparkPlanHelper {
   }
 
   /**
-   * Returns a cloned [[SparkSession]] with adaptive execution disabled, or 
the original
-   * [[SparkSession]] if its adaptive execution is already disabled.
+   * Returns a cloned [[SparkSession]] with all specified configurations 
disabled, or
+   * the original [[SparkSession]] if all configurations are already disabled.
    */
-  def getOrCloneSessionWithAqeOff[T](session: SparkSession): SparkSession = {
-    if (!session.sessionState.conf.adaptiveExecutionEnabled) {
+  def getOrCloneSessionWithConfigsOff(
+      session: SparkSession,
+      configurations: Seq[String]): SparkSession = {

Review comment:
       sure, updated, it's safer.

##########
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala
##########
@@ -55,6 +56,17 @@ class CacheManager extends Logging with 
AdaptiveSparkPlanHelper {
   @transient @volatile
   private var cachedData = IndexedSeq[CachedData]()
 
+  /**
+   * Configurations needs to be turned off, to avoid regression for cached 
query, so that the
+   * outputPartitioning of the underlying cached query plan can be leveraged 
later.
+   * Configurations include:
+   * 1. AQE
+   * 2. Automatic bucketed table scan
+   */
+  private val configsOff = Seq(

Review comment:
       @maropu - sure, updated.

##########
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanHelper.scala
##########
@@ -139,15 +138,21 @@ trait AdaptiveSparkPlanHelper {
   }
 
   /**
-   * Returns a cloned [[SparkSession]] with adaptive execution disabled, or 
the original
-   * [[SparkSession]] if its adaptive execution is already disabled.
+   * Returns a cloned [[SparkSession]] with all specified configurations 
disabled, or
+   * the original [[SparkSession]] if all configurations are already disabled.
    */
-  def getOrCloneSessionWithAqeOff[T](session: SparkSession): SparkSession = {
-    if (!session.sessionState.conf.adaptiveExecutionEnabled) {
+  def getOrCloneSessionWithConfigsOff(

Review comment:
       Sounds good it makes sense to me, moved to `object SparkSessoin`.

##########
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala
##########
@@ -79,10 +91,11 @@ class CacheManager extends Logging with 
AdaptiveSparkPlanHelper {
     if (lookupCachedData(planToCache).nonEmpty) {
       logWarning("Asked to cache already cached data.")
     } else {
-      // Turn off AQE so that the outputPartitioning of the underlying plan 
can be leveraged.
-      val sessionWithAqeOff = getOrCloneSessionWithAqeOff(query.sparkSession)
-      val inMemoryRelation = sessionWithAqeOff.withActive {
-        val qe = sessionWithAqeOff.sessionState.executePlan(planToCache)
+      // Turn off configs so that the outputPartitioning of the underlying 
plan can be leveraged.
+      val sessionWithConfigsOff = getOrCloneSessionWithConfigsOff(

Review comment:
       @maropu - updated.

##########
File path: 
sql/core/src/test/scala/org/apache/spark/sql/sources/DisableUnnecessaryBucketedScanSuite.scala
##########
@@ -218,4 +221,27 @@ abstract class DisableUnnecessaryBucketedScanSuite extends 
QueryTest with SQLTes
       }
     }
   }
+
+  test("SPARK-33075: not disable bucketed table scan for cached query") {
+    withTable("t1") {
+      withSQLConf(SQLConf.AUTO_BUCKETED_SCAN_ENABLED.key -> "true") {
+        df1.write.format("parquet").bucketBy(8, "i").saveAsTable("t1")
+        sql("CACHE TABLE tempTable AS SELECT i FROM t1")

Review comment:
       @cloud-fan - sure, updated.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] c21 commented on a change in pull request #30138: [SPARK-33075][SQL] Enable auto bucketed scan by default (disable only for cached query)

Reply via email to