[GitHub] [spark] sunchao commented on a diff in pull request #36067: [SPARK-38573][SQL] Support Auto Partition Level Statistics Collection

GitBox Mon, 04 Apr 2022 17:37:18 -0700


sunchao commented on code in PR #36067:
URL: https://github.com/apache/spark/pull/36067#discussion_r842239376



##########
sql/core/src/main/scala/org/apache/spark/sql/execution/command/CommandUtils.scala:
##########
@@ -62,6 +66,14 @@ object CommandUtils extends Logging {
       if (isNewStats) {
         val newStats = CatalogStatistics(sizeInBytes = newSize)
         catalog.alterTableStats(table.identifier, Some(newStats))
+
+        if (withAutoPartitionStats) {
+          if (partitionSpec.nonEmpty) {
+            AnalyzePartitionCommand(table.identifier, partitionSpec, 
false).run(sparkSession)

Review Comment:
   these two branches look the same, can we combine them?



##########
sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala:
##########
@@ -114,7 +114,11 @@ case class InsertIntoHiveTable(
     CommandUtils.uncacheTableOrView(sparkSession, 
table.identifier.quotedString)
     sparkSession.sessionState.catalog.refreshTable(table.identifier)
 
-    CommandUtils.updateTableStats(sparkSession, table)
+    val partitionSpec = partition.map {
+      case (key, Some(null)) => key -> 
Some(ExternalCatalogUtils.DEFAULT_PARTITION_NAME)
+      case other => other
+    }
+    CommandUtils.updateTableStats(sparkSession, table, partitionSpec)

Review Comment:
   hmm does this handle [dynamic partition 
insert](https://cwiki.apache.org/confluence/display/hive/languagemanual+dml#LanguageManualDML-DynamicPartitionInserts)?
 



##########
sql/core/src/main/scala/org/apache/spark/sql/execution/command/CommandUtils.scala:
##########
@@ -62,6 +66,14 @@ object CommandUtils extends Logging {
       if (isNewStats) {
         val newStats = CatalogStatistics(sizeInBytes = newSize)
         catalog.alterTableStats(table.identifier, Some(newStats))
+
+        if (withAutoPartitionStats) {
+          if (partitionSpec.nonEmpty) {
+            AnalyzePartitionCommand(table.identifier, partitionSpec, 
false).run(sparkSession)
+          } else if (table.partitionColumnNames.nonEmpty) {
+            AnalyzePartitionCommand(table.identifier, partitionSpec, 
false).run(sparkSession)

Review Comment:
   I'm not sure whether we want to always do full scan in the partitions to get 
number of rows - it could be rather expensive?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] sunchao commented on a diff in pull request #36067: [SPARK-38573][SQL] Support Auto Partition Level Statistics Collection

Reply via email to