[GitHub] [spark] gatorsmile commented on a change in pull request #26569: [SPARK-29938] [SQL] Add batching support in Alter table add partition flow

GitBox Mon, 13 Jan 2020 19:30:14 -0800

gatorsmile commented on a change in pull request #26569: [SPARK-29938] [SQL] 
Add batching support in Alter table add partition flow
URL: https://github.com/apache/spark/pull/26569#discussion_r366138160


 ##########
 File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala
 ##########
 @@ -476,14 +476,26 @@ case class AlterTableAddPartitionCommand(
       CatalogTablePartition(normalizedSpec, table.storage.copy(
         locationUri = location.map(CatalogUtils.stringToURI)))
     }
-    catalog.createPartitions(table.identifier, parts, ignoreIfExists = 
ifNotExists)
+
+    // Hive metastore may not have enough memory to handle millions of 
partitions in single RPC.
+    // Also the request to metastore times out when adding lot of partitions 
in one shot.
+    // we should split them into smaller batches
+    val batchSize = 100
+    parts.toIterator.grouped(batchSize).foreach { batch =>
+      catalog.createPartitions(table.identifier, batch, ignoreIfExists = 
ifNotExists)
+    }
 
     if (table.stats.nonEmpty) {
       if (sparkSession.sessionState.conf.autoSizeUpdateEnabled) {
-        val addedSize = parts.map { part =>
-          CommandUtils.calculateLocationSize(sparkSession.sessionState, 
table.identifier,
-            part.storage.locationUri)
-        }.sum
+        def calculatePartSize(part: CatalogTablePartition) = 
CommandUtils.calculateLocationSize(
+          sparkSession.sessionState, table.identifier, 
part.storage.locationUri)
+        val threshold = 
sparkSession.sparkContext.conf.get(RDD_PARALLEL_LISTING_THRESHOLD)
+        val partSizes = if (parts.length > threshold) {
+            ThreadUtils.parmap(parts, "gatheringNewPartitionStats", 
8)(calculatePartSize)
 
 Review comment:
   We should avoid duplicating the same logic in the code base. The fix looks 
not good to me. We should reuse InMemoryFileIndex.bulkListLeafFiles

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] gatorsmile commented on a change in pull request #26569: [SPARK-29938] [SQL] Add batching support in Alter table add partition flow

Reply via email to