[GitHub] [spark] HyukjinKwon commented on a change in pull request #28032: [SPARK-31264][SQL] Repartition by dynamic partition columns before insert partition table

GitBox Sun, 06 Jun 2021 18:38:30 -0700


HyukjinKwon commented on a change in pull request #28032:
URL: https://github.com/apache/spark/pull/28032#discussion_r646222780




##########
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
##########
@@ -2435,6 +2435,15 @@ object SQLConf {
       .booleanConf
       .createWithDefault(true)
 
+  val REPARTITION_BEFORE_INSERT =
+    buildConf("spark.sql.execution.repartitionBeforeInsert")
+      .internal()
+      .doc("When perform a insert into partitioned table. Turn on this config 
to " +
+        "insert a repartition by dynamic partition columns to ease pressure on 
the NameNode " +

Review comment:
       Do you mind how it reduces the pressure on the NMs? I don't think people 
can understand what happens by reading docs.
   
   I would describe it concisely and easy to understand (e.g., p. 44 - 48 
https://www.slideshare.net/TejasPatil1/hive-bucketing-in-apache-spark). It's 
bucketing but we could resue it for partitioning for such as the number of 
files and diagrams.
   
   Probably we should describe pros and cons, and when it should be turned on. 
Feel free to write something in 
https://github.com/apache/spark/blob/master/docs/sql-performance-tuning.md too.
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] HyukjinKwon commented on a change in pull request #28032: [SPARK-31264][SQL] Repartition by dynamic partition columns before insert partition table

Reply via email to