[GitHub] [spark] AngersZhuuuu commented on a change in pull request #28032: [SPARK-31264][SQL] Repartition by dynamic partition columns before insert partition table

GitBox Mon, 28 Sep 2020 00:22:08 -0700


AngersZhuuuu commented on a change in pull request #28032:
URL: https://github.com/apache/spark/pull/28032#discussion_r495736123




##########
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
##########
@@ -2038,6 +2038,15 @@ object SQLConf {
       .booleanConf
       .createWithDefault(true)
 
+  val REPARTITION_BEFORE_INSERT =
+    buildConf("spark.sql.execution.repartitionBeforeInsert")

Review comment:
       > If this is better in 90% and worse in 10% cases , I might be okay. If 
it's better in 50% and worse in 50% cases, is it worthwhile?
   
   For dynamic partition write, file size is `(shuffle partition size) * (table 
partition size)`.  after repartition file size is ` (table partition size)`, 
add `RepartitionByExpression` shuffle data is quick since without other 
computation.  we should concern is data skew, with AQE we can control each 
partition's size to match expected file size. 
   IMO, if this pr can give a test case of data skew and how it behavior with 
AQE is better.
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] AngersZhuuuu commented on a change in pull request #28032: [SPARK-31264][SQL] Repartition by dynamic partition columns before insert partition table

Reply via email to