[GitHub] [spark] koertkuipers edited a comment on pull request #27986: [SPARK-31220][SQL] repartition obeys initialPartitionNum when adaptiveExecutionEnabled

GitBox Sun, 21 Jun 2020 21:44:26 -0700


koertkuipers edited a comment on pull request #27986:
URL: https://github.com/apache/spark/pull/27986#issuecomment-647266612



   @cloud-fan so how can i repartition by a column while the number of 
partitions is set smartly (based on data size) instead of using some user 
specified number of partitions or hardcoded value?
   
   repartitioning a dataframe by columns is fairly typical before writing to a 
partitioned file sink to avoid too many files per directory. see for example:
   
https://github.com/delta-io/delta/blob/master/src/main/scala/org/apache/spark/sql/delta/commands/MergeIntoCommand.scala#L472
   in these situations its beneficial to write out the optimal number of files, 
not a fixed/hardcoded number...
   
   and personally for repartition i would expect the optimal number of files to 
be written if AQE is enabled and i did not specify the number of partitions. 
thats why i was so confused by the current results. but thats just my opinion.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] koertkuipers edited a comment on pull request #27986: [SPARK-31220][SQL] repartition obeys initialPartitionNum when adaptiveExecutionEnabled

Reply via email to