koertkuipers edited a comment on pull request #27986: URL: https://github.com/apache/spark/pull/27986#issuecomment-647266612
@cloud-fan so how can i repartition by a column while the number of partitions is set smartly (based on data size) instead of using some user specified number of partitions or hardcoded value? repartitioning a dataframe by columns is fairly typical before writing to a partitioned file sink to avoid too many files per directory. see for example: https://github.com/delta-io/delta/blob/master/src/main/scala/org/apache/spark/sql/delta/commands/MergeIntoCommand.scala#L472 in these situations its beneficial to write out the optimal number of files, not a fixed/hardcoded number... and personally for repartition i would expect the optimal number of files to be written if AQE is enabled and i did not specify the number of partitions. thats why i was so confused by the current results. but thats just my opinion. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
