AQE in recent Spark versions should take care of any skew during writes. Make sure it is enabled and configured correctly.
- Anton пн, 14 квіт. 2025 р. о 13:50 namratha mk <nmk...@gmail.com> пише: > Hi Ed, > > In the latest version of spark(>3.5), for both hash and range > distribution mode we can control the size of partition by spark property > "spark.sql.adaptive.advisoryPartitionSizeInBytes". This will control the > small files problem. > > Regards, > Namratha > > On Mon, Apr 7, 2025 at 8:44 AM Ed Mancebo <edmanc...@gmail.com> wrote: > >> Hi all, >> >> First time posting here >> >> I’m using MERGE INTO to upsert into a table with daily partitions. More >> recent days tend to have many more updates, which is causing skew in the >> write stage when write.distribution-mode=hash (the most recent day of data >> will get assigned to a single task, which takes much longer to finish than >> older days). >> >> I tried write.distribution-mode=range instead, but this only helps a >> little bit. I think this does a good job of splitting up the most recent >> days across multiple tasks, but probably clusters the very oldest/smallest >> days on a single task, which is slow due to opening and closing too many >> small files. >> >> I’m wondering if there’s a mode that works well for this use case that I >> may have missed, or if not, is there any appetite for supporting one? One >> idea is to add an option for a user-specified column in the clustering in >> SparkDistributionAndOrderingUtil. This would allow the caller to provide >> an additional column to split up large partitions while writing, without >> changing the table partitioning. >> >> Thanks in advance - >> >> Ed >> >>