AQE in recent Spark versions should take care of any skew during writes.
Make sure it is enabled and configured correctly.

- Anton

пн, 14 квіт. 2025 р. о 13:50 namratha mk <nmk...@gmail.com> пише:

> Hi Ed,
>
> In the latest version of spark(>3.5), for both hash and range
> distribution mode we can control the size of partition by spark property
> "spark.sql.adaptive.advisoryPartitionSizeInBytes". This will control the
> small files problem.
>
> Regards,
> Namratha
>
> On Mon, Apr 7, 2025 at 8:44 AM Ed Mancebo <edmanc...@gmail.com> wrote:
>
>> Hi all,
>>
>> First time posting here
>>
>> I’m using MERGE INTO to upsert into a table with daily partitions.  More
>> recent days tend to have many more updates, which is causing skew in the
>> write stage when write.distribution-mode=hash (the most recent day of data
>> will get assigned to a single task, which takes much longer to finish than
>> older days).
>>
>> I tried write.distribution-mode=range instead, but this only helps a
>> little bit.  I think this does a good job of splitting up the most recent
>> days across multiple tasks, but probably clusters the very oldest/smallest
>> days on a single task, which is slow due to opening and closing too many
>> small files.
>>
>> I’m wondering if there’s a mode that works well for this use case that I
>> may have missed, or if not, is there any appetite for supporting one?  One
>> idea is to add an option for a user-specified column in the clustering in
>> SparkDistributionAndOrderingUtil.  This would allow the caller to provide
>> an additional column to split up large partitions while writing, without
>> changing the table partitioning.
>>
>> Thanks in advance -
>>
>> Ed
>>
>>

Reply via email to