Re: Why is sort required for Spark writing to partitioned table

Pucheng Yang Tue, 25 Apr 2023 11:05:57 -0700

Hi to confirm,

In the doc,
https://iceberg.apache.org/docs/1.0.0/spark-writes/#writing-to-partitioned-tables,
it says "Explicit sort is necessary because Spark doesn’t allow Iceberg to
request a sort before writing as of Spark 3.0. SPARK-23889
<https://issues.apache.org/jira/browse/SPARK-23889> is filed to enable
Iceberg to require specific distribution & sort order to Spark."


I found that all relevant JIRAs in SPARK-23889
<https://issues.apache.org/jira/browse/SPARK-23889> are resolved in
spark-3.2.0. Does that mean we don't need explicit sort  anymore from
spark-3.2.0 and after?

Thanks

On Tue, Mar 7, 2023 at 8:10 PM Russell Spitzer <[email protected]>
wrote:

> This is no longer accurate, since now we do have a "fan-out" writer for
> spark. But originally the idea here is that it is way more efficient to
> open a single file handle at a time and write to it, than to open a new
> file handle for every file as we find a new partition to write to in the
> same spark task. Fanout performs the write as just opening each handle as
> the writer sees a new partition.
>
> Now that said, this is a local required sort for the default writer. For
> best performance though in making as few files as possible using write
> distribution mode "Hash" will force a real shuffle but eliminate this issue
> by making sure each spark task is writing to a single or single set of
> Partitions in order. We need to update this document to talk about
> distribution modes, especially since hash will be the new default soon and
> this information is basically for manual tuning only.
>
> If your data is already organized the way you want, setting distribution
> mode to none will avoid this shuffle. If you don't care about multiple file
> handles being open at the same time, you can set the fanout writer option.
> With "none" and "fan-out" writers you will basically write in the fastest
> way possible at the expense of memory at write time and possibly generating
> many files if your data isn't organized.
>
> On Tue, Mar 7, 2023 at 9:46 PM Manu Zhang <[email protected]> wrote:
>
>> Hi all,
>>
>> As per
>> https://iceberg.apache.org/docs/latest/spark-writes/#writing-to-partitioned-tables,
>> sort is required for Spark writing to a partitioned table. Does anyone know
>> the reason behind it? If this is to avoid creating too many small files,
>> isn't shuffle/repartition sufficient?
>>
>> Thanks,
>> Manu
>>
>>

Re: Why is sort required for Spark writing to partitioned table

Reply via email to