Re: Why is sort required for Spark writing to partitioned table

Anton Okolnychyi Tue, 25 Apr 2023 11:47:01 -0700

We have implemented this natively in Spark and explicit sorts are no longer 
required. Iceberg takes into account both the partition and sort key in the 
table to request a distribution and ordering from Spark. Should be supported 
both for batch and micro-batch writes.


- Anton

> On Apr 25, 2023, at 11:05 AM, Pucheng Yang <[email protected]> 
> wrote:
> 
> Hi to confirm,
> 
> In the doc, 
> https://iceberg.apache.org/docs/1.0.0/spark-writes/#writing-to-partitioned-tables
>  
> <https://iceberg.apache.org/docs/1.0.0/spark-writes/#writing-to-partitioned-tables>,
>  it says "Explicit sort is necessary because Spark doesn’t allow Iceberg to 
> request a sort before writing as of Spark 3.0. SPARK-23889 
> <https://issues.apache.org/jira/browse/SPARK-23889> is filed to enable 
> Iceberg to require specific distribution & sort order to Spark."
> 
> I found that all relevant JIRAs in SPARK-23889 
> <https://issues.apache.org/jira/browse/SPARK-23889> are resolved in 
> spark-3.2.0. Does that mean we don't need explicit sort  anymore from 
> spark-3.2.0 and after?
> 
> Thanks
> 
> On Tue, Mar 7, 2023 at 8:10 PM Russell Spitzer <[email protected] 
> <mailto:[email protected]>> wrote:
> This is no longer accurate, since now we do have a "fan-out" writer for 
> spark. But originally the idea here is that it is way more efficient to open 
> a single file handle at a time and write to it, than to open a new file 
> handle for every file as we find a new partition to write to in the same 
> spark task. Fanout performs the write as just opening each handle as the 
> writer sees a new partition.
> 
> Now that said, this is a local required sort for the default writer. For best 
> performance though in making as few files as possible using write 
> distribution mode "Hash" will force a real shuffle but eliminate this issue 
> by making sure each spark task is writing to a single or single set of 
> Partitions in order. We need to update this document to talk about 
> distribution modes, especially since hash will be the new default soon and 
> this information is basically for manual tuning only.
> 
> If your data is already organized the way you want, setting distribution mode 
> to none will avoid this shuffle. If you don't care about multiple file 
> handles being open at the same time, you can set the fanout writer option. 
> With "none" and "fan-out" writers you will basically write in the fastest way 
> possible at the expense of memory at write time and possibly generating many 
> files if your data isn't organized.
> 
> On Tue, Mar 7, 2023 at 9:46 PM Manu Zhang <[email protected] 
> <mailto:[email protected]>> wrote:
> Hi all,
> 
> As per 
> https://iceberg.apache.org/docs/latest/spark-writes/#writing-to-partitioned-tables
>  
> <https://iceberg.apache.org/docs/latest/spark-writes/#writing-to-partitioned-tables>,
>  sort is required for Spark writing to a partitioned table. Does anyone know 
> the reason behind it? If this is to avoid creating too many small files, 
> isn't shuffle/repartition sufficient?
> 
> Thanks,
> Manu
>

Re: Why is sort required for Spark writing to partitioned table

Reply via email to