Re: Query about the semantics of "overwrite" in Iceberg

Ryan Blue Fri, 22 Nov 2019 09:08:25 -0800

Saisai,

Iceberg's behavior matches Hive's and Spark's behavior when using dynamic
overwrite mode.

Spark does not specify the correct behavior -- it varies by source. In
addition, it isn't possible for a v2 source in 2.4 to implement the static
overwrite mode that is Spark's default. The problem is that the source is
not passed the static partition values, only rows.

This is fixed in 3.0 because Spark will choose its behavior and correctly
configure the source with a dynamic overwrite or an overwrite using an
expression.

On Thu, Nov 21, 2019 at 11:33 PM Saisai Shao <[email protected]> wrote:

> Hi Team,
>
> I found that Iceberg's "overwrite" is different from Spark's built-in
> sources like Parquet. The "overwrite" semantics in Iceberg seems more like
> "upsert", but not deleting the partitions where new data doesn't contain.
>
> I would like to know what is the purpose of such design choice? Also if I
> want to achieve Spark Parquet's "overwrite" semantics, how would I
> achieve this?
>
> Warning
>
> *Spark does not define the behavior of DataFrame overwrite*. Like most
> sources, Iceberg will dynamically overwrite partitions when the dataframe
> contains rows in a partition. Unpartitioned tables are completely
> overwritten.
>
> Best regards,
> Saisai
>

-- 
Ryan Blue
Software Engineer
Netflix

Re: Query about the semantics of "overwrite" in Iceberg

Reply via email to