Re: Query about the semantics of "overwrite" in Iceberg

Saisai Shao Sun, 24 Nov 2019 18:09:37 -0800

Thanks Ryan for your response, let me spend more on Spark 3.0 overwrite
behavior.


Best Regards
Saisai

Ryan Blue <[email protected]> 于2019年11月23日周六 上午1:08写道：

> Saisai,
>
> Iceberg's behavior matches Hive's and Spark's behavior when using dynamic
> overwrite mode.
>
> Spark does not specify the correct behavior -- it varies by source. In
> addition, it isn't possible for a v2 source in 2.4 to implement the static
> overwrite mode that is Spark's default. The problem is that the source is
> not passed the static partition values, only rows.
>
> This is fixed in 3.0 because Spark will choose its behavior and correctly
> configure the source with a dynamic overwrite or an overwrite using an
> expression.
>
> On Thu, Nov 21, 2019 at 11:33 PM Saisai Shao <[email protected]>
> wrote:
>
>> Hi Team,
>>
>> I found that Iceberg's "overwrite" is different from Spark's built-in
>> sources like Parquet. The "overwrite" semantics in Iceberg seems more like
>> "upsert", but not deleting the partitions where new data doesn't contain.
>>
>> I would like to know what is the purpose of such design choice? Also if I
>> want to achieve Spark Parquet's "overwrite" semantics, how would I
>> achieve this?
>>
>> Warning
>>
>> *Spark does not define the behavior of DataFrame overwrite*. Like most
>> sources, Iceberg will dynamically overwrite partitions when the dataframe
>> contains rows in a partition. Unpartitioned tables are completely
>> overwritten.
>>
>> Best regards,
>> Saisai
>>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Re: Query about the semantics of "overwrite" in Iceberg

Reply via email to