Hi Satish,

Thanks for starting this..  Your use-cases do sounds very valuable to
support. So +1 from me.

IIUC, you are implementing a partition level overwrite, where existing
filegroups will be retained, but instead of merging, you will just reuse
the file names and write the incoming records into new file slices?
You probably already thought of this, but one thing to watch out for is :
we should generate a new file slice for every file group in a partition..
Otherwise, old data will be visible to queries.

if so, that makes sense to me.  We can discuss more on whether we can
extend the bulk_insert() API with additional flags instead of a new
insertOverwrite() API..

Others, thoughts?

Thanks
Vinoth

On Wed, Apr 15, 2020 at 11:03 AM Satish Kotha <satishko...@uber.com.invalid>
wrote:

> Hello
>
> I want to discuss adding a new high level API 'insertOverwrite' on
> HoodieWriteClient. This API can be used to
>
>    -
>
>    Overwrite specific partitions with new records
>    -
>
>       Example: partition has  'x' records. If insert overwrite is done with
>       'y' records on that partition, the partition will have just 'y'
> records (as
>       opposed to  'x union y' with upsert)
>       -
>
>    Overwrite entire table with new records
>    -
>
>       Overwrite all partitions in the table
>
> Usecases:
>
> - Tables where the majority of records change every cycle. So it is likely
> efficient to write new data instead of doing upserts.
>
> -  Operational tasks to fix a specific corrupted partition. We can do
> 'insert overwrite'  on that partition with records from the source. This
> can be much faster than restore and replay for some data sources.
>
> The functionality will be similar to hive definition of 'insert overwite'.
> But, doing this in Hoodie will provide better isolation between writer and
> readers. I can share possible implementation choices and some nuances if
> the community thinks this is a useful feature to add.
>
>
> Appreciate any feedback.
>
>
> Thanks
>
> Satish
>

Reply via email to