Hello

I want to discuss adding a new high level API 'insertOverwrite' on
HoodieWriteClient. This API can be used to

   -

   Overwrite specific partitions with new records
   -

      Example: partition has  'x' records. If insert overwrite is done with
      'y' records on that partition, the partition will have just 'y'
records (as
      opposed to  'x union y' with upsert)
      -

   Overwrite entire table with new records
   -

      Overwrite all partitions in the table

Usecases:

- Tables where the majority of records change every cycle. So it is likely
efficient to write new data instead of doing upserts.

-  Operational tasks to fix a specific corrupted partition. We can do
'insert overwrite'  on that partition with records from the source. This
can be much faster than restore and replay for some data sources.

The functionality will be similar to hive definition of 'insert overwite'.
But, doing this in Hoodie will provide better isolation between writer and
readers. I can share possible implementation choices and some nuances if
the community thinks this is a useful feature to add.


Appreciate any feedback.


Thanks

Satish

Reply via email to