Satish,
Thanks for the proposal. I think a RFC would be useful here. Let me know your 
thoughts. It would be good to nail other details like whether/how to deal with 
external index management with this API.
Thanks,Balaji.V
    On Thursday, April 16, 2020, 10:46:19 AM PDT, Balaji Varadarajan 
<[email protected]> wrote:  
 
 
+1 from me. This is a really cool feature. 
Yes, A new file slice (empty parquet) is indeed generated for every file group 
in a partition. 
Regarding cleaning these "empty" file slices eventually by cleaner (to avoid 
cases where there are too many of them lying around) in a safe way, we can 
encode some MAGIC in the write-token component for Hudi readers to skip these 
files so that they can be safely removed. 
For metadata management, I think it would be useful to distinguish between this 
API and other insert APIs. At the very least, we would need a different 
operation type which can be achieved with same API (with flags).
Balaji.V

    On Thursday, April 16, 2020, 09:54:09 AM PDT, Vinoth Chandar 
<[email protected]> wrote:  
 
 Hi Satish,

Thanks for starting this..  Your use-cases do sounds very valuable to
support. So +1 from me.

IIUC, you are implementing a partition level overwrite, where existing
filegroups will be retained, but instead of merging, you will just reuse
the file names and write the incoming records into new file slices?
You probably already thought of this, but one thing to watch out for is :
we should generate a new file slice for every file group in a partition..
Otherwise, old data will be visible to queries.

if so, that makes sense to me.  We can discuss more on whether we can
extend the bulk_insert() API with additional flags instead of a new
insertOverwrite() API..

Others, thoughts?

Thanks
Vinoth

On Wed, Apr 15, 2020 at 11:03 AM Satish Kotha <[email protected]>
wrote:

> Hello
>
> I want to discuss adding a new high level API 'insertOverwrite' on
> HoodieWriteClient. This API can be used to
>
>    -
>
>    Overwrite specific partitions with new records
>    -
>
>      Example: partition has  'x' records. If insert overwrite is done with
>      'y' records on that partition, the partition will have just 'y'
> records (as
>      opposed to  'x union y' with upsert)
>      -
>
>    Overwrite entire table with new records
>    -
>
>      Overwrite all partitions in the table
>
> Usecases:
>
> - Tables where the majority of records change every cycle. So it is likely
> efficient to write new data instead of doing upserts.
>
> -  Operational tasks to fix a specific corrupted partition. We can do
> 'insert overwrite'  on that partition with records from the source. This
> can be much faster than restore and replay for some data sources.
>
> The functionality will be similar to hive definition of 'insert overwite'.
> But, doing this in Hoodie will provide better isolation between writer and
> readers. I can share possible implementation choices and some nuances if
> the community thinks this is a useful feature to add.
>
>
> Appreciate any feedback.
>
>
> Thanks
>
> Satish
>
    

Reply via email to