>A new file slice (empty parquet) is indeed generated for every file group in a partition. we could just reuse the existing file groups right? probably is bit hacky...
>we can encode some MAGIC in the write-token component for Hudi readers to skip these files so that they can be safely removed. This kind of MAGIC worries me :) .. if it comes to that, I suggest, lets get a version of metadata management along lines of RFC-15/timeline server going before implementing this. On Thu, Apr 16, 2020 at 10:55 AM vbal...@apache.org <vbal...@apache.org> wrote: > Satish, > Thanks for the proposal. I think a RFC would be useful here. Let me know > your thoughts. It would be good to nail other details like whether/how to > deal with external index management with this API. > Thanks,Balaji.V > On Thursday, April 16, 2020, 10:46:19 AM PDT, Balaji Varadarajan > <v.bal...@ymail.com.invalid> wrote: > > > +1 from me. This is a really cool feature. > Yes, A new file slice (empty parquet) is indeed generated for every file > group in a partition. > Regarding cleaning these "empty" file slices eventually by cleaner (to > avoid cases where there are too many of them lying around) in a safe way, > we can encode some MAGIC in the write-token component for Hudi readers to > skip these files so that they can be safely removed. > For metadata management, I think it would be useful to distinguish between > this API and other insert APIs. At the very least, we would need a > different operation type which can be achieved with same API (with flags). > Balaji.V > > On Thursday, April 16, 2020, 09:54:09 AM PDT, Vinoth Chandar < > vin...@apache.org> wrote: > > Hi Satish, > > Thanks for starting this.. Your use-cases do sounds very valuable to > support. So +1 from me. > > IIUC, you are implementing a partition level overwrite, where existing > filegroups will be retained, but instead of merging, you will just reuse > the file names and write the incoming records into new file slices? > You probably already thought of this, but one thing to watch out for is : > we should generate a new file slice for every file group in a partition.. > Otherwise, old data will be visible to queries. > > if so, that makes sense to me. We can discuss more on whether we can > extend the bulk_insert() API with additional flags instead of a new > insertOverwrite() API.. > > Others, thoughts? > > Thanks > Vinoth > > On Wed, Apr 15, 2020 at 11:03 AM Satish Kotha <satishko...@uber.com.invalid > > > wrote: > > > Hello > > > > I want to discuss adding a new high level API 'insertOverwrite' on > > HoodieWriteClient. This API can be used to > > > > - > > > > Overwrite specific partitions with new records > > - > > > > Example: partition has 'x' records. If insert overwrite is done > with > > 'y' records on that partition, the partition will have just 'y' > > records (as > > opposed to 'x union y' with upsert) > > - > > > > Overwrite entire table with new records > > - > > > > Overwrite all partitions in the table > > > > Usecases: > > > > - Tables where the majority of records change every cycle. So it is > likely > > efficient to write new data instead of doing upserts. > > > > - Operational tasks to fix a specific corrupted partition. We can do > > 'insert overwrite' on that partition with records from the source. This > > can be much faster than restore and replay for some data sources. > > > > The functionality will be similar to hive definition of 'insert > overwite'. > > But, doing this in Hoodie will provide better isolation between writer > and > > readers. I can share possible implementation choices and some nuances if > > the community thinks this is a useful feature to add. > > > > > > Appreciate any feedback. > > > > > > Thanks > > > > Satish > > >