+1, thanks for starting this effort Satish! -Nishith
On Fri, Apr 17, 2020 at 2:26 PM Vinoth Chandar <vin...@apache.org> wrote: > Thanks Satish! > > On Fri, Apr 17, 2020 at 11:32 AM Satish Kotha <satishko...@uber.com.invalid > > > wrote: > > > Thanks for interesting discussion. I will start RFC as suggested and > > discuss points brought up in this thread. > > > > > > On Thu, Apr 16, 2020 at 11:44 AM Balaji Varadarajan > > <v.bal...@ymail.com.invalid> wrote: > > > > > > > > >A new file slice (empty parquet) is indeed generated for every file > > group > > > in a partition. > > > >> we could just reuse the existing file groups right? probably is bit > > > hacky... > > > Sorry for the confusion. I meant to say the empty file slice is only > for > > > file-groups which does not have any incoming records assigned. This is > > for > > > the case when we have fewer incoming records to fit into all existing > > > file-groups. Existing file groups will be reused. > > > Agree, on the magic part. > > > Balaji.V On Thursday, April 16, 2020, 11:11:06 AM PDT, Vinoth > Chandar > > < > > > vin...@apache.org> wrote: > > > > > > >A new file slice (empty parquet) is indeed generated for every file > > group > > > in a partition. > > > we could just reuse the existing file groups right? probably is bit > > > hacky... > > > > > > >we can encode some MAGIC in the write-token component for Hudi readers > > to > > > skip these files so that they can be safely removed. > > > This kind of MAGIC worries me :) .. if it comes to that, I suggest, > lets > > > get a version of metadata management along lines of RFC-15/timeline > > server > > > going before implementing this. > > > > > > On Thu, Apr 16, 2020 at 10:55 AM vbal...@apache.org < > vbal...@apache.org> > > > wrote: > > > > > > > Satish, > > > > Thanks for the proposal. I think a RFC would be useful here. Let me > > know > > > > your thoughts. It would be good to nail other details like > whether/how > > to > > > > deal with external index management with this API. > > > > Thanks,Balaji.V > > > > On Thursday, April 16, 2020, 10:46:19 AM PDT, Balaji Varadarajan > > > > <v.bal...@ymail.com.invalid> wrote: > > > > > > > > > > > > +1 from me. This is a really cool feature. > > > > Yes, A new file slice (empty parquet) is indeed generated for every > > file > > > > group in a partition. > > > > Regarding cleaning these "empty" file slices eventually by cleaner > (to > > > > avoid cases where there are too many of them lying around) in a safe > > way, > > > > we can encode some MAGIC in the write-token component for Hudi > readers > > to > > > > skip these files so that they can be safely removed. > > > > For metadata management, I think it would be useful to distinguish > > > between > > > > this API and other insert APIs. At the very least, we would need a > > > > different operation type which can be achieved with same API (with > > > flags). > > > > Balaji.V > > > > > > > > On Thursday, April 16, 2020, 09:54:09 AM PDT, Vinoth Chandar < > > > > vin...@apache.org> wrote: > > > > > > > > Hi Satish, > > > > > > > > Thanks for starting this.. Your use-cases do sounds very valuable to > > > > support. So +1 from me. > > > > > > > > IIUC, you are implementing a partition level overwrite, where > existing > > > > filegroups will be retained, but instead of merging, you will just > > reuse > > > > the file names and write the incoming records into new file slices? > > > > You probably already thought of this, but one thing to watch out for > > is : > > > > we should generate a new file slice for every file group in a > > partition.. > > > > Otherwise, old data will be visible to queries. > > > > > > > > if so, that makes sense to me. We can discuss more on whether we can > > > > extend the bulk_insert() API with additional flags instead of a new > > > > insertOverwrite() API.. > > > > > > > > Others, thoughts? > > > > > > > > Thanks > > > > Vinoth > > > > > > > > On Wed, Apr 15, 2020 at 11:03 AM Satish Kotha > > > <satishko...@uber.com.invalid > > > > > > > > > wrote: > > > > > > > > > Hello > > > > > > > > > > I want to discuss adding a new high level API 'insertOverwrite' on > > > > > HoodieWriteClient. This API can be used to > > > > > > > > > > - > > > > > > > > > > Overwrite specific partitions with new records > > > > > - > > > > > > > > > > Example: partition has 'x' records. If insert overwrite is > done > > > > with > > > > > 'y' records on that partition, the partition will have just > 'y' > > > > > records (as > > > > > opposed to 'x union y' with upsert) > > > > > - > > > > > > > > > > Overwrite entire table with new records > > > > > - > > > > > > > > > > Overwrite all partitions in the table > > > > > > > > > > Usecases: > > > > > > > > > > - Tables where the majority of records change every cycle. So it is > > > > likely > > > > > efficient to write new data instead of doing upserts. > > > > > > > > > > - Operational tasks to fix a specific corrupted partition. We can > do > > > > > 'insert overwrite' on that partition with records from the source. > > > This > > > > > can be much faster than restore and replay for some data sources. > > > > > > > > > > The functionality will be similar to hive definition of 'insert > > > > overwite'. > > > > > But, doing this in Hoodie will provide better isolation between > > writer > > > > and > > > > > readers. I can share possible implementation choices and some > nuances > > > if > > > > > the community thinks this is a useful feature to add. > > > > > > > > > > > > > > > Appreciate any feedback. > > > > > > > > > > > > > > > Thanks > > > > > > > > > > Satish > > > > > > > > > > > > > > >