Thanks for interesting discussion. I will start RFC as suggested and discuss points brought up in this thread.
On Thu, Apr 16, 2020 at 11:44 AM Balaji Varadarajan <[email protected]> wrote: > > >A new file slice (empty parquet) is indeed generated for every file group > in a partition. > >> we could just reuse the existing file groups right? probably is bit > hacky... > Sorry for the confusion. I meant to say the empty file slice is only for > file-groups which does not have any incoming records assigned. This is for > the case when we have fewer incoming records to fit into all existing > file-groups. Existing file groups will be reused. > Agree, on the magic part. > Balaji.V On Thursday, April 16, 2020, 11:11:06 AM PDT, Vinoth Chandar < > [email protected]> wrote: > > >A new file slice (empty parquet) is indeed generated for every file group > in a partition. > we could just reuse the existing file groups right? probably is bit > hacky... > > >we can encode some MAGIC in the write-token component for Hudi readers to > skip these files so that they can be safely removed. > This kind of MAGIC worries me :) .. if it comes to that, I suggest, lets > get a version of metadata management along lines of RFC-15/timeline server > going before implementing this. > > On Thu, Apr 16, 2020 at 10:55 AM [email protected] <[email protected]> > wrote: > > > Satish, > > Thanks for the proposal. I think a RFC would be useful here. Let me know > > your thoughts. It would be good to nail other details like whether/how to > > deal with external index management with this API. > > Thanks,Balaji.V > > On Thursday, April 16, 2020, 10:46:19 AM PDT, Balaji Varadarajan > > <[email protected]> wrote: > > > > > > +1 from me. This is a really cool feature. > > Yes, A new file slice (empty parquet) is indeed generated for every file > > group in a partition. > > Regarding cleaning these "empty" file slices eventually by cleaner (to > > avoid cases where there are too many of them lying around) in a safe way, > > we can encode some MAGIC in the write-token component for Hudi readers to > > skip these files so that they can be safely removed. > > For metadata management, I think it would be useful to distinguish > between > > this API and other insert APIs. At the very least, we would need a > > different operation type which can be achieved with same API (with > flags). > > Balaji.V > > > > On Thursday, April 16, 2020, 09:54:09 AM PDT, Vinoth Chandar < > > [email protected]> wrote: > > > > Hi Satish, > > > > Thanks for starting this.. Your use-cases do sounds very valuable to > > support. So +1 from me. > > > > IIUC, you are implementing a partition level overwrite, where existing > > filegroups will be retained, but instead of merging, you will just reuse > > the file names and write the incoming records into new file slices? > > You probably already thought of this, but one thing to watch out for is : > > we should generate a new file slice for every file group in a partition.. > > Otherwise, old data will be visible to queries. > > > > if so, that makes sense to me. We can discuss more on whether we can > > extend the bulk_insert() API with additional flags instead of a new > > insertOverwrite() API.. > > > > Others, thoughts? > > > > Thanks > > Vinoth > > > > On Wed, Apr 15, 2020 at 11:03 AM Satish Kotha > <[email protected] > > > > > wrote: > > > > > Hello > > > > > > I want to discuss adding a new high level API 'insertOverwrite' on > > > HoodieWriteClient. This API can be used to > > > > > > - > > > > > > Overwrite specific partitions with new records > > > - > > > > > > Example: partition has 'x' records. If insert overwrite is done > > with > > > 'y' records on that partition, the partition will have just 'y' > > > records (as > > > opposed to 'x union y' with upsert) > > > - > > > > > > Overwrite entire table with new records > > > - > > > > > > Overwrite all partitions in the table > > > > > > Usecases: > > > > > > - Tables where the majority of records change every cycle. So it is > > likely > > > efficient to write new data instead of doing upserts. > > > > > > - Operational tasks to fix a specific corrupted partition. We can do > > > 'insert overwrite' on that partition with records from the source. > This > > > can be much faster than restore and replay for some data sources. > > > > > > The functionality will be similar to hive definition of 'insert > > overwite'. > > > But, doing this in Hoodie will provide better isolation between writer > > and > > > readers. I can share possible implementation choices and some nuances > if > > > the community thinks this is a useful feature to add. > > > > > > > > > Appreciate any feedback. > > > > > > > > > Thanks > > > > > > Satish > > > > > >
