Thanks Satish! On Fri, Apr 17, 2020 at 11:32 AM Satish Kotha <[email protected]> wrote:
> Thanks for interesting discussion. I will start RFC as suggested and > discuss points brought up in this thread. > > > On Thu, Apr 16, 2020 at 11:44 AM Balaji Varadarajan > <[email protected]> wrote: > > > > > >A new file slice (empty parquet) is indeed generated for every file > group > > in a partition. > > >> we could just reuse the existing file groups right? probably is bit > > hacky... > > Sorry for the confusion. I meant to say the empty file slice is only for > > file-groups which does not have any incoming records assigned. This is > for > > the case when we have fewer incoming records to fit into all existing > > file-groups. Existing file groups will be reused. > > Agree, on the magic part. > > Balaji.V On Thursday, April 16, 2020, 11:11:06 AM PDT, Vinoth Chandar > < > > [email protected]> wrote: > > > > >A new file slice (empty parquet) is indeed generated for every file > group > > in a partition. > > we could just reuse the existing file groups right? probably is bit > > hacky... > > > > >we can encode some MAGIC in the write-token component for Hudi readers > to > > skip these files so that they can be safely removed. > > This kind of MAGIC worries me :) .. if it comes to that, I suggest, lets > > get a version of metadata management along lines of RFC-15/timeline > server > > going before implementing this. > > > > On Thu, Apr 16, 2020 at 10:55 AM [email protected] <[email protected]> > > wrote: > > > > > Satish, > > > Thanks for the proposal. I think a RFC would be useful here. Let me > know > > > your thoughts. It would be good to nail other details like whether/how > to > > > deal with external index management with this API. > > > Thanks,Balaji.V > > > On Thursday, April 16, 2020, 10:46:19 AM PDT, Balaji Varadarajan > > > <[email protected]> wrote: > > > > > > > > > +1 from me. This is a really cool feature. > > > Yes, A new file slice (empty parquet) is indeed generated for every > file > > > group in a partition. > > > Regarding cleaning these "empty" file slices eventually by cleaner (to > > > avoid cases where there are too many of them lying around) in a safe > way, > > > we can encode some MAGIC in the write-token component for Hudi readers > to > > > skip these files so that they can be safely removed. > > > For metadata management, I think it would be useful to distinguish > > between > > > this API and other insert APIs. At the very least, we would need a > > > different operation type which can be achieved with same API (with > > flags). > > > Balaji.V > > > > > > On Thursday, April 16, 2020, 09:54:09 AM PDT, Vinoth Chandar < > > > [email protected]> wrote: > > > > > > Hi Satish, > > > > > > Thanks for starting this.. Your use-cases do sounds very valuable to > > > support. So +1 from me. > > > > > > IIUC, you are implementing a partition level overwrite, where existing > > > filegroups will be retained, but instead of merging, you will just > reuse > > > the file names and write the incoming records into new file slices? > > > You probably already thought of this, but one thing to watch out for > is : > > > we should generate a new file slice for every file group in a > partition.. > > > Otherwise, old data will be visible to queries. > > > > > > if so, that makes sense to me. We can discuss more on whether we can > > > extend the bulk_insert() API with additional flags instead of a new > > > insertOverwrite() API.. > > > > > > Others, thoughts? > > > > > > Thanks > > > Vinoth > > > > > > On Wed, Apr 15, 2020 at 11:03 AM Satish Kotha > > <[email protected] > > > > > > > wrote: > > > > > > > Hello > > > > > > > > I want to discuss adding a new high level API 'insertOverwrite' on > > > > HoodieWriteClient. This API can be used to > > > > > > > > - > > > > > > > > Overwrite specific partitions with new records > > > > - > > > > > > > > Example: partition has 'x' records. If insert overwrite is done > > > with > > > > 'y' records on that partition, the partition will have just 'y' > > > > records (as > > > > opposed to 'x union y' with upsert) > > > > - > > > > > > > > Overwrite entire table with new records > > > > - > > > > > > > > Overwrite all partitions in the table > > > > > > > > Usecases: > > > > > > > > - Tables where the majority of records change every cycle. So it is > > > likely > > > > efficient to write new data instead of doing upserts. > > > > > > > > - Operational tasks to fix a specific corrupted partition. We can do > > > > 'insert overwrite' on that partition with records from the source. > > This > > > > can be much faster than restore and replay for some data sources. > > > > > > > > The functionality will be similar to hive definition of 'insert > > > overwite'. > > > > But, doing this in Hoodie will provide better isolation between > writer > > > and > > > > readers. I can share possible implementation choices and some nuances > > if > > > > the community thinks this is a useful feature to add. > > > > > > > > > > > > Appreciate any feedback. > > > > > > > > > > > > Thanks > > > > > > > > Satish > > > > > > > > > >
