Thanks, Satish! Will review! On Fri, May 8, 2020 at 4:38 PM Satish Kotha <[email protected]> wrote:
> Hello everyone, > > I started RFC here > > https://cwiki.apache.org/confluence/display/HUDI/RFC+-+18+Insert+Overwrite+API > . > Appreciate any feedback. > > Thanks > Satish > > On Tue, Apr 21, 2020 at 9:34 AM nishith agarwal <[email protected]> > wrote: > > > +1, thanks for starting this effort Satish! > > > > -Nishith > > > > On Fri, Apr 17, 2020 at 2:26 PM Vinoth Chandar <[email protected]> > wrote: > > > > > Thanks Satish! > > > > > > On Fri, Apr 17, 2020 at 11:32 AM Satish Kotha > > <[email protected] > > > > > > > wrote: > > > > > > > Thanks for interesting discussion. I will start RFC as suggested and > > > > discuss points brought up in this thread. > > > > > > > > > > > > On Thu, Apr 16, 2020 at 11:44 AM Balaji Varadarajan > > > > <[email protected]> wrote: > > > > > > > > > > > > > > >A new file slice (empty parquet) is indeed generated for every > file > > > > group > > > > > in a partition. > > > > > >> we could just reuse the existing file groups right? probably is > > bit > > > > > hacky... > > > > > Sorry for the confusion. I meant to say the empty file slice is > only > > > for > > > > > file-groups which does not have any incoming records assigned. This > > is > > > > for > > > > > the case when we have fewer incoming records to fit into all > existing > > > > > file-groups. Existing file groups will be reused. > > > > > Agree, on the magic part. > > > > > Balaji.V On Thursday, April 16, 2020, 11:11:06 AM PDT, Vinoth > > > Chandar > > > > < > > > > > [email protected]> wrote: > > > > > > > > > > >A new file slice (empty parquet) is indeed generated for every > file > > > > group > > > > > in a partition. > > > > > we could just reuse the existing file groups right? probably is bit > > > > > hacky... > > > > > > > > > > >we can encode some MAGIC in the write-token component for Hudi > > readers > > > > to > > > > > skip these files so that they can be safely removed. > > > > > This kind of MAGIC worries me :) .. if it comes to that, I > suggest, > > > lets > > > > > get a version of metadata management along lines of RFC-15/timeline > > > > server > > > > > going before implementing this. > > > > > > > > > > On Thu, Apr 16, 2020 at 10:55 AM [email protected] < > > > [email protected]> > > > > > wrote: > > > > > > > > > > > Satish, > > > > > > Thanks for the proposal. I think a RFC would be useful here. Let > me > > > > know > > > > > > your thoughts. It would be good to nail other details like > > > whether/how > > > > to > > > > > > deal with external index management with this API. > > > > > > Thanks,Balaji.V > > > > > > On Thursday, April 16, 2020, 10:46:19 AM PDT, Balaji > Varadarajan > > > > > > <[email protected]> wrote: > > > > > > > > > > > > > > > > > > +1 from me. This is a really cool feature. > > > > > > Yes, A new file slice (empty parquet) is indeed generated for > every > > > > file > > > > > > group in a partition. > > > > > > Regarding cleaning these "empty" file slices eventually by > cleaner > > > (to > > > > > > avoid cases where there are too many of them lying around) in a > > safe > > > > way, > > > > > > we can encode some MAGIC in the write-token component for Hudi > > > readers > > > > to > > > > > > skip these files so that they can be safely removed. > > > > > > For metadata management, I think it would be useful to > distinguish > > > > > between > > > > > > this API and other insert APIs. At the very least, we would need > a > > > > > > different operation type which can be achieved with same API > (with > > > > > flags). > > > > > > Balaji.V > > > > > > > > > > > > On Thursday, April 16, 2020, 09:54:09 AM PDT, Vinoth Chandar < > > > > > > [email protected]> wrote: > > > > > > > > > > > > Hi Satish, > > > > > > > > > > > > Thanks for starting this.. Your use-cases do sounds very > valuable > > to > > > > > > support. So +1 from me. > > > > > > > > > > > > IIUC, you are implementing a partition level overwrite, where > > > existing > > > > > > filegroups will be retained, but instead of merging, you will > just > > > > reuse > > > > > > the file names and write the incoming records into new file > slices? > > > > > > You probably already thought of this, but one thing to watch out > > for > > > > is : > > > > > > we should generate a new file slice for every file group in a > > > > partition.. > > > > > > Otherwise, old data will be visible to queries. > > > > > > > > > > > > if so, that makes sense to me. We can discuss more on whether we > > can > > > > > > extend the bulk_insert() API with additional flags instead of a > new > > > > > > insertOverwrite() API.. > > > > > > > > > > > > Others, thoughts? > > > > > > > > > > > > Thanks > > > > > > Vinoth > > > > > > > > > > > > On Wed, Apr 15, 2020 at 11:03 AM Satish Kotha > > > > > <[email protected] > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > Hello > > > > > > > > > > > > > > I want to discuss adding a new high level API 'insertOverwrite' > > on > > > > > > > HoodieWriteClient. This API can be used to > > > > > > > > > > > > > > - > > > > > > > > > > > > > > Overwrite specific partitions with new records > > > > > > > - > > > > > > > > > > > > > > Example: partition has 'x' records. If insert overwrite > is > > > done > > > > > > with > > > > > > > 'y' records on that partition, the partition will have > just > > > 'y' > > > > > > > records (as > > > > > > > opposed to 'x union y' with upsert) > > > > > > > - > > > > > > > > > > > > > > Overwrite entire table with new records > > > > > > > - > > > > > > > > > > > > > > Overwrite all partitions in the table > > > > > > > > > > > > > > Usecases: > > > > > > > > > > > > > > - Tables where the majority of records change every cycle. So > it > > is > > > > > > likely > > > > > > > efficient to write new data instead of doing upserts. > > > > > > > > > > > > > > - Operational tasks to fix a specific corrupted partition. We > > can > > > do > > > > > > > 'insert overwrite' on that partition with records from the > > source. > > > > > This > > > > > > > can be much faster than restore and replay for some data > sources. > > > > > > > > > > > > > > The functionality will be similar to hive definition of 'insert > > > > > > overwite'. > > > > > > > But, doing this in Hoodie will provide better isolation between > > > > writer > > > > > > and > > > > > > > readers. I can share possible implementation choices and some > > > nuances > > > > > if > > > > > > > the community thinks this is a useful feature to add. > > > > > > > > > > > > > > > > > > > > > Appreciate any feedback. > > > > > > > > > > > > > > > > > > > > > Thanks > > > > > > > > > > > > > > Satish > > > > > > > > > > > > > > > > > > > > > > > > > > > >
