Thanks, Satish!  Will review!

On Fri, May 8, 2020 at 4:38 PM Satish Kotha <[email protected]>
wrote:

> Hello everyone,
>
> I started RFC here
>
> https://cwiki.apache.org/confluence/display/HUDI/RFC+-+18+Insert+Overwrite+API
> .
> Appreciate any feedback.
>
> Thanks
> Satish
>
> On Tue, Apr 21, 2020 at 9:34 AM nishith agarwal <[email protected]>
> wrote:
>
> > +1, thanks for starting this effort Satish!
> >
> > -Nishith
> >
> > On Fri, Apr 17, 2020 at 2:26 PM Vinoth Chandar <[email protected]>
> wrote:
> >
> > > Thanks Satish!
> > >
> > > On Fri, Apr 17, 2020 at 11:32 AM Satish Kotha
> > <[email protected]
> > > >
> > > wrote:
> > >
> > > > Thanks for interesting discussion. I will start RFC as suggested and
> > > > discuss points brought up in this thread.
> > > >
> > > >
> > > > On Thu, Apr 16, 2020 at 11:44 AM Balaji Varadarajan
> > > > <[email protected]> wrote:
> > > >
> > > > >
> > > > > >A new file slice (empty parquet) is indeed generated for every
> file
> > > > group
> > > > > in a partition.
> > > > > >> we could just reuse the existing file groups right? probably is
> > bit
> > > > > hacky...
> > > > > Sorry for the confusion. I meant to say the empty file slice is
> only
> > > for
> > > > > file-groups which does not have any incoming records assigned. This
> > is
> > > > for
> > > > > the case when we have fewer incoming records to fit into all
> existing
> > > > > file-groups. Existing file groups will be reused.
> > > > > Agree, on the magic part.
> > > > > Balaji.V    On Thursday, April 16, 2020, 11:11:06 AM PDT, Vinoth
> > > Chandar
> > > > <
> > > > > [email protected]> wrote:
> > > > >
> > > > >  >A new file slice (empty parquet) is indeed generated for every
> file
> > > > group
> > > > > in a partition.
> > > > > we could just reuse the existing file groups right? probably is bit
> > > > > hacky...
> > > > >
> > > > > >we can encode some MAGIC in the write-token component for Hudi
> > readers
> > > > to
> > > > > skip these files so that they can be safely removed.
> > > > > This kind of MAGIC worries me :) ..  if it comes to that, I
> suggest,
> > > lets
> > > > > get a version of metadata management along lines of RFC-15/timeline
> > > > server
> > > > > going before implementing this.
> > > > >
> > > > > On Thu, Apr 16, 2020 at 10:55 AM [email protected] <
> > > [email protected]>
> > > > > wrote:
> > > > >
> > > > > >  Satish,
> > > > > > Thanks for the proposal. I think a RFC would be useful here. Let
> me
> > > > know
> > > > > > your thoughts. It would be good to nail other details like
> > > whether/how
> > > > to
> > > > > > deal with external index management with this API.
> > > > > > Thanks,Balaji.V
> > > > > >    On Thursday, April 16, 2020, 10:46:19 AM PDT, Balaji
> Varadarajan
> > > > > > <[email protected]> wrote:
> > > > > >
> > > > > >
> > > > > > +1 from me. This is a really cool feature.
> > > > > > Yes, A new file slice (empty parquet) is indeed generated for
> every
> > > > file
> > > > > > group in a partition.
> > > > > > Regarding cleaning these "empty" file slices eventually by
> cleaner
> > > (to
> > > > > > avoid cases where there are too many of them lying around) in a
> > safe
> > > > way,
> > > > > > we can encode some MAGIC in the write-token component for Hudi
> > > readers
> > > > to
> > > > > > skip these files so that they can be safely removed.
> > > > > > For metadata management, I think it would be useful to
> distinguish
> > > > > between
> > > > > > this API and other insert APIs. At the very least, we would need
> a
> > > > > > different operation type which can be achieved with same API
> (with
> > > > > flags).
> > > > > > Balaji.V
> > > > > >
> > > > > >    On Thursday, April 16, 2020, 09:54:09 AM PDT, Vinoth Chandar <
> > > > > > [email protected]> wrote:
> > > > > >
> > > > > >  Hi Satish,
> > > > > >
> > > > > > Thanks for starting this..  Your use-cases do sounds very
> valuable
> > to
> > > > > > support. So +1 from me.
> > > > > >
> > > > > > IIUC, you are implementing a partition level overwrite, where
> > > existing
> > > > > > filegroups will be retained, but instead of merging, you will
> just
> > > > reuse
> > > > > > the file names and write the incoming records into new file
> slices?
> > > > > > You probably already thought of this, but one thing to watch out
> > for
> > > > is :
> > > > > > we should generate a new file slice for every file group in a
> > > > partition..
> > > > > > Otherwise, old data will be visible to queries.
> > > > > >
> > > > > > if so, that makes sense to me.  We can discuss more on whether we
> > can
> > > > > > extend the bulk_insert() API with additional flags instead of a
> new
> > > > > > insertOverwrite() API..
> > > > > >
> > > > > > Others, thoughts?
> > > > > >
> > > > > > Thanks
> > > > > > Vinoth
> > > > > >
> > > > > > On Wed, Apr 15, 2020 at 11:03 AM Satish Kotha
> > > > > <[email protected]
> > > > > > >
> > > > > > wrote:
> > > > > >
> > > > > > > Hello
> > > > > > >
> > > > > > > I want to discuss adding a new high level API 'insertOverwrite'
> > on
> > > > > > > HoodieWriteClient. This API can be used to
> > > > > > >
> > > > > > >    -
> > > > > > >
> > > > > > >    Overwrite specific partitions with new records
> > > > > > >    -
> > > > > > >
> > > > > > >      Example: partition has  'x' records. If insert overwrite
> is
> > > done
> > > > > > with
> > > > > > >      'y' records on that partition, the partition will have
> just
> > > 'y'
> > > > > > > records (as
> > > > > > >      opposed to  'x union y' with upsert)
> > > > > > >      -
> > > > > > >
> > > > > > >    Overwrite entire table with new records
> > > > > > >    -
> > > > > > >
> > > > > > >      Overwrite all partitions in the table
> > > > > > >
> > > > > > > Usecases:
> > > > > > >
> > > > > > > - Tables where the majority of records change every cycle. So
> it
> > is
> > > > > > likely
> > > > > > > efficient to write new data instead of doing upserts.
> > > > > > >
> > > > > > > -  Operational tasks to fix a specific corrupted partition. We
> > can
> > > do
> > > > > > > 'insert overwrite'  on that partition with records from the
> > source.
> > > > > This
> > > > > > > can be much faster than restore and replay for some data
> sources.
> > > > > > >
> > > > > > > The functionality will be similar to hive definition of 'insert
> > > > > > overwite'.
> > > > > > > But, doing this in Hoodie will provide better isolation between
> > > > writer
> > > > > > and
> > > > > > > readers. I can share possible implementation choices and some
> > > nuances
> > > > > if
> > > > > > > the community thinks this is a useful feature to add.
> > > > > > >
> > > > > > >
> > > > > > > Appreciate any feedback.
> > > > > > >
> > > > > > >
> > > > > > > Thanks
> > > > > > >
> > > > > > > Satish
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Reply via email to