Re: [DISCUSS] Insert Overwrite with snapshot isolation

Vinoth Chandar Fri, 17 Apr 2020 14:27:23 -0700

Thanks Satish!

On Fri, Apr 17, 2020 at 11:32 AM Satish Kotha <[email protected]>
wrote:


> Thanks for interesting discussion. I will start RFC as suggested and
> discuss points brought up in this thread.
>
>
> On Thu, Apr 16, 2020 at 11:44 AM Balaji Varadarajan
> <[email protected]> wrote:
>
> >
> > >A new file slice (empty parquet) is indeed generated for every file
> group
> > in a partition.
> > >> we could just reuse the existing file groups right? probably is bit
> > hacky...
> > Sorry for the confusion. I meant to say the empty file slice is only for
> > file-groups which does not have any incoming records assigned. This is
> for
> > the case when we have fewer incoming records to fit into all existing
> > file-groups. Existing file groups will be reused.
> > Agree, on the magic part.
> > Balaji.V    On Thursday, April 16, 2020, 11:11:06 AM PDT, Vinoth Chandar
> <
> > [email protected]> wrote:
> >
> >  >A new file slice (empty parquet) is indeed generated for every file
> group
> > in a partition.
> > we could just reuse the existing file groups right? probably is bit
> > hacky...
> >
> > >we can encode some MAGIC in the write-token component for Hudi readers
> to
> > skip these files so that they can be safely removed.
> > This kind of MAGIC worries me :) ..  if it comes to that, I suggest, lets
> > get a version of metadata management along lines of RFC-15/timeline
> server
> > going before implementing this.
> >
> > On Thu, Apr 16, 2020 at 10:55 AM [email protected] <[email protected]>
> > wrote:
> >
> > >  Satish,
> > > Thanks for the proposal. I think a RFC would be useful here. Let me
> know
> > > your thoughts. It would be good to nail other details like whether/how
> to
> > > deal with external index management with this API.
> > > Thanks,Balaji.V
> > >    On Thursday, April 16, 2020, 10:46:19 AM PDT, Balaji Varadarajan
> > > <[email protected]> wrote:
> > >
> > >
> > > +1 from me. This is a really cool feature.
> > > Yes, A new file slice (empty parquet) is indeed generated for every
> file
> > > group in a partition.
> > > Regarding cleaning these "empty" file slices eventually by cleaner (to
> > > avoid cases where there are too many of them lying around) in a safe
> way,
> > > we can encode some MAGIC in the write-token component for Hudi readers
> to
> > > skip these files so that they can be safely removed.
> > > For metadata management, I think it would be useful to distinguish
> > between
> > > this API and other insert APIs. At the very least, we would need a
> > > different operation type which can be achieved with same API (with
> > flags).
> > > Balaji.V
> > >
> > >    On Thursday, April 16, 2020, 09:54:09 AM PDT, Vinoth Chandar <
> > > [email protected]> wrote:
> > >
> > >  Hi Satish,
> > >
> > > Thanks for starting this..  Your use-cases do sounds very valuable to
> > > support. So +1 from me.
> > >
> > > IIUC, you are implementing a partition level overwrite, where existing
> > > filegroups will be retained, but instead of merging, you will just
> reuse
> > > the file names and write the incoming records into new file slices?
> > > You probably already thought of this, but one thing to watch out for
> is :
> > > we should generate a new file slice for every file group in a
> partition..
> > > Otherwise, old data will be visible to queries.
> > >
> > > if so, that makes sense to me.  We can discuss more on whether we can
> > > extend the bulk_insert() API with additional flags instead of a new
> > > insertOverwrite() API..
> > >
> > > Others, thoughts?
> > >
> > > Thanks
> > > Vinoth
> > >
> > > On Wed, Apr 15, 2020 at 11:03 AM Satish Kotha
> > <[email protected]
> > > >
> > > wrote:
> > >
> > > > Hello
> > > >
> > > > I want to discuss adding a new high level API 'insertOverwrite' on
> > > > HoodieWriteClient. This API can be used to
> > > >
> > > >    -
> > > >
> > > >    Overwrite specific partitions with new records
> > > >    -
> > > >
> > > >      Example: partition has  'x' records. If insert overwrite is done
> > > with
> > > >      'y' records on that partition, the partition will have just 'y'
> > > > records (as
> > > >      opposed to  'x union y' with upsert)
> > > >      -
> > > >
> > > >    Overwrite entire table with new records
> > > >    -
> > > >
> > > >      Overwrite all partitions in the table
> > > >
> > > > Usecases:
> > > >
> > > > - Tables where the majority of records change every cycle. So it is
> > > likely
> > > > efficient to write new data instead of doing upserts.
> > > >
> > > > -  Operational tasks to fix a specific corrupted partition. We can do
> > > > 'insert overwrite'  on that partition with records from the source.
> > This
> > > > can be much faster than restore and replay for some data sources.
> > > >
> > > > The functionality will be similar to hive definition of 'insert
> > > overwite'.
> > > > But, doing this in Hoodie will provide better isolation between
> writer
> > > and
> > > > readers. I can share possible implementation choices and some nuances
> > if
> > > > the community thinks this is a useful feature to add.
> > > >
> > > >
> > > > Appreciate any feedback.
> > > >
> > > >
> > > > Thanks
> > > >
> > > > Satish
> > > >
> > >
> >
>

Re: [DISCUSS] Insert Overwrite with snapshot isolation

Reply via email to