Re: [DISCUSS] Insert Overwrite with snapshot isolation

Satish Kotha Fri, 17 Apr 2020 11:33:14 -0700

Thanks for interesting discussion. I will start RFC as suggested and
discuss points brought up in this thread.



On Thu, Apr 16, 2020 at 11:44 AM Balaji Varadarajan
<[email protected]> wrote:

>
> >A new file slice (empty parquet) is indeed generated for every file group
> in a partition.
> >> we could just reuse the existing file groups right? probably is bit
> hacky...
> Sorry for the confusion. I meant to say the empty file slice is only for
> file-groups which does not have any incoming records assigned. This is for
> the case when we have fewer incoming records to fit into all existing
> file-groups. Existing file groups will be reused.
> Agree, on the magic part.
> Balaji.V    On Thursday, April 16, 2020, 11:11:06 AM PDT, Vinoth Chandar <
> [email protected]> wrote:
>
>  >A new file slice (empty parquet) is indeed generated for every file group
> in a partition.
> we could just reuse the existing file groups right? probably is bit
> hacky...
>
> >we can encode some MAGIC in the write-token component for Hudi readers to
> skip these files so that they can be safely removed.
> This kind of MAGIC worries me :) ..  if it comes to that, I suggest, lets
> get a version of metadata management along lines of RFC-15/timeline server
> going before implementing this.
>
> On Thu, Apr 16, 2020 at 10:55 AM [email protected] <[email protected]>
> wrote:
>
> >  Satish,
> > Thanks for the proposal. I think a RFC would be useful here. Let me know
> > your thoughts. It would be good to nail other details like whether/how to
> > deal with external index management with this API.
> > Thanks,Balaji.V
> >    On Thursday, April 16, 2020, 10:46:19 AM PDT, Balaji Varadarajan
> > <[email protected]> wrote:
> >
> >
> > +1 from me. This is a really cool feature.
> > Yes, A new file slice (empty parquet) is indeed generated for every file
> > group in a partition.
> > Regarding cleaning these "empty" file slices eventually by cleaner (to
> > avoid cases where there are too many of them lying around) in a safe way,
> > we can encode some MAGIC in the write-token component for Hudi readers to
> > skip these files so that they can be safely removed.
> > For metadata management, I think it would be useful to distinguish
> between
> > this API and other insert APIs. At the very least, we would need a
> > different operation type which can be achieved with same API (with
> flags).
> > Balaji.V
> >
> >    On Thursday, April 16, 2020, 09:54:09 AM PDT, Vinoth Chandar <
> > [email protected]> wrote:
> >
> >  Hi Satish,
> >
> > Thanks for starting this..  Your use-cases do sounds very valuable to
> > support. So +1 from me.
> >
> > IIUC, you are implementing a partition level overwrite, where existing
> > filegroups will be retained, but instead of merging, you will just reuse
> > the file names and write the incoming records into new file slices?
> > You probably already thought of this, but one thing to watch out for is :
> > we should generate a new file slice for every file group in a partition..
> > Otherwise, old data will be visible to queries.
> >
> > if so, that makes sense to me.  We can discuss more on whether we can
> > extend the bulk_insert() API with additional flags instead of a new
> > insertOverwrite() API..
> >
> > Others, thoughts?
> >
> > Thanks
> > Vinoth
> >
> > On Wed, Apr 15, 2020 at 11:03 AM Satish Kotha
> <[email protected]
> > >
> > wrote:
> >
> > > Hello
> > >
> > > I want to discuss adding a new high level API 'insertOverwrite' on
> > > HoodieWriteClient. This API can be used to
> > >
> > >    -
> > >
> > >    Overwrite specific partitions with new records
> > >    -
> > >
> > >      Example: partition has  'x' records. If insert overwrite is done
> > with
> > >      'y' records on that partition, the partition will have just 'y'
> > > records (as
> > >      opposed to  'x union y' with upsert)
> > >      -
> > >
> > >    Overwrite entire table with new records
> > >    -
> > >
> > >      Overwrite all partitions in the table
> > >
> > > Usecases:
> > >
> > > - Tables where the majority of records change every cycle. So it is
> > likely
> > > efficient to write new data instead of doing upserts.
> > >
> > > -  Operational tasks to fix a specific corrupted partition. We can do
> > > 'insert overwrite'  on that partition with records from the source.
> This
> > > can be much faster than restore and replay for some data sources.
> > >
> > > The functionality will be similar to hive definition of 'insert
> > overwite'.
> > > But, doing this in Hoodie will provide better isolation between writer
> > and
> > > readers. I can share possible implementation choices and some nuances
> if
> > > the community thinks this is a useful feature to add.
> > >
> > >
> > > Appreciate any feedback.
> > >
> > >
> > > Thanks
> > >
> > > Satish
> > >
> >
>

Re: [DISCUSS] Insert Overwrite with snapshot isolation

Reply via email to