Re: [DISCUSS] Insert Overwrite with snapshot isolation

Vinoth Chandar Thu, 16 Apr 2020 11:11:23 -0700

>A new file slice (empty parquet) is indeed generated for every file group
in a partition.
we could just reuse the existing file groups right? probably is bit
hacky...


>we can encode some MAGIC in the write-token component for Hudi readers to
skip these files so that they can be safely removed.
This kind of MAGIC worries me :) ..  if it comes to that, I suggest, lets
get a version of metadata management along lines of RFC-15/timeline server
going before implementing this.

On Thu, Apr 16, 2020 at 10:55 AM vbal...@apache.org <vbal...@apache.org>
wrote:

>  Satish,
> Thanks for the proposal. I think a RFC would be useful here. Let me know
> your thoughts. It would be good to nail other details like whether/how to
> deal with external index management with this API.
> Thanks,Balaji.V
>     On Thursday, April 16, 2020, 10:46:19 AM PDT, Balaji Varadarajan
> <v.bal...@ymail.com.invalid> wrote:
>
>
> +1 from me. This is a really cool feature.
> Yes, A new file slice (empty parquet) is indeed generated for every file
> group in a partition.
> Regarding cleaning these "empty" file slices eventually by cleaner (to
> avoid cases where there are too many of them lying around) in a safe way,
> we can encode some MAGIC in the write-token component for Hudi readers to
> skip these files so that they can be safely removed.
> For metadata management, I think it would be useful to distinguish between
> this API and other insert APIs. At the very least, we would need a
> different operation type which can be achieved with same API (with flags).
> Balaji.V
>
>     On Thursday, April 16, 2020, 09:54:09 AM PDT, Vinoth Chandar <
> vin...@apache.org> wrote:
>
>  Hi Satish,
>
> Thanks for starting this..  Your use-cases do sounds very valuable to
> support. So +1 from me.
>
> IIUC, you are implementing a partition level overwrite, where existing
> filegroups will be retained, but instead of merging, you will just reuse
> the file names and write the incoming records into new file slices?
> You probably already thought of this, but one thing to watch out for is :
> we should generate a new file slice for every file group in a partition..
> Otherwise, old data will be visible to queries.
>
> if so, that makes sense to me.  We can discuss more on whether we can
> extend the bulk_insert() API with additional flags instead of a new
> insertOverwrite() API..
>
> Others, thoughts?
>
> Thanks
> Vinoth
>
> On Wed, Apr 15, 2020 at 11:03 AM Satish Kotha <satishko...@uber.com.invalid
> >
> wrote:
>
> > Hello
> >
> > I want to discuss adding a new high level API 'insertOverwrite' on
> > HoodieWriteClient. This API can be used to
> >
> >    -
> >
> >    Overwrite specific partitions with new records
> >    -
> >
> >      Example: partition has  'x' records. If insert overwrite is done
> with
> >      'y' records on that partition, the partition will have just 'y'
> > records (as
> >      opposed to  'x union y' with upsert)
> >      -
> >
> >    Overwrite entire table with new records
> >    -
> >
> >      Overwrite all partitions in the table
> >
> > Usecases:
> >
> > - Tables where the majority of records change every cycle. So it is
> likely
> > efficient to write new data instead of doing upserts.
> >
> > -  Operational tasks to fix a specific corrupted partition. We can do
> > 'insert overwrite'  on that partition with records from the source. This
> > can be much faster than restore and replay for some data sources.
> >
> > The functionality will be similar to hive definition of 'insert
> overwite'.
> > But, doing this in Hoodie will provide better isolation between writer
> and
> > readers. I can share possible implementation choices and some nuances if
> > the community thinks this is a useful feature to add.
> >
> >
> > Appreciate any feedback.
> >
> >
> > Thanks
> >
> > Satish
> >
>

Re: [DISCUSS] Insert Overwrite with snapshot isolation

Reply via email to