Re: [DISCUSS] Support for `_hoodie_record_key` as a virtual column

Sivabalan Sat, 22 Aug 2020 06:09:59 -0700

Aah, yes. That’s right.

On Sat, Aug 22, 2020 at 2:43 AM Vinoth Chandar <[email protected]> wrote:


> All of the remaining meta fields compress very very nicely. They have
>
> almost no overhead.
>
>
>
> On Fri, Aug 21, 2020 at 12:00 PM Abhishek Modi <[email protected]>
>
> wrote:
>
>
>
> > @sivabalan the current plan is to only add this for hoodie_record_key.
> But
>
> > I'm hoping to make the implementation general enough to add other columns
>
> > as well going forward :)
>
> >
>
> > On Fri, Aug 21, 2020 at 11:49 AM Sivabalan <[email protected]> wrote:
>
> >
>
> > > +1 for virtual record keys. Do you also propose to generalize this for
>
> > > partition path as well ?
>
> > >
>
> > >
>
> > > On Fri, Aug 21, 2020 at 4:20 AM Pratyaksh Sharma <
> [email protected]>
>
> > > wrote:
>
> > >
>
> > > > This is a good option to have. :)
>
> > > >
>
> > > > On Thu, Aug 20, 2020 at 11:25 PM Vinoth Chandar <[email protected]>
>
> > > wrote:
>
> > > >
>
> > > > > IIRC _hoodie_record_key was supposed to this standardized key
> field.
>
> > :)
>
> > > > > Anyways, it's good to provide this option to the user.
>
> > > > > So +1 for. RFC/further discussion.
>
> > > > >
>
> > > > > To level set, I want to also share some of the benefits of having
> an
>
> > > > > explicit key column.
>
> > > > > a) if you build your data lake using a bunch of hudi tables, now
> you
>
> > > > have a
>
> > > > > standardized data model
>
> > > > > b) Even if your key generator changes, it does not affect the
>
> > existing
>
> > > > > data's keys. and updates will be matched correctly.
>
> > > > >
>
> > > > > On Thu, Aug 20, 2020 at 10:41 AM Balaji Varadarajan
>
> > > > > <[email protected]> wrote:
>
> > > > >
>
> > > > > >  +1. This should be good to have as an option. If everybody
> agrees,
>
> > > > > please
>
> > > > > > go ahead with RFC and we can discuss details there.
>
> > > > > > Balaji.V    On Tuesday, August 18, 2020, 04:37:18 PM PDT,
> Abhishek
>
> > > Modi
>
> > > > > > <[email protected]> wrote:
>
> > > > > >
>
> > > > > >  Hi everyone!
>
> > > > > >
>
> > > > > > I was hoping to discuss adding support for making
>
> > > `_hoodie_record_key`
>
> > > > a
>
> > > > > > virtual column :)
>
> > > > > >
>
> > > > > > Context:
>
> > > > > > Currently, _hoodie_record_key is written to DFS, as a column in
> the
>
> > > > > Parquet
>
> > > > > > file. In our production systems at Uber however,
> _hoodie_record_key
>
> > > > > > contains data that can be found in a different column (or set of
>
> > > > > columns).
>
> > > > > > This means that we are storing duplicated data.
>
> > > > > >
>
> > > > > > Proposal:
>
> > > > > > In the interest of improving storage efficiency, we could add
>
> > confs /
>
> > > > > > abstract classes that can construct the _hoodie_record_key given
>
> > > other
>
> > > > > > columns. That way we do not have to store duplicated data on DFS.
>
> > > > > >
>
> > > > > > Any thoughts on this?
>
> > > > > >
>
> > > > > > Best,
>
> > > > > > Modi
>
> > > > > >
>
> > > > >
>
> > > >
>
> > >
>
> > >
>
> > > --
>
> > > Regards,
>
> > > -Sivabalan
>
> > >
>
> >
>
> --
Regards,
-Sivabalan

Re: [DISCUSS] Support for `_hoodie_record_key` as a virtual column

Reply via email to