Thanks for the proposal. I am a bit concerning about this feature. 
_hoodie_record_key is the primary key of the Hudi table and we need this field 
for indexing, sorting, merging e.t.c. We are using _hoodie_record_key very 
frequently so I can see the overhead on both write side and read side. I am not 
sure if this is worth it. Make _hoodie_partition_path virtual make more sense 
to me.

On the implementation side, if we support this as an option, we need special 
handles in many places. For example, the LogScanner, _hoodie_record_key is the 
key of the lookup Hashmap and this Map could be spilled to the disk as well. 
There should be significant amount of work from my perspective. 

-1 for making _hoodie_record_key virtual.
+1 for _hoodie_partition_path

Best Regards,
Gary Li
 

On 8/22/20, 9:09 PM, "Sivabalan" <[email protected]> wrote:

    Aah, yes. That’s right.

    On Sat, Aug 22, 2020 at 2:43 AM Vinoth Chandar <[email protected]> wrote:

    > All of the remaining meta fields compress very very nicely. They have
    >
    > almost no overhead.
    >
    >
    >
    > On Fri, Aug 21, 2020 at 12:00 PM Abhishek Modi <[email protected]>
    >
    > wrote:
    >
    >
    >
    > > @sivabalan the current plan is to only add this for hoodie_record_key.
    > But
    >
    > > I'm hoping to make the implementation general enough to add other 
columns
    >
    > > as well going forward :)
    >
    > >
    >
    > > On Fri, Aug 21, 2020 at 11:49 AM Sivabalan <[email protected]> wrote:
    >
    > >
    >
    > > > +1 for virtual record keys. Do you also propose to generalize this for
    >
    > > > partition path as well ?
    >
    > > >
    >
    > > >
    >
    > > > On Fri, Aug 21, 2020 at 4:20 AM Pratyaksh Sharma <
    > [email protected]>
    >
    > > > wrote:
    >
    > > >
    >
    > > > > This is a good option to have. :)
    >
    > > > >
    >
    > > > > On Thu, Aug 20, 2020 at 11:25 PM Vinoth Chandar <[email protected]>
    >
    > > > wrote:
    >
    > > > >
    >
    > > > > > IIRC _hoodie_record_key was supposed to this standardized key
    > field.
    >
    > > :)
    >
    > > > > > Anyways, it's good to provide this option to the user.
    >
    > > > > > So +1 for. RFC/further discussion.
    >
    > > > > >
    >
    > > > > > To level set, I want to also share some of the benefits of having
    > an
    >
    > > > > > explicit key column.
    >
    > > > > > a) if you build your data lake using a bunch of hudi tables, now
    > you
    >
    > > > > have a
    >
    > > > > > standardized data model
    >
    > > > > > b) Even if your key generator changes, it does not affect the
    >
    > > existing
    >
    > > > > > data's keys. and updates will be matched correctly.
    >
    > > > > >
    >
    > > > > > On Thu, Aug 20, 2020 at 10:41 AM Balaji Varadarajan
    >
    > > > > > <[email protected]> wrote:
    >
    > > > > >
    >
    > > > > > >  +1. This should be good to have as an option. If everybody
    > agrees,
    >
    > > > > > please
    >
    > > > > > > go ahead with RFC and we can discuss details there.
    >
    > > > > > > Balaji.V    On Tuesday, August 18, 2020, 04:37:18 PM PDT,
    > Abhishek
    >
    > > > Modi
    >
    > > > > > > <[email protected]> wrote:
    >
    > > > > > >
    >
    > > > > > >  Hi everyone!
    >
    > > > > > >
    >
    > > > > > > I was hoping to discuss adding support for making
    >
    > > > `_hoodie_record_key`
    >
    > > > > a
    >
    > > > > > > virtual column :)
    >
    > > > > > >
    >
    > > > > > > Context:
    >
    > > > > > > Currently, _hoodie_record_key is written to DFS, as a column in
    > the
    >
    > > > > > Parquet
    >
    > > > > > > file. In our production systems at Uber however,
    > _hoodie_record_key
    >
    > > > > > > contains data that can be found in a different column (or set of
    >
    > > > > > columns).
    >
    > > > > > > This means that we are storing duplicated data.
    >
    > > > > > >
    >
    > > > > > > Proposal:
    >
    > > > > > > In the interest of improving storage efficiency, we could add
    >
    > > confs /
    >
    > > > > > > abstract classes that can construct the _hoodie_record_key given
    >
    > > > other
    >
    > > > > > > columns. That way we do not have to store duplicated data on 
DFS.
    >
    > > > > > >
    >
    > > > > > > Any thoughts on this?
    >
    > > > > > >
    >
    > > > > > > Best,
    >
    > > > > > > Modi
    >
    > > > > > >
    >
    > > > > >
    >
    > > > >
    >
    > > >
    >
    > > >
    >
    > > > --
    >
    > > > Regards,
    >
    > > > -Sivabalan
    >
    > > >
    >
    > >
    >
    > --
    Regards,
    -Sivabalan

Reply via email to