IIRC _hoodie_record_key was supposed to this standardized key field. :)
Anyways, it's good to provide this option to the user.
So +1 for. RFC/further discussion.

To level set, I want to also share some of the benefits of having an
explicit key column.
a) if you build your data lake using a bunch of hudi tables, now you have a
standardized data model
b) Even if your key generator changes, it does not affect the existing
data's keys. and updates will be matched correctly.

On Thu, Aug 20, 2020 at 10:41 AM Balaji Varadarajan
<v.bal...@ymail.com.invalid> wrote:

>  +1. This should be good to have as an option. If everybody agrees, please
> go ahead with RFC and we can discuss details there.
> Balaji.V    On Tuesday, August 18, 2020, 04:37:18 PM PDT, Abhishek Modi
> <m...@uber.com.invalid> wrote:
>
>  Hi everyone!
>
> I was hoping to discuss adding support for making `_hoodie_record_key` a
> virtual column :)
>
> Context:
> Currently, _hoodie_record_key is written to DFS, as a column in the Parquet
> file. In our production systems at Uber however, _hoodie_record_key
> contains data that can be found in a different column (or set of columns).
> This means that we are storing duplicated data.
>
> Proposal:
> In the interest of improving storage efficiency, we could add confs /
> abstract classes that can construct the _hoodie_record_key given other
> columns. That way we do not have to store duplicated data on DFS.
>
> Any thoughts on this?
>
> Best,
> Modi
>

Reply via email to