XinyaoTian commented on PR #6382: URL: https://github.com/apache/hudi/pull/6382#issuecomment-1246305177
The feature we implemented looks like below. We gave a simple but useful example here to illustrate directly what this RFC is doing. If we have a table whose configuration contains multiple event-time fields, which could be looked like this: `hoodie.payload.combine.fields=a_ts,b_ts`, rather than only a single field currently given by Hudi `hoodie.payload.combine.field=ts`. We check the table and see this table has a reocrd: ```sql spark-sql> select * from test_db.hudi_payload_test_03; 20220622111029695 20220622111029695_0_1 public_id:1 pt=DD 214f6985-fee5-4091-a65d-d52e9eb20634-0_0-67-4219_20220622111029695.parquet 1 a_101 101 b_101 101 0DD Time taken: 0.858 seconds, Fetched 1 row(s) ``` We upsert this record with a bigger value in b_ts field but any fields related with a is null: ```sql INSERT INTO test_db.hudi_payload_test_03 SELECT 1 AS public_id, null AS a_info, null AS a_ts, 'b_105_New_record' AS b_info, 105 AS b_ts, 0 AS ts, 'DD' AS pt; ``` The result should be looked like this, only columns related with b has been updated, and a_columns keep unchanged. ```sql spark-sql> select * from test_db.hudi_payload_test_03; 20220622111939468 20220622111939468_0_1 public_id:1 pt=DD 214f6985-fee5-4091-a65d-d52e9eb20634-0_0-30-2209_20220622111939468.parquet 1 a_101 101 b_105_New_record 105 0 DD Time taken: 0.496 seconds, Fetched 1 row(s) ``` If we upsert a smaller value in the a_ts field, nothing happened. Neither null fields or fields containing values. ```sql INSERT INTO test_db.hudi_payload_test_03 SELECT 1 AS public_id, null AS a_info, null AS a_ts, 'b_99_Some_Record' AS b_info, 99 AS b_ts, 0 AS ts, 'DD' AS pt; ``` ```sql spark-sql> select * from test_db.hudi_payload_test_03; 20220622112743351 20220622112743351_0_1 public_id:1 pt=DD 214f6985-fee5-4091-a65d-d52e9eb20634-0_0-69-4422_20220622112743351.parquet 1 a_101 101 b_105_New_record 105 0 DD Time taken: 0.501 seconds, Fetched 1 row(s) ``` By using this feature, data developers can combine several tables into one table (and keep everything up-to-date through streaming ingestion), and only use this table to conduct further work like ML algorithms, AI training, or BIg-screen visualization. This feature will make many things really fast and simple. Hope my example is useful for understanding the feature provided by our RFC :) @yihua @prasannarajaperumal @alexeykudinkin -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
