[GitHub] [hudi] XinyaoTian commented on pull request #6382: [HUDI-4612][RFC-59] RFC-59 Materials (RFC Proposal) Submission: "Multiple event_time Fields Latest Verification in a Single Table"

GitBox Tue, 13 Sep 2022 23:32:58 -0700


XinyaoTian commented on PR #6382:
URL: https://github.com/apache/hudi/pull/6382#issuecomment-1246305177


   The feature we implemented looks like below. We gave a simple but useful 
example here to illustrate directly what this RFC is doing.
   
   If we have a table whose configuration contains multiple event-time fields, 
which could be looked like this: `hoodie.payload.combine.fields=a_ts,b_ts`, 
rather than only a single field currently given by Hudi 
`hoodie.payload.combine.field=ts`. 
   
   We check the table and see this table has a reocrd:
   
   ```sql
   spark-sql> select * from test_db.hudi_payload_test_03;
   20220622111029695       20220622111029695_0_1   public_id:1     pt=DD   
214f6985-fee5-4091-a65d-d52e9eb20634-0_0-67-4219_20220622111029695.parquet      
1       a_101   101     b_101   101 0DD
   Time taken: 0.858 seconds, Fetched 1 row(s)
   ```
   
   We upsert this record with a bigger value in b_ts field but any fields 
related with a is null:
   ```sql
   INSERT INTO test_db.hudi_payload_test_03
   SELECT 1 AS public_id, null AS a_info, null AS a_ts, 'b_105_New_record' AS 
b_info, 105 AS b_ts, 0 AS ts, 'DD' AS pt;
   ```
   
   The result should be looked like this, only columns related with b has been 
updated, and a_columns keep unchanged.
   ```sql
   spark-sql> select * from test_db.hudi_payload_test_03;
   20220622111939468       20220622111939468_0_1   public_id:1     pt=DD   
214f6985-fee5-4091-a65d-d52e9eb20634-0_0-30-2209_20220622111939468.parquet      
1       a_101   101     b_105_New_record     105     0       DD
   Time taken: 0.496 seconds, Fetched 1 row(s)
   ```
   
   If we upsert a smaller value in the a_ts field, nothing happened. Neither 
null fields or fields containing values.
   
   ```sql
   INSERT INTO test_db.hudi_payload_test_03
   SELECT 1 AS public_id, null AS a_info, null AS a_ts, 'b_99_Some_Record' AS 
b_info, 99 AS b_ts, 0 AS ts, 'DD' AS pt;
   ```
   
   ```sql
   spark-sql> select * from test_db.hudi_payload_test_03;
   20220622112743351       20220622112743351_0_1   public_id:1     pt=DD   
214f6985-fee5-4091-a65d-d52e9eb20634-0_0-69-4422_20220622112743351.parquet      
1       a_101   101     b_105_New_record     105     0       DD
   Time taken: 0.501 seconds, Fetched 1 row(s)
   ```
   
   By using this feature, data developers can combine several tables into one 
table (and keep everything up-to-date through streaming ingestion), and only 
use this table to conduct further work like ML algorithms, AI training, or 
BIg-screen visualization. This feature will make many things really fast and 
simple.
   
   Hope my example is useful for understanding the feature provided by our RFC 
:) @yihua @prasannarajaperumal @alexeykudinkin 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] XinyaoTian commented on pull request #6382: [HUDI-4612][RFC-59] RFC-59 Materials (RFC Proposal) Submission: "Multiple event_time Fields Latest Verification in a Single Table"

Reply via email to