[ 
https://issues.apache.org/jira/browse/HUDI-7538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-7538:
---------------------------------
    Description: 
For sake of more consistency, we need to consolidate the the changelog mode 
(currently supported for Flink MoR) and RFC-51 based CDC feature which is a 
debezium style change log (currently supported for CoW for Spark/Flink)

 
|Format Name|CDC Source Required|Resource Cost(writer)|Resource 
Cost(reader)|Friendly to Streaming|
|CDC|*No*|low/high|low/high (based on logging modes we choose)|No (the debezium 
style output is not what Flink needs for e.g)|
|Changelog|Yes|low|low|Yes|

This proposal is to converge onto "CDC" as the path going forward, with the 
following changes to incorporated for supporting existing users/usage of 
changelog. CDC format is more generalized in the database world. It offers 
advantages like not requiring further down-stream processing to say stitch 
together +U and -U, to update a downstream table. for e.g a field that changed 
is a key in a downstream table, so we need both +U and -U to compute the 
updates. 

 

(A) Introduce a new "changelog" output mode for CDC queries, which generates 
I,+U,-U,D format that changelog needs (this can be constructed easily by 
processing the output of CDC query as follows)
 * when before is `null`, emit I
 * when after is `null`, emit D
 * when both are non-null, emit two records +U and -U

(B) New writes in 1.0 will *ONLY* produce .cdc changelog format, and stops 
publishing to _hoodie_operation field 
 # this means, anyone querying this field, using a snapshot query, will break.
 # we will bring this back in 1.1 etc, based on user feedback as a hidden/field 
in the FlinkCatalog.

(C) To support backwards compatibilty, we fallback to reading 
`_hoodie_operation` in 0.X tables. 

For CDC reads, we use first use the CDC log if its avaible for that file slice. 
If not and base file schema has {{_hoodie_operation}} already, we fallback to 
reading {{_hoodie_operation}} from base file if mode=OP_KEY_ONLY.. Throw error 
for other modes. 



(D) Snapshot queries from spark, presto, trino etc all work with tables, that 
have `_hoodie_operation` published. 

 This is already completed for Spark. so others should be easy to do. 

 

(E) We need to complete a review of the CDC schema

ts - should be completion time or instant time?

 

 

  was:
For sake of more consistency, we need to consolidate the the changelog mode 
(currently supported for Flink MoR) and RFC-51 based CDC feature which is a 
debezium style change log (currently supported for CoW for Spark/Flink)

 
|Format Name|CDC Source Required|Resource Cost(writer)|Resource 
Cost(reader)|Friendly to Streaming|
|CDC|*No*|low/high|low/high (based on logging modes we choose)|No (the debezium 
style output is not what Flink needs for e.g)|
|Changelog|Yes|low|low|Yes|

This proposal is to converge onto "CDC" as the path going forward, with the 
following changes to incorporate. 

 

 

 

 

 


> Consolidate the CDC Formats (changelog format, RFC-51)
> ------------------------------------------------------
>
>                 Key: HUDI-7538
>                 URL: https://issues.apache.org/jira/browse/HUDI-7538
>             Project: Apache Hudi
>          Issue Type: Improvement
>          Components: storage-management
>            Reporter: Vinoth Chandar
>            Assignee: Vinoth Chandar
>            Priority: Major
>              Labels: hudi-1.0.0-beta2
>             Fix For: 1.0.0
>
>
> For sake of more consistency, we need to consolidate the the changelog mode 
> (currently supported for Flink MoR) and RFC-51 based CDC feature which is a 
> debezium style change log (currently supported for CoW for Spark/Flink)
>  
> |Format Name|CDC Source Required|Resource Cost(writer)|Resource 
> Cost(reader)|Friendly to Streaming|
> |CDC|*No*|low/high|low/high (based on logging modes we choose)|No (the 
> debezium style output is not what Flink needs for e.g)|
> |Changelog|Yes|low|low|Yes|
> This proposal is to converge onto "CDC" as the path going forward, with the 
> following changes to incorporated for supporting existing users/usage of 
> changelog. CDC format is more generalized in the database world. It offers 
> advantages like not requiring further down-stream processing to say stitch 
> together +U and -U, to update a downstream table. for e.g a field that 
> changed is a key in a downstream table, so we need both +U and -U to compute 
> the updates. 
>  
> (A) Introduce a new "changelog" output mode for CDC queries, which generates 
> I,+U,-U,D format that changelog needs (this can be constructed easily by 
> processing the output of CDC query as follows)
>  * when before is `null`, emit I
>  * when after is `null`, emit D
>  * when both are non-null, emit two records +U and -U
> (B) New writes in 1.0 will *ONLY* produce .cdc changelog format, and stops 
> publishing to _hoodie_operation field 
>  # this means, anyone querying this field, using a snapshot query, will break.
>  # we will bring this back in 1.1 etc, based on user feedback as a 
> hidden/field in the FlinkCatalog.
> (C) To support backwards compatibilty, we fallback to reading 
> `_hoodie_operation` in 0.X tables. 
> For CDC reads, we use first use the CDC log if its avaible for that file 
> slice. If not and base file schema has {{_hoodie_operation}} already, we 
> fallback to reading {{_hoodie_operation}} from base file if 
> mode=OP_KEY_ONLY.. Throw error for other modes. 
> (D) Snapshot queries from spark, presto, trino etc all work with tables, that 
> have `_hoodie_operation` published. 
>  This is already completed for Spark. so others should be easy to do. 
>  
> (E) We need to complete a review of the CDC schema
> ts - should be completion time or instant time?
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to