[
https://issues.apache.org/jira/browse/HUDI-7538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Vinoth Chandar updated HUDI-7538:
---------------------------------
Description:
For sake of more consistency, we need to consolidate the the changelog mode
(currently supported for Flink MoR) and RFC-51 based CDC feature which is a
debezium style change log (currently supported for CoW for Spark/Flink)
|Format Name|CDC Source Required|Resource Cost(writer)|Resource
Cost(reader)|Friendly to Streaming|
|CDC|*No*|low/high|low/high (based on logging modes we choose)|No (the debezium
style output is not what Flink needs for e.g)|
|Changelog|Yes|low|low|Yes|
This proposal is to converge onto "CDC" as the path going forward, with the
following changes to incorporated for supporting existing users/usage of
changelog. CDC format is more generalized in the database world. It offers
advantages like not requiring further down-stream processing to say stitch
together +U and -U, to update a downstream table. for e.g a field that changed
is a key in a downstream table, so we need both +U and -U to compute the
updates.
(A) Introduce a new "changelog" output mode for CDC queries, which generates
I,+U,-U,D format that changelog needs (this can be constructed easily by
processing the output of CDC query as follows)
* when before is `null`, emit I
* when after is `null`, emit D
* when both are non-null, emit two records +U and -U
(B) New writes in 1.0 will *ONLY* produce .cdc changelog format, and stops
publishing to _hoodie_operation field
# this means, anyone querying this field, using a snapshot query, will break.
# we will bring this back in 1.1 etc, based on user feedback as a hidden/field
in the FlinkCatalog.
(C) To support backwards compatibilty, we fallback to reading
`_hoodie_operation` in 0.X tables.
For CDC reads, we use first use the CDC log if its avaible for that file slice.
If not and base file schema has {{_hoodie_operation}} already, we fallback to
reading {{_hoodie_operation}} from base file if mode=OP_KEY_ONLY.. Throw error
for other modes.
(D) Snapshot queries from spark, presto, trino etc all work with tables, that
have `_hoodie_operation` published.
This is already completed for Spark. so others should be easy to do.
(E) We need to complete a review of the CDC schema
ts - should be completion time or instant time?
was:
For sake of more consistency, we need to consolidate the the changelog mode
(currently supported for Flink MoR) and RFC-51 based CDC feature which is a
debezium style change log (currently supported for CoW for Spark/Flink)
|Format Name|CDC Source Required|Resource Cost(writer)|Resource
Cost(reader)|Friendly to Streaming|
|CDC|*No*|low/high|low/high (based on logging modes we choose)|No (the debezium
style output is not what Flink needs for e.g)|
|Changelog|Yes|low|low|Yes|
This proposal is to converge onto "CDC" as the path going forward, with the
following changes to incorporate.
> Consolidate the CDC Formats (changelog format, RFC-51)
> ------------------------------------------------------
>
> Key: HUDI-7538
> URL: https://issues.apache.org/jira/browse/HUDI-7538
> Project: Apache Hudi
> Issue Type: Improvement
> Components: storage-management
> Reporter: Vinoth Chandar
> Assignee: Vinoth Chandar
> Priority: Major
> Labels: hudi-1.0.0-beta2
> Fix For: 1.0.0
>
>
> For sake of more consistency, we need to consolidate the the changelog mode
> (currently supported for Flink MoR) and RFC-51 based CDC feature which is a
> debezium style change log (currently supported for CoW for Spark/Flink)
>
> |Format Name|CDC Source Required|Resource Cost(writer)|Resource
> Cost(reader)|Friendly to Streaming|
> |CDC|*No*|low/high|low/high (based on logging modes we choose)|No (the
> debezium style output is not what Flink needs for e.g)|
> |Changelog|Yes|low|low|Yes|
> This proposal is to converge onto "CDC" as the path going forward, with the
> following changes to incorporated for supporting existing users/usage of
> changelog. CDC format is more generalized in the database world. It offers
> advantages like not requiring further down-stream processing to say stitch
> together +U and -U, to update a downstream table. for e.g a field that
> changed is a key in a downstream table, so we need both +U and -U to compute
> the updates.
>
> (A) Introduce a new "changelog" output mode for CDC queries, which generates
> I,+U,-U,D format that changelog needs (this can be constructed easily by
> processing the output of CDC query as follows)
> * when before is `null`, emit I
> * when after is `null`, emit D
> * when both are non-null, emit two records +U and -U
> (B) New writes in 1.0 will *ONLY* produce .cdc changelog format, and stops
> publishing to _hoodie_operation field
> # this means, anyone querying this field, using a snapshot query, will break.
> # we will bring this back in 1.1 etc, based on user feedback as a
> hidden/field in the FlinkCatalog.
> (C) To support backwards compatibilty, we fallback to reading
> `_hoodie_operation` in 0.X tables.
> For CDC reads, we use first use the CDC log if its avaible for that file
> slice. If not and base file schema has {{_hoodie_operation}} already, we
> fallback to reading {{_hoodie_operation}} from base file if
> mode=OP_KEY_ONLY.. Throw error for other modes.
> (D) Snapshot queries from spark, presto, trino etc all work with tables, that
> have `_hoodie_operation` published.
> This is already completed for Spark. so others should be easy to do.
>
> (E) We need to complete a review of the CDC schema
> ts - should be completion time or instant time?
>
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)