[ https://issues.apache.org/jira/browse/HUDI-7538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Vinoth Chandar updated HUDI-7538: --------------------------------- Description: For sake of more consistency, we need to consolidate the the changelog mode (currently supported for Flink MoR) and RFC-51 based CDC feature which is a debezium style change log (currently supported for CoW for Spark/Flink) |Format Name|CDC Source Required|Resource Cost(writer)|Resource Cost(reader)|Friendly to Streaming| |CDC|*No*|low/high|low/high (based on logging modes we choose)|No (the debezium style output is not what Flink needs for e.g)| |Changelog|Yes|low|low|Yes| This proposal is to converge onto "CDC" as the path going forward, with the following changes to incorporated for supporting existing users/usage of changelog. CDC format is more generalized in the database world. It offers advantages like not requiring further down-stream processing to say stitch together +U and -U, to update a downstream table. for e.g a field that changed is a key in a downstream table, so we need both +U and -U to compute the updates. (A) Introduce a new "changelog" output mode for CDC queries, which generates I,+U,-U,D format that changelog needs (this can be constructed easily by processing the output of CDC query as follows) * when before is `null`, emit I * when after is `null`, emit D * when both are non-null, emit two records +U and -U (B) New writes in 1.0 will *ONLY* produce .cdc changelog format, and stops publishing to _hoodie_operation field # this means, anyone querying this field, using a snapshot query, will break. # we will bring this back in 1.1 etc, based on user feedback as a hidden/field in the FlinkCatalog. (C) To support backwards compatibilty, we fallback to reading `_hoodie_operation` in 0.X tables. For CDC reads, we use first use the CDC log if its avaible for that file slice. If not and base file schema has {{_hoodie_operation}} already, we fallback to reading {{_hoodie_operation}} from base file if mode=OP_KEY_ONLY.. Throw error for other modes. (D) Snapshot queries from spark, presto, trino etc all work with tables, that have `_hoodie_operation` published. This is already completed for Spark. so others should be easy to do. (E) We need to complete a review of the CDC schema ts - should be completion time or instant time? was: For sake of more consistency, we need to consolidate the the changelog mode (currently supported for Flink MoR) and RFC-51 based CDC feature which is a debezium style change log (currently supported for CoW for Spark/Flink) |Format Name|CDC Source Required|Resource Cost(writer)|Resource Cost(reader)|Friendly to Streaming| |CDC|*No*|low/high|low/high (based on logging modes we choose)|No (the debezium style output is not what Flink needs for e.g)| |Changelog|Yes|low|low|Yes| This proposal is to converge onto "CDC" as the path going forward, with the following changes to incorporate. > Consolidate the CDC Formats (changelog format, RFC-51) > ------------------------------------------------------ > > Key: HUDI-7538 > URL: https://issues.apache.org/jira/browse/HUDI-7538 > Project: Apache Hudi > Issue Type: Improvement > Components: storage-management > Reporter: Vinoth Chandar > Assignee: Vinoth Chandar > Priority: Major > Labels: hudi-1.0.0-beta2 > Fix For: 1.0.0 > > > For sake of more consistency, we need to consolidate the the changelog mode > (currently supported for Flink MoR) and RFC-51 based CDC feature which is a > debezium style change log (currently supported for CoW for Spark/Flink) > > |Format Name|CDC Source Required|Resource Cost(writer)|Resource > Cost(reader)|Friendly to Streaming| > |CDC|*No*|low/high|low/high (based on logging modes we choose)|No (the > debezium style output is not what Flink needs for e.g)| > |Changelog|Yes|low|low|Yes| > This proposal is to converge onto "CDC" as the path going forward, with the > following changes to incorporated for supporting existing users/usage of > changelog. CDC format is more generalized in the database world. It offers > advantages like not requiring further down-stream processing to say stitch > together +U and -U, to update a downstream table. for e.g a field that > changed is a key in a downstream table, so we need both +U and -U to compute > the updates. > > (A) Introduce a new "changelog" output mode for CDC queries, which generates > I,+U,-U,D format that changelog needs (this can be constructed easily by > processing the output of CDC query as follows) > * when before is `null`, emit I > * when after is `null`, emit D > * when both are non-null, emit two records +U and -U > (B) New writes in 1.0 will *ONLY* produce .cdc changelog format, and stops > publishing to _hoodie_operation field > # this means, anyone querying this field, using a snapshot query, will break. > # we will bring this back in 1.1 etc, based on user feedback as a > hidden/field in the FlinkCatalog. > (C) To support backwards compatibilty, we fallback to reading > `_hoodie_operation` in 0.X tables. > For CDC reads, we use first use the CDC log if its avaible for that file > slice. If not and base file schema has {{_hoodie_operation}} already, we > fallback to reading {{_hoodie_operation}} from base file if > mode=OP_KEY_ONLY.. Throw error for other modes. > (D) Snapshot queries from spark, presto, trino etc all work with tables, that > have `_hoodie_operation` published. > This is already completed for Spark. so others should be easy to do. > > (E) We need to complete a review of the CDC schema > ts - should be completion time or instant time? > > -- This message was sent by Atlassian Jira (v8.20.10#820010)