[
https://issues.apache.org/jira/browse/HUDI-7229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Vinoth Chandar updated HUDI-7229:
---------------------------------
Description:
OLTP workloads on upstream databases, often update/delete/insert different
columns in the table on each operation. Currently, Hudi can only supporting
partial updates in cases where the same columns are being mutated in a given
write to Hudi (e.g Spark SQL ETLs with MIT or Update statements). Here, we
explore what it takes to support a smarter storage format, that can only encode
the changed columns into log along with the different implementations.
h2. Goals
# Enable partial update functionality for all existing and potential future
CDC workloads without huge modification or duplication.
# Performance parity with current full-record updates or partial updates
across the same set of columns
# Exhibit reduction in storage costs, by only storing the changed columns.
# Should also result in computation cost reductions by scanning/processing
less data
# Should not affect the scalability of the existing system ingestion system.
The number of files generated for partial update should not increase
dramatically.
was:DMS, Debezium, etc.
> Enable partial updates for CDC work payload
> -------------------------------------------
>
> Key: HUDI-7229
> URL: https://issues.apache.org/jira/browse/HUDI-7229
> Project: Apache Hudi
> Issue Type: Task
> Reporter: Lin Liu
> Assignee: Vinoth Chandar
> Priority: Major
> Labels: pull-request-available
> Fix For: 1.1.0
>
>
> OLTP workloads on upstream databases, often update/delete/insert different
> columns in the table on each operation. Currently, Hudi can only supporting
> partial updates in cases where the same columns are being mutated in a given
> write to Hudi (e.g Spark SQL ETLs with MIT or Update statements). Here, we
> explore what it takes to support a smarter storage format, that can only
> encode the changed columns into log along with the different implementations.
> h2. Goals
> # Enable partial update functionality for all existing and potential future
> CDC workloads without huge modification or duplication.
> # Performance parity with current full-record updates or partial updates
> across the same set of columns
> # Exhibit reduction in storage costs, by only storing the changed columns.
> # Should also result in computation cost reductions by scanning/processing
> less data
> # Should not affect the scalability of the existing system ingestion system.
> The number of files generated for partial update should not increase
> dramatically.
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)