hudi-bot opened a new issue, #16354:
URL: https://github.com/apache/hudi/issues/16354

   OLTP workloads on upstream databases, often update/delete/insert different 
columns in the table on each operation. Currently, Hudi can only supporting 
partial updates in cases where the same columns are being mutated in a given 
write to Hudi (e.g Spark SQL ETLs with MIT or Update statements). Here, we 
explore what it takes to support a smarter storage format, that can only encode 
the changed columns into log along with the different implementations.
   h2. Goals
    # Enable partial update functionality for all existing and potential future 
CDC workloads without huge modification or duplication.
    # Performance parity with current full-record updates or partial updates 
across the same set of columns
    # Exhibit reduction in storage costs, by only storing the changed columns.
    # Should also result in computation cost reductions by scanning/processing 
less data
    # Should not affect the scalability of the existing system ingestion 
system. The number of files generated for partial update should not increase 
dramatically.
   
    
   
   ## JIRA info
   
   - Link: https://issues.apache.org/jira/browse/HUDI-7229
   - Type: Task
   - Epic: https://issues.apache.org/jira/browse/HUDI-6242
   - Fix version(s):
     - 1.2.0
   
   
   ---
   
   
   ## Comments
   
   03/May/24 02:13;vinoth;Punting this to 1.1 
   
   
    # [1.1] Implement support on top of data blocks.
    ## we need to pass change columns information and operation all the way to 
write handles, using a field in HoodieRecord
    ## ... 
    # [1.1] Implement support on top of cdc data blocks.
    ## we can track similar bitmaps for cdc data blocks as well
    ## we need to extend the new file group reader to also merge base and cdc 
blocks. (not just base and data blocks).;;;


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to