alexeykudinkin commented on code in PR #6256: URL: https://github.com/apache/hudi/pull/6256#discussion_r980381575
########## rfc/rfc-51/rfc-51.md: ########## @@ -148,20 +155,46 @@ hudi_cdc_table/ Under a partition directory, the `.log` file with `CDCBlock` above will keep the changing data we have to materialize. -There is an option to control what data is written to `CDCBlock`, that is `hoodie.table.cdc.supplemental.logging`. See the description of this config above. +#### Persisting CDC in MOR: Write-on-indexing vs Write-on-compaction + +2 design choices on when to persist CDC in MOR tables: + +Write-on-indexing allows CDC info to be persisted at the earliest, however, in case of Flink writer or Bucket +indexing, `op` (I/U/D) data is not available at indexing. + +Write-on-compaction can always persist CDC info and achieve standardization of implementation logic across engines, +however, some delays are added to the CDC query results. Based on the business requirements, Log Compaction (RFC-48) or +scheduling more frequent compaction can be used to minimize the latency. -Spark DataSource example: +The semantics we propose to establish are: when base files are written, the corresponding CDC data is also persisted. Review Comment: @xushiyan can you elaborate about "on-the-fly inference", i don't think i saw it being mentioned anywhere in the RFC. I still don't think i fully understand how we're going to be tackling the issue of CDC records being deferred until compactor runs. What if compactor isn't even setup to run for the table? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
