prasannarajaperumal commented on PR #5885: URL: https://github.com/apache/hudi/pull/5885#issuecomment-1186159978
Hey @YannByron , Thanks for this PR and a well written [RFC-51](https://github.com/apache/hudi/blob/master/rfc/rfc-51/rfc-51.md). Overall I agree with the high level direction. I will do the code review soon. I have a question before that. Should we introduce a new concept (CDC) here on Hudi tables? I think this should be sub-mode of Incremental Query. For illustration, Suppose we have something like the following modes for incremental query (change stream) - LATEST_STATE_INSERT_DELETE_KEYS (entire row state for all inserted keys and empty delete keys?) - LATEST_STATE_ONLY_INSERT_KEYS - MIN_STATE_CHANGE_INSERT_DELETE_KEYS (only columns changed and consolidate multiple inserts,deletes, or remove data inserted and deleted within the time range) - ALL_STATE_CHANGES_INSERT_DELETE_KEYS (include every single change made to the key) I think read-schema changes for the CDC style incremental queries could be a challenge. The reason I think of converging the incremental queries with RFC-51 is because - Removes the limitation of tracking deletes accross compaction boundaries for incremental queries - I think it just makes sense for us to track the data we track when "cdc.supplemental.logging=false" by default for all Hudi tables. Having this data stored efficiently for point lookups will help with record merging as well I suppose? @YannByron What do you think? (cc @vinothchandar ) Cheers Prasanna -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
