hughfdjackson commented on issue #1979: URL: https://github.com/apache/hudi/issues/1979#issuecomment-680685136
Hi @bvaradar - > In general getting incremental read to discard duplicates is not possible for MOR table types as we defer the merging of records to compaction. That's interesting - as your comment suggests, I've only looked at CoW tables in any depth. I look forward to delving into MoR's design in a bit more detail so I can get my head around what the implications of such a feature would be there + understand your comment better. > I was thinking about alternate ways to achieve your use-case for COW table by using an application level boolean flag. Let me know if this makes sense: > > Introduce additional boolean column "changed". Default Value is false. > Have your own implementation of HoodieRecordPayload plugged-in. > 3a In HoodieRecordPayload.getInsertValue(), return an avro record with changed = true. This function is called first time when the new record is inserted. > 3(b) In HoodieRecordPayload.combineAndGetUpdateValue(), if you determine, there is no material change, set changed = false else set it to true. > > In your incremental query, add the filter changed = true to filter out those without material changes ? That does make sense, although I think a boolean column may lead to missing changes if the incremental read spans two or more commits to the same row. I'm spiking a variation on that suggesting with my team, wherein: 1. Introduce a 'last_updated_timestamp', default to null (i.e. the update was in this commit) 2. Have your own implementation of HoodieRecordPayload plugged-in. 3. a. In HoodieRecordPayload.getInsertValue(), return an avro record with last_updated_timestamp = null.* 3. b. In HoodieRecordPayload.combineAndGetUpdateValue(), if you determine, there is no material change, set last_updated_timestamp to that of the old record (if it exists) _or_ to the old record's commit_time. In the incremental query, we're filtering for `null` (which indicates that one of the commits within the timeline last updated the record) or for `last_updated_timestamp` within the beginInstant and endInstant bounds. We've not tested it extensively, but it looks like a promising workaround so far. --- \* It'd be 'cleaner' to set this equal to the commit time of the write, but in our HoodieRecordPayload class, that's not available unfortunately. The 'null means insert' + special case handling in HoodieRecordPayload.combineAndGetUpdateValue() is a work-around for that. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
