hughfdjackson commented on issue #1979:
URL: https://github.com/apache/hudi/issues/1979#issuecomment-680685136


   Hi @bvaradar -  
   
   > In general getting incremental read to discard duplicates is not possible 
for MOR table types as we defer the merging of records to compaction.
   
   That's interesting - as your comment suggests, I've only looked at CoW 
tables in any depth.  I look forward to delving into MoR's design in a bit more 
detail so I can get my head around what the implications of such a feature 
would be there + understand your comment better. 
   
   > I was thinking about alternate ways to achieve your use-case for COW table 
by using an application level boolean flag. Let me know if this makes sense:
   > 
   >     Introduce additional boolean column "changed". Default Value is false.
   >     Have your own implementation of HoodieRecordPayload plugged-in.
   >     3a In HoodieRecordPayload.getInsertValue(), return an avro record with 
changed = true. This function is called first time when the new record is 
inserted.
   >     3(b) In HoodieRecordPayload.combineAndGetUpdateValue(), if you 
determine, there is no material change, set changed = false else set it to true.
   > 
   > In your incremental query, add the filter changed = true to filter out 
those without material changes ?
   
   That does make sense, although I think a boolean column may lead to missing 
changes if the incremental read spans two or more commits to the same row.  I'm 
spiking a variation on that suggesting with my team, wherein: 
   
   1. Introduce a 'last_updated_timestamp', default to null (i.e. the update 
was in this commit)
   2. Have your own implementation of HoodieRecordPayload plugged-in.
   3. a. In HoodieRecordPayload.getInsertValue(), return an avro record with 
last_updated_timestamp = null.*
   3. b. In HoodieRecordPayload.combineAndGetUpdateValue(), if you determine, 
there is no material change, set last_updated_timestamp to that of the old 
record (if it exists) _or_ to the old record's commit_time. 
   
   In the incremental query, we're filtering for `null` (which indicates that 
one of the commits within the timeline last updated the record) or for 
`last_updated_timestamp` within the beginInstant and endInstant bounds. 
   
   We've not tested it extensively, but it looks like a promising workaround so 
far. 
   
   ---
   
   \* It'd be 'cleaner' to set this equal to the commit time of the write, but 
in our HoodieRecordPayload class, that's not available unfortunately.  The 
'null means insert' + special case handling in 
HoodieRecordPayload.combineAndGetUpdateValue() is a work-around for that.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to