[ 
https://issues.apache.org/jira/browse/HUDI-2299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-2299:
---------------------------------
    Sprint: Hudi-Sprint-Jan-3

> The log format DELETE block lose the info orderingVal
> -----------------------------------------------------
>
>                 Key: HUDI-2299
>                 URL: https://issues.apache.org/jira/browse/HUDI-2299
>             Project: Apache Hudi
>          Issue Type: Bug
>          Components: Common Core
>            Reporter: Danny Chen
>            Assignee: Alexey Kudinkin
>            Priority: Blocker
>             Fix For: 0.11.0
>
>
> The append handle now always write data block first then delete block, and 
> the delete block only keeps the hoodie keys, when reading, the scanner just 
> read the DELETE block without any info of ordering value, thus, if the we 
> write two records:
> insert: {id: 0, ts: 2}
> delete: {id: 0, ts: 1}
> Finally the insert message is deleted !!!, this is a critical bug for 
> streaming write, we should fix it as soon as possible
> _*Here is the discussion on slack*_:
> Danny Chan  12:42 PM
> https://issues.apache.org/jira/browse/HUDI-2299
> 12:43
> Hi, @vc, our user found a critical bug for MOR log format, if there are 
> disorder DELETEs in the streaming messages, the event time of the DELETEs are 
> totally ignored.
> 12:44
> I guess this should be a blocker of 0.9 because it affect the correctness of 
> the data set.
> vc  12:44 PM
> if we can fix it by end of day friday PST
> 12:44
> we can add it
> 12:44
> Just want to cut a release this week.
> 12:45
> Do you have a sense for the fix? bandwidth to take it up?
> Danny Chan  12:46 PM
> I try to fix it but can not figure out a good way, if the DELETE block 
> records the orderingVal, the format breaks the compatibility.
> vc  1:05 PM
> We can version the format. thats doable. Should we precombine before even 
> logging the deeltes?
> Danny Chan  1:11 PM
> Yes, we should
> vc  1:26 PM
> I think, thats how its working today. Deletes don't have an ordering val per 
> se, right
> 1:28
> Delete block at t1 :
>   delete key k
> Data block at t2 :
>   ins key k with ordering val 2
> We can just fix it so that the insert shows up, since t2 > t1.
> For what kind of functionality you need, we need to do soft deletes i.e 
> updates with an ordering value instead of hard deletes
> 1:28
> makes sense?
> Danny Chan  1:32 PM
> we can but that’s not the perfect solution, especially if the dataset comes 
> from a CDC source, for example the MySQL binlog. There is no extra flag in 
> schema for soft delete though.
> 1:37
> In my opinion, it is not about soft DELETE or hard DELETE, even if we do a 
> soft DELETE, the event time (orderingVal) is still important for consumers 
> for versoning. (edited) 
> vc  1:57 PM
> tbh, I don't see us fixing this in two days
> 1:58
> lets do a 0.9.1 after this ?
> 1:58
> shortly after with a bunch of bug fixes and the large pending PRs
> 1:58
> we can even make it 0.10.0
> Danny Chan  1:58 PM
> Yes, the cut time is very soon. We can move the fix to next version.
> vc  1:59 PM
> We have some inconsistent semantics in places
> 1:59
> some are commit time (arrival time) based and some are orderingVal (event 
> time) based
> 2:00
> In the meantime, see HoodieDeleteBlockVersion you can just define a new 
> version for delete block alone for e,g
> 2:00
> and add more information



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to