[
https://issues.apache.org/jira/browse/HUDI-2299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Vinoth Chandar updated HUDI-2299:
---------------------------------
Sprint: Hudi-Sprint-Jan-3
> The log format DELETE block lose the info orderingVal
> -----------------------------------------------------
>
> Key: HUDI-2299
> URL: https://issues.apache.org/jira/browse/HUDI-2299
> Project: Apache Hudi
> Issue Type: Bug
> Components: Common Core
> Reporter: Danny Chen
> Assignee: Alexey Kudinkin
> Priority: Blocker
> Fix For: 0.11.0
>
>
> The append handle now always write data block first then delete block, and
> the delete block only keeps the hoodie keys, when reading, the scanner just
> read the DELETE block without any info of ordering value, thus, if the we
> write two records:
> insert: {id: 0, ts: 2}
> delete: {id: 0, ts: 1}
> Finally the insert message is deleted !!!, this is a critical bug for
> streaming write, we should fix it as soon as possible
> _*Here is the discussion on slack*_:
> Danny Chan 12:42 PM
> https://issues.apache.org/jira/browse/HUDI-2299
> 12:43
> Hi, @vc, our user found a critical bug for MOR log format, if there are
> disorder DELETEs in the streaming messages, the event time of the DELETEs are
> totally ignored.
> 12:44
> I guess this should be a blocker of 0.9 because it affect the correctness of
> the data set.
> vc 12:44 PM
> if we can fix it by end of day friday PST
> 12:44
> we can add it
> 12:44
> Just want to cut a release this week.
> 12:45
> Do you have a sense for the fix? bandwidth to take it up?
> Danny Chan 12:46 PM
> I try to fix it but can not figure out a good way, if the DELETE block
> records the orderingVal, the format breaks the compatibility.
> vc 1:05 PM
> We can version the format. thats doable. Should we precombine before even
> logging the deeltes?
> Danny Chan 1:11 PM
> Yes, we should
> vc 1:26 PM
> I think, thats how its working today. Deletes don't have an ordering val per
> se, right
> 1:28
> Delete block at t1 :
> delete key k
> Data block at t2 :
> ins key k with ordering val 2
> We can just fix it so that the insert shows up, since t2 > t1.
> For what kind of functionality you need, we need to do soft deletes i.e
> updates with an ordering value instead of hard deletes
> 1:28
> makes sense?
> Danny Chan 1:32 PM
> we can but that’s not the perfect solution, especially if the dataset comes
> from a CDC source, for example the MySQL binlog. There is no extra flag in
> schema for soft delete though.
> 1:37
> In my opinion, it is not about soft DELETE or hard DELETE, even if we do a
> soft DELETE, the event time (orderingVal) is still important for consumers
> for versoning. (edited)
> vc 1:57 PM
> tbh, I don't see us fixing this in two days
> 1:58
> lets do a 0.9.1 after this ?
> 1:58
> shortly after with a bunch of bug fixes and the large pending PRs
> 1:58
> we can even make it 0.10.0
> Danny Chan 1:58 PM
> Yes, the cut time is very soon. We can move the fix to next version.
> vc 1:59 PM
> We have some inconsistent semantics in places
> 1:59
> some are commit time (arrival time) based and some are orderingVal (event
> time) based
> 2:00
> In the meantime, see HoodieDeleteBlockVersion you can just define a new
> version for delete block alone for e,g
> 2:00
> and add more information
--
This message was sent by Atlassian Jira
(v8.20.1#820001)