[ 
https://issues.apache.org/jira/browse/HUDI-8248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan reassigned HUDI-8248:
-----------------------------------------

    Assignee: sivabalan narayanan

> Fix LogRecord reader to account for rollback blocks with higher timestamps
> --------------------------------------------------------------------------
>
>                 Key: HUDI-8248
>                 URL: https://issues.apache.org/jira/browse/HUDI-8248
>             Project: Apache Hudi
>          Issue Type: Improvement
>          Components: reader-core
>            Reporter: sivabalan narayanan
>            Assignee: sivabalan narayanan
>            Priority: Major
>
> With LogRecordReader, we also configure maxIntant time to read. Sometimes 
> rollback blocks could have higher timestamps compared to the maxInstant set, 
> which might lead to some data inconsistencies.  
>  
> Lets go through an illustration:
> Say, we have t1.dc, t2.dc and t2.dc crashed mid way.
> Current layout is,
> {{base file(t1), lf1(partially committed data w/ t2 as instant time)}}
>  
> Then we start t5.dc say. just when we start t5.dc, hudi detects pending 
> commit and triggers a rollback. And this rollback will get an instant time of 
> t6 (t6.rb). Note that rollback's commit time is greater than t5 or current 
> ongoing delta commit.
> So, once rollback completes, this is the layout.
> {{base file, lf1(from t2.dc partially failed), lf3 (rollback command block 
> with t6).}}
>  
> And once t5.dc completes, this is how the layout looks like
> {{base file, lf1(from t2.dc partially failed), lf3 (rollback command block 
> with t6). lf4 (from t5)}}
>  
> At this point in time, when we trigger snapshot read or try to trigger 
> tagLocation w/ global index, maxInstant is set to last entry among commits 
> timeline which is t5. So, while LogRecordReader while processing all log 
> blocks, when it reaches lf3, it detects the timestamp of t6 > t5 (i.e max 
> instant time) and bails out of for loop. So, in essence it may even read lf4 
> in above scenario.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to