sivabalan narayanan created HUDI-8248:
-----------------------------------------

             Summary: Fix LogRecord reader to account for rollback blocks with 
higher timestamps
                 Key: HUDI-8248
                 URL: https://issues.apache.org/jira/browse/HUDI-8248
             Project: Apache Hudi
          Issue Type: Improvement
          Components: reader-core
            Reporter: sivabalan narayanan


With LogRecordReader, we also configure maxIntant time to read. Sometimes 
rollback blocks could have higher timestamps compared to the maxInstant set, 
which might lead to some data inconsistencies.  

 

Lets go through an illustration:

Say, we have t1.dc, t2.dc and t2.dc crashed mid way.
Current layout is,
{{base file(t1), lf1(partially committed data w/ t2 as instant time)}}
 
Then we start t5.dc say. just when we start t5.dc, hudi detects pending commit 
and triggers a rollback. And this rollback will get an instant time of t6 
(t6.rb). Note that rollback's commit time is greater than t5 or current ongoing 
delta commit.
So, once rollback completes, this is the layout.
{{base file, lf1(from t2.dc partially failed), lf3 (rollback command block with 
t6).}}
 
And once t5.dc completes, this is how the layout looks like
{{base file, lf1(from t2.dc partially failed), lf3 (rollback command block with 
t6). lf4 (from t5)}}
 
At this point in time, when we trigger snapshot read or try to trigger 
tagLocation w/ global index, maxInstant is set to last entry among commits 
timeline which is t5. So, while LogRecordReader while processing all log 
blocks, when it reaches lf3, it detects the timestamp of t6 > t5 (i.e max 
instant time) and bails out of for loop. So, in essence it may even read lf4 in 
above scenario.

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to