ZiyueGuan created HUDI-2400:
-------------------------------

             Summary: Allow timeline server correctly sync when concurrent 
write to timeline
                 Key: HUDI-2400
                 URL: https://issues.apache.org/jira/browse/HUDI-2400
             Project: Apache Hudi
          Issue Type: Sub-task
          Components: Compaction
            Reporter: ZiyueGuan


Firstly, assume HUDI-1847 is available and we can have an ingestion spark job 
and a compaction job running at the same time.
Assume we have a timestamp for each HoodieTimeLine object which represent the 
time it generated from hdfs.
Considering following case,
 1. ingestion schedule compaction inline. Now we have a timeline: 
1.deltaCommit.Completed, 2.Compaction.Requested (TimeStamp: 1L)
 2. Then ingestion keep move on. We now have 1.deltaCommit.Completed, 
2.Compaction.Requested 3.deltaCommit.Inflight (TimeStamp: 2L) in ingestion job.
 3. We have an independent Spark job run compaction 2. We now have 
1.deltaCommit.Completed, 2.Compaction.Inflight 3.deltaCommit.Inflight 
(TimeStamp: 3L)
 4. Executors in ingestion job send request to timeline server, now they hold 
timeline with TimeStamp 2L. But Timeline Server have timestamp 3L which is 
later than client.

According to the logic in 
https://github.com/apache/hudi/blob/master/hudi-timeline-service/src/main/java/org/apache/hudi/timeline/service/RequestHandler.java#L137,
 
we thought local view of table's timeline is behind that of client's view as 
long as the timeline hashes are different. However this may not be true in the 
case mentioned above.
Here the hashes are different because client view is behind local view.

A simple solution is to add an attribute to timeline which is the timestamp we 
used above. 
And timeline server may determine whether to sync fileSystemView by comparing 
timestamps between client and local rather than the difference between timeline 
hashes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to