yanxiang created HUDI-4044:
------------------------------

             Summary: When reading data from flink-hudi to external storage, 
the result is incorrect
                 Key: HUDI-4044
                 URL: https://issues.apache.org/jira/browse/HUDI-4044
             Project: Apache Hudi
          Issue Type: Bug
          Components: flink
    Affects Versions: 0.11.0
            Reporter: yanxiang


When reading data from flink-hudi to external storage, the result is incorrect  
because of concurrency issues:
 
Here's the  case:
 
There is a split_monitor task that listens for changes on the TimeLine every N 
seconds; There are four split_reader tasks for processing changing data and 
sinking data to external storage:
 
(1) First,split_monitor listens to Instance1 changes , and the corresponding 
fileId is log1. Split_monitor distributes the fileId information to 
split_reader task 1 in Rebanlance mode for processing.
 
(2) then,split_monitor listens for Instance2 change . The corresponding fileId 
is log1 (assuming that the changed data have the same primary key ). The 
split_monitor task distributes fileId information to split_reader task 2 in 
Rebanlance mode for processing.
 
(3) Split_reader task 1 and split_reader task 2 process the same primary key 
data, and their processing speeds are inconsistent. As a result, the sequence 
of data sink to external storage is inconsistent. The data modified earlier 
overwrites the data modified later, resulting in incorrect data.
 
 
Solution:
After the split_monitor task monitors the data changes, it distributes them to 
the split_reader task through the FileId Hash mode to ensure that the same 
FileId files are processed in the same split_reader task, thus solving this 
problem .



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to