Shashikant Banerjee created HDDS-5619:
-----------------------------------------

             Summary: Ozone data corruption issue on follower node
                 Key: HDDS-5619
                 URL: https://issues.apache.org/jira/browse/HDDS-5619
             Project: Apache Ozone
          Issue Type: Bug
          Components: Ozone Datanode
            Reporter: Aravindan Vijayan
            Assignee: Shashikant Banerjee
         Attachments: repro.patch

A data corruption issue was recently observed in one of the clusters where 
follower data node replica of containers were found corrupted. The issue was 
primarily happening happening bcoz of a race condition among, readStateMachine 
and writeStateMachine threads which were reading and writing the chunks 
concurrently.  Following logs confirm this:
{code:java}
INFO  ratis.ContainerStateMachine 
(ContainerStateMachine.java:readStateMachineData(598)) - Thread for Read 
ChunkWriter-1-0 file 107544261427200162_chunk_2 Address 0.0.0.0:56510
2021-08-11 2028,524 [ChunkWriter-1-0] INFO  ratis.ContainerStateMachine 
(ContainerStateMachine.java:lambda$handleWriteChunk$2(467)) - Thread for Write 
ChunkWriter-1-0 file 107544261427200162_chunk_2 Address 0.0.0.0:56513
2021-08-11 2028,524 [ChunkWriter-1-0] INFO  ratis.ContainerStateMachine 
(ContainerStateMachine.java:lambda$handleWriteChunk$2(467)) - Thread for Write 
ChunkWriter-1-0 file 107544261427200162_chunk_2 Address 0.0.0.0:56507
2021-08-11 2028,542 [ChunkWriter-3-0] INFO  ratis.ContainerStateMachine 
(ContainerStateMachine.java:lambda$handleWriteChunk$2(467)) - Thread for Write 
ChunkWriter-3-0 file 107544261427200162_chunk_3 Address 0.0.0.0:56510
2021-08-11 2028,543 [ChunkWriter-3-0] INFO  ratis.ContainerStateMachine 
(ContainerStateMachine.java:readStateMachineData(598)) - Thread for Read 
ChunkWriter-3-0 file 107544261427200162_chunk_3 Address 0.0.0.0:56510
2021-08-11 2028,544 [ChunkWriter-3-0] INFO  ratis.ContainerStateMachine 
(ContainerStateMachine.java:readStateMachineData(598)) - Thread for Read 
ChunkWriter-3-0 file 107544261427200162_chunk_3 Address 0.0.0.0:56510
2021-08-11 2028,545 [ChunkWriter-3-0] INFO  ratis.ContainerStateMachine 
(ContainerStateMachine.java:lambda$handleWriteChunk$2(467)) - Thread for Write 
ChunkWriter-3-0 file 107544261427200162_chunk_3 Address 0.0.0.0:56513
2021-08-11 2028,549 [ChunkWriter-0-0] INFO  ratis.ContainerStateMachine 
(ContainerStateMachine.java:lambda$handleWriteChunk$2(467)) - Thread for Write 
ChunkWriter-0-0 file 107544261427200162_chunk_4 Address 0.0.0.0:56510
2021-08-11 2028,550 [ChunkWriter-0-0] INFO  ratis.ContainerStateMachine 
(ContainerStateMachine.java:readStateMachineData(598)) - Thread for Read 
ChunkWriter-0-0 file 107544261427200162_chunk_4 Address 0.0.0.0:56510
2021-08-11 2028,551 [ChunkWriter-0-0] INFO  ratis.ContainerStateMachine 
(ContainerStateMachine.java:readStateMachineData(598)) - Thread for Read 
ChunkWriter-0-0 file 107544261427200162_chunk_4 Address 0.0.0.0:56510
2021-08-11 2028,553 [ChunkWriter-0-0] INFO  ratis.ContainerStateMachine 
(ContainerStateMachine.java:lambda$handleWriteChunk$2(467)) - Thread for Write 
ChunkWriter-0-0 file 107544261427200162_chunk_4 Address 0.0.0.0:56513
2021-08-11 2028,648 [ChunkWriter-3-0] INFO  ratis.ContainerStateMachine 
(ContainerStateMachine.java:lambda$handleWriteChunk$2(467)) - Thread for Write 
ChunkWriter-3-0 file 107544261427200162_chunk_3 Address 0.0.0.0:56507
{code}
The assumption was till now, that readStateMachine and WriteStateMachine 
Threads are executed serially on asingle thread executor using a hash function 
on the BlockId which doesn't seem to work well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to