[
https://issues.apache.org/jira/browse/HDDS-5619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Shashikant Banerjee updated HDDS-5619:
--------------------------------------
Attachment: repro.patch
> Ozone data corruption issue on follower node
> --------------------------------------------
>
> Key: HDDS-5619
> URL: https://issues.apache.org/jira/browse/HDDS-5619
> Project: Apache Ozone
> Issue Type: Bug
> Components: Ozone Datanode
> Reporter: Aravindan Vijayan
> Assignee: Shashikant Banerjee
> Priority: Major
> Attachments: repro.patch
>
>
> A data corruption issue was recently observed in one of the clusters where
> follower data node replica of containers were found corrupted. The issue was
> primarily happening happening bcoz of a race condition among,
> readStateMachine and writeStateMachine threads which were reading and writing
> the chunks concurrently. Following logs confirm this:
> {code:java}
> INFO ratis.ContainerStateMachine
> (ContainerStateMachine.java:readStateMachineData(598)) - Thread for Read
> ChunkWriter-1-0 file 107544261427200162_chunk_2 Address 0.0.0.0:56510
> 2021-08-11 2028,524 [ChunkWriter-1-0] INFO ratis.ContainerStateMachine
> (ContainerStateMachine.java:lambda$handleWriteChunk$2(467)) - Thread for
> Write ChunkWriter-1-0 file 107544261427200162_chunk_2 Address 0.0.0.0:56513
> 2021-08-11 2028,524 [ChunkWriter-1-0] INFO ratis.ContainerStateMachine
> (ContainerStateMachine.java:lambda$handleWriteChunk$2(467)) - Thread for
> Write ChunkWriter-1-0 file 107544261427200162_chunk_2 Address 0.0.0.0:56507
> 2021-08-11 2028,542 [ChunkWriter-3-0] INFO ratis.ContainerStateMachine
> (ContainerStateMachine.java:lambda$handleWriteChunk$2(467)) - Thread for
> Write ChunkWriter-3-0 file 107544261427200162_chunk_3 Address 0.0.0.0:56510
> 2021-08-11 2028,543 [ChunkWriter-3-0] INFO ratis.ContainerStateMachine
> (ContainerStateMachine.java:readStateMachineData(598)) - Thread for Read
> ChunkWriter-3-0 file 107544261427200162_chunk_3 Address 0.0.0.0:56510
> 2021-08-11 2028,544 [ChunkWriter-3-0] INFO ratis.ContainerStateMachine
> (ContainerStateMachine.java:readStateMachineData(598)) - Thread for Read
> ChunkWriter-3-0 file 107544261427200162_chunk_3 Address 0.0.0.0:56510
> 2021-08-11 2028,545 [ChunkWriter-3-0] INFO ratis.ContainerStateMachine
> (ContainerStateMachine.java:lambda$handleWriteChunk$2(467)) - Thread for
> Write ChunkWriter-3-0 file 107544261427200162_chunk_3 Address 0.0.0.0:56513
> 2021-08-11 2028,549 [ChunkWriter-0-0] INFO ratis.ContainerStateMachine
> (ContainerStateMachine.java:lambda$handleWriteChunk$2(467)) - Thread for
> Write ChunkWriter-0-0 file 107544261427200162_chunk_4 Address 0.0.0.0:56510
> 2021-08-11 2028,550 [ChunkWriter-0-0] INFO ratis.ContainerStateMachine
> (ContainerStateMachine.java:readStateMachineData(598)) - Thread for Read
> ChunkWriter-0-0 file 107544261427200162_chunk_4 Address 0.0.0.0:56510
> 2021-08-11 2028,551 [ChunkWriter-0-0] INFO ratis.ContainerStateMachine
> (ContainerStateMachine.java:readStateMachineData(598)) - Thread for Read
> ChunkWriter-0-0 file 107544261427200162_chunk_4 Address 0.0.0.0:56510
> 2021-08-11 2028,553 [ChunkWriter-0-0] INFO ratis.ContainerStateMachine
> (ContainerStateMachine.java:lambda$handleWriteChunk$2(467)) - Thread for
> Write ChunkWriter-0-0 file 107544261427200162_chunk_4 Address 0.0.0.0:56513
> 2021-08-11 2028,648 [ChunkWriter-3-0] INFO ratis.ContainerStateMachine
> (ContainerStateMachine.java:lambda$handleWriteChunk$2(467)) - Thread for
> Write ChunkWriter-3-0 file 107544261427200162_chunk_3 Address 0.0.0.0:56507
> {code}
> The assumption was till now, that readStateMachine and WriteStateMachine
> Threads are executed serially on asingle thread executor using a hash
> function on the BlockId which doesn't seem to work well.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]