[ 
https://issues.apache.org/jira/browse/HDDS-13492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Rose resolved HDDS-13492.
-------------------------------
    Fix Version/s: 2.1.0
       Resolution: Fixed

> Corrupt Replica Not Removed During Over-Replication; Checksum Divergence 
> After Reconcile Command
> ------------------------------------------------------------------------------------------------
>
>                 Key: HDDS-13492
>                 URL: https://issues.apache.org/jira/browse/HDDS-13492
>             Project: Apache Ozone
>          Issue Type: Bug
>          Components: Ozone Datanode
>            Reporter: Bablu Raul
>            Priority: Major
>             Fix For: 2.1.0
>
>
> {code:java}
> data9  
> data8  
> data3 {code}
> I stopped datanodes 3 and 8 and waited for re-replication to a new datanode. 
> The affected container transitioned to the QUASI_CLOSED state
> {code:java}
> data9  
> data6  
> data5 {code}
> On data6, I deleted all container directories using rm -rf containerDir*
> On {*}{{data5}}{*}, I deleted *only the specific container directory* that 
> held the relevant block file to intentionally create inconsistency and 
> confuse the system
> Subsequently, I restarted the previously stopped DataNodes {{data8}} and 
> {{{}data3{}}}.
> I then verified the checksum without running any reconcile command.
> {code:java}
> data2  0
> data4  95806261
> data3  95806261
> data8  95806261
> data6  0
> data5  0 {code}
> the above snippet as expected then again I then verified the checksum without 
> running any reconcile command. here
> {code:java}
> data2  95806261  
> data4  95806261  
> data3  95806261  
> data6  0  
> data5  0 {code}
> Run reconcile command
> {code:java}
> data2 95806261  
> data4 be7eccf4  
> data3 95806261  
> data6 0  
> data5 0 {code}
> wait 20 min
> {code:java}
> data2 95806261  
> data4 be7eccf4  
> data3 95806261  
> data6 0  
> data5 0 {code}
> dn-container.log
> {code:java}
> 2025-06-19 22:08:46,417 | INFO  | ID=3247 | Index=0 | BCSID=8 | 
> State=QUASI_CLOSED | DataChecksum=0 |  
> 2025-06-19 22:10:46,517 | INFO  | ID=3247 | Index=0 | BCSID=8 | 
> State=QUASI_CLOSED | DataChecksum=0 |  
> 2025-06-19 22:11:15,353 | WARN  | ID=3247 | Index=0 | BCSID=8 | State=CLOSED 
> | DataChecksum=95806261 | Container data checksum updated from 0 to 95806261 
> |  
> 2025-06-19 22:11:15,353 | INFO  | ID=3247 | Index=0 | BCSID=8 | State=CLOSED 
> | DataChecksum=95806261 |  
> 2025-06-19 22:13:14,250 | WARN  | ID=3247 | Index=0 | BCSID=8 | State=CLOSED 
> | DataChecksum=be7eccf4 | Container data checksum updated from 95806261 to 
> be7eccf4 | {code}
> The expected behavior is that running the reconcile command should resolve 
> inconsistencies by removing empty datanodes and retaining the correct 
> replicas. However, in this case, the reconcile command did not remove the 
> empty datanode and instead corrupted the valid replica



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to