[
https://issues.apache.org/jira/browse/HDDS-13492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ethan Rose resolved HDDS-13492.
-------------------------------
Fix Version/s: 2.1.0
Resolution: Fixed
> Corrupt Replica Not Removed During Over-Replication; Checksum Divergence
> After Reconcile Command
> ------------------------------------------------------------------------------------------------
>
> Key: HDDS-13492
> URL: https://issues.apache.org/jira/browse/HDDS-13492
> Project: Apache Ozone
> Issue Type: Bug
> Components: Ozone Datanode
> Reporter: Bablu Raul
> Priority: Major
> Fix For: 2.1.0
>
>
> {code:java}
> data9
> data8
> data3 {code}
> I stopped datanodes 3 and 8 and waited for re-replication to a new datanode.
> The affected container transitioned to the QUASI_CLOSED state
> {code:java}
> data9
> data6
> data5 {code}
> On data6, I deleted all container directories using rm -rf containerDir*
> On {*}{{data5}}{*}, I deleted *only the specific container directory* that
> held the relevant block file to intentionally create inconsistency and
> confuse the system
> Subsequently, I restarted the previously stopped DataNodes {{data8}} and
> {{{}data3{}}}.
> I then verified the checksum without running any reconcile command.
> {code:java}
> data2 0
> data4 95806261
> data3 95806261
> data8 95806261
> data6 0
> data5 0 {code}
> the above snippet as expected then again I then verified the checksum without
> running any reconcile command. here
> {code:java}
> data2 95806261
> data4 95806261
> data3 95806261
> data6 0
> data5 0 {code}
> Run reconcile command
> {code:java}
> data2 95806261
> data4 be7eccf4
> data3 95806261
> data6 0
> data5 0 {code}
> wait 20 min
> {code:java}
> data2 95806261
> data4 be7eccf4
> data3 95806261
> data6 0
> data5 0 {code}
> dn-container.log
> {code:java}
> 2025-06-19 22:08:46,417 | INFO | ID=3247 | Index=0 | BCSID=8 |
> State=QUASI_CLOSED | DataChecksum=0 |
> 2025-06-19 22:10:46,517 | INFO | ID=3247 | Index=0 | BCSID=8 |
> State=QUASI_CLOSED | DataChecksum=0 |
> 2025-06-19 22:11:15,353 | WARN | ID=3247 | Index=0 | BCSID=8 | State=CLOSED
> | DataChecksum=95806261 | Container data checksum updated from 0 to 95806261
> |
> 2025-06-19 22:11:15,353 | INFO | ID=3247 | Index=0 | BCSID=8 | State=CLOSED
> | DataChecksum=95806261 |
> 2025-06-19 22:13:14,250 | WARN | ID=3247 | Index=0 | BCSID=8 | State=CLOSED
> | DataChecksum=be7eccf4 | Container data checksum updated from 95806261 to
> be7eccf4 | {code}
> The expected behavior is that running the reconcile command should resolve
> inconsistencies by removing empty datanodes and retaining the correct
> replicas. However, in this case, the reconcile command did not remove the
> empty datanode and instead corrupted the valid replica
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]