[
https://issues.apache.org/jira/browse/HDDS-14697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Kodali Bhavya Sree updated HDDS-14697:
--------------------------------------
Description:
Data integrity is missing for the snapshot after de-commissioning Leader OM
node. Below are the steps followed:
{code:java}
1. Create a volume and bucket with different params. Generate keys over the
bucket.
2. Calculate checksum of all the files.
3. Create two snapshot, delete one of them
4. Decommission a leader OM node
5. Validate the checksums of the file.
6. Create a snapshots after decommissioning and calculate snapdiff with the
snapshot (created) before decommissioning{code}
Checksum validation of snapshots after decommissioning is failing.
Failing after below step → Copying all objects under snapshot {{snap-j0vp8}} of
that bucket into
{code:java}
2026-02-22 09:46:51,203|INFO|MainThread|machine.py:190 -
run()||GUID=c7d2cbe1-f9e1-4074-a0fe-00b4d295bbbe|RUNNING: ssh -l root -i
/tmp/hw-qe-keypair.pem -q -o StrictHostKeyChecking=no -o
UserKnownHostsFile=/dev/null quasar-nntvzt-3.vpc.cloudera.com "export
KRB5CCNAME=/hwqe/hadoopqe/artifacts/kerberosTickets/hrt_qa.kerberos.ticket;
export OZONE_LOGLEVEL=INFO;/opt/cloudera/parcels/CDH/bin/ozone fs -get
ofs://ozone1771736427/vol-test-workload-om-decommission-recommission-1771752191/buck-test-workload-om-decommission-recommission-1771752191/.snapshot/snap-j0vp8/*
/test_master_node_decommissioning_om_workload/workload_local1771752225"{code}
Below Null pointer exception is seen
{code:java}
2026-02-22 09:46:53,571|INFO|MainThread|machine.py:205 -
run()||GUID=c7d2cbe1-f9e1-4074-a0fe-00b4d295bbbe|26/02/22 01:46:53 INFO
retry.RetryInvocationHandler: com.google.protobuf.ServiceException:
org.apache.hadoop.ipc_.RemoteException(java.lang.IllegalStateException):
java.lang.NullPointerException: Cannot invoke
"org.apache.hadoop.ozone.om.snapshot.OmSnapshotLocalDataManager$SnapshotVersionsMeta.getVersion()"
because the return value of
"org.apache.hadoop.ozone.om.snapshot.OmSnapshotLocalDataManager$ReadableOmSnapshotLocalDataMetaProvider.getMeta()"
is nul{code}
Where definition of getMeta() is below:
{code:java}
public synchronized SnapshotVersionsMeta getMeta() throws IOException {
if (closed) {
throw new IOException("Resource has already been closed.");
}
return meta;
}
{code}
Upstream PR where these changes went in:
https://github.infra.cloudera.com/CDH/ozone/commit/1dda9abfa979f3282d3355e154ce025f76d48665
was:
Data integrity is missing for the snapshot after de-commissioning Leader OM
node. Below are the steps followed:
{code:java}
1. Create a volume and bucket with different params. Generate keys over the
bucket.
2. Calculate checksum of all the files.
3. Create two snapshot, delete one of them
4. Decommission a leader OM node
5. Validate the checksums of the file.
6. Create a snapshots after decommissioning and calculate snapdiff with the
snapshot (created) before decommissioning{code}
Checksum validation of snapshots after decommissioning is failing.
Failing after below step → Copying all objects under snapshot {{snap-j0vp8}} of
that bucket into
{{}}
{code:java}
2026-02-22 09:46:51,203|INFO|MainThread|machine.py:190 -
run()||GUID=c7d2cbe1-f9e1-4074-a0fe-00b4d295bbbe|RUNNING: ssh -l root -i
/tmp/hw-qe-keypair.pem -q -o StrictHostKeyChecking=no -o
UserKnownHostsFile=/dev/null quasar-nntvzt-3.vpc.cloudera.com "export
KRB5CCNAME=/hwqe/hadoopqe/artifacts/kerberosTickets/hrt_qa.kerberos.ticket;
export OZONE_LOGLEVEL=INFO;/opt/cloudera/parcels/CDH/bin/ozone fs -get
ofs://ozone1771736427/vol-test-workload-om-decommission-recommission-1771752191/buck-test-workload-om-decommission-recommission-1771752191/.snapshot/snap-j0vp8/*
/test_master_node_decommissioning_om_workload/workload_local1771752225"{code}
{{
}}Below Null pointer exception is seen
{code:java}
2026-02-22 09:46:53,571|INFO|MainThread|machine.py:205 -
run()||GUID=c7d2cbe1-f9e1-4074-a0fe-00b4d295bbbe|26/02/22 01:46:53 INFO
retry.RetryInvocationHandler: com.google.protobuf.ServiceException:
org.apache.hadoop.ipc_.RemoteException(java.lang.IllegalStateException):
java.lang.NullPointerException: Cannot invoke
"org.apache.hadoop.ozone.om.snapshot.OmSnapshotLocalDataManager$SnapshotVersionsMeta.getVersion()"
because the return value of
"org.apache.hadoop.ozone.om.snapshot.OmSnapshotLocalDataManager$ReadableOmSnapshotLocalDataMetaProvider.getMeta()"
is nul{code}
where definition of getMeta() is below:
{code:java}
public synchronized SnapshotVersionsMeta getMeta() throws IOException {
if (closed) {
throw new IOException("Resource has already been closed.");
}
return meta;
}
{code}
Upstream PR where these changes went in:
[+https://github.infra.cloudera.com/CDH/ozone/commit/1dda9abfa979f3282d3355e154ce025f76d48665+]
> Data integrity is missing for the snapshot after de-commissioning Leader OM
> node. Below are the steps followed:
> ---------------------------------------------------------------------------------------------------------------
>
> Key: HDDS-14697
> URL: https://issues.apache.org/jira/browse/HDDS-14697
> Project: Apache Ozone
> Issue Type: Bug
> Components: Ozone Manager
> Affects Versions: 2.0.0
> Reporter: Kodali Bhavya Sree
> Priority: Critical
>
> Data integrity is missing for the snapshot after de-commissioning Leader OM
> node. Below are the steps followed:
> {code:java}
> 1. Create a volume and bucket with different params. Generate keys over the
> bucket.
> 2. Calculate checksum of all the files.
> 3. Create two snapshot, delete one of them
> 4. Decommission a leader OM node
> 5. Validate the checksums of the file.
> 6. Create a snapshots after decommissioning and calculate snapdiff with the
> snapshot (created) before decommissioning{code}
> Checksum validation of snapshots after decommissioning is failing.
> Failing after below step → Copying all objects under snapshot {{snap-j0vp8}}
> of that bucket into
> {code:java}
> 2026-02-22 09:46:51,203|INFO|MainThread|machine.py:190 -
> run()||GUID=c7d2cbe1-f9e1-4074-a0fe-00b4d295bbbe|RUNNING: ssh -l root -i
> /tmp/hw-qe-keypair.pem -q -o StrictHostKeyChecking=no -o
> UserKnownHostsFile=/dev/null quasar-nntvzt-3.vpc.cloudera.com "export
> KRB5CCNAME=/hwqe/hadoopqe/artifacts/kerberosTickets/hrt_qa.kerberos.ticket;
> export OZONE_LOGLEVEL=INFO;/opt/cloudera/parcels/CDH/bin/ozone fs -get
> ofs://ozone1771736427/vol-test-workload-om-decommission-recommission-1771752191/buck-test-workload-om-decommission-recommission-1771752191/.snapshot/snap-j0vp8/*
> /test_master_node_decommissioning_om_workload/workload_local1771752225"{code}
> Below Null pointer exception is seen
> {code:java}
> 2026-02-22 09:46:53,571|INFO|MainThread|machine.py:205 -
> run()||GUID=c7d2cbe1-f9e1-4074-a0fe-00b4d295bbbe|26/02/22 01:46:53 INFO
> retry.RetryInvocationHandler: com.google.protobuf.ServiceException:
> org.apache.hadoop.ipc_.RemoteException(java.lang.IllegalStateException):
> java.lang.NullPointerException: Cannot invoke
> "org.apache.hadoop.ozone.om.snapshot.OmSnapshotLocalDataManager$SnapshotVersionsMeta.getVersion()"
> because the return value of
> "org.apache.hadoop.ozone.om.snapshot.OmSnapshotLocalDataManager$ReadableOmSnapshotLocalDataMetaProvider.getMeta()"
> is nul{code}
>
> Where definition of getMeta() is below:
> {code:java}
> public synchronized SnapshotVersionsMeta getMeta() throws IOException {
> if (closed) {
> throw new IOException("Resource has already been closed.");
> }
> return meta;
> }
> {code}
>
> Upstream PR where these changes went in:
> https://github.infra.cloudera.com/CDH/ozone/commit/1dda9abfa979f3282d3355e154ce025f76d48665
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]