[ 
https://issues.apache.org/jira/browse/HDDS-14697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kodali Bhavya Sree updated HDDS-14697:
--------------------------------------
    Description: 
Data integrity is missing for the snapshot after de-commissioning Leader OM 
node. Below are the steps followed:
{code:java}
1. Create a volume and bucket with different params. Generate keys over the 
bucket.
2. Calculate checksum of all the files.
3. Create two snapshot, delete one of them
4. Decommission  a leader OM node
5. Validate the checksums of the file.
6. Create a snapshots after decommissioning and calculate snapdiff with the 
snapshot (created) before decommissioning{code}

Checksum validation of snapshots after decommissioning is failing.
Failing after below step → Copying all objects under snapshot {{snap-j0vp8}} of 
that bucket into
{code:java}
2026-02-22 09:46:51,203|INFO|MainThread|machine.py:190 - 
run()||GUID=c7d2cbe1-f9e1-4074-a0fe-00b4d295bbbe|RUNNING: ssh -l root -i 
/tmp/hw-qe-keypair.pem -q -o StrictHostKeyChecking=no -o 
UserKnownHostsFile=/dev/null quasar-nntvzt-3.vpc.cloudera.com "export 
KRB5CCNAME=/hwqe/hadoopqe/artifacts/kerberosTickets/hrt_qa.kerberos.ticket; 
export OZONE_LOGLEVEL=INFO;/opt/cloudera/parcels/CDH/bin/ozone fs -get 
ofs://ozone1771736427/vol-test-workload-om-decommission-recommission-1771752191/buck-test-workload-om-decommission-recommission-1771752191/.snapshot/snap-j0vp8/*
 /test_master_node_decommissioning_om_workload/workload_local1771752225"{code}

Below Null pointer exception is seen
{code:java}
2026-02-22 09:46:53,571|INFO|MainThread|machine.py:205 - 
run()||GUID=c7d2cbe1-f9e1-4074-a0fe-00b4d295bbbe|26/02/22 01:46:53 INFO 
retry.RetryInvocationHandler: com.google.protobuf.ServiceException: 
org.apache.hadoop.ipc_.RemoteException(java.lang.IllegalStateException): 
java.lang.NullPointerException: Cannot invoke 
"org.apache.hadoop.ozone.om.snapshot.OmSnapshotLocalDataManager$SnapshotVersionsMeta.getVersion()"
 because the return value of 
"org.apache.hadoop.ozone.om.snapshot.OmSnapshotLocalDataManager$ReadableOmSnapshotLocalDataMetaProvider.getMeta()"
 is nul{code}
 

Where definition of getMeta() is below:
{code:java}
public synchronized SnapshotVersionsMeta getMeta() throws IOException {
      if (closed) {
        throw new IOException("Resource has already been closed.");
      }
      return meta;
    }
{code}
 

Upstream PR where these changes went in:
https://github.infra.cloudera.com/CDH/ozone/commit/1dda9abfa979f3282d3355e154ce025f76d48665

  was:
Data integrity is missing for the snapshot after de-commissioning Leader OM 
node. Below are the steps followed:
{code:java}
1. Create a volume and bucket with different params. Generate keys over the 
bucket.
2. Calculate checksum of all the files.
3. Create two snapshot, delete one of them
4. Decommission  a leader OM node
5. Validate the checksums of the file.
6. Create a snapshots after decommissioning and calculate snapdiff with the 
snapshot (created) before decommissioning{code}
Checksum validation of snapshots after decommissioning is failing.
Failing after below step → Copying all objects under snapshot {{snap-j0vp8}} of 
that bucket into
 
 {{}}
{code:java}
2026-02-22 09:46:51,203|INFO|MainThread|machine.py:190 - 
run()||GUID=c7d2cbe1-f9e1-4074-a0fe-00b4d295bbbe|RUNNING: ssh -l root -i 
/tmp/hw-qe-keypair.pem -q -o StrictHostKeyChecking=no -o 
UserKnownHostsFile=/dev/null quasar-nntvzt-3.vpc.cloudera.com "export 
KRB5CCNAME=/hwqe/hadoopqe/artifacts/kerberosTickets/hrt_qa.kerberos.ticket; 
export OZONE_LOGLEVEL=INFO;/opt/cloudera/parcels/CDH/bin/ozone fs -get 
ofs://ozone1771736427/vol-test-workload-om-decommission-recommission-1771752191/buck-test-workload-om-decommission-recommission-1771752191/.snapshot/snap-j0vp8/*
 /test_master_node_decommissioning_om_workload/workload_local1771752225"{code}
{{

}}Below Null pointer exception is seen
{code:java}
2026-02-22 09:46:53,571|INFO|MainThread|machine.py:205 - 
run()||GUID=c7d2cbe1-f9e1-4074-a0fe-00b4d295bbbe|26/02/22 01:46:53 INFO 
retry.RetryInvocationHandler: com.google.protobuf.ServiceException: 
org.apache.hadoop.ipc_.RemoteException(java.lang.IllegalStateException): 
java.lang.NullPointerException: Cannot invoke 
"org.apache.hadoop.ozone.om.snapshot.OmSnapshotLocalDataManager$SnapshotVersionsMeta.getVersion()"
 because the return value of 
"org.apache.hadoop.ozone.om.snapshot.OmSnapshotLocalDataManager$ReadableOmSnapshotLocalDataMetaProvider.getMeta()"
 is nul{code}
 

where definition of getMeta() is below:
 
{code:java}
public synchronized SnapshotVersionsMeta getMeta() throws IOException {
      if (closed) {
        throw new IOException("Resource has already been closed.");
      }
      return meta;
    }
{code}
 

Upstream PR where these changes went in:
[+https://github.infra.cloudera.com/CDH/ozone/commit/1dda9abfa979f3282d3355e154ce025f76d48665+]


> Data integrity is missing for the snapshot after de-commissioning Leader OM 
> node. Below are the steps followed:
> ---------------------------------------------------------------------------------------------------------------
>
>                 Key: HDDS-14697
>                 URL: https://issues.apache.org/jira/browse/HDDS-14697
>             Project: Apache Ozone
>          Issue Type: Bug
>          Components: Ozone Manager
>    Affects Versions: 2.0.0
>            Reporter: Kodali Bhavya Sree
>            Priority: Critical
>
> Data integrity is missing for the snapshot after de-commissioning Leader OM 
> node. Below are the steps followed:
> {code:java}
> 1. Create a volume and bucket with different params. Generate keys over the 
> bucket.
> 2. Calculate checksum of all the files.
> 3. Create two snapshot, delete one of them
> 4. Decommission  a leader OM node
> 5. Validate the checksums of the file.
> 6. Create a snapshots after decommissioning and calculate snapdiff with the 
> snapshot (created) before decommissioning{code}
> Checksum validation of snapshots after decommissioning is failing.
> Failing after below step → Copying all objects under snapshot {{snap-j0vp8}} 
> of that bucket into
> {code:java}
> 2026-02-22 09:46:51,203|INFO|MainThread|machine.py:190 - 
> run()||GUID=c7d2cbe1-f9e1-4074-a0fe-00b4d295bbbe|RUNNING: ssh -l root -i 
> /tmp/hw-qe-keypair.pem -q -o StrictHostKeyChecking=no -o 
> UserKnownHostsFile=/dev/null quasar-nntvzt-3.vpc.cloudera.com "export 
> KRB5CCNAME=/hwqe/hadoopqe/artifacts/kerberosTickets/hrt_qa.kerberos.ticket; 
> export OZONE_LOGLEVEL=INFO;/opt/cloudera/parcels/CDH/bin/ozone fs -get 
> ofs://ozone1771736427/vol-test-workload-om-decommission-recommission-1771752191/buck-test-workload-om-decommission-recommission-1771752191/.snapshot/snap-j0vp8/*
>  /test_master_node_decommissioning_om_workload/workload_local1771752225"{code}
> Below Null pointer exception is seen
> {code:java}
> 2026-02-22 09:46:53,571|INFO|MainThread|machine.py:205 - 
> run()||GUID=c7d2cbe1-f9e1-4074-a0fe-00b4d295bbbe|26/02/22 01:46:53 INFO 
> retry.RetryInvocationHandler: com.google.protobuf.ServiceException: 
> org.apache.hadoop.ipc_.RemoteException(java.lang.IllegalStateException): 
> java.lang.NullPointerException: Cannot invoke 
> "org.apache.hadoop.ozone.om.snapshot.OmSnapshotLocalDataManager$SnapshotVersionsMeta.getVersion()"
>  because the return value of 
> "org.apache.hadoop.ozone.om.snapshot.OmSnapshotLocalDataManager$ReadableOmSnapshotLocalDataMetaProvider.getMeta()"
>  is nul{code}
>  
> Where definition of getMeta() is below:
> {code:java}
> public synchronized SnapshotVersionsMeta getMeta() throws IOException {
>       if (closed) {
>         throw new IOException("Resource has already been closed.");
>       }
>       return meta;
>     }
> {code}
>  
> Upstream PR where these changes went in:
> https://github.infra.cloudera.com/CDH/ozone/commit/1dda9abfa979f3282d3355e154ce025f76d48665



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to