[
https://issues.apache.org/jira/browse/HDDS-10652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17836616#comment-17836616
]
Siddhant Sangwan commented on HDDS-10652:
-----------------------------------------
I was suspecting a problem in the following code in the executePutBlock method
of ECBlockOutputStream.java according to the stack trace:
{code}
//Reverse Traversal as all parity will have checksumBytes
for (int i = blockData.length - 1; i >= 0; i--) {
BlockData bd = blockData[i];
if (bd == null) {
continue;
}
List<ChunkInfo> chunks = bd.getChunks();
if (chunks != null && chunks.size() > 0 && chunks.get(0)
.hasStripeChecksum()) {
checksumBlockData = bd;
break;
}
}
{code}
chunks could be null, or chunks.size() could be 0, or hasStripChecksum() could
be returning false. To pinpoint which condition is returning false, I started
tracking another container, #1006, which had the same exception and also had
more helpful logs available.
{code:java}
2024-04-02 09:16:46,676 INFO
[ContainerReplicationThread-1]-org.apache.hadoop.ozone.container.ec.reconstruction.ECReconstructionCoordinator:
Block Data for: conID: 1006 locID: 113750153625601083 bcsId: 0 replica Index:
5 block length: 268435456 block group length: 805306368 chunk list:
chunkNum: 1 length: 1048576 offset: 0
chunkNum: 2 length: 1048576 offset: 1048576
chunkNum: 3 length: 1048576 offset: 2097152
...
{code}
This was present in the DN logs for container #1006. This container had one
missing index, index 4. The above log message was present for all indices
except 4, and all of them had chunks like shown in the sample above. This
proves that chunks is not null and chunks.size() is greater than 0. Which means
chunks.get(0).hasStripeChecksum() returned false.
I think hasStripeChecksum is returning false because compatibility is broken
between the old and new Ozone versions being tested here.
In the new version, the ChunkInfo proto has a stripeChecksum field:
{code:java}
message ChunkInfo {
required string chunkName = 1;
required uint64 offset = 2;
required uint64 len = 3;
repeated KeyValue metadata = 4;
required ChecksumData checksumData =5;
optional bytes stripeChecksum = 6;
}
{code}
But the old version does not have this field:
{code}
message ChunkInfo {
required string chunkName = 1;
required uint64 offset = 2;
required uint64 len = 3;
repeated KeyValue metadata = 4;
required ChecksumData checksumData =5;
}
{code}
This would lead to ChunkInfo#hasStripeChecksum() return false, unless we're
setting it somewhere else to maintain compatibility b/w the two versions.
> [Upgrade][EC] Reconstruction failing with "java.io.IOException: None of the
> block data have checksum"
> -----------------------------------------------------------------------------------------------------
>
> Key: HDDS-10652
> URL: https://issues.apache.org/jira/browse/HDDS-10652
> Project: Apache Ozone
> Issue Type: Bug
> Components: EC, ECOfflineRecovery
> Reporter: Pratyush Bhatt
> Assignee: Siddhant Sangwan
> Priority: Major
>
> {color:#172b4d}*Upgrade versions:*
> Pre upgrade hash:
> [https://github.com/apache/ozone/commit/6ee6c357678676661ebb3181a56622c79b487bc1]
> Post upgrade Hash:
> [https://github.com/apache/ozone/commit/46b6f3def1d84ca769affb4d3f0d84dece6e8567]
> {color}{color:#172b4d}*Scenario:*
> Write a EC file(5GB) RS-3-2-1024K policy(in this case) before upgrade, after
> upgrade, shut down either 2 Parity nodes(this case) or 2 Data nodes, as the
> policy supports tolerating 2 DN failure. Check if reconstruction happens
> after sometime.
> *Observed Behavior:*
> 1. Data was successfully written pre-upgrade using Freon.
> File name:
> _o3://ozone1711558189/ec-construct-vol/ec-construct-buck/ec-construction/0_
> 2. Post upgrade Stop two of the DNs, in this case the Parity nodes that we
> obtained from one of the containers that was storing the above file's
> data.{color}
> {code:java}
> ozone admin container info 1004 --json
> 2024-03-27 21:35:15,065|INFO|MainThread|machine.py:232 -
> run()||GUID=183f2d10-e3a7-407f-adb5-b87f3e3af53b|Exit Code: 0
> 2024-03-27 21:35:15,098|INFO|MainThread|ozone.py:723 -
> find_ec_data_parity_hosts()|parity hosts: ['DN-4', 'DN-3']
> 2024-03-27 21:35:15,098|INFO|MainThread|ozone.py:724 -
> find_ec_data_parity_hosts()|data hosts: ['DN-8', 'DN-5', 'DN-1'] {code}
> {code:java}
> 2024-03-27 21:35:15,311|INFO|MainThread|cm_apilib.py:1214 -
> stopComponent()|Initiating stop of OZONE_DATANODE at host DN-4
> 2024-03-27 21:35:15,349|INFO|MainThread|cm_apilib.py:1218 -
> stopComponent()|Command name = Stop , ID = 2860
> 2024-03-27 21:35:15,580|INFO|MainThread|cm_apilib.py:1214 -
> stopComponent()|Initiating stop of OZONE_DATANODE at host DN-3
> 2024-03-27 21:35:15,609|INFO|MainThread|cm_apilib.py:1218 -
> stopComponent()|Command name = Stop , ID = 2862 {code}
> {color:#172b4d}Node DN-3 and DN-4 are stopped.
> 3. Read file's data(Online Reconstruction) and compute checksum, -> That
> matched.
> 4. Wait for Reconstruction to happen, test waited for 20 Minutes, but Still
> only 3 DNs were present even after 20 minutes:{color}
> {code:java}
> ['DN-5', 'DN-1', 'DN-8']{code}
> Infact still after 10 hours(At the time of writing), there are still 3 DNs
> only:
> {code:java}
> date
> Thu Mar 28 08:39:16 UTC 2024
> ozone admin container info 1004 --json
> {
> "containerInfo" : {
> "state" : "CLOSED",
> "stateEnterTime" : "2024-03-27T18:43:51.934Z",
> "replicationConfig" : {
> "data" : 3,
> "parity" : 2,
> "ecChunkSize" : 1048576,
> "codec" : "RS",
> "requiredNodes" : 5,
> "replicationType" : "EC"
> },
> "usedBytes" : 1342177280,
> "numberOfKeys" : 5,
> "lastUsed" : "2024-03-28T08:39:24.535189Z",
> "owner" : "om1",
> "containerID" : 1004,
> "deleteTransactionId" : 0,
> "sequenceId" : 0,
> "deleted" : false,
> "open" : false
> },
> "pipeline" : {
> "id" : {
> "id" : "73532c14-40ac-4924-9353-2f18ab0d63f2"
> },
> "replicationConfig" : {
> "data" : 3,
> "parity" : 2,
> "ecChunkSize" : 1048576,
> "codec" : "RS",
> "requiredNodes" : 5,
> "replicationType" : "EC"
> },
> "nodesInOrder" : [ {
> "level" : 0,
> "cost" : 0,
> "uuid" : "6179347f-5824-41d4-b722-f1dbc5f14880",
> "uuidString" : "6179347f-5824-41d4-b722-f1dbc5f14880",
> "ipAddress" : "10.140.37.12",
> "hostName" : "DN-5",
> "ports" : [ {
> "name" : "HTTPS",
> "value" : 9883
> }, {
> "name" : "CLIENT_RPC",
> "value" : 9864
> }, {
> "name" : "REPLICATION",
> "value" : 9886
> }, {
> "name" : "RATIS",
> "value" : 9858
> }, {
> "name" : "RATIS_ADMIN",
> "value" : 9857
> }, {
> "name" : "RATIS_SERVER",
> "value" : 9856
> }, {
> "name" : "STANDALONE",
> "value" : 9859
> } ],
> "setupTime" : 0,
> "persistedOpState" : "IN_SERVICE",
> "persistedOpStateExpiryEpochSec" : 0,
> "initialVersion" : 0,
> "currentVersion" : 1,
> "decommissioned" : false,
> "maintenance" : false,
> "signature" : -662262523,
> "networkLocation" : "/default",
> "networkName" : "6179347f-5824-41d4-b722-f1dbc5f14880",
> "networkFullPath" : "/default/6179347f-5824-41d4-b722-f1dbc5f14880",
> "numOfLeaves" : 1
> }, {
> "level" : 0,
> "cost" : 0,
> "uuid" : "d8afb52b-5f4c-4d94-9286-7c3cfd6c315c",
> "uuidString" : "d8afb52b-5f4c-4d94-9286-7c3cfd6c315c",
> "ipAddress" : "10.140.40.9",
> "hostName" : "DN-1",
> "ports" : [ {
> "name" : "HTTPS",
> "value" : 9883
> }, {
> "name" : "CLIENT_RPC",
> "value" : 9864
> }, {
> "name" : "REPLICATION",
> "value" : 9886
> }, {
> "name" : "RATIS",
> "value" : 9858
> }, {
> "name" : "RATIS_ADMIN",
> "value" : 9857
> }, {
> "name" : "RATIS_SERVER",
> "value" : 9856
> }, {
> "name" : "STANDALONE",
> "value" : 9859
> } ],
> "setupTime" : 0,
> "persistedOpState" : "IN_SERVICE",
> "persistedOpStateExpiryEpochSec" : 0,
> "initialVersion" : 0,
> "currentVersion" : 1,
> "decommissioned" : false,
> "maintenance" : false,
> "signature" : -1387859873,
> "networkLocation" : "/default",
> "networkName" : "d8afb52b-5f4c-4d94-9286-7c3cfd6c315c",
> "networkFullPath" : "/default/d8afb52b-5f4c-4d94-9286-7c3cfd6c315c",
> "numOfLeaves" : 1
> }, {
> "level" : 0,
> "cost" : 0,
> "uuid" : "ef7ae3e9-5ec3-49d6-9b93-1c687009bc1e",
> "uuidString" : "ef7ae3e9-5ec3-49d6-9b93-1c687009bc1e",
> "ipAddress" : "10.140.137.128",
> "hostName" : "DN-8",
> "ports" : [ {
> "name" : "HTTPS",
> "value" : 9883
> }, {
> "name" : "CLIENT_RPC",
> "value" : 9864
> }, {
> "name" : "REPLICATION",
> "value" : 9886
> }, {
> "name" : "RATIS",
> "value" : 9858
> }, {
> "name" : "RATIS_ADMIN",
> "value" : 9857
> }, {
> "name" : "RATIS_SERVER",
> "value" : 9856
> }, {
> "name" : "STANDALONE",
> "value" : 9859
> } ],
> "setupTime" : 0,
> "persistedOpState" : "IN_SERVICE",
> "persistedOpStateExpiryEpochSec" : 0,
> "initialVersion" : 0,
> "currentVersion" : 1,
> "decommissioned" : false,
> "maintenance" : false,
> "signature" : 1098159392,
> "networkLocation" : "/default",
> "networkName" : "ef7ae3e9-5ec3-49d6-9b93-1c687009bc1e",
> "networkFullPath" : "/default/ef7ae3e9-5ec3-49d6-9b93-1c687009bc1e",
> "numOfLeaves" : 1
> } ],
> "creationTimestamp" : "2024-03-28T08:39:24.480Z",
> "stateEnterTime" : "2024-03-28T08:39:24.545517Z",
> "leaderNode" : {
> "level" : 0,
> "cost" : 0,
> "uuid" : "6179347f-5824-41d4-b722-f1dbc5f14880",
> "uuidString" : "6179347f-5824-41d4-b722-f1dbc5f14880",
> "ipAddress" : "10.140.37.12",
> "hostName" : "DN-5",
> "ports" : [ {
> "name" : "HTTPS",
> "value" : 9883
> }, {
> "name" : "CLIENT_RPC",
> "value" : 9864
> }, {
> "name" : "REPLICATION",
> "value" : 9886
> }, {
> "name" : "RATIS",
> "value" : 9858
> }, {
> "name" : "RATIS_ADMIN",
> "value" : 9857
> }, {
> "name" : "RATIS_SERVER",
> "value" : 9856
> }, {
> "name" : "STANDALONE",
> "value" : 9859
> } ],
> "setupTime" : 0,
> "persistedOpState" : "IN_SERVICE",
> "persistedOpStateExpiryEpochSec" : 0,
> "initialVersion" : 0,
> "currentVersion" : 1,
> "decommissioned" : false,
> "maintenance" : false,
> "signature" : -662262523,
> "networkLocation" : "/default",
> "networkName" : "6179347f-5824-41d4-b722-f1dbc5f14880",
> "networkFullPath" : "/default/6179347f-5824-41d4-b722-f1dbc5f14880",
> "numOfLeaves" : 1
> },
> "firstNode" : {
> "level" : 0,
> "cost" : 0,
> "uuid" : "6179347f-5824-41d4-b722-f1dbc5f14880",
> "uuidString" : "6179347f-5824-41d4-b722-f1dbc5f14880",
> "ipAddress" : "10.140.37.12",
> "hostName" : "DN-5",
> "ports" : [ {
> "name" : "HTTPS",
> "value" : 9883
> }, {
> "name" : "CLIENT_RPC",
> "value" : 9864
> }, {
> "name" : "REPLICATION",
> "value" : 9886
> }, {
> "name" : "RATIS",
> "value" : 9858
> }, {
> "name" : "RATIS_ADMIN",
> "value" : 9857
> }, {
> "name" : "RATIS_SERVER",
> "value" : 9856
> }, {
> "name" : "STANDALONE",
> "value" : 9859
> } ],
> "setupTime" : 0,
> "persistedOpState" : "IN_SERVICE",
> "persistedOpStateExpiryEpochSec" : 0,
> "initialVersion" : 0,
> "currentVersion" : 1,
> "decommissioned" : false,
> "maintenance" : false,
> "signature" : -662262523,
> "networkLocation" : "/default",
> "networkName" : "6179347f-5824-41d4-b722-f1dbc5f14880",
> "networkFullPath" : "/default/6179347f-5824-41d4-b722-f1dbc5f14880",
> "numOfLeaves" : 1
> },
> "closestNode" : {
> "level" : 0,
> "cost" : 0,
> "uuid" : "6179347f-5824-41d4-b722-f1dbc5f14880",
> "uuidString" : "6179347f-5824-41d4-b722-f1dbc5f14880",
> "ipAddress" : "10.140.37.12",
> "hostName" : "DN-5",
> "ports" : [ {
> "name" : "HTTPS",
> "value" : 9883
> }, {
> "name" : "CLIENT_RPC",
> "value" : 9864
> }, {
> "name" : "REPLICATION",
> "value" : 9886
> }, {
> "name" : "RATIS",
> "value" : 9858
> }, {
> "name" : "RATIS_ADMIN",
> "value" : 9857
> }, {
> "name" : "RATIS_SERVER",
> "value" : 9856
> }, {
> "name" : "STANDALONE",
> "value" : 9859
> } ],
> "setupTime" : 0,
> "persistedOpState" : "IN_SERVICE",
> "persistedOpStateExpiryEpochSec" : 0,
> "initialVersion" : 0,
> "currentVersion" : 1,
> "decommissioned" : false,
> "maintenance" : false,
> "signature" : -662262523,
> "networkLocation" : "/default",
> "networkName" : "6179347f-5824-41d4-b722-f1dbc5f14880",
> "networkFullPath" : "/default/6179347f-5824-41d4-b722-f1dbc5f14880",
> "numOfLeaves" : 1
> },
> "allocationTimeout" : false,
> "healthy" : true,
> "pipelineState" : "ALLOCATED",
> "nodes" : [ {
> "level" : 0,
> "cost" : 0,
> "uuid" : "6179347f-5824-41d4-b722-f1dbc5f14880",
> "uuidString" : "6179347f-5824-41d4-b722-f1dbc5f14880",
> "ipAddress" : "10.140.37.12",
> "hostName" : "DN-5",
> "ports" : [ {
> "name" : "HTTPS",
> "value" : 9883
> }, {
> "name" : "CLIENT_RPC",
> "value" : 9864
> }, {
> "name" : "REPLICATION",
> "value" : 9886
> }, {
> "name" : "RATIS",
> "value" : 9858
> }, {
> "name" : "RATIS_ADMIN",
> "value" : 9857
> }, {
> "name" : "RATIS_SERVER",
> "value" : 9856
> }, {
> "name" : "STANDALONE",
> "value" : 9859
> } ],
> "setupTime" : 0,
> "persistedOpState" : "IN_SERVICE",
> "persistedOpStateExpiryEpochSec" : 0,
> "initialVersion" : 0,
> "currentVersion" : 1,
> "decommissioned" : false,
> "maintenance" : false,
> "signature" : -662262523,
> "networkLocation" : "/default",
> "networkName" : "6179347f-5824-41d4-b722-f1dbc5f14880",
> "networkFullPath" : "/default/6179347f-5824-41d4-b722-f1dbc5f14880",
> "numOfLeaves" : 1
> }, {
> "level" : 0,
> "cost" : 0,
> "uuid" : "d8afb52b-5f4c-4d94-9286-7c3cfd6c315c",
> "uuidString" : "d8afb52b-5f4c-4d94-9286-7c3cfd6c315c",
> "ipAddress" : "10.140.40.9",
> "hostName" : "DN-1",
> "ports" : [ {
> "name" : "HTTPS",
> "value" : 9883
> }, {
> "name" : "CLIENT_RPC",
> "value" : 9864
> }, {
> "name" : "REPLICATION",
> "value" : 9886
> }, {
> "name" : "RATIS",
> "value" : 9858
> }, {
> "name" : "RATIS_ADMIN",
> "value" : 9857
> }, {
> "name" : "RATIS_SERVER",
> "value" : 9856
> }, {
> "name" : "STANDALONE",
> "value" : 9859
> } ],
> "setupTime" : 0,
> "persistedOpState" : "IN_SERVICE",
> "persistedOpStateExpiryEpochSec" : 0,
> "initialVersion" : 0,
> "currentVersion" : 1,
> "decommissioned" : false,
> "maintenance" : false,
> "signature" : -1387859873,
> "networkLocation" : "/default",
> "networkName" : "d8afb52b-5f4c-4d94-9286-7c3cfd6c315c",
> "networkFullPath" : "/default/d8afb52b-5f4c-4d94-9286-7c3cfd6c315c",
> "numOfLeaves" : 1
> }, {
> "level" : 0,
> "cost" : 0,
> "uuid" : "ef7ae3e9-5ec3-49d6-9b93-1c687009bc1e",
> "uuidString" : "ef7ae3e9-5ec3-49d6-9b93-1c687009bc1e",
> "ipAddress" : "10.140.137.128",
> "hostName" : "DN-8",
> "ports" : [ {
> "name" : "HTTPS",
> "value" : 9883
> }, {
> "name" : "CLIENT_RPC",
> "value" : 9864
> }, {
> "name" : "REPLICATION",
> "value" : 9886
> }, {
> "name" : "RATIS",
> "value" : 9858
> }, {
> "name" : "RATIS_ADMIN",
> "value" : 9857
> }, {
> "name" : "RATIS_SERVER",
> "value" : 9856
> }, {
> "name" : "STANDALONE",
> "value" : 9859
> } ],
> "setupTime" : 0,
> "persistedOpState" : "IN_SERVICE",
> "persistedOpStateExpiryEpochSec" : 0,
> "initialVersion" : 0,
> "currentVersion" : 1,
> "decommissioned" : false,
> "maintenance" : false,
> "signature" : 1098159392,
> "networkLocation" : "/default",
> "networkName" : "ef7ae3e9-5ec3-49d6-9b93-1c687009bc1e",
> "networkFullPath" : "/default/ef7ae3e9-5ec3-49d6-9b93-1c687009bc1e",
> "numOfLeaves" : 1
> } ],
> "empty" : false,
> "type" : "EC"
> },
> "replicas" : [ {
> "containerID" : 1004,
> "state" : "CLOSED",
> "datanodeDetails" : {
> "level" : 0,
> "cost" : 0,
> "uuid" : "6179347f-5824-41d4-b722-f1dbc5f14880",
> "uuidString" : "6179347f-5824-41d4-b722-f1dbc5f14880",
> "ipAddress" : "10.140.37.12",
> "hostName" : "DN-5z",
> "ports" : [ {
> "name" : "HTTPS",
> "value" : 9883
> }, {
> "name" : "CLIENT_RPC",
> "value" : 9864
> }, {
> "name" : "REPLICATION",
> "value" : 9886
> }, {
> "name" : "RATIS",
> "value" : 9858
> }, {
> "name" : "RATIS_ADMIN",
> "value" : 9857
> }, {
> "name" : "RATIS_SERVER",
> "value" : 9856
> }, {
> "name" : "STANDALONE",
> "value" : 9859
> } ],
> "setupTime" : 0,
> "persistedOpState" : "IN_SERVICE",
> "persistedOpStateExpiryEpochSec" : 0,
> "initialVersion" : 0,
> "currentVersion" : 1,
> "decommissioned" : false,
> "maintenance" : false,
> "signature" : -662262523,
> "networkLocation" : "/default",
> "networkName" : "6179347f-5824-41d4-b722-f1dbc5f14880",
> "networkFullPath" : "/default/6179347f-5824-41d4-b722-f1dbc5f14880",
> "numOfLeaves" : 1
> },
> "placeOfBirth" : "6179347f-5824-41d4-b722-f1dbc5f14880",
> "sequenceId" : 0,
> "keyCount" : 5,
> "bytesUsed" : 1342177280,
> "replicaIndex" : 2
> }, {
> "containerID" : 1004,
> "state" : "CLOSED",
> "datanodeDetails" : {
> "level" : 0,
> "cost" : 0,
> "uuid" : "d8afb52b-5f4c-4d94-9286-7c3cfd6c315c",
> "uuidString" : "d8afb52b-5f4c-4d94-9286-7c3cfd6c315c",
> "ipAddress" : "10.140.40.9",
> "hostName" : "DN-1",
> "ports" : [ {
> "name" : "HTTPS",
> "value" : 9883
> }, {
> "name" : "CLIENT_RPC",
> "value" : 9864
> }, {
> "name" : "REPLICATION",
> "value" : 9886
> }, {
> "name" : "RATIS",
> "value" : 9858
> }, {
> "name" : "RATIS_ADMIN",
> "value" : 9857
> }, {
> "name" : "RATIS_SERVER",
> "value" : 9856
> }, {
> "name" : "STANDALONE",
> "value" : 9859
> } ],
> "setupTime" : 0,
> "persistedOpState" : "IN_SERVICE",
> "persistedOpStateExpiryEpochSec" : 0,
> "initialVersion" : 0,
> "currentVersion" : 1,
> "decommissioned" : false,
> "maintenance" : false,
> "signature" : -1387859873,
> "networkLocation" : "/default",
> "networkName" : "d8afb52b-5f4c-4d94-9286-7c3cfd6c315c",
> "networkFullPath" : "/default/d8afb52b-5f4c-4d94-9286-7c3cfd6c315c",
> "numOfLeaves" : 1
> },
> "placeOfBirth" : "d8afb52b-5f4c-4d94-9286-7c3cfd6c315c",
> "sequenceId" : 0,
> "keyCount" : 5,
> "bytesUsed" : 1342177280,
> "replicaIndex" : 3
> }, {
> "containerID" : 1004,
> "state" : "CLOSED",
> "datanodeDetails" : {
> "level" : 0,
> "cost" : 0,
> "uuid" : "ef7ae3e9-5ec3-49d6-9b93-1c687009bc1e",
> "uuidString" : "ef7ae3e9-5ec3-49d6-9b93-1c687009bc1e",
> "ipAddress" : "10.140.137.128",
> "hostName" : "DN-8",
> "ports" : [ {
> "name" : "HTTPS",
> "value" : 9883
> }, {
> "name" : "CLIENT_RPC",
> "value" : 9864
> }, {
> "name" : "REPLICATION",
> "value" : 9886
> }, {
> "name" : "RATIS",
> "value" : 9858
> }, {
> "name" : "RATIS_ADMIN",
> "value" : 9857
> }, {
> "name" : "RATIS_SERVER",
> "value" : 9856
> }, {
> "name" : "STANDALONE",
> "value" : 9859
> } ],
> "setupTime" : 0,
> "persistedOpState" : "IN_SERVICE",
> "persistedOpStateExpiryEpochSec" : 0,
> "initialVersion" : 0,
> "currentVersion" : 1,
> "decommissioned" : false,
> "maintenance" : false,
> "signature" : 1098159392,
> "networkLocation" : "/default",
> "networkName" : "ef7ae3e9-5ec3-49d6-9b93-1c687009bc1e",
> "networkFullPath" : "/default/ef7ae3e9-5ec3-49d6-9b93-1c687009bc1e",
> "numOfLeaves" : 1
> },
> "placeOfBirth" : "711656cf-a99e-4b2c-8c35-f015ee94889c",
> "sequenceId" : 0,
> "keyCount" : 5,
> "bytesUsed" : 1342177280,
> "replicaIndex" : 1
> } ]
> } {code}
> Checked the SCM Logs, it is still sending reconstructECContainersCommand,
> {code:java}
> 2024-03-28 08:36:56,748 INFO [Under Replicated
> Processor]-org.apache.hadoop.hdds.scm.container.replication.ReplicationManager:
> Sending command [reconstructECContainersCommand: containerID: 1004,
> replicationConfig: EC{rs-3-2-1024k}, sources:
> [ef7ae3e9-5ec3-49d6-9b93-1c687009bc1e(DN-8/10.140.137.128) replicaIndex: 1,
> 6179347f-5824-41d4-b722-f1dbc5f14880(DN-5/10.140.37.12) replicaIndex: 2,
> d8afb52b-5f4c-4d94-9286-7c3cfd6c315c(DN-1/10.140.40.9) replicaIndex: 3],
> targets: [572ed33d-a834-4d80-be35-7b1b19c8bd74(DN-7/10.140.234.130),
> 711656cf-a99e-4b2c-8c35-f015ee94889c(DN-2/10.140.45.129)], missingIndexes:
> [4, 5]] for container ContainerInfo{id=#1004, state=CLOSED,
> stateEnterTime=2024-03-27T18:43:51.934Z,
> pipelineID=PipelineID=53f5587f-9e6c-465d-a0cb-b82d10c227d3, owner=om1} to
> 572ed33d-a834-4d80-be35-7b1b19c8bd74(DN-7/10.140.234.130) with datanode
> deadline 1711615886747 and scm deadline 1711615916747 {code}
> Checked one of the Target DN DN-7, its throwing below warnings.
> {code:java}
> 2024-03-28 08:37:14,982 WARN
> [ContainerReplicationThread-5]-org.apache.hadoop.ozone.container.ec.reconstruction.ECReconstructionCoordinatorTask:
> FAILED reconstructECContainersCommand: containerID=1004,
> replication=rs-3-2-1024k, missingIndexes=[4, 5],
> sources={1=ef7ae3e9-5ec3-49d6-9b93-1c687009bc1e(DN-8/10.140.137.128),
> 2=6179347f-5824-41d4-b722-f1dbc5f14880(DN-5/10.140.37.12),
> 3=d8afb52b-5f4c-4d94-9286-7c3cfd6c315c(DN-1/10.140.40.9)},
> targets={4=572ed33d-a834-4d80-be35-7b1b19c8bd74(DN-7/10.140.234.130),
> 5=711656cf-a99e-4b2c-8c35-f015ee94889c(DN-2/10.140.45.129)} after 10639 ms
> java.io.IOException: None of the block data have checksum which means
> 2(parity)+1 blocks are not present
> at
> org.apache.hadoop.hdds.scm.storage.ECBlockOutputStream.executePutBlock(ECBlockOutputStream.java:156)
> at
> org.apache.hadoop.ozone.container.ec.reconstruction.ECReconstructionCoordinator.reconstructECBlockGroup(ECReconstructionCoordinator.java:325)
> at
> org.apache.hadoop.ozone.container.ec.reconstruction.ECReconstructionCoordinator.reconstructECContainerGroup(ECReconstructionCoordinator.java:171)
> at
> org.apache.hadoop.ozone.container.ec.reconstruction.ECReconstructionCoordinatorTask.runTask(ECReconstructionCoordinatorTask.java:68)
> at
> org.apache.hadoop.ozone.container.replication.ReplicationSupervisor$TaskRunner.run(ReplicationSupervisor.java:359)
> at
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
> at
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> at java.base/java.lang.Thread.run(Thread.java:834)
> 2024-03-28 08:37:14,982 WARN
> [ContainerReplicationThread-5]-org.apache.hadoop.ozone.container.replication.ReplicationSupervisor:
> Failed FAILED reconstructECContainersCommand: containerID=1004,
> replication=rs-3-2-1024k, missingIndexes=[4, 5],
> sources={1=ef7ae3e9-5ec3-49d6-9b93-1c687009bc1e(DN-8/10.140.137.128),
> 2=6179347f-5824-41d4-b722-f1dbc5f14880(DN-5/10.140.37.12),
> 3=d8afb52b-5f4c-4d94-9286-7c3cfd6c315c(DN-1/10.140.40.9)},
> targets={4=572ed33d-a834-4d80-be35-7b1b19c8bd74(DN-7/10.140.234.130),
> 5=711656cf-a99e-4b2c-8c35-f015ee94889c(DN-2/10.140.45.129)} {code}
> *Expected Behavior:* Reconstruction should have happened
> Note: This is fairly reproducible everytime.
> cc: [~siddhant]
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]