Soumitra Sulav created HDDS-14761:
-------------------------------------
Summary: EC replica read fails atmost once with GetBlock error
after offline reconstruction
Key: HDDS-14761
URL: https://issues.apache.org/jira/browse/HDDS-14761
Project: Apache Ozone
Issue Type: Bug
Components: OM
Affects Versions: 2.1.0
Reporter: Soumitra Sulav
EC replica read fails with GetBlock error after offline reconstruction on using
cache in SCMClient.
Steps done :
# Created an EC key with {{{}rs-3-2-1024k{}}}.
# Run key info/get to cache the container replicas info in the SCM Client.
# Ensured that the calls were not gooing to SCM and read from cache by looking
at SCM audit read logs.
# Brought down one of the replica and waited for EC offline reconstruction.
# Ran the key get again, it failed with exception at GetBlock.
# Observed a SCM call {{op=GET_CONTAINER_WITH_PIPELINE_BATCH}} which indicates
the cache is refreshed.
# On next get key command, it works.
*CLI commands :*
{code:java}
# ozone admin container info 1003
Container id: 1003
Pipeline id: 716aecb6-f1fc-4e50-b603-c547177adb7c
Container State: CLOSED
Datanodes: [7752d3a1-06b3-42fb-bdf7-2886d90b4bdb/node-5.vpc.domain.com,
5d343a9a-5649-44a7-80f1-9c76f16d611e/node-4.vpc.domain.com,
eba28c30-aba2-4f9c-99b2-86fbb4828cae/node-7.vpc.domain.com,
f6b6e9e8-7ca7-4612-9a1c-ad1a9477c7e4/node-3.vpc.domain.com,
43ee72fb-a6c2-4e8c-a36e-8613d8e0fd38/node-6.vpc.domain.com]
# shutdown replica node-3.vpc.domain.com
# ozone admin container info 1003
Container id: 1003
Pipeline id: 61dd8be1-5959-4c06-8fb8-79db365b7682
Container State: CLOSED
Datanodes: [7752d3a1-06b3-42fb-bdf7-2886d90b4bdb/node-5.vpc.domain.com,
5d343a9a-5649-44a7-80f1-9c76f16d611e/node-4.vpc.domain.com,
eba28c30-aba2-4f9c-99b2-86fbb4828cae/node-7.vpc.domain.com,
43ee72fb-a6c2-4e8c-a36e-8613d8e0fd38/node-6.vpc.domain.com]
# EC offline reconstruction happens on node-1.vpc.domain.com
# ozone admin container info 1003
Container id: 1003
Pipeline id: 5a21c624-ada4-4560-bd00-7eb897488e54
Container State: CLOSED
Datanodes: [7752d3a1-06b3-42fb-bdf7-2886d90b4bdb/node-5.vpc.domain.com,
5d343a9a-5649-44a7-80f1-9c76f16d611e/node-4.vpc.domain.com,
b3984801-0289-473a-b1be-3b3f10a27d83/node-1.vpc.domain.com,
eba28c30-aba2-4f9c-99b2-86fbb4828cae/node-7.vpc.domain.com,
43ee72fb-a6c2-4e8c-a36e-8613d8e0fd38/node-6.vpc.domain.com]
# ozone sh key get vol-scmfail-k7k14/buck-scmfail-k7k14/key_5mb_ec_1
download_key_5mb_ec_17
26/03/03 08:08:43 ERROR scm.XceiverClientGrpc: Failed to execute command
GetBlock on the pipeline Pipeline[ Id: f6b6e9e8-7ca7-4612-9a1c-ad1a9477c7e4,
Nodes: f6b6e9e8-7ca7-4612-9a1c-ad1a9477c7e4(node-3.vpc.domain.com/10.65.8.219)
ReplicaIndex: 3, ReplicationConfig: STANDALONE/ONE, State:CLOSED, leaderId:,
CreationTimestamp2026-03-03T08:08:43.740-08:00[America/Los_Angeles]].
26/03/03 08:08:43 INFO storage.BlockInputStream: Unable to read information for
block conID: 1003 locID: 117883640217601008 bcsId: 0 replicaIndex: null from
pipeline PipelineID=f6b6e9e8-7ca7-4612-9a1c-ad1a9477c7e4:
java.util.concurrent.ExecutionException:
org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: UNAVAILABLE: io
exception
2026-03-03 08:08:43,777 | INFO | SCMAudit |
user=om/[email protected] | ip=10.65.14.58 |
op=GET_CONTAINER_WITH_PIPELINE_BATCH {containerIDs=#1003,} | ret=SUCCESS |
# Next key get works as cache is invalidated.{code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]