[
https://issues.apache.org/jira/browse/HDDS-11550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Krishna Kumar Asawa reassigned HDDS-11550:
------------------------------------------
Assignee: Ashish Kumar
> [HBase Ozone] Leaserecovery fails with "StorageContainerException: Unable to
> find the block"
> --------------------------------------------------------------------------------------------
>
> Key: HDDS-11550
> URL: https://issues.apache.org/jira/browse/HDDS-11550
> Project: Apache Ozone
> Issue Type: Bug
> Reporter: Pratyush Bhatt
> Assignee: Ashish Kumar
> Priority: Major
>
> *Test Scenario:*
> 1. Open multiple files in Ozone ~100
> 2. Write some data and perform hsync.
> 3. Perform lease recovery on all files.
> 4. Parallely restart OM leader.
> *Observed behavior:*
> Lease recovery command failed for one of the file
> {code:java}
> 2024-10-07 10:32:33,947|INFO|Thread-102|machine.py:190 -
> run()||GUID=0fac0928-4730-408c-9f03-4eeedad7b751|RUNNING:
> /opt/cloudera/parcels/CDH/bin/ozone debug recover
> --path=ofs://ozone1728283844/hsyncvol/hsyncbuck/hsync/File_17.txt
> 2024-10-07 10:32:34,544|INFO|Thread-100|machine.py:205 -
> run()||GUID=e2ec2f41-91d2-4cb5-b67b-5dee26443a57|2024-10-07 10:32:34:540 1120
> sec: 365401 operations; 808.5 current ops/sec; est completion in 1 hour 23
> minutes [INSERT: Count=8087, Max=131583, Min=3264, Avg=12369.62, 90=29215,
> 99=65663, 99.9=103359, 99.99=129919]
> 2024-10-07 10:32:38,670|INFO|Thread-102|machine.py:205 -
> run()||GUID=0fac0928-4730-408c-9f03-4eeedad7b751|24/10/07 10:32:38 WARN
> impl.MetricsConfig: Cannot locate configuration: tried
> hadoop-metrics2-xceiverclientmetrics.properties,hadoop-metrics2.properties
> 2024-10-07 10:32:38,758|INFO|Thread-102|machine.py:205 -
> run()||GUID=0fac0928-4730-408c-9f03-4eeedad7b751|24/10/07 10:32:38 INFO
> impl.MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s).
> 2024-10-07 10:32:38,758|INFO|Thread-102|machine.py:205 -
> run()||GUID=0fac0928-4730-408c-9f03-4eeedad7b751|24/10/07 10:32:38 INFO
> impl.MetricsSystemImpl: XceiverClientMetrics metrics system started
> 2024-10-07 10:32:38,843|INFO|Thread-102|machine.py:205 -
> run()||GUID=0fac0928-4730-408c-9f03-4eeedad7b751|24/10/07 10:32:38 INFO
> scm.XceiverClientRatis: WatchType ALL_COMMITTED. Majority 2,
> 2024-10-07 10:32:38,926|INFO|Thread-102|machine.py:205 -
> run()||GUID=0fac0928-4730-408c-9f03-4eeedad7b751|24/10/07 10:32:38 INFO
> netty.NettyConfigKeys$DataStream: setTlsConf GrpcTlsConfig0-
> 2024-10-07 10:32:39,413|INFO|Thread-102|machine.py:205 -
> run()||GUID=0fac0928-4730-408c-9f03-4eeedad7b751|24/10/07 10:32:39 INFO
> metrics.MetricRegistries: Loaded MetricRegistries class
> org.apache.ratis.metrics.dropwizard3.Dm3MetricRegistriesImpl
> 2024-10-07 10:32:40,686|INFO|Thread-102|machine.py:205 -
> run()||GUID=0fac0928-4730-408c-9f03-4eeedad7b751|24/10/07 10:32:40 ERROR
> impl.OrderedAsync: Failed to send request, message=cmdType: FinalizeBlock
> 2024-10-07 10:32:40,686|INFO|Thread-102|machine.py:205 -
> run()||GUID=0fac0928-4730-408c-9f03-4eeedad7b751|traceID: ""
> 2024-10-07 10:32:40,686|INFO|Thread-102|machine.py:205 -
> run()||GUID=0fac0928-4730-408c-9f03-4eeedad7b751|containerID: 1051
> 2024-10-07 10:32:40,686|INFO|Thread-102|machine.py:205 -
> run()||GUID=0fac0928-4730-408c-9f03-4eeedad7b751|datanodeUuid:
> "9d980530-9455-4855-b264-0b011f810aeb"
> 2024-10-07 10:32:40,687|INFO|Thread-102|machine.py:205 -
> run()||GUID=0fac0928-4730-408c-9f03-4eeedad7b751|encodedToken:
> "UwoCb20SJWNvbklEOiAxMDUxIGxvY0lEOiAxMTM3NTAxNTM2MjU2MDQzNjAYw8bA3aYyKAEoAjCAgICAAToVCOOPkafmyr6oZRC9mtq4-IbouI8BIG9ivUYkaRtyNJ64ih8ZyvB3OOVC6EpiW_VZeI7CmrioEEhERFNfQkxPQ0tfVE9LRU4oY29uSUQ6IDEwNTEgbG9jSUQ6IDExMzc1MDE1MzYyNTYwNDM2MAAAAA"
> 2024-10-07 10:32:40,687|INFO|Thread-102|machine.py:205 -
> run()||GUID=0fac0928-4730-408c-9f03-4eeedad7b751|version: 3
> 2024-10-07 10:32:40,687|INFO|Thread-102|machine.py:205 -
> run()||GUID=0fac0928-4730-408c-9f03-4eeedad7b751|finalizeBlock {
> 2024-10-07 10:32:40,687|INFO|Thread-102|machine.py:205 -
> run()||GUID=0fac0928-4730-408c-9f03-4eeedad7b751|blockID {
> 2024-10-07 10:32:40,687|INFO|Thread-102|machine.py:205 -
> run()||GUID=0fac0928-4730-408c-9f03-4eeedad7b751|containerID: 1051
> 2024-10-07 10:32:40,688|INFO|Thread-102|machine.py:205 -
> run()||GUID=0fac0928-4730-408c-9f03-4eeedad7b751|localID: 113750153625604360
> 2024-10-07 10:32:40,688|INFO|Thread-102|machine.py:205 -
> run()||GUID=0fac0928-4730-408c-9f03-4eeedad7b751|blockCommitSequenceId: 0
> 2024-10-07 10:32:40,688|INFO|Thread-102|machine.py:205 -
> run()||GUID=0fac0928-4730-408c-9f03-4eeedad7b751|}
> 2024-10-07 10:32:40,688|INFO|Thread-102|machine.py:205 -
> run()||GUID=0fac0928-4730-408c-9f03-4eeedad7b751|}
> 2024-10-07 10:32:40,688|INFO|Thread-102|machine.py:205 -
> run()||GUID=0fac0928-4730-408c-9f03-4eeedad7b751|, data.size=0
> 2024-10-07 10:32:40,688|INFO|Thread-102|machine.py:205 -
> run()||GUID=0fac0928-4730-408c-9f03-4eeedad7b751|java.util.concurrent.CompletionException:
> org.apache.ratis.protocol.exceptions.StateMachineException:
> org.apache.hadoop.hdds.scm.container.common.helpers.StorageContainerException:
> Unable to find the block. BlockID : conID: 1051 locID: 113750153625604360
> bcsId: 0 replicaIndex: null
> 2024-10-07 10:32:40,689|INFO|Thread-102|machine.py:205 -
> run()||GUID=0fac0928-4730-408c-9f03-4eeedad7b751|at
> org.apache.ratis.client.impl.RaftClientImpl.handleRaftException(RaftClientImpl.java:373)
> 2024-10-07 10:32:40,689|INFO|Thread-102|machine.py:205 -
> run()||GUID=0fac0928-4730-408c-9f03-4eeedad7b751|at
> org.apache.ratis.client.impl.OrderedAsync.lambda$send$3(OrderedAsync.java:175)
> 2024-10-07 10:32:40,689|INFO|Thread-102|machine.py:205 -
> run()||GUID=0fac0928-4730-408c-9f03-4eeedad7b751|at
> java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:616)
> 2024-10-07 10:32:40,689|INFO|Thread-102|machine.py:205 -
> run()||GUID=0fac0928-4730-408c-9f03-4eeedad7b751|at
> java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591)
> 2024-10-07 10:32:40,689|INFO|Thread-102|machine.py:205 -
> run()||GUID=0fac0928-4730-408c-9f03-4eeedad7b751|at
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
> 2024-10-07 10:32:40,690|INFO|Thread-102|machine.py:205 -
> run()||GUID=0fac0928-4730-408c-9f03-4eeedad7b751|at
> java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1975)
> 2024-10-07 10:32:40,690|INFO|Thread-102|machine.py:205 -
> run()||GUID=0fac0928-4730-408c-9f03-4eeedad7b751|at
> org.apache.ratis.client.impl.OrderedAsync$PendingOrderedRequest.setReply(OrderedAsync.java:105)
> 2024-10-07 10:32:40,690|INFO|Thread-102|machine.py:205 -
> run()||GUID=0fac0928-4730-408c-9f03-4eeedad7b751|at
> org.apache.ratis.client.impl.OrderedAsync$PendingOrderedRequest.setReply(OrderedAsync.java:66)
> 2024-10-07 10:32:40,690|INFO|Thread-102|machine.py:205 -
> run()||GUID=0fac0928-4730-408c-9f03-4eeedad7b751|at
> org.apache.ratis.util.SlidingWindow$RequestMap.setReply(SlidingWindow.java:147)
> 2024-10-07 10:32:40,690|INFO|Thread-102|machine.py:205 -
> run()||GUID=0fac0928-4730-408c-9f03-4eeedad7b751|at
> org.apache.ratis.util.SlidingWindow$Client.receiveReply(SlidingWindow.java:351)
> 2024-10-07 10:32:40,690|INFO|Thread-102|machine.py:205 -
> run()||GUID=0fac0928-4730-408c-9f03-4eeedad7b751|at
> org.apache.ratis.client.impl.OrderedAsync.lambda$sendRequestWithRetry$5(OrderedAsync.java:210)
> 2024-10-07 10:32:40,691|INFO|Thread-102|machine.py:205 -
> run()||GUID=0fac0928-4730-408c-9f03-4eeedad7b751|at
> java.util.concurrent.CompletableFuture.uniAccept(CompletableFuture.java:670)
> 2024-10-07 10:32:40,691|INFO|Thread-102|machine.py:205 -
> run()||GUID=0fac0928-4730-408c-9f03-4eeedad7b751|at
> java.util.concurrent.CompletableFuture$UniAccept.tryFire(CompletableFuture.java:646)
> 2024-10-07 10:32:40,691|INFO|Thread-102|machine.py:205 -
> run()||GUID=0fac0928-4730-408c-9f03-4eeedad7b751|at
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
> 2024-10-07 10:32:40,691|INFO|Thread-102|machine.py:205 -
> run()||GUID=0fac0928-4730-408c-9f03-4eeedad7b751|at
> java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1975)
> 2024-10-07 10:32:40,691|INFO|Thread-102|machine.py:205 -
> run()||GUID=0fac0928-4730-408c-9f03-4eeedad7b751|at
> org.apache.ratis.grpc.client.GrpcClientProtocolClient$AsyncStreamObservers$1.lambda$onNext$0(GrpcClientProtocolClient.java:322)
> 2024-10-07 10:32:40,691|INFO|Thread-102|machine.py:205 -
> run()||GUID=0fac0928-4730-408c-9f03-4eeedad7b751|at
> java.util.Optional.ifPresent(Optional.java:159)
> 2024-10-07 10:32:40,692|INFO|Thread-102|machine.py:205 -
> run()||GUID=0fac0928-4730-408c-9f03-4eeedad7b751|at
> org.apache.ratis.grpc.client.GrpcClientProtocolClient$AsyncStreamObservers.handleReplyFuture(GrpcClientProtocolClient.java:378)
> 2024-10-07 10:32:40,692|INFO|Thread-102|machine.py:205 -
> run()||GUID=0fac0928-4730-408c-9f03-4eeedad7b751|at
> org.apache.ratis.grpc.client.GrpcClientProtocolClient$AsyncStreamObservers.access$100(GrpcClientProtocolClient.java:300)
> 2024-10-07 10:32:40,692|INFO|Thread-102|machine.py:205 -
> run()||GUID=0fac0928-4730-408c-9f03-4eeedad7b751|at
> org.apache.ratis.grpc.client.GrpcClientProtocolClient$AsyncStreamObservers$1.onNext(GrpcClientProtocolClient.java:322)
> 2024-10-07 10:32:40,692|INFO|Thread-102|machine.py:205 -
> run()||GUID=0fac0928-4730-408c-9f03-4eeedad7b751|at
> org.apache.ratis.grpc.client.GrpcClientProtocolClient$AsyncStreamObservers$1.onNext(GrpcClientProtocolClient.java:305)
> 2024-10-07 10:32:40,692|INFO|Thread-102|machine.py:205 -
> run()||GUID=0fac0928-4730-408c-9f03-4eeedad7b751|at
> org.apache.ratis.thirdparty.io.grpc.stub.ClientCalls$StreamObserverToCallListenerAdapter.onMessage(ClientCalls.java:468)
> 2024-10-07 10:32:40,692|INFO|Thread-102|machine.py:205 -
> run()||GUID=0fac0928-4730-408c-9f03-4eeedad7b751|at
> org.apache.ratis.thirdparty.io.grpc.ForwardingClientCallListener.onMessage(ForwardingClientCallListener.java:33)
> 2024-10-07 10:32:40,693|INFO|Thread-102|machine.py:205 -
> run()||GUID=0fac0928-4730-408c-9f03-4eeedad7b751|at
> org.apache.ratis.thirdparty.io.grpc.internal.DelayedClientCall$DelayedListener.onMessage(DelayedClientCall.java:473)
> 2024-10-07 10:32:40,693|INFO|Thread-102|machine.py:205 -
> run()||GUID=0fac0928-4730-408c-9f03-4eeedad7b751|at
> org.apache.ratis.thirdparty.io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1MessagesAvailable.runInternal(ClientCallImpl.java:660)
> 2024-10-07 10:32:40,693|INFO|Thread-102|machine.py:205 -
> run()||GUID=0fac0928-4730-408c-9f03-4eeedad7b751|at
> org.apache.ratis.thirdparty.io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1MessagesAvailable.runInContext(ClientCallImpl.java:647)
> 2024-10-07 10:32:40,693|INFO|Thread-102|machine.py:205 -
> run()||GUID=0fac0928-4730-408c-9f03-4eeedad7b751|at
> org.apache.ratis.thirdparty.io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
> 2024-10-07 10:32:40,693|INFO|Thread-102|machine.py:205 -
> run()||GUID=0fac0928-4730-408c-9f03-4eeedad7b751|at
> org.apache.ratis.thirdparty.io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133)
> 2024-10-07 10:32:40,694|INFO|Thread-102|machine.py:205 -
> run()||GUID=0fac0928-4730-408c-9f03-4eeedad7b751|at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> 2024-10-07 10:32:40,694|INFO|Thread-102|machine.py:205 -
> run()||GUID=0fac0928-4730-408c-9f03-4eeedad7b751|at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> 2024-10-07 10:32:40,694|INFO|Thread-102|machine.py:205 -
> run()||GUID=0fac0928-4730-408c-9f03-4eeedad7b751|at
> java.lang.Thread.run(Thread.java:750)
> 2024-10-07 10:32:40,694|INFO|Thread-102|machine.py:205 -
> run()||GUID=0fac0928-4730-408c-9f03-4eeedad7b751|Caused by:
> org.apache.ratis.protocol.exceptions.StateMachineException:
> org.apache.hadoop.hdds.scm.container.common.helpers.StorageContainerException:
> Unable to find the block. BlockID : conID: 1051 locID: 113750153625604360
> bcsId: 0 replicaIndex: null
> 2024-10-07 10:32:40,694|INFO|Thread-102|machine.py:205 -
> run()||GUID=0fac0928-4730-408c-9f03-4eeedad7b751|at
> org.apache.hadoop.ozone.container.common.transport.server.ratis.ContainerStateMachine.lambda$applyTransaction$12(ContainerStateMachine.java:1070)
> 2024-10-07 10:32:40,695|INFO|Thread-102|machine.py:205 -
> run()||GUID=0fac0928-4730-408c-9f03-4eeedad7b751|at
> java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:616)
> 2024-10-07 10:32:40,695|INFO|Thread-102|machine.py:205 -
> run()||GUID=0fac0928-4730-408c-9f03-4eeedad7b751|at
> java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591)
> 2024-10-07 10:32:40,695|INFO|Thread-102|machine.py:205 -
> run()||GUID=0fac0928-4730-408c-9f03-4eeedad7b751|at
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
> 2024-10-07 10:32:40,695|INFO|Thread-102|machine.py:205 -
> run()||GUID=0fac0928-4730-408c-9f03-4eeedad7b751|at
> java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1975)
> 2024-10-07 10:32:40,695|INFO|Thread-102|machine.py:205 -
> run()||GUID=0fac0928-4730-408c-9f03-4eeedad7b751|at
> org.apache.ratis.util.TaskQueue.lambda$submit$0(TaskQueue.java:133)
> 2024-10-07 10:32:40,696|INFO|Thread-102|machine.py:205 -
> run()||GUID=0fac0928-4730-408c-9f03-4eeedad7b751|at
> org.apache.ratis.util.LogUtils.runAndLog(LogUtils.java:38)
> 2024-10-07 10:32:40,696|INFO|Thread-102|machine.py:205 -
> run()||GUID=0fac0928-4730-408c-9f03-4eeedad7b751|at
> org.apache.ratis.util.LogUtils$1.run(LogUtils.java:78)
> 2024-10-07 10:32:40,696|INFO|Thread-102|machine.py:205 -
> run()||GUID=0fac0928-4730-408c-9f03-4eeedad7b751|at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> 2024-10-07 10:32:40,696|INFO|Thread-102|machine.py:205 -
> run()||GUID=0fac0928-4730-408c-9f03-4eeedad7b751|at
> java.util.concurrent.FutureTask.run(FutureTask.java:266)
> 2024-10-07 10:32:40,696|INFO|Thread-102|machine.py:205 -
> run()||GUID=0fac0928-4730-408c-9f03-4eeedad7b751|... 3 more
> 2024-10-07 10:32:40,696|INFO|Thread-102|machine.py:205 -
> run()||GUID=0fac0928-4730-408c-9f03-4eeedad7b751|24/10/07 10:32:40 WARN
> ozone.BasicRootedOzoneClientAdapterImpl: Failed to execute finalizeBlock
> command
> 2024-10-07 10:32:40,696|INFO|Thread-102|machine.py:205 -
> run()||GUID=0fac0928-4730-408c-9f03-4eeedad7b751|java.io.IOException: Failed
> to execute command cmdType: FinalizeBlock
> 2024-10-07 10:32:40,696|INFO|Thread-102|machine.py:205 -
> run()||GUID=0fac0928-4730-408c-9f03-4eeedad7b751|containerID: 1051
> 2024-10-07 10:32:40,696|INFO|Thread-102|machine.py:205 -
> run()||GUID=0fac0928-4730-408c-9f03-4eeedad7b751|datanodeUuid:
> "9d980530-9455-4855-b264-0b011f810aeb"
> 2024-10-07 10:32:40,696|INFO|Thread-102|machine.py:205 -
> run()||GUID=0fac0928-4730-408c-9f03-4eeedad7b751|encodedToken:
> "UwoCb20SJWNvbklEOiAxMDUxIGxvY0lEOiAxMTM3NTAxNTM2MjU2MDQzNjAYw8bA3aYyKAEoAjCAgICAAToVCOOPkafmyr6oZRC9mtq4-IbouI8BIG9ivUYkaRtyNJ64ih8ZyvB3OOVC6EpiW_VZeI7CmrioEEhERFNfQkxPQ0tfVE9LRU4oY29uSUQ6IDEwNTEgbG9jSUQ6IDExMzc1MDE1MzYyNTYwNDM2MAAAAA"
> 2024-10-07 10:32:40,697|INFO|Thread-102|machine.py:205 -
> run()||GUID=0fac0928-4730-408c-9f03-4eeedad7b751|finalizeBlock {
> 2024-10-07 10:32:40,697|INFO|Thread-102|machine.py:205 -
> run()||GUID=0fac0928-4730-408c-9f03-4eeedad7b751|blockID {
> 2024-10-07 10:32:40,697|INFO|Thread-102|machine.py:205 -
> run()||GUID=0fac0928-4730-408c-9f03-4eeedad7b751|containerID: 1051
> 2024-10-07 10:32:40,697|INFO|Thread-102|machine.py:205 -
> run()||GUID=0fac0928-4730-408c-9f03-4eeedad7b751|localID: 113750153625604360
> 2024-10-07 10:32:40,697|INFO|Thread-102|machine.py:205 -
> run()||GUID=0fac0928-4730-408c-9f03-4eeedad7b751|blockCommitSequenceId: 0
> 2024-10-07 10:32:40,697|INFO|Thread-102|machine.py:205 -
> run()||GUID=0fac0928-4730-408c-9f03-4eeedad7b751|}
> 2024-10-07 10:32:40,698|INFO|Thread-102|machine.py:205 -
> run()||GUID=0fac0928-4730-408c-9f03-4eeedad7b751|}
> 2024-10-07 10:32:40,698|INFO|Thread-102|machine.py:205 -
> run()||GUID=0fac0928-4730-408c-9f03-4eeedad7b751|
> 2024-10-07 10:32:40,698|INFO|Thread-102|machine.py:205 -
> run()||GUID=0fac0928-4730-408c-9f03-4eeedad7b751|at
> org.apache.hadoop.hdds.scm.XceiverClientSpi.getIOExceptionForSendCommand(XceiverClientSpi.java:156)
> 2024-10-07 10:32:40,698|INFO|Thread-102|machine.py:205 -
> run()||GUID=0fac0928-4730-408c-9f03-4eeedad7b751|at
> org.apache.hadoop.hdds.scm.XceiverClientSpi.sendCommand(XceiverClientSpi.java:149)
> 2024-10-07 10:32:40,698|INFO|Thread-102|machine.py:205 -
> run()||GUID=0fac0928-4730-408c-9f03-4eeedad7b751|at
> org.apache.hadoop.hdds.scm.storage.ContainerProtocolCalls.finalizeBlock(ContainerProtocolCalls.java:326)
> 2024-10-07 10:32:40,698|INFO|Thread-102|machine.py:205 -
> run()||GUID=0fac0928-4730-408c-9f03-4eeedad7b751|at
> org.apache.hadoop.fs.ozone.BasicRootedOzoneClientAdapterImpl.finalizeBlock(BasicRootedOzoneClientAdapterImpl.java:1433)
> 2024-10-07 10:32:40,699|INFO|Thread-102|machine.py:205 -
> run()||GUID=0fac0928-4730-408c-9f03-4eeedad7b751|at
> org.apache.hadoop.fs.ozone.LeaseRecoveryClientDNHandler.getOmKeyLocationInfos(LeaseRecoveryClientDNHandler.java:83)
> 2024-10-07 10:32:40,699|INFO|Thread-102|machine.py:205 -
> run()||GUID=0fac0928-4730-408c-9f03-4eeedad7b751|at
> org.apache.hadoop.fs.ozone.RootedOzoneFileSystem.recoverLeaseTraced(RootedOzoneFileSystem.java:162)
> 2024-10-07 10:32:40,699|INFO|Thread-102|machine.py:205 -
> run()||GUID=0fac0928-4730-408c-9f03-4eeedad7b751|at
> org.apache.hadoop.fs.ozone.RootedOzoneFileSystem.lambda$recoverLease$0(RootedOzoneFileSystem.java:141)
> 2024-10-07 10:32:40,699|INFO|Thread-102|machine.py:205 -
> run()||GUID=0fac0928-4730-408c-9f03-4eeedad7b751|at
> org.apache.hadoop.hdds.tracing.TracingUtil.executeInSpan(TracingUtil.java:169)
> 2024-10-07 10:32:40,700|INFO|Thread-102|machine.py:205 -
> run()||GUID=0fac0928-4730-408c-9f03-4eeedad7b751|at
> org.apache.hadoop.hdds.tracing.TracingUtil.executeInNewSpan(TracingUtil.java:159)
> 2024-10-07 10:32:40,700|INFO|Thread-102|machine.py:205 -
> run()||GUID=0fac0928-4730-408c-9f03-4eeedad7b751|at
> org.apache.hadoop.fs.ozone.RootedOzoneFileSystem.recoverLease(RootedOzoneFileSystem.java:140)
> 2024-10-07 10:32:40,700|INFO|Thread-102|machine.py:205 -
> run()||GUID=0fac0928-4730-408c-9f03-4eeedad7b751|at
> org.apache.hadoop.ozone.debug.LeaseRecoverer.call(LeaseRecoverer.java:77)
> 2024-10-07 10:32:40,700|INFO|Thread-102|machine.py:205 -
> run()||GUID=0fac0928-4730-408c-9f03-4eeedad7b751|at
> org.apache.hadoop.ozone.debug.LeaseRecoverer.call(LeaseRecoverer.java:38)
> 2024-10-07 10:32:40,701|INFO|Thread-102|machine.py:205 -
> run()||GUID=0fac0928-4730-408c-9f03-4eeedad7b751|at
> picocli.CommandLine.executeUserObject(CommandLine.java:2045)
> 2024-10-07 10:32:40,701|INFO|Thread-102|machine.py:205 -
> run()||GUID=0fac0928-4730-408c-9f03-4eeedad7b751|at
> picocli.CommandLine.access$1500(CommandLine.java:148)
> 2024-10-07 10:32:40,701|INFO|Thread-102|machine.py:205 -
> run()||GUID=0fac0928-4730-408c-9f03-4eeedad7b751|at
> picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2465)
> 2024-10-07 10:32:40,701|INFO|Thread-102|machine.py:205 -
> run()||GUID=0fac0928-4730-408c-9f03-4eeedad7b751|at
> picocli.CommandLine$RunLast.handle(CommandLine.java:2457)
> 2024-10-07 10:32:40,701|INFO|Thread-102|machine.py:205 -
> run()||GUID=0fac0928-4730-408c-9f03-4eeedad7b751|at
> picocli.CommandLine$RunLast.handle(CommandLine.java:2419)
> 2024-10-07 10:32:40,701|INFO|Thread-102|machine.py:205 -
> run()||GUID=0fac0928-4730-408c-9f03-4eeedad7b751|at
> picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:2277)
> 2024-10-07 10:32:40,701|INFO|Thread-102|machine.py:205 -
> run()||GUID=0fac0928-4730-408c-9f03-4eeedad7b751|at
> picocli.CommandLine$RunLast.execute(CommandLine.java:2421)
> 2024-10-07 10:32:40,702|INFO|Thread-102|machine.py:205 -
> run()||GUID=0fac0928-4730-408c-9f03-4eeedad7b751|at
> picocli.CommandLine.execute(CommandLine.java:2174)
> 2024-10-07 10:32:40,702|INFO|Thread-102|machine.py:205 -
> run()||GUID=0fac0928-4730-408c-9f03-4eeedad7b751|at
> org.apache.hadoop.hdds.cli.GenericCli.execute(GenericCli.java:100)
> 2024-10-07 10:32:40,702|INFO|Thread-102|machine.py:205 -
> run()||GUID=0fac0928-4730-408c-9f03-4eeedad7b751|at
> org.apache.hadoop.hdds.cli.GenericCli.run(GenericCli.java:91)
> 2024-10-07 10:32:40,702|INFO|Thread-102|machine.py:205 -
> run()||GUID=0fac0928-4730-408c-9f03-4eeedad7b751|at
> org.apache.hadoop.ozone.debug.OzoneDebug.main(OzoneDebug.java:64)
> 2024-10-07 10:32:40,702|INFO|Thread-102|machine.py:205 -
> run()||GUID=0fac0928-4730-408c-9f03-4eeedad7b751|Caused by:
> java.util.concurrent.ExecutionException:
> org.apache.ratis.protocol.exceptions.StateMachineException:
> org.apache.hadoop.hdds.scm.container.common.helpers.StorageContainerException:
> Unable to find the block. BlockID : conID: 1051 locID: 113750153625604360
> bcsId: 0 replicaIndex: null
> 2024-10-07 10:32:40,702|INFO|Thread-102|machine.py:205 -
> run()||GUID=0fac0928-4730-408c-9f03-4eeedad7b751|at
> java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357)
> 2024-10-07 10:32:40,702|INFO|Thread-102|machine.py:205 -
> run()||GUID=0fac0928-4730-408c-9f03-4eeedad7b751|at
> java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908)
> 2024-10-07 10:32:40,702|INFO|Thread-102|machine.py:205 -
> run()||GUID=0fac0928-4730-408c-9f03-4eeedad7b751|at
> org.apache.hadoop.hdds.scm.XceiverClientSpi.sendCommand(XceiverClientSpi.java:139)
> 2024-10-07 10:32:40,703|INFO|Thread-102|machine.py:205 -
> run()||GUID=0fac0928-4730-408c-9f03-4eeedad7b751|... 21 more
> 2024-10-07 10:32:40,703|INFO|Thread-102|machine.py:205 -
> run()||GUID=0fac0928-4730-408c-9f03-4eeedad7b751|Caused by:
> org.apache.ratis.protocol.exceptions.StateMachineException:
> org.apache.hadoop.hdds.scm.container.common.helpers.StorageContainerException:
> Unable to find the block. BlockID : conID: 1051 locID: 113750153625604360
> bcsId: 0 replicaIndex: null
> 2024-10-07 10:32:40,703|INFO|Thread-102|machine.py:205 -
> run()||GUID=0fac0928-4730-408c-9f03-4eeedad7b751|at
> org.apache.hadoop.ozone.container.common.transport.server.ratis.ContainerStateMachine.lambda$applyTransaction$12(ContainerStateMachine.java:1070)
> 2024-10-07 10:32:40,703|INFO|Thread-102|machine.py:205 -
> run()||GUID=0fac0928-4730-408c-9f03-4eeedad7b751|at
> java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:616)
> 2024-10-07 10:32:40,703|INFO|Thread-102|machine.py:205 -
> run()||GUID=0fac0928-4730-408c-9f03-4eeedad7b751|at
> java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591)
> 2024-10-07 10:32:40,703|INFO|Thread-102|machine.py:205 -
> run()||GUID=0fac0928-4730-408c-9f03-4eeedad7b751|at
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
> 2024-10-07 10:32:40,703|INFO|Thread-102|machine.py:205 -
> run()||GUID=0fac0928-4730-408c-9f03-4eeedad7b751|at
> java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1975)
> 2024-10-07 10:32:40,703|INFO|Thread-102|machine.py:205 -
> run()||GUID=0fac0928-4730-408c-9f03-4eeedad7b751|at
> org.apache.ratis.util.TaskQueue.lambda$submit$0(TaskQueue.java:133)
> 2024-10-07 10:32:40,703|INFO|Thread-102|machine.py:205 -
> run()||GUID=0fac0928-4730-408c-9f03-4eeedad7b751|at
> org.apache.ratis.util.LogUtils.runAndLog(LogUtils.java:38)
> 2024-10-07 10:32:40,704|INFO|Thread-102|machine.py:205 -
> run()||GUID=0fac0928-4730-408c-9f03-4eeedad7b751|at
> org.apache.ratis.util.LogUtils$1.run(LogUtils.java:78)
> 2024-10-07 10:32:40,704|INFO|Thread-102|machine.py:205 -
> run()||GUID=0fac0928-4730-408c-9f03-4eeedad7b751|at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> 2024-10-07 10:32:40,704|INFO|Thread-102|machine.py:205 -
> run()||GUID=0fac0928-4730-408c-9f03-4eeedad7b751|at
> java.util.concurrent.FutureTask.run(FutureTask.java:266)
> 2024-10-07 10:32:40,704|INFO|Thread-102|machine.py:205 -
> run()||GUID=0fac0928-4730-408c-9f03-4eeedad7b751|at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> 2024-10-07 10:32:40,704|INFO|Thread-102|machine.py:205 -
> run()||GUID=0fac0928-4730-408c-9f03-4eeedad7b751|at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> 2024-10-07 10:32:40,704|INFO|Thread-102|machine.py:205 -
> run()||GUID=0fac0928-4730-408c-9f03-4eeedad7b751|at
> java.lang.Thread.run(Thread.java:750)
> 2024-10-07 10:32:40,872|INFO|Thread-102|machine.py:205 -
> run()||GUID=0fac0928-4730-408c-9f03-4eeedad7b751|24/10/07 10:32:40 WARN
> scm.XceiverClientGrpc: Failed to execute command GetCommittedBlockLength on
> the pipeline Pipeline[ Id: 53abcc7c-979c-4097-9687-1f8dbb627ccd, Nodes:
> 9d980530-9455-4855-b264-0b011f810aeb(DN-4/11.22.33.44) ReplicaIndex:
> 06974e9b8-eeb9-4047-b0a2-626b2c175eb1(DN-7/22.11.33.44) ReplicaIndex:
> 018b2d32b-faa9-4c6d-9730-8f808679f70f(DN-6/33.11.22.44) ReplicaIndex: 0,
> ReplicationConfig: STANDALONE/THREE, State:OPEN, leaderId:,
> CreationTimestamp2024-10-07T10:32:40.692Z[Etc/UTC]].
> 2024-10-07 10:32:40,872|INFO|Thread-102|machine.py:205 -
> run()||GUID=0fac0928-4730-408c-9f03-4eeedad7b751|Unable to find the block.
> BlockID : conID: 1051 locID: 113750153625604360 bcsId: 0 replicaIndex: null
> {code}
> Checked the SCM leader log(Node 2) for this container ID:
> {code:java}
> 2024-10-07 10:32:40,494 INFO
> [node2-EventQueue-CloseContainerForCloseContainerEventHandler]-org.apache.hadoop.hdds.scm.container.CloseContainerEventHandler:
> Close container Event triggered for container : #1051, current state: OPEN
> 2024-10-07 10:32:40,494 INFO
> [node2-FixedThreadPoolWithAffinityExecutor-9-0]-org.apache.hadoop.hdds.scm.container.IncrementalContainerReportHandler:
> Moving OPEN container #1051 to CLOSING state, datanode
> 9d980530-9455-4855-b264-0b011f810aeb(DN-4/11.22.33.44) reported UNHEALTHY
> replica with index 0.
> 2024-10-07 10:32:40,496 INFO
> [node2-EventQueue-CloseContainerForCloseContainerEventHandler]-org.apache.hadoop.hdds.scm.container.CloseContainerEventHandler:
> Close container Event triggered for container : #1051, current state: CLOSING
> 2024-10-07 10:32:40,497 INFO
> [node2-EventQueue-CloseContainerForCloseContainerEventHandler]-org.apache.hadoop.hdds.scm.container.CloseContainerEventHandler:
> Close container Event triggered for container : #1051, current state: CLOSING
> 2024-10-07 10:32:40,515 INFO
> [node2-EventQueue-PipelineActionsForPipelineActionHandler]-org.apache.hadoop.hdds.scm.pipeline.PipelineActionHandler:
> Received pipeline action CLOSE for
> PipelineID=e34300bb-044e-4641-b338-072839835604 from datanode
> 18b2d32b-faa9-4c6d-9730-8f808679f70f. Reason : Ratis Transaction failure in
> datanode 18b2d32b-faa9-4c6d-9730-8f808679f70f with role LEADER .Triggering
> pipeline close action. {code}
> Checked the DN-4 (node 4), it shows:
> {code:java}
> 2024-10-07 10:32:40,483 WARN
> [9d980530-9455-4855-b264-0b011f810aeb-ContainerOp-e34300bb-044e-4641-b338-072839835604-7]-org.apache.hadoop.ozone.container.keyvalue.KeyValueHandler:
> Operation: FinalizeBlock , Trace ID: , Message: Unable to find the block.
> BlockID : conID: 1051 locID: 113750153625604360 bcsId: 0 replicaIndex: null ,
> Result: NO_SUCH_BLOCK , StorageContainerException Occurred.
> org.apache.hadoop.hdds.scm.container.common.helpers.StorageContainerException:
> Unable to find the block. BlockID : conID: 1051 locID: 113750153625604360
> bcsId: 0 replicaIndex: null
> at
> org.apache.hadoop.ozone.container.metadata.DatanodeStoreWithIncrementalChunkList.getCompleteBlockData(DatanodeStoreWithIncrementalChunkList.java:69)
> at
> org.apache.hadoop.ozone.container.metadata.DatanodeStore.getBlockByID(DatanodeStore.java:138)
> at
> org.apache.hadoop.ozone.container.keyvalue.impl.BlockManagerImpl.getBlockByID(BlockManagerImpl.java:375)
> at
> org.apache.hadoop.ozone.container.keyvalue.impl.BlockManagerImpl.getBlock(BlockManagerImpl.java:277)
> at
> org.apache.hadoop.ozone.container.keyvalue.KeyValueHandler.handleFinalizeBlock(KeyValueHandler.java:611)
> at
> org.apache.hadoop.ozone.container.keyvalue.KeyValueHandler.dispatchRequest(KeyValueHandler.java:287)
> at
> org.apache.hadoop.ozone.container.keyvalue.KeyValueHandler.handle(KeyValueHandler.java:231)
> at
> org.apache.hadoop.ozone.container.common.impl.HddsDispatcher.dispatchRequest(HddsDispatcher.java:355)
> at
> org.apache.hadoop.ozone.container.common.impl.HddsDispatcher.lambda$dispatch$0(HddsDispatcher.java:195)
> at
> org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:89)
> at
> org.apache.hadoop.ozone.container.common.impl.HddsDispatcher.dispatch(HddsDispatcher.java:194)
> at
> org.apache.hadoop.ozone.container.common.transport.server.ratis.ContainerStateMachine.dispatchCommand(ContainerStateMachine.java:505)
> at
> org.apache.hadoop.ozone.container.common.transport.server.ratis.ContainerStateMachine.lambda$applyTransaction$10(ContainerStateMachine.java:973)
> at org.apache.ratis.util.TaskQueue.lambda$submit$0(TaskQueue.java:121)
> at org.apache.ratis.util.LogUtils.runAndLog(LogUtils.java:38)
> at org.apache.ratis.util.LogUtils$1.run(LogUtils.java:78)
> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:750) {code}
> Note: Lease recovery on other files succeeded.
> cc: [~weichiu] [~ashishkr]
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]