[
https://issues.apache.org/jira/browse/HDDS-10626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Pratyush Bhatt updated HDDS-10626:
----------------------------------
Description:
In a scenario where I'm conducting lease recovery on multiple files during a
rolling restart, the OM encounters failure subsequent to the restart of Ozone
Managers (OMs).
{code:java}
2024-03-31 09:47:01,866 ERROR [om72-OMStateMachineApplyTransactionThread -
0]-org.apache.hadoop.ozone.om.ratis.OzoneManagerStateMachine: Terminating with
exit status 1: Request cmdType: RecoverLease
traceID: ""
clientId: "client-433C04E5C8CC"
userInfo {
userName: "hdfs@XYZ"
remoteAddress: "xx.yy.ww.zz"
hostName: "vb1307.xyz.com"
}
version: 3
layoutVersion {
version: 6
}
RecoverLeaseRequest {
volumeName: "hsyncvol"
bucketName: "hsyncbuck"
keyName: "hsync/File_24.txt"
force: false
}
failed with exception
java.lang.NullPointerException: SecretKey client must have been initialized
already.
at java.util.Objects.requireNonNull(Objects.java:228)
at
org.apache.hadoop.hdds.security.symmetric.DefaultSecretKeySignerClient.getCurrentSecretKey(DefaultSecretKeySignerClient.java:70)
at
org.apache.hadoop.hdds.security.token.ShortLivedTokenSecretManager.createPassword(ShortLivedTokenSecretManager.java:47)
at
org.apache.hadoop.hdds.security.token.OzoneBlockTokenSecretManager.generateToken(OzoneBlockTokenSecretManager.java:70)
at
org.apache.hadoop.ozone.om.request.file.OMRecoverLeaseRequest.updateBlockInfo(OMRecoverLeaseRequest.java:281)
at
org.apache.hadoop.ozone.om.request.file.OMRecoverLeaseRequest.doWork(OMRecoverLeaseRequest.java:264)
at
org.apache.hadoop.ozone.om.request.file.OMRecoverLeaseRequest.validateAndUpdateCache(OMRecoverLeaseRequest.java:156)
at
org.apache.hadoop.ozone.protocolPB.OzoneManagerRequestHandler.lambda$0(OzoneManagerRequestHandler.java:406)
at
org.apache.hadoop.util.MetricUtil.captureLatencyNs(MetricUtil.java:45)
at
org.apache.hadoop.ozone.protocolPB.OzoneManagerRequestHandler.handleWriteRequestImpl(OzoneManagerRequestHandler.java:404)
at
org.apache.hadoop.ozone.protocolPB.RequestHandler.handleWriteRequest(RequestHandler.java:63)
at
org.apache.hadoop.ozone.om.ratis.OzoneManagerStateMachine.runCommand(OzoneManagerStateMachine.java:525)
at
org.apache.hadoop.ozone.om.ratis.OzoneManagerStateMachine.lambda$1(OzoneManagerStateMachine.java:343)
at
java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1604)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748) {code}
Have seen this 2-3 times, and this time I was able to repro it when Lease
recovery is happening during RR phase.
cc: [~ashishk] [~weichiu]
was:
In a scenario where I'm conducting lease recovery on multiple files during a
rolling restart, the OM encounters failure subsequent to the restart of Ozone
Managers (OMs).
{code:java}
2024-03-31 09:47:01,866 ERROR [om72-OMStateMachineApplyTransactionThread -
0]-org.apache.hadoop.ozone.om.ratis.OzoneManagerStateMachine: Terminating with
exit status 1: Request cmdType: RecoverLease
traceID: ""
clientId: "client-433C04E5C8CC"
userInfo {
userName: "[email protected]"
remoteAddress: "10.64.62.57"
hostName: "vb1307.halxg.cloudera.com"
}
version: 3
layoutVersion {
version: 6
}
RecoverLeaseRequest {
volumeName: "hsyncvol"
bucketName: "hsyncbuck"
keyName: "hsync/File_24.txt"
force: false
}
failed with exception
java.lang.NullPointerException: SecretKey client must have been initialized
already.
at java.util.Objects.requireNonNull(Objects.java:228)
at
org.apache.hadoop.hdds.security.symmetric.DefaultSecretKeySignerClient.getCurrentSecretKey(DefaultSecretKeySignerClient.java:70)
at
org.apache.hadoop.hdds.security.token.ShortLivedTokenSecretManager.createPassword(ShortLivedTokenSecretManager.java:47)
at
org.apache.hadoop.hdds.security.token.OzoneBlockTokenSecretManager.generateToken(OzoneBlockTokenSecretManager.java:70)
at
org.apache.hadoop.ozone.om.request.file.OMRecoverLeaseRequest.updateBlockInfo(OMRecoverLeaseRequest.java:281)
at
org.apache.hadoop.ozone.om.request.file.OMRecoverLeaseRequest.doWork(OMRecoverLeaseRequest.java:264)
at
org.apache.hadoop.ozone.om.request.file.OMRecoverLeaseRequest.validateAndUpdateCache(OMRecoverLeaseRequest.java:156)
at
org.apache.hadoop.ozone.protocolPB.OzoneManagerRequestHandler.lambda$0(OzoneManagerRequestHandler.java:406)
at
org.apache.hadoop.util.MetricUtil.captureLatencyNs(MetricUtil.java:45)
at
org.apache.hadoop.ozone.protocolPB.OzoneManagerRequestHandler.handleWriteRequestImpl(OzoneManagerRequestHandler.java:404)
at
org.apache.hadoop.ozone.protocolPB.RequestHandler.handleWriteRequest(RequestHandler.java:63)
at
org.apache.hadoop.ozone.om.ratis.OzoneManagerStateMachine.runCommand(OzoneManagerStateMachine.java:525)
at
org.apache.hadoop.ozone.om.ratis.OzoneManagerStateMachine.lambda$1(OzoneManagerStateMachine.java:343)
at
java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1604)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748) {code}
Have seen this 2-3 times, and this time I was able to repro it when Lease
recovery is happening during RR phase.
cc: [~ashishk] [~weichiu]
> [LeaseRecovery] OM shuts down with "SecretKey client must have been
> initialized already"
> ----------------------------------------------------------------------------------------
>
> Key: HDDS-10626
> URL: https://issues.apache.org/jira/browse/HDDS-10626
> Project: Apache Ozone
> Issue Type: Bug
> Components: OM
> Reporter: Pratyush Bhatt
> Priority: Major
>
> In a scenario where I'm conducting lease recovery on multiple files during a
> rolling restart, the OM encounters failure subsequent to the restart of Ozone
> Managers (OMs).
> {code:java}
> 2024-03-31 09:47:01,866 ERROR [om72-OMStateMachineApplyTransactionThread -
> 0]-org.apache.hadoop.ozone.om.ratis.OzoneManagerStateMachine: Terminating
> with exit status 1: Request cmdType: RecoverLease
> traceID: ""
> clientId: "client-433C04E5C8CC"
> userInfo {
> userName: "hdfs@XYZ"
> remoteAddress: "xx.yy.ww.zz"
> hostName: "vb1307.xyz.com"
> }
> version: 3
> layoutVersion {
> version: 6
> }
> RecoverLeaseRequest {
> volumeName: "hsyncvol"
> bucketName: "hsyncbuck"
> keyName: "hsync/File_24.txt"
> force: false
> }
> failed with exception
> java.lang.NullPointerException: SecretKey client must have been initialized
> already.
> at java.util.Objects.requireNonNull(Objects.java:228)
> at
> org.apache.hadoop.hdds.security.symmetric.DefaultSecretKeySignerClient.getCurrentSecretKey(DefaultSecretKeySignerClient.java:70)
> at
> org.apache.hadoop.hdds.security.token.ShortLivedTokenSecretManager.createPassword(ShortLivedTokenSecretManager.java:47)
> at
> org.apache.hadoop.hdds.security.token.OzoneBlockTokenSecretManager.generateToken(OzoneBlockTokenSecretManager.java:70)
> at
> org.apache.hadoop.ozone.om.request.file.OMRecoverLeaseRequest.updateBlockInfo(OMRecoverLeaseRequest.java:281)
> at
> org.apache.hadoop.ozone.om.request.file.OMRecoverLeaseRequest.doWork(OMRecoverLeaseRequest.java:264)
> at
> org.apache.hadoop.ozone.om.request.file.OMRecoverLeaseRequest.validateAndUpdateCache(OMRecoverLeaseRequest.java:156)
> at
> org.apache.hadoop.ozone.protocolPB.OzoneManagerRequestHandler.lambda$0(OzoneManagerRequestHandler.java:406)
> at
> org.apache.hadoop.util.MetricUtil.captureLatencyNs(MetricUtil.java:45)
> at
> org.apache.hadoop.ozone.protocolPB.OzoneManagerRequestHandler.handleWriteRequestImpl(OzoneManagerRequestHandler.java:404)
> at
> org.apache.hadoop.ozone.protocolPB.RequestHandler.handleWriteRequest(RequestHandler.java:63)
> at
> org.apache.hadoop.ozone.om.ratis.OzoneManagerStateMachine.runCommand(OzoneManagerStateMachine.java:525)
> at
> org.apache.hadoop.ozone.om.ratis.OzoneManagerStateMachine.lambda$1(OzoneManagerStateMachine.java:343)
> at
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1604)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748) {code}
> Have seen this 2-3 times, and this time I was able to repro it when Lease
> recovery is happening during RR phase.
> cc: [~ashishk] [~weichiu]
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]