ChenSammi opened a new pull request, #6735:
URL: https://github.com/apache/ozone/pull/6735

   ## What changes were proposed in this pull request?
   
   Current OM doesn't handle the LeaderSteppingDownException at all, which 
leads to this NPE exception. 
   
   ```
   2024-05-28 07:00:47,206 [IPC Server handler 67 on default port 9862] WARN 
ipc.Server: IPC Server handler 67 on default port 9862, call Call#25821 Retry#0 
org.apache.hadoop.ozone.om.protocol.OzoneManagerProtocol.submitRequest from 
scm3.org:53692 / 172.25.0.118:53692
   java.lang.NullPointerException
           at 
org.apache.hadoop.ozone.om.helpers.OMRatisHelper.getOMResponseFromRaftClientReply(OMRatisHelper.java:69)
           at 
org.apache.hadoop.ozone.om.ratis.OzoneManagerRatisServer.createOmResponseImpl(OzoneManagerRatisServer.java:527)
           at 
org.apache.hadoop.ozone.om.ratis.OzoneManagerRatisServer.lambda$2(OzoneManagerRatisServer.java:285)
           at 
org.apache.hadoop.util.MetricUtil.captureLatencyNs(MetricUtil.java:45)
           at 
org.apache.hadoop.ozone.om.ratis.OzoneManagerRatisServer.createOmResponse(OzoneManagerRatisServer.java:283)
           at 
org.apache.hadoop.ozone.om.ratis.OzoneManagerRatisServer.submitRequest(OzoneManagerRatisServer.java:263)
           at 
org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequestToRatis(OzoneManagerProtocolServerSideTranslatorPB.java:252)
           at 
org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.internalProcessRequest(OzoneManagerProtocolServerSideTranslatorPB.java:226)
           at 
org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.processRequest(OzoneManagerProtocolServerSideTranslatorPB.java:161)
           at 
org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:89)
           at 
org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequest(OzoneManagerProtocolServerSideTranslatorPB.java:152)
           at 
org.apache.hadoop.ozone.protocol.proto.OzoneManagerProtocolProtos$OzoneManagerService$2.callBlockingMethod(OzoneManagerProtocolProtos.java)
           at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server.processCall(ProtobufRpcEngine.java:484)
           at 
org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:595)
           at 
org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:573)
           at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1227)
           at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1094)
           at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1017)
           at java.base/java.security.AccessController.doPrivileged(Native 
Method)
           at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
           at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899)
           at org.apache.hadoop.ipc.Server$Handler.run(Server.java:3048)
   ```
   
   ## What is the link to the Apache JIRA
   
   https://issues.apache.org/jira/browse/HDDS-10918
   
   ## How was this patch tested?
   
   Manual test. 
   1. Build Ozone with patch and start a Ozone OM HA cluster,
   2. start a command to transfer OM leader repeatedly
   3. start a "ozone freon rk" command
   4. watch the OM logs, there are no more above NPE stack traces.
   5. watch freon ouputs, there are no more NPE stack traces, instead, there 
are following stack traces, 
   
   ```
   2024-05-28 08:12:44,885 [pool-2-thread-8] INFO retry.RetryInvocationHandler: 
com.google.protobuf.ServiceException: 
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ozone.om.exceptions.OMNotLeaderException):
 om3@group-D66704EFC61C is stepping down
       at 
org.apache.hadoop.ozone.om.ratis.OzoneManagerRatisServer.createOmResponseImpl(OzoneManagerRatisServer.java:501)
       at 
org.apache.hadoop.ozone.om.ratis.OzoneManagerRatisServer.lambda$2(OzoneManagerRatisServer.java:286)
       at org.apache.hadoop.util.MetricUtil.captureLatencyNs(MetricUtil.java:45)
       at 
org.apache.hadoop.ozone.om.ratis.OzoneManagerRatisServer.createOmResponse(OzoneManagerRatisServer.java:284)
       at 
org.apache.hadoop.ozone.om.ratis.OzoneManagerRatisServer.submitRequest(OzoneManagerRatisServer.java:264)
       at 
org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequestToRatis(OzoneManagerProtocolServerSideTranslatorPB.java:252)
       at 
org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.internalProcessRequest(OzoneManagerProtocolServerSideTranslatorPB.java:226)
       at 
org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.processRequest(OzoneManagerProtocolServerSideTranslatorPB.java:161)
       at 
org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:89)
       at 
org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequest(OzoneManagerProtocolServerSideTranslatorPB.java:152)
       at 
org.apache.hadoop.ozone.protocol.proto.OzoneManagerProtocolProtos$OzoneManagerService$2.callBlockingMethod(OzoneManagerProtocolProtos.java)
       at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server.processCall(ProtobufRpcEngine.java:484)
       at 
org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:595)
       at 
org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:573)
       at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1227)
       at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1094)
       at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1017)
       at java.base/java.security.AccessController.doPrivileged(Native Method)
       at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
       at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899)
       at org.apache.hadoop.ipc.Server$Handler.run(Server.java:3048)
   , while invoking $Proxy24.submitRequest over nodeId=om1,nodeAddress=om1:9862 
after 3 failover attempts. Trying to failover immediately. Current retry count: 
3.
   2024-05-28 08:12:44,886 [pool-2-thread-8] WARN retry.RetryInvocationHandler: 
A failover has occurred since the start of call #3126 $Proxy24.submitRequest 
over nodeId=om1,nodeAddress=om1:9862
   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to