[
https://issues.apache.org/jira/browse/HDDS-3004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17035970#comment-17035970
]
Li Cheng commented on HDDS-3004:
--------------------------------
When I bounce the leader OM one more time, it turns out the new leader cannot
be ready for over 5 hours.
2020-02-13 {color:#FF0000}11:32:00{color},707 [IPC Server handler 165 on 9862]
INFO org.apache.hadoop.ipc.Server: IPC Server handler 165 on 9862, call Call#49
Retry#16 org.apache.hadoop.ozone.om.protocol.OzoneManagerProtocol.submitRequest
from 9.134.50.210:58760
org.apache.hadoop.ozone.om.exceptions.OMLeaderNotReadyException:
om3@group-02A030565101 is {color:#FF0000}in LEADER state but not ready
yet{color}.
at
org.apache.hadoop.ozone.om.ratis.OzoneManagerRatisServer.processReply(OzoneManagerRatisServer.java:177)
at
org.apache.hadoop.ozone.om.ratis.OzoneManagerRatisServer.submitRequest(OzoneManagerRatisServer.java:136)
at
org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequestToRatis(OzoneManagerProtocolServerSideTranslatorPB.java:162)
at
org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.processRequest(OzoneManagerProtocolServerSideTranslatorPB.java:118)
at
org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:72)
at
org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequest(OzoneManagerProtocolServerSideTranslatorPB.java:97)
at
org.apache.hadoop.ozone.protocol.proto.OzoneManagerProtocolProtos$OzoneManagerService$2.callBlockingMethod(OzoneManagerProtocolProtos.java)
at
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:876)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:822)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2682)
2020-02-13 {color:#FF0000}14:53:{color}10,393 [IPC Server handler 174 on 9862]
INFO org.apache.hadoop.ipc.Server: IPC Server handler 174 on 9862, call Call#99
Retry#16 org.apache.hadoop.ozone.om.protocol.OzoneManagerProtocol.submitRequest
from 9.134.50.210:59780
org.apache.hadoop.ozone.om.exceptions.OMLeaderNotReadyException:
om3@group-02A030565101 is {color:#FF0000}in LEADER state but not ready
yet{color}.
at
org.apache.hadoop.ozone.om.ratis.OzoneManagerRatisServer.processReply(OzoneManagerRatisServer.java:177)
at
org.apache.hadoop.ozone.om.ratis.OzoneManagerRatisServer.submitRequest(OzoneManagerRatisServer.java:136)
at
org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequestToRatis(OzoneManagerProtocolServerSideTranslatorPB.java:162)
at
org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.processRequest(OzoneManagerProtocolServerSideTranslatorPB.java:118)
at
org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:72)
at
org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequest(OzoneManagerProtocolServerSideTranslatorPB.java:97)
at
org.apache.hadoop.ozone.protocol.proto.OzoneManagerProtocolProtos$OzoneManagerService$2.callBlockingMethod(OzoneManagerProtocolProtos.java)
at
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:876)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:822)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2682)
No s3 writes can be successful.
[root@VM_50_210_centos ~]# ~/.local/bin/aws s3api --endpoint
http://localhost:9878 put-object --bucket ozone-test-reproduce-123 --key
test.in --body test.txt
An error occurred (500) when calling the PutObject operation (reached max
retries: 4): Internal Server Error
[root@VM_50_210_centos ~]# ~/.local/bin/aws s3api --endpoint
http://localhost:9878 put-object --bucket ozone-test-reproduce-123 --key
test.in --body test.txt
An error occurred (500) when calling the PutObject operation (reached max
retries: 4): Internal Server Error
> S3 gateway write failure when leader OM is down and OM HA is enabled
> --------------------------------------------------------------------
>
> Key: HDDS-3004
> URL: https://issues.apache.org/jira/browse/HDDS-3004
> Project: Hadoop Distributed Data Store
> Issue Type: Bug
> Components: om
> Affects Versions: 0.4.0
> Reporter: Li Cheng
> Priority: Major
>
> Use S3 gateway to keep writing data into a specific s3 gateway endpoint.
> After the writer starts to work, I kill the OM process on the OM leader host.
> After that, the s3 gateway can never allow writing data nad keeps reporting
> InternalError for all new coming keys.
> Process Process-488:
> S3UploadFailedError: Failed to upload ./20191204/file1056.dat to
> ozone-test-reproduce-123/./20191204/file1056.dat: An error occurred (500)
> when calling the PutObject operation (reached max retries: 4): Internal
> Server Error
> Process Process-489:
> S3UploadFailedError: Failed to upload ./20191204/file9631.dat to
> ozone-test-reproduce-123/./20191204/file9631.dat: An error occurred (500)
> when calling the PutObject operation (reached max retries: 4): Internal
> Server Error
> Process Process-490:
> S3UploadFailedError: Failed to upload ./20191204/file7520.dat to
> ozone-test-reproduce-123/./20191204/file7520.dat: An error occurred (500)
> when calling the PutObject operation (reached max retries: 4): Internal
> Server Error
> Process Process-491:
> S3UploadFailedError: Failed to upload ./20191204/file4220.dat to
> ozone-test-reproduce-123/./20191204/file4220.dat: An error occurred (500)
> when calling the PutObject operation (reached max retries: 4): Internal
> Server Error
> Process Process-492:
> S3UploadFailedError: Failed to upload ./20191204/file5523.dat to
> ozone-test-reproduce-123/./20191204/file5523.dat: An error occurred (500)
> when calling the PutObject operation (reached max retries: 4): Internal
> Server Error
> Process Process-493:
> S3UploadFailedError: Failed to upload ./20191204/file7520.dat to
> ozone-test-reproduce-123/./20191204/file7520.dat: An error occurred (500)
> when calling the PutObject operation (reached max retries: 4): Internal
> Server Error
> That's a partial list and note that all keys are different. I also tried
> re-enable the OM process on previous leader OM, but it doesn't help since the
> leader has changed. Also attach partial OM logs:
> 2020-02-12 14:57:11,128 [IPC Server handler 72 on 9862] INFO
> org.apache.hadoop.ipc.Server: IPC Server handler 72 on 9862, call Call#4859
> Retry#0
> org.apache.hadoop.ozone.om.protocol.OzoneManagerProtocol.submitRequest from
> 9.134.50.210:36561
> org.apache.hadoop.ozone.om.exceptions.OMNotLeaderException: OM:om1 is not the
> leader. Suggested leader is OM:om2.
> at
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.createNotLeaderException(OzoneManagerProtocolServerSideTranslatorPB.java:183)
> at
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitReadRequestToOM(OzoneManagerProtocolServerSideTranslatorPB.java:171)
> at
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.processRequest(OzoneManagerProtocolServerSideTranslatorPB.java:107)
> at
> org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:72)
> at
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequest(OzoneManagerProtocolServerSideTranslatorPB.java:97)
> at
> org.apache.hadoop.ozone.protocol.proto.OzoneManagerProtocolProtos$OzoneManagerService$2.callBlockingMethod(OzoneManagerProtocolProtos.java)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:876)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:822)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2682)
> 2020-02-12 14:57:11,918 [IPC Server handler 159 on 9862] INFO
> org.apache.hadoop.ipc.Server: IPC Server handler 159 on 9862, call Call#4864
> Retry#0
> org.apache.hadoop.ozone.om.protocol.OzoneManagerProtocol.submitRequest from
> 9.134.50.210:36561
> org.apache.hadoop.ozone.om.exceptions.OMNotLeaderException: OM:om1 is not the
> leader. Suggested leader is OM:om2.
> at
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.createNotLeaderException(OzoneManagerProtocolServerSideTranslatorPB.java:183)
> at
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitReadRequestToOM(OzoneManagerProtocolServerSideTranslatorPB.java:171)
> at
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.processRequest(OzoneManagerProtocolServerSideTranslatorPB.java:107)
> at
> org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:72)
> at
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequest(OzoneManagerProtocolServerSideTranslatorPB.java:97)
> at
> org.apache.hadoop.ozone.protocol.proto.OzoneManagerProtocolProtos$OzoneManagerService$2.callBlockingMethod(OzoneManagerProtocolProtos.java)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:876)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:822)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2682)
> 2020-02-12 14:57:15,395 [IPC Server handler 23 on 9862] INFO
> org.apache.hadoop.ipc.Server: IPC Server handler 23 on 9862, call Call#4869
> Retry#0
> org.apache.hadoop.ozone.om.protocol.OzoneManagerProtocol.submitRequest from
> 9.134.50.210:36561
> org.apache.hadoop.ozone.om.exceptions.OMNotLeaderException: OM:om1 is not the
> leader. Suggested leader is OM:om2.
> at
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.createNotLeaderException(OzoneManagerProtocolServerSideTranslatorPB.java:183)
> at
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitReadRequestToOM(OzoneManagerProtocolServerSideTranslatorPB.java:171)
> at
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.processRequest(OzoneManagerProtocolServerSideTranslatorPB.java:107)
> at
> org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:72)
> at
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequest(OzoneManagerProtocolServerSideTranslatorPB.java:97)
> at
> org.apache.hadoop.ozone.protocol.proto.OzoneManagerProtocolProtos$OzoneManagerService$2.callBlockingMethod(OzoneManagerProtocolProtos.java)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:876)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:822)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2682)
>
>
> Also attach the ozone-site.xml config to enable OM HA:
> <property>
> <name>ozone.om.service.ids</name>
> <value>OMHA</value>
> </property>
> <property>
> <name>ozone.om.nodes.OMHA</name>
> <value>om1,om2,om3</value>
> </property>
> <property>
> <name>ozone.om.node.id</name>
> <value>om1</value>
> </property>
> <property>
> <name>ozone.om.address.OMHA.om1</name>
> <value>9.134.50.210:9862</value>
> </property>
> <property>
> <name>ozone.om.address.OMHA.om2</name>
> <value>9.134.51.215:9862</value>
> </property>
> <property>
> <name>ozone.om.address.OMHA.om3</name>
> <value>9.134.51.25:9862</value>
> </property>
> <property>
> <name>ozone.om.ratis.enable</name>
> <value>true</value>
> </property>
> <property>
> <name>ozone.enabled</name>
> <value>true</value>
> <tag>OZONE, REQUIRED</tag>
> <description>
> Status of the Ozone Object Storage service is enabled.
> Set to true to enable Ozone.
> Set to false to disable Ozone.
> Unless this value is set to true, Ozone services will not be started in
> the cluster.
> Please note: By default ozone is disabled on a hadoop cluster.
> </description>
> </property>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]