Li Cheng created HDDS-3004:
------------------------------

             Summary: S3 gateway write failure when leader OM is down and OM HA 
is enabled
                 Key: HDDS-3004
                 URL: https://issues.apache.org/jira/browse/HDDS-3004
             Project: Hadoop Distributed Data Store
          Issue Type: Bug
          Components: om
    Affects Versions: 0.4.0
            Reporter: Li Cheng


Use S3 gateway to keep writing data into a specific s3 gateway endpoint. After 
the writer starts to work, I kill the OM process on the OM leader host. After 
that, the s3 gateway can never allow writing data nad keeps reporting 
InternalError for all new coming keys.

Process Process-488:
S3UploadFailedError: Failed to upload ./20191204/file1056.dat to 
ozone-test-reproduce-123/./20191204/file1056.dat: An error occurred (500) when 
calling the PutObject operation (reached max retries: 4): Internal Server Error
Process Process-489:
S3UploadFailedError: Failed to upload ./20191204/file9631.dat to 
ozone-test-reproduce-123/./20191204/file9631.dat: An error occurred (500) when 
calling the PutObject operation (reached max retries: 4): Internal Server Error
Process Process-490:
S3UploadFailedError: Failed to upload ./20191204/file7520.dat to 
ozone-test-reproduce-123/./20191204/file7520.dat: An error occurred (500) when 
calling the PutObject operation (reached max retries: 4): Internal Server Error
Process Process-491:
S3UploadFailedError: Failed to upload ./20191204/file4220.dat to 
ozone-test-reproduce-123/./20191204/file4220.dat: An error occurred (500) when 
calling the PutObject operation (reached max retries: 4): Internal Server Error
Process Process-492:
S3UploadFailedError: Failed to upload ./20191204/file5523.dat to 
ozone-test-reproduce-123/./20191204/file5523.dat: An error occurred (500) when 
calling the PutObject operation (reached max retries: 4): Internal Server Error
Process Process-493:
S3UploadFailedError: Failed to upload ./20191204/file7520.dat to 
ozone-test-reproduce-123/./20191204/file7520.dat: An error occurred (500) when 
calling the PutObject operation (reached max retries: 4): Internal Server Error

That's a partial list and note that all keys are different. I also tried 
re-enable the OM process on previous leader OM, but it doesn't help since the 
leader has changed. Also attach partial OM logs:
2020-02-12 14:57:11,128 [IPC Server handler 72 on 9862] INFO 
org.apache.hadoop.ipc.Server: IPC Server handler 72 on 9862, call Call#4859 
Retry#0 org.apache.hadoop.ozone.om.protocol.OzoneManagerProtocol.submitRequest 
from 9.134.50.210:36561
org.apache.hadoop.ozone.om.exceptions.OMNotLeaderException: OM:om1 is not the 
leader. Suggested leader is OM:om2.
 at 
org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.createNotLeaderException(OzoneManagerProtocolServerSideTranslatorPB.java:183)
 at 
org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitReadRequestToOM(OzoneManagerProtocolServerSideTranslatorPB.java:171)
 at 
org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.processRequest(OzoneManagerProtocolServerSideTranslatorPB.java:107)
 at 
org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:72)
 at 
org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequest(OzoneManagerProtocolServerSideTranslatorPB.java:97)
 at 
org.apache.hadoop.ozone.protocol.proto.OzoneManagerProtocolProtos$OzoneManagerService$2.callBlockingMethod(OzoneManagerProtocolProtos.java)
 at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
 at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
 at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:876)
 at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:822)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:422)
 at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
 at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2682)
2020-02-12 14:57:11,918 [IPC Server handler 159 on 9862] INFO 
org.apache.hadoop.ipc.Server: IPC Server handler 159 on 9862, call Call#4864 
Retry#0 org.apache.hadoop.ozone.om.protocol.OzoneManagerProtocol.submitRequest 
from 9.134.50.210:36561
org.apache.hadoop.ozone.om.exceptions.OMNotLeaderException: OM:om1 is not the 
leader. Suggested leader is OM:om2.
 at 
org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.createNotLeaderException(OzoneManagerProtocolServerSideTranslatorPB.java:183)
 at 
org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitReadRequestToOM(OzoneManagerProtocolServerSideTranslatorPB.java:171)
 at 
org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.processRequest(OzoneManagerProtocolServerSideTranslatorPB.java:107)
 at 
org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:72)
 at 
org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequest(OzoneManagerProtocolServerSideTranslatorPB.java:97)
 at 
org.apache.hadoop.ozone.protocol.proto.OzoneManagerProtocolProtos$OzoneManagerService$2.callBlockingMethod(OzoneManagerProtocolProtos.java)
 at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
 at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
 at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:876)
 at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:822)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:422)
 at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
 at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2682)
2020-02-12 14:57:15,395 [IPC Server handler 23 on 9862] INFO 
org.apache.hadoop.ipc.Server: IPC Server handler 23 on 9862, call Call#4869 
Retry#0 org.apache.hadoop.ozone.om.protocol.OzoneManagerProtocol.submitRequest 
from 9.134.50.210:36561
org.apache.hadoop.ozone.om.exceptions.OMNotLeaderException: OM:om1 is not the 
leader. Suggested leader is OM:om2.
 at 
org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.createNotLeaderException(OzoneManagerProtocolServerSideTranslatorPB.java:183)
 at 
org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitReadRequestToOM(OzoneManagerProtocolServerSideTranslatorPB.java:171)
 at 
org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.processRequest(OzoneManagerProtocolServerSideTranslatorPB.java:107)
 at 
org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:72)
 at 
org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequest(OzoneManagerProtocolServerSideTranslatorPB.java:97)
 at 
org.apache.hadoop.ozone.protocol.proto.OzoneManagerProtocolProtos$OzoneManagerService$2.callBlockingMethod(OzoneManagerProtocolProtos.java)
 at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
 at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
 at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:876)
 at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:822)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:422)
 at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
 at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2682)

 

 

Also attach the ozone-site.xml config to enable OM HA:

<property>
 <name>ozone.om.service.ids</name>
 <value>OMHA</value>
 </property>
 <property>
 <name>ozone.om.nodes.OMHA</name>
 <value>om1,om2,om3</value>
 </property>
 <property>
 <name>ozone.om.node.id</name>
 <value>om1</value>
 </property>
 <property>
 <name>ozone.om.address.OMHA.om1</name>
 <value>9.134.50.210:9862</value>
 </property>
 <property>
 <name>ozone.om.address.OMHA.om2</name>
 <value>9.134.51.215:9862</value>
 </property>
 <property>
 <name>ozone.om.address.OMHA.om3</name>
 <value>9.134.51.25:9862</value>
 </property>
 <property>
 <name>ozone.om.ratis.enable</name>
 <value>true</value>
 </property>
 <property>
 <name>ozone.enabled</name>
 <value>true</value>
 <tag>OZONE, REQUIRED</tag>
 <description>
 Status of the Ozone Object Storage service is enabled.
 Set to true to enable Ozone.
 Set to false to disable Ozone.
 Unless this value is set to true, Ozone services will not be started in
 the cluster.

Please note: By default ozone is disabled on a hadoop cluster.
 </description>
 </property>



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to