[jira] [Updated] (HDDS-3004) OM HA stability issues

Li Cheng (Jira) Mon, 24 Feb 2020 09:57:01 -0800


     [ 
https://issues.apache.org/jira/browse/HDDS-3004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Li Cheng updated HDDS-3004:
---------------------------
    Description: 
To conclude a little, _+{color:#ff0000}major issues{color}+_ that I find:
 # When I do a long running s3g writing to cluster with OM HA and I stop the Om 
leader to force a re-election, the writing will stop and can never recover.

--updates 2020-02-20:

https://issues.apache.org/jira/browse/HDDS-3031 {color:#ff0000}fixes{color} 
this issue.

 

2. If I force a OM re-election and do a scm restart after that, the cluster 
cannot see any leader datanode and no datanodes are able to send pipeline 
reports, which makes the cluster unavailable as well. I consider this a 
multi-failover case when the leader OM and SCM are on the same node and there 
is a short outage happen to the node.

 

--updates 2020-02-20:

 When you do a jar swap for a new version of Ozone and enable OM HA while 
keeping the same ozone-site.xml as last time, if you've written some data into 
the last Ozone cluster (and therefore there are existing versions and metadata 
for om and scm), SCM cannot be up after the jar swap.

{color:#ff0000}Error logs{color}: 
PipelineID=aae4f728-82ef-4bbb-a0a5-7b3f2af030cc not found in scm out logs when 
scm process cannot be started.

 

--updates 2020-02-24:

After I add some logs to SCM starter:
Assuming SCM is only bounced after the leader OM is stopped
1. If SCM is bounced {color:#de350b}after{color} former leader OM is restarted, 
meaning all OMs are up, SCM will be bootstrapped correctly but there will be 
missing pipeline report from the node who doesn't have OM process on it (it's 
always him tho). This would cause all pipelines stay at ALLOCATED state and 
cluster will be in safemode. At this point, if I {color:#de350b}restart the 
blacksheep datanode{color}, it will come back and send the pipeline report to 
SCM and all pipelines will be at OPEN state.
2. If SCM is bounced {color:#de350b}before{color} the former leader OM is 
restarted, meaning not all OMs in ratis ring are up, SCM 
{color:#de350b}cannot{color} be bootstrapped correctly and it shows Pipeline 
not found.

 

Original posting:

Use S3 gateway to keep writing data into a specific s3 gateway endpoint. After 
the writer starts to work, I kill the OM process on the OM leader host. After 
that, the s3 gateway can never allow writing data and keeps reporting 
InternalError for all new coming keys.

Process Process-488:
 S3UploadFailedError: Failed to upload ./20191204/file1056.dat to 
ozone-test-reproduce-123/./20191204/file1056.dat: An error occurred (500) when 
calling the PutObject operation (reached max retries: 4): Internal Server Error
 Process Process-489:
 S3UploadFailedError: Failed to upload ./20191204/file9631.dat to 
ozone-test-reproduce-123/./20191204/file9631.dat: An error occurred (500) when 
calling the PutObject operation (reached max retries: 4): Internal Server Error
 Process Process-490:
 S3UploadFailedError: Failed to upload ./20191204/file7520.dat to 
ozone-test-reproduce-123/./20191204/file7520.dat: An error occurred (500) when 
calling the PutObject operation (reached max retries: 4): Internal Server Error
 Process Process-491:
 S3UploadFailedError: Failed to upload ./20191204/file4220.dat to 
ozone-test-reproduce-123/./20191204/file4220.dat: An error occurred (500) when 
calling the PutObject operation (reached max retries: 4): Internal Server Error
 Process Process-492:
 S3UploadFailedError: Failed to upload ./20191204/file5523.dat to 
ozone-test-reproduce-123/./20191204/file5523.dat: An error occurred (500) when 
calling the PutObject operation (reached max retries: 4): Internal Server Error
 Process Process-493:
 S3UploadFailedError: Failed to upload ./20191204/file7520.dat to 
ozone-test-reproduce-123/./20191204/file7520.dat: An error occurred (500) when 
calling the PutObject operation (reached max retries: 4): Internal Server Error

That's a partial list and note that all keys are different. I also tried 
re-enable the OM process on previous leader OM, but it doesn't help since the 
leader has changed. Also attach partial OM logs:
 2020-02-12 14:57:11,128 [IPC Server handler 72 on 9862] INFO 
org.apache.hadoop.ipc.Server: IPC Server handler 72 on 9862, call Call#4859 
Retry#0 org.apache.hadoop.ozone.om.protocol.OzoneManagerProtocol.submitRequest 
from 9.134.50.210:36561
 org.apache.hadoop.ozone.om.exceptions.OMNotLeaderException: OM:om1 is not the 
leader. Suggested leader is OM:om2.
 at 
org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.createNotLeaderException(OzoneManagerProtocolServerSideTranslatorPB.java:183)
 at 
org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitReadRequestToOM(OzoneManagerProtocolServerSideTranslatorPB.java:171)
 at 
org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.processRequest(OzoneManagerProtocolServerSideTranslatorPB.java:107)
 at 
org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:72)
 at 
org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequest(OzoneManagerProtocolServerSideTranslatorPB.java:97)
 at 
org.apache.hadoop.ozone.protocol.proto.OzoneManagerProtocolProtos$OzoneManagerService$2.callBlockingMethod(OzoneManagerProtocolProtos.java)
 at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
 at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
 at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:876)
 at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:822)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:422)
 at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
 at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2682)
 2020-02-12 14:57:11,918 [IPC Server handler 159 on 9862] INFO 
org.apache.hadoop.ipc.Server: IPC Server handler 159 on 9862, call Call#4864 
Retry#0 org.apache.hadoop.ozone.om.protocol.OzoneManagerProtocol.submitRequest 
from 9.134.50.210:36561
 org.apache.hadoop.ozone.om.exceptions.OMNotLeaderException: OM:om1 is not the 
leader. Suggested leader is OM:om2.
 at 
org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.createNotLeaderException(OzoneManagerProtocolServerSideTranslatorPB.java:183)
 at 
org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitReadRequestToOM(OzoneManagerProtocolServerSideTranslatorPB.java:171)
 at 
org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.processRequest(OzoneManagerProtocolServerSideTranslatorPB.java:107)
 at 
org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:72)
 at 
org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequest(OzoneManagerProtocolServerSideTranslatorPB.java:97)
 at 
org.apache.hadoop.ozone.protocol.proto.OzoneManagerProtocolProtos$OzoneManagerService$2.callBlockingMethod(OzoneManagerProtocolProtos.java)
 at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
 at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
 at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:876)
 at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:822)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:422)
 at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
 at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2682)
 2020-02-12 14:57:15,395 [IPC Server handler 23 on 9862] INFO 
org.apache.hadoop.ipc.Server: IPC Server handler 23 on 9862, call Call#4869 
Retry#0 org.apache.hadoop.ozone.om.protocol.OzoneManagerProtocol.submitRequest 
from 9.134.50.210:36561
 org.apache.hadoop.ozone.om.exceptions.OMNotLeaderException: OM:om1 is not the 
leader. Suggested leader is OM:om2.
 at 
org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.createNotLeaderException(OzoneManagerProtocolServerSideTranslatorPB.java:183)
 at 
org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitReadRequestToOM(OzoneManagerProtocolServerSideTranslatorPB.java:171)
 at 
org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.processRequest(OzoneManagerProtocolServerSideTranslatorPB.java:107)
 at 
org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:72)
 at 
org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequest(OzoneManagerProtocolServerSideTranslatorPB.java:97)
 at 
org.apache.hadoop.ozone.protocol.proto.OzoneManagerProtocolProtos$OzoneManagerService$2.callBlockingMethod(OzoneManagerProtocolProtos.java)
 at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
 at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
 at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:876)
 at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:822)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:422)
 at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
 at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2682)

 

 

Also attach the ozone-site.xml config to enable OM HA:

<property>
 <name>ozone.om.service.ids</name>
 <value>OMHA</value>
 </property>
 <property>
 <name>ozone.om.nodes.OMHA</name>
 <value>om1,om2,om3</value>
 </property>
 <property>
 <name>ozone.om.node.id</name>
 <value>om1</value>
 </property>
 <property>
 <name>ozone.om.address.OMHA.om1</name>
 <value>9.134.50.210:9862</value>
 </property>
 <property>
 <name>ozone.om.address.OMHA.om2</name>
 <value>9.134.51.215:9862</value>
 </property>
 <property>
 <name>ozone.om.address.OMHA.om3</name>
 <value>9.134.51.25:9862</value>
 </property>
 <property>
 <name>ozone.om.ratis.enable</name>
 <value>true</value>
 </property>
 <property>
 <name>ozone.enabled</name>
 <value>true</value>
 <tag>OZONE, REQUIRED</tag>
 <description>
 Status of the Ozone Object Storage service is enabled.
 Set to true to enable Ozone.
 Set to false to disable Ozone.
 Unless this value is set to true, Ozone services will not be started in
 the cluster.

Please note: By default ozone is disabled on a hadoop cluster.
 </description>
 </property>

  was:
To conclude a little, _+{color:#ff0000}major issues{color}+_ that I find:
 # When I do a long running s3g writing to cluster with OM HA and I stop the Om 
leader to force a re-election, the writing will stop and can never recover.

--updates 2020-02-20:

https://issues.apache.org/jira/browse/HDDS-3031 {color:#FF0000}fixes{color} 
this issue.

 

2. If I force a OM re-election and do a scm restart after that, the cluster 
cannot see any leader datanode and no datanodes are able to send pipeline 
reports, which makes the cluster unavailable as well. I consider this a 
multi-failover case when the leader OM and SCM are on the same node and there 
is a short outage happen to the node.

 

--updates 2020-02-20:

 When you do a jar swap for a new version of Ozone and enable OM HA while 
keeping the same ozone-site.xml as last time, if you've written some data into 
the last Ozone cluster (and therefore there are existing versions and metadata 
for om and scm), SCM cannot be up after the jar swap.

{color:#FF0000}Error logs{color}: 
PipelineID=aae4f728-82ef-4bbb-a0a5-7b3f2af030cc not found in scm out logs when 
scm process cannot be started.

 

Original posting:

Use S3 gateway to keep writing data into a specific s3 gateway endpoint. After 
the writer starts to work, I kill the OM process on the OM leader host. After 
that, the s3 gateway can never allow writing data and keeps reporting 
InternalError for all new coming keys.

Process Process-488:
 S3UploadFailedError: Failed to upload ./20191204/file1056.dat to 
ozone-test-reproduce-123/./20191204/file1056.dat: An error occurred (500) when 
calling the PutObject operation (reached max retries: 4): Internal Server Error
 Process Process-489:
 S3UploadFailedError: Failed to upload ./20191204/file9631.dat to 
ozone-test-reproduce-123/./20191204/file9631.dat: An error occurred (500) when 
calling the PutObject operation (reached max retries: 4): Internal Server Error
 Process Process-490:
 S3UploadFailedError: Failed to upload ./20191204/file7520.dat to 
ozone-test-reproduce-123/./20191204/file7520.dat: An error occurred (500) when 
calling the PutObject operation (reached max retries: 4): Internal Server Error
 Process Process-491:
 S3UploadFailedError: Failed to upload ./20191204/file4220.dat to 
ozone-test-reproduce-123/./20191204/file4220.dat: An error occurred (500) when 
calling the PutObject operation (reached max retries: 4): Internal Server Error
 Process Process-492:
 S3UploadFailedError: Failed to upload ./20191204/file5523.dat to 
ozone-test-reproduce-123/./20191204/file5523.dat: An error occurred (500) when 
calling the PutObject operation (reached max retries: 4): Internal Server Error
 Process Process-493:
 S3UploadFailedError: Failed to upload ./20191204/file7520.dat to 
ozone-test-reproduce-123/./20191204/file7520.dat: An error occurred (500) when 
calling the PutObject operation (reached max retries: 4): Internal Server Error

That's a partial list and note that all keys are different. I also tried 
re-enable the OM process on previous leader OM, but it doesn't help since the 
leader has changed. Also attach partial OM logs:
 2020-02-12 14:57:11,128 [IPC Server handler 72 on 9862] INFO 
org.apache.hadoop.ipc.Server: IPC Server handler 72 on 9862, call Call#4859 
Retry#0 org.apache.hadoop.ozone.om.protocol.OzoneManagerProtocol.submitRequest 
from 9.134.50.210:36561
 org.apache.hadoop.ozone.om.exceptions.OMNotLeaderException: OM:om1 is not the 
leader. Suggested leader is OM:om2.
 at 
org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.createNotLeaderException(OzoneManagerProtocolServerSideTranslatorPB.java:183)
 at 
org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitReadRequestToOM(OzoneManagerProtocolServerSideTranslatorPB.java:171)
 at 
org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.processRequest(OzoneManagerProtocolServerSideTranslatorPB.java:107)
 at 
org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:72)
 at 
org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequest(OzoneManagerProtocolServerSideTranslatorPB.java:97)
 at 
org.apache.hadoop.ozone.protocol.proto.OzoneManagerProtocolProtos$OzoneManagerService$2.callBlockingMethod(OzoneManagerProtocolProtos.java)
 at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
 at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
 at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:876)
 at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:822)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:422)
 at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
 at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2682)
 2020-02-12 14:57:11,918 [IPC Server handler 159 on 9862] INFO 
org.apache.hadoop.ipc.Server: IPC Server handler 159 on 9862, call Call#4864 
Retry#0 org.apache.hadoop.ozone.om.protocol.OzoneManagerProtocol.submitRequest 
from 9.134.50.210:36561
 org.apache.hadoop.ozone.om.exceptions.OMNotLeaderException: OM:om1 is not the 
leader. Suggested leader is OM:om2.
 at 
org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.createNotLeaderException(OzoneManagerProtocolServerSideTranslatorPB.java:183)
 at 
org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitReadRequestToOM(OzoneManagerProtocolServerSideTranslatorPB.java:171)
 at 
org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.processRequest(OzoneManagerProtocolServerSideTranslatorPB.java:107)
 at 
org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:72)
 at 
org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequest(OzoneManagerProtocolServerSideTranslatorPB.java:97)
 at 
org.apache.hadoop.ozone.protocol.proto.OzoneManagerProtocolProtos$OzoneManagerService$2.callBlockingMethod(OzoneManagerProtocolProtos.java)
 at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
 at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
 at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:876)
 at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:822)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:422)
 at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
 at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2682)
 2020-02-12 14:57:15,395 [IPC Server handler 23 on 9862] INFO 
org.apache.hadoop.ipc.Server: IPC Server handler 23 on 9862, call Call#4869 
Retry#0 org.apache.hadoop.ozone.om.protocol.OzoneManagerProtocol.submitRequest 
from 9.134.50.210:36561
 org.apache.hadoop.ozone.om.exceptions.OMNotLeaderException: OM:om1 is not the 
leader. Suggested leader is OM:om2.
 at 
org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.createNotLeaderException(OzoneManagerProtocolServerSideTranslatorPB.java:183)
 at 
org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitReadRequestToOM(OzoneManagerProtocolServerSideTranslatorPB.java:171)
 at 
org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.processRequest(OzoneManagerProtocolServerSideTranslatorPB.java:107)
 at 
org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:72)
 at 
org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequest(OzoneManagerProtocolServerSideTranslatorPB.java:97)
 at 
org.apache.hadoop.ozone.protocol.proto.OzoneManagerProtocolProtos$OzoneManagerService$2.callBlockingMethod(OzoneManagerProtocolProtos.java)
 at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
 at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
 at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:876)
 at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:822)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:422)
 at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
 at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2682)

 

 

Also attach the ozone-site.xml config to enable OM HA:

<property>
 <name>ozone.om.service.ids</name>
 <value>OMHA</value>
 </property>
 <property>
 <name>ozone.om.nodes.OMHA</name>
 <value>om1,om2,om3</value>
 </property>
 <property>
 <name>ozone.om.node.id</name>
 <value>om1</value>
 </property>
 <property>
 <name>ozone.om.address.OMHA.om1</name>
 <value>9.134.50.210:9862</value>
 </property>
 <property>
 <name>ozone.om.address.OMHA.om2</name>
 <value>9.134.51.215:9862</value>
 </property>
 <property>
 <name>ozone.om.address.OMHA.om3</name>
 <value>9.134.51.25:9862</value>
 </property>
 <property>
 <name>ozone.om.ratis.enable</name>
 <value>true</value>
 </property>
 <property>
 <name>ozone.enabled</name>
 <value>true</value>
 <tag>OZONE, REQUIRED</tag>
 <description>
 Status of the Ozone Object Storage service is enabled.
 Set to true to enable Ozone.
 Set to false to disable Ozone.
 Unless this value is set to true, Ozone services will not be started in
 the cluster.

Please note: By default ozone is disabled on a hadoop cluster.
 </description>
 </property>


> OM HA stability issues
> ----------------------
>
>                 Key: HDDS-3004
>                 URL: https://issues.apache.org/jira/browse/HDDS-3004
>             Project: Hadoop Distributed Data Store
>          Issue Type: Bug
>          Components: om
>    Affects Versions: 0.4.0
>            Reporter: Li Cheng
>            Assignee: Bharat Viswanadham
>            Priority: Blocker
>
> To conclude a little, _+{color:#ff0000}major issues{color}+_ that I find:
>  # When I do a long running s3g writing to cluster with OM HA and I stop the 
> Om leader to force a re-election, the writing will stop and can never recover.
> --updates 2020-02-20:
> https://issues.apache.org/jira/browse/HDDS-3031 {color:#ff0000}fixes{color} 
> this issue.
>  
> 2. If I force a OM re-election and do a scm restart after that, the cluster 
> cannot see any leader datanode and no datanodes are able to send pipeline 
> reports, which makes the cluster unavailable as well. I consider this a 
> multi-failover case when the leader OM and SCM are on the same node and there 
> is a short outage happen to the node.
>  
> --updates 2020-02-20:
>  When you do a jar swap for a new version of Ozone and enable OM HA while 
> keeping the same ozone-site.xml as last time, if you've written some data 
> into the last Ozone cluster (and therefore there are existing versions and 
> metadata for om and scm), SCM cannot be up after the jar swap.
> {color:#ff0000}Error logs{color}: 
> PipelineID=aae4f728-82ef-4bbb-a0a5-7b3f2af030cc not found in scm out logs 
> when scm process cannot be started.
>  
> --updates 2020-02-24:
> After I add some logs to SCM starter:
> Assuming SCM is only bounced after the leader OM is stopped
> 1. If SCM is bounced {color:#de350b}after{color} former leader OM is 
> restarted, meaning all OMs are up, SCM will be bootstrapped correctly but 
> there will be missing pipeline report from the node who doesn't have OM 
> process on it (it's always him tho). This would cause all pipelines stay at 
> ALLOCATED state and cluster will be in safemode. At this point, if I 
> {color:#de350b}restart the blacksheep datanode{color}, it will come back and 
> send the pipeline report to SCM and all pipelines will be at OPEN state.
> 2. If SCM is bounced {color:#de350b}before{color} the former leader OM is 
> restarted, meaning not all OMs in ratis ring are up, SCM 
> {color:#de350b}cannot{color} be bootstrapped correctly and it shows Pipeline 
> not found.
>  
> Original posting:
> Use S3 gateway to keep writing data into a specific s3 gateway endpoint. 
> After the writer starts to work, I kill the OM process on the OM leader host. 
> After that, the s3 gateway can never allow writing data and keeps reporting 
> InternalError for all new coming keys.
> Process Process-488:
>  S3UploadFailedError: Failed to upload ./20191204/file1056.dat to 
> ozone-test-reproduce-123/./20191204/file1056.dat: An error occurred (500) 
> when calling the PutObject operation (reached max retries: 4): Internal 
> Server Error
>  Process Process-489:
>  S3UploadFailedError: Failed to upload ./20191204/file9631.dat to 
> ozone-test-reproduce-123/./20191204/file9631.dat: An error occurred (500) 
> when calling the PutObject operation (reached max retries: 4): Internal 
> Server Error
>  Process Process-490:
>  S3UploadFailedError: Failed to upload ./20191204/file7520.dat to 
> ozone-test-reproduce-123/./20191204/file7520.dat: An error occurred (500) 
> when calling the PutObject operation (reached max retries: 4): Internal 
> Server Error
>  Process Process-491:
>  S3UploadFailedError: Failed to upload ./20191204/file4220.dat to 
> ozone-test-reproduce-123/./20191204/file4220.dat: An error occurred (500) 
> when calling the PutObject operation (reached max retries: 4): Internal 
> Server Error
>  Process Process-492:
>  S3UploadFailedError: Failed to upload ./20191204/file5523.dat to 
> ozone-test-reproduce-123/./20191204/file5523.dat: An error occurred (500) 
> when calling the PutObject operation (reached max retries: 4): Internal 
> Server Error
>  Process Process-493:
>  S3UploadFailedError: Failed to upload ./20191204/file7520.dat to 
> ozone-test-reproduce-123/./20191204/file7520.dat: An error occurred (500) 
> when calling the PutObject operation (reached max retries: 4): Internal 
> Server Error
> That's a partial list and note that all keys are different. I also tried 
> re-enable the OM process on previous leader OM, but it doesn't help since the 
> leader has changed. Also attach partial OM logs:
>  2020-02-12 14:57:11,128 [IPC Server handler 72 on 9862] INFO 
> org.apache.hadoop.ipc.Server: IPC Server handler 72 on 9862, call Call#4859 
> Retry#0 
> org.apache.hadoop.ozone.om.protocol.OzoneManagerProtocol.submitRequest from 
> 9.134.50.210:36561
>  org.apache.hadoop.ozone.om.exceptions.OMNotLeaderException: OM:om1 is not 
> the leader. Suggested leader is OM:om2.
>  at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.createNotLeaderException(OzoneManagerProtocolServerSideTranslatorPB.java:183)
>  at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitReadRequestToOM(OzoneManagerProtocolServerSideTranslatorPB.java:171)
>  at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.processRequest(OzoneManagerProtocolServerSideTranslatorPB.java:107)
>  at 
> org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:72)
>  at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequest(OzoneManagerProtocolServerSideTranslatorPB.java:97)
>  at 
> org.apache.hadoop.ozone.protocol.proto.OzoneManagerProtocolProtos$OzoneManagerService$2.callBlockingMethod(OzoneManagerProtocolProtos.java)
>  at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
>  at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
>  at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:876)
>  at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:822)
>  at java.security.AccessController.doPrivileged(Native Method)
>  at javax.security.auth.Subject.doAs(Subject.java:422)
>  at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
>  at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2682)
>  2020-02-12 14:57:11,918 [IPC Server handler 159 on 9862] INFO 
> org.apache.hadoop.ipc.Server: IPC Server handler 159 on 9862, call Call#4864 
> Retry#0 
> org.apache.hadoop.ozone.om.protocol.OzoneManagerProtocol.submitRequest from 
> 9.134.50.210:36561
>  org.apache.hadoop.ozone.om.exceptions.OMNotLeaderException: OM:om1 is not 
> the leader. Suggested leader is OM:om2.
>  at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.createNotLeaderException(OzoneManagerProtocolServerSideTranslatorPB.java:183)
>  at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitReadRequestToOM(OzoneManagerProtocolServerSideTranslatorPB.java:171)
>  at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.processRequest(OzoneManagerProtocolServerSideTranslatorPB.java:107)
>  at 
> org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:72)
>  at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequest(OzoneManagerProtocolServerSideTranslatorPB.java:97)
>  at 
> org.apache.hadoop.ozone.protocol.proto.OzoneManagerProtocolProtos$OzoneManagerService$2.callBlockingMethod(OzoneManagerProtocolProtos.java)
>  at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
>  at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
>  at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:876)
>  at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:822)
>  at java.security.AccessController.doPrivileged(Native Method)
>  at javax.security.auth.Subject.doAs(Subject.java:422)
>  at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
>  at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2682)
>  2020-02-12 14:57:15,395 [IPC Server handler 23 on 9862] INFO 
> org.apache.hadoop.ipc.Server: IPC Server handler 23 on 9862, call Call#4869 
> Retry#0 
> org.apache.hadoop.ozone.om.protocol.OzoneManagerProtocol.submitRequest from 
> 9.134.50.210:36561
>  org.apache.hadoop.ozone.om.exceptions.OMNotLeaderException: OM:om1 is not 
> the leader. Suggested leader is OM:om2.
>  at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.createNotLeaderException(OzoneManagerProtocolServerSideTranslatorPB.java:183)
>  at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitReadRequestToOM(OzoneManagerProtocolServerSideTranslatorPB.java:171)
>  at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.processRequest(OzoneManagerProtocolServerSideTranslatorPB.java:107)
>  at 
> org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:72)
>  at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequest(OzoneManagerProtocolServerSideTranslatorPB.java:97)
>  at 
> org.apache.hadoop.ozone.protocol.proto.OzoneManagerProtocolProtos$OzoneManagerService$2.callBlockingMethod(OzoneManagerProtocolProtos.java)
>  at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
>  at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
>  at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:876)
>  at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:822)
>  at java.security.AccessController.doPrivileged(Native Method)
>  at javax.security.auth.Subject.doAs(Subject.java:422)
>  at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
>  at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2682)
>  
>  
> Also attach the ozone-site.xml config to enable OM HA:
> <property>
>  <name>ozone.om.service.ids</name>
>  <value>OMHA</value>
>  </property>
>  <property>
>  <name>ozone.om.nodes.OMHA</name>
>  <value>om1,om2,om3</value>
>  </property>
>  <property>
>  <name>ozone.om.node.id</name>
>  <value>om1</value>
>  </property>
>  <property>
>  <name>ozone.om.address.OMHA.om1</name>
>  <value>9.134.50.210:9862</value>
>  </property>
>  <property>
>  <name>ozone.om.address.OMHA.om2</name>
>  <value>9.134.51.215:9862</value>
>  </property>
>  <property>
>  <name>ozone.om.address.OMHA.om3</name>
>  <value>9.134.51.25:9862</value>
>  </property>
>  <property>
>  <name>ozone.om.ratis.enable</name>
>  <value>true</value>
>  </property>
>  <property>
>  <name>ozone.enabled</name>
>  <value>true</value>
>  <tag>OZONE, REQUIRED</tag>
>  <description>
>  Status of the Ozone Object Storage service is enabled.
>  Set to true to enable Ozone.
>  Set to false to disable Ozone.
>  Unless this value is set to true, Ozone services will not be started in
>  the cluster.
> Please note: By default ozone is disabled on a hadoop cluster.
>  </description>
>  </property>



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org

[jira] [Updated] (HDDS-3004) OM HA stability issues

Reply via email to