[
https://issues.apache.org/jira/browse/HDDS-3313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17073476#comment-17073476
]
Attila Doroszlai commented on HDDS-3313:
----------------------------------------
_failure 1_ scenario:
{code:title=Stop Leader OM and Verify Failover}
06:53:20.688 INFO Running command 'ozone sh volume create
o3://omservice/volume1 2>&1'.
06:53:29.143 INFO ${rc} = 255
06:53:29.144 INFO ${output} = Couldn't create RpcClient protocol
{code}
I think this can happen if OM leader is not elected yet even during retry
interval:
{code:title=log excerpt from each OM}
2020-04-02 06:53:29 INFO Server:2726 - IPC Server handler 2 on 9862, call
Call#0 Retry#5
org.apache.hadoop.ozone.om.protocol.OzoneManagerProtocol.submitRequest from
172.18.0.3:50872
org.apache.hadoop.ozone.om.exceptions.OMNotLeaderException: OM:om3 is not the
leader. Could not determine the leader node.
2020-04-02 06:53:29 INFO Server:2726 - IPC Server handler 68 on 9862, call
Call#0 Retry#6
org.apache.hadoop.ozone.om.protocol.OzoneManagerProtocol.submitRequest from
172.18.0.3:34354
org.apache.hadoop.ozone.om.exceptions.OMNotLeaderException: OM:om1 is not the
leader. Could not determine the leader node.
2020-04-02 06:53:29 INFO Server:2726 - IPC Server handler 2 on 9862, call
Call#0 Retry#7
org.apache.hadoop.ozone.om.protocol.OzoneManagerProtocol.submitRequest from
172.18.0.3:54958
org.apache.hadoop.ozone.om.exceptions.OMNotLeaderException: OM:om2 is not the
leader. Could not determine the leader node.
{code}
Later _Test Multiple Failovers_ fails due to {{copyFromLocal: Volume volume1 is
not found}}.
https://github.com/adoroszlai/hadoop-ozone/runs/554140106
> OM HA acceptance test is flaky
> ------------------------------
>
> Key: HDDS-3313
> URL: https://issues.apache.org/jira/browse/HDDS-3313
> Project: Hadoop Distributed Data Store
> Issue Type: Bug
> Components: test
> Reporter: Attila Doroszlai
> Assignee: Hanisha Koneru
> Priority: Critical
> Attachments: acceptance.zip
>
>
> {{ozone-om-ha}} test is failing intermittently. Example on master:
> https://github.com/apache/hadoop-ozone/runs/549544110
> {code:title=failure 1}
> 2020-03-31T19:34:02.3757399Z
> ==============================================================================
> 2020-03-31T19:34:02.3762775Z ozone-om-ha-testOMHA :: Smoketest ozone cluster
> startup
> 2020-03-31T19:34:02.3763313Z
> ==============================================================================
> 2020-03-31T19:34:07.9174050Z Stop Leader OM and Verify Failover
> | FAIL |
> 2020-03-31T19:34:07.9174675Z 255 != 0
> 2020-03-31T19:34:07.9176048Z
> ------------------------------------------------------------------------------
> 2020-03-31T19:34:37.4682717Z Test Multiple Failovers
> | FAIL |
> 2020-03-31T19:34:37.4682899Z 1 != 0
> 2020-03-31T19:34:37.4683766Z
> ------------------------------------------------------------------------------
> 2020-03-31T19:35:24.9569154Z Restart OM and Verify Ratis Logs
> | FAIL |
> 2020-03-31T19:35:24.9569529Z 255 != 0
> 2020-03-31T19:35:24.9574925Z
> ------------------------------------------------------------------------------
> 2020-03-31T19:35:24.9575613Z ozone-om-ha-testOMHA :: Smoketest ozone cluster
> startup | FAIL |
> 2020-03-31T19:35:24.9575952Z 3 critical tests, 0 passed, 3 failed
> 2020-03-31T19:35:24.9576076Z 3 tests total, 0 passed, 3 failed
> {code}
> {code:title=failure 2}
> 2020-03-31T20:36:29.5715868Z
> ==============================================================================
> 2020-03-31T20:36:29.5743517Z ozone-om-ha-testOMHA :: Smoketest ozone cluster
> startup
> 2020-03-31T20:36:29.5744025Z
> ==============================================================================
> 2020-03-31T20:37:08.4625840Z Stop Leader OM and Verify Failover
> | PASS |
> 2020-03-31T20:37:08.4626644Z
> ------------------------------------------------------------------------------
> 2020-03-31T20:39:47.9721513Z Test Multiple Failovers
> | PASS |
> 2020-03-31T20:39:47.9723424Z
> ------------------------------------------------------------------------------
> 2020-03-31T21:25:29.1203036Z Restart OM and Verify Ratis Logs
> | FAIL |
> 2020-03-31T21:25:29.1204001Z Test timeout 8 minutes exceeded.
> 2020-03-31T21:25:29.1204954Z
> ------------------------------------------------------------------------------
> 2020-03-31T21:25:29.1220689Z ozone-om-ha-testOMHA :: Smoketest ozone cluster
> startup | FAIL |
> 2020-03-31T21:25:29.1224446Z 3 critical tests, 2 passed, 1 failed
> 2020-03-31T21:25:29.1224833Z 3 tests total, 2 passed, 1 failed
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]