[ 
https://issues.apache.org/jira/browse/HDDS-3313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17073476#comment-17073476
 ] 

Attila Doroszlai commented on HDDS-3313:
----------------------------------------

_failure 1_ scenario:

{code:title=Stop Leader OM and Verify Failover}
06:53:20.688    INFO    Running command 'ozone sh volume create 
o3://omservice/volume1 2>&1'.   
06:53:29.143    INFO    ${rc} = 255     
06:53:29.144    INFO    ${output} = Couldn't create RpcClient protocol
{code}

I think this can happen if OM leader is not elected yet even during retry 
interval:

{code:title=log excerpt from each OM}
2020-04-02 06:53:29 INFO  Server:2726 - IPC Server handler 2 on 9862, call 
Call#0 Retry#5 
org.apache.hadoop.ozone.om.protocol.OzoneManagerProtocol.submitRequest from 
172.18.0.3:50872
org.apache.hadoop.ozone.om.exceptions.OMNotLeaderException: OM:om3 is not the 
leader. Could not determine the leader node.
2020-04-02 06:53:29 INFO  Server:2726 - IPC Server handler 68 on 9862, call 
Call#0 Retry#6 
org.apache.hadoop.ozone.om.protocol.OzoneManagerProtocol.submitRequest from 
172.18.0.3:34354
org.apache.hadoop.ozone.om.exceptions.OMNotLeaderException: OM:om1 is not the 
leader. Could not determine the leader node.
2020-04-02 06:53:29 INFO  Server:2726 - IPC Server handler 2 on 9862, call 
Call#0 Retry#7 
org.apache.hadoop.ozone.om.protocol.OzoneManagerProtocol.submitRequest from 
172.18.0.3:54958
org.apache.hadoop.ozone.om.exceptions.OMNotLeaderException: OM:om2 is not the 
leader. Could not determine the leader node.
{code}

Later _Test Multiple Failovers_ fails due to {{copyFromLocal: Volume volume1 is 
not found}}.

https://github.com/adoroszlai/hadoop-ozone/runs/554140106

> OM HA acceptance test is flaky
> ------------------------------
>
>                 Key: HDDS-3313
>                 URL: https://issues.apache.org/jira/browse/HDDS-3313
>             Project: Hadoop Distributed Data Store
>          Issue Type: Bug
>          Components: test
>            Reporter: Attila Doroszlai
>            Assignee: Hanisha Koneru
>            Priority: Critical
>         Attachments: acceptance.zip
>
>
> {{ozone-om-ha}} test is failing intermittently.  Example on master: 
> https://github.com/apache/hadoop-ozone/runs/549544110
> {code:title=failure 1}
> 2020-03-31T19:34:02.3757399Z 
> ==============================================================================
> 2020-03-31T19:34:02.3762775Z ozone-om-ha-testOMHA :: Smoketest ozone cluster 
> startup                       
> 2020-03-31T19:34:02.3763313Z 
> ==============================================================================
> 2020-03-31T19:34:07.9174050Z Stop Leader OM and Verify Failover               
>                      | FAIL |
> 2020-03-31T19:34:07.9174675Z 255 != 0
> 2020-03-31T19:34:07.9176048Z 
> ------------------------------------------------------------------------------
> 2020-03-31T19:34:37.4682717Z Test Multiple Failovers                          
>                      | FAIL |
> 2020-03-31T19:34:37.4682899Z 1 != 0
> 2020-03-31T19:34:37.4683766Z 
> ------------------------------------------------------------------------------
> 2020-03-31T19:35:24.9569154Z Restart OM and Verify Ratis Logs                 
>                      | FAIL |
> 2020-03-31T19:35:24.9569529Z 255 != 0
> 2020-03-31T19:35:24.9574925Z 
> ------------------------------------------------------------------------------
> 2020-03-31T19:35:24.9575613Z ozone-om-ha-testOMHA :: Smoketest ozone cluster 
> startup               | FAIL |
> 2020-03-31T19:35:24.9575952Z 3 critical tests, 0 passed, 3 failed
> 2020-03-31T19:35:24.9576076Z 3 tests total, 0 passed, 3 failed
> {code}
> {code:title=failure 2}
> 2020-03-31T20:36:29.5715868Z 
> ==============================================================================
> 2020-03-31T20:36:29.5743517Z ozone-om-ha-testOMHA :: Smoketest ozone cluster 
> startup                       
> 2020-03-31T20:36:29.5744025Z 
> ==============================================================================
> 2020-03-31T20:37:08.4625840Z Stop Leader OM and Verify Failover               
>                      | PASS |
> 2020-03-31T20:37:08.4626644Z 
> ------------------------------------------------------------------------------
> 2020-03-31T20:39:47.9721513Z Test Multiple Failovers                          
>                      | PASS |
> 2020-03-31T20:39:47.9723424Z 
> ------------------------------------------------------------------------------
> 2020-03-31T21:25:29.1203036Z Restart OM and Verify Ratis Logs                 
>                      | FAIL |
> 2020-03-31T21:25:29.1204001Z Test timeout 8 minutes exceeded.
> 2020-03-31T21:25:29.1204954Z 
> ------------------------------------------------------------------------------
> 2020-03-31T21:25:29.1220689Z ozone-om-ha-testOMHA :: Smoketest ozone cluster 
> startup               | FAIL |
> 2020-03-31T21:25:29.1224446Z 3 critical tests, 2 passed, 1 failed
> 2020-03-31T21:25:29.1224833Z 3 tests total, 2 passed, 1 failed
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to