Attila Doroszlai created HDDS-7518:
--------------------------------------

             Summary: Intermittent failure in ozonesecure replication test
                 Key: HDDS-7518
                 URL: https://issues.apache.org/jira/browse/HDDS-7518
             Project: Apache Ozone
          Issue Type: Bug
          Components: test
    Affects Versions: 1.4.0
            Reporter: Attila Doroszlai
            Assignee: Dave Teng


HDDS-7260 increased datanode count from 3 to 5 in {{ozonesecure}} acceptance 
test.  This causes intermittent, but frequent failure in the replication test.

{code:title=https://github.com/apache/ozone/blob/85e7cd1867ec9000df798c74ab0f9cf153936a5d/hadoop-ozone/dist/src/main/compose/ozonesecure/test.sh#L57-L61}
# test replication
docker-compose up -d --scale datanode=2
execute_robot_test scm -v container:1 -v count:2 replication/wait.robot
docker-compose up -d --scale datanode=3
execute_robot_test scm -v container:1 -v count:3 replication/wait.robot
{code}

The test scales datanodes to 2, waits for container replica count = 2, then 
scales to 3, and waits for replica count = 3.

With 3 initial datanodes, the test can assume all nodes have the container, so 
scaling to 2 then 3 datanodes, the container count always matches expectations, 
if replication works correctly.

Now with 5 initial datanodes, when the test scales datanodes to 2, the 
container may have 0, 1 or 2 healthy replicas left, depending on where the 
original 3 replicas were stored.  And when datanodes are scaled to 3, the 
container may have 1, 2 or 3 replicas.

Thus the test fails frequently, but not in 100% of runs.

* 
https://github.com/adoroszlai/ozone-build-results/tree/master/2022/11/18/18547/acceptance-secure/
* 
https://github.com/adoroszlai/ozone-build-results/tree/master/2022/11/18/18566/acceptance-secure/
* 
https://github.com/adoroszlai/ozone-build-results/tree/master/2022/11/18/18559/acceptance-secure/
* 
https://github.com/adoroszlai/ozone-build-results/tree/master/2022/11/18/18554/acceptance-secure/
* 
https://github.com/adoroszlai/ozone-build-results/tree/master/2022/11/19/18570/acceptance-secure/

Some possible fixes:
# move the EC-specific tests to {{hadoop-ozone/dist/src/main/smoketest/ec}}, 
which is executed only in {{ozone}} environment, with 5 datanodes
# scale datanodes to 3 before starting the replication test, and create a new 
container for the test
# scale datanodes to 5 before {{ozonefs}} test, then scale back to 3

I prefer solution (1).

CC [~sodonnell] [~weichiu]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to