Attila Doroszlai created HDDS-7518:
--------------------------------------
Summary: Intermittent failure in ozonesecure replication test
Key: HDDS-7518
URL: https://issues.apache.org/jira/browse/HDDS-7518
Project: Apache Ozone
Issue Type: Bug
Components: test
Affects Versions: 1.4.0
Reporter: Attila Doroszlai
Assignee: Dave Teng
HDDS-7260 increased datanode count from 3 to 5 in {{ozonesecure}} acceptance
test. This causes intermittent, but frequent failure in the replication test.
{code:title=https://github.com/apache/ozone/blob/85e7cd1867ec9000df798c74ab0f9cf153936a5d/hadoop-ozone/dist/src/main/compose/ozonesecure/test.sh#L57-L61}
# test replication
docker-compose up -d --scale datanode=2
execute_robot_test scm -v container:1 -v count:2 replication/wait.robot
docker-compose up -d --scale datanode=3
execute_robot_test scm -v container:1 -v count:3 replication/wait.robot
{code}
The test scales datanodes to 2, waits for container replica count = 2, then
scales to 3, and waits for replica count = 3.
With 3 initial datanodes, the test can assume all nodes have the container, so
scaling to 2 then 3 datanodes, the container count always matches expectations,
if replication works correctly.
Now with 5 initial datanodes, when the test scales datanodes to 2, the
container may have 0, 1 or 2 healthy replicas left, depending on where the
original 3 replicas were stored. And when datanodes are scaled to 3, the
container may have 1, 2 or 3 replicas.
Thus the test fails frequently, but not in 100% of runs.
*
https://github.com/adoroszlai/ozone-build-results/tree/master/2022/11/18/18547/acceptance-secure/
*
https://github.com/adoroszlai/ozone-build-results/tree/master/2022/11/18/18566/acceptance-secure/
*
https://github.com/adoroszlai/ozone-build-results/tree/master/2022/11/18/18559/acceptance-secure/
*
https://github.com/adoroszlai/ozone-build-results/tree/master/2022/11/18/18554/acceptance-secure/
*
https://github.com/adoroszlai/ozone-build-results/tree/master/2022/11/19/18570/acceptance-secure/
Some possible fixes:
# move the EC-specific tests to {{hadoop-ozone/dist/src/main/smoketest/ec}},
which is executed only in {{ozone}} environment, with 5 datanodes
# scale datanodes to 3 before starting the replication test, and create a new
container for the test
# scale datanodes to 5 before {{ozonefs}} test, then scale back to 3
I prefer solution (1).
CC [~sodonnell] [~weichiu]
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]