[
https://issues.apache.org/jira/browse/HDDS-7518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Attila Doroszlai resolved HDDS-7518.
------------------------------------
Fix Version/s: 1.4.0
Resolution: Fixed
> Intermittent failure in ozonesecure replication test
> ----------------------------------------------------
>
> Key: HDDS-7518
> URL: https://issues.apache.org/jira/browse/HDDS-7518
> Project: Apache Ozone
> Issue Type: Bug
> Components: test
> Affects Versions: 1.4.0
> Reporter: Attila Doroszlai
> Assignee: Dave Teng
> Priority: Critical
> Labels: pull-request-available
> Fix For: 1.4.0
>
>
> HDDS-7260 increased datanode count from 3 to 5 in {{ozonesecure}} acceptance
> test. This causes intermittent, but frequent failure in the replication test.
> {code:title=https://github.com/apache/ozone/blob/85e7cd1867ec9000df798c74ab0f9cf153936a5d/hadoop-ozone/dist/src/main/compose/ozonesecure/test.sh#L57-L61}
> # test replication
> docker-compose up -d --scale datanode=2
> execute_robot_test scm -v container:1 -v count:2 replication/wait.robot
> docker-compose up -d --scale datanode=3
> execute_robot_test scm -v container:1 -v count:3 replication/wait.robot
> {code}
> The test scales datanodes to 2, waits for container replica count = 2, then
> scales to 3, and waits for replica count = 3.
> With 3 initial datanodes, the test can assume all nodes have the container,
> so scaling to 2 then 3 datanodes, the container count always matches
> expectations, if replication works correctly.
> Now with 5 initial datanodes, when the test scales datanodes to 2, the
> container may have 0, 1 or 2 healthy replicas left, depending on where the
> original 3 replicas were stored. And when datanodes are scaled to 3, the
> container may have 1, 2 or 3 replicas.
> Thus the test fails frequently, but not in 100% of runs.
> *
> https://github.com/adoroszlai/ozone-build-results/tree/master/2022/11/18/18547/acceptance-secure/
> *
> https://github.com/adoroszlai/ozone-build-results/tree/master/2022/11/18/18566/acceptance-secure/
> *
> https://github.com/adoroszlai/ozone-build-results/tree/master/2022/11/18/18559/acceptance-secure/
> *
> https://github.com/adoroszlai/ozone-build-results/tree/master/2022/11/18/18554/acceptance-secure/
> *
> https://github.com/adoroszlai/ozone-build-results/tree/master/2022/11/19/18570/acceptance-secure/
> Some possible fixes:
> # move the EC-specific tests to {{hadoop-ozone/dist/src/main/smoketest/ec}},
> which is executed only in {{ozone}} environment with 5 datanodes, and revert
> {{ozonesecure}} back to 3 datanodes
> # scale datanodes to 3 before starting the replication test, and create a new
> container for the test
> # scale datanodes to 5 before {{ozonefs}} test, then scale back to 3
> I prefer solution (1).
> CC [~sodonnell] [~weichiu]
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]