dombizita commented on code in PR #9877:
URL: https://github.com/apache/ozone/pull/9877#discussion_r2988045589
##########
hadoop-ozone/dist/src/main/compose/upgrade/compose/ha/docker-compose.yaml:
##########
@@ -22,6 +22,13 @@ x-common-config:
- ../../../common/security.conf
image: ${OZONE_TEST_IMAGE}
dns_search: .
+ extra_hosts:
+ - "om1:10.9.0.11"
+ - "om2:10.9.0.12"
+ - "om3:10.9.0.13"
+ - "scm1.org:10.9.0.14"
+ - "scm2.org:10.9.0.15"
+ - "scm3.org:10.9.0.16"
Review Comment:
Without this I saw test failures with key creation timeouts:
```
/Users/zitadombi/git_repos/ozone/hadoop-ozone/dist/target/ozone-2.2.0-SNAPSHOT/compose/upgrade
fails like this: "--- RESTARTING scm1 WITH IMAGE 2.2.0 ---
--- STOPPING scm1 ---
--- STOPPED scm1 ---
--- SCM BEFORE: scm3 ---
--- SCM AFTER: scm3 ---
--- CALLING before_service_restart with scm1 ---
Using Docker Compose v2
==============================================================================
2.2.0-2.2.0-2-scm1-generate-generate-scm1 :: Generate data
==============================================================================
Create a volume and bucket | PASS
|
------------------------------------------------------------------------------
Create key | FAIL
|
Test timeout 5 minutes exceeded.
------------------------------------------------------------------------------
Create a bucket in s3v volume | PASS
|
------------------------------------------------------------------------------
Create key in the bucket in s3v volume | FAIL
|
Test timeout 5 minutes exceeded.
------------------------------------------------------------------------------
Try to create a bucket using S3 API | PASS
|
------------------------------------------------------------------------------
Create key using S3 API | FAIL
|
Test timeout 5 minutes exceeded.
------------------------------------------------------------------------------
2.2.0-2.2.0-2-scm1-generate-generate-scm1 :: Generate data | FAIL
|
6 tests, 3 passed, 3 failed
=============================================================================="
```
In this test, when `scm1` is stopped, DNS resolution for `scm1.org` fell
back to public DNS, which caused key creation timeouts while retrying on the
bad address:
```
Address change detected. Old: scm1.org/10.9.0.14 New: scm1.org/208.91.197.27
```
`extra_hosts` forces deterministic in-cluster resolution even when a node is
down, so HA client retries stay on the intended private IPs. This `extra_hosts`
is also used at other docker yaml files, where we have HA and stopping
containers one-by-one (e.g. debug tools, decommissioning)
Cursor response while debugging: "Most likely root cause in your run:
scm1.org resolves to a public IP while scm1 is intentionally down, and OM/SCM
clients get stuck retrying that bad address. `scm1` is the hostname that
collides with real DNS (`scm1.org`), so when container DNS entry disappears
during stop, resolver falls back to public DNS. Then Java caches/keeps retrying
that bad endpoint long enough to hit your 5-minute test timeout."
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]