dombizita commented on code in PR #9877:
URL: https://github.com/apache/ozone/pull/9877#discussion_r2988045589


##########
hadoop-ozone/dist/src/main/compose/upgrade/compose/ha/docker-compose.yaml:
##########
@@ -22,6 +22,13 @@ x-common-config:
     - ../../../common/security.conf
   image: ${OZONE_TEST_IMAGE}
   dns_search: .
+  extra_hosts:
+    - "om1:10.9.0.11"
+    - "om2:10.9.0.12"
+    - "om3:10.9.0.13"
+    - "scm1.org:10.9.0.14"
+    - "scm2.org:10.9.0.15"
+    - "scm3.org:10.9.0.16"

Review Comment:
   Without this I saw test failures with key creation timeouts:
   ```
   
/Users/zitadombi/git_repos/ozone/hadoop-ozone/dist/target/ozone-2.2.0-SNAPSHOT/compose/upgrade
 fails like this: "--- RESTARTING scm1 WITH IMAGE 2.2.0 ---
   --- STOPPING scm1 ---
   --- STOPPED scm1 ---
   --- SCM BEFORE: scm3 ---
   --- SCM AFTER: scm3 ---
   --- CALLING before_service_restart with scm1 ---
   Using Docker Compose v2
   
==============================================================================
   2.2.0-2.2.0-2-scm1-generate-generate-scm1 :: Generate data                   
 
   
==============================================================================
   Create a volume and bucket                                            | PASS 
|
   
------------------------------------------------------------------------------
   Create key                                                            | FAIL 
|
   Test timeout 5 minutes exceeded.
   
------------------------------------------------------------------------------
   Create a bucket in s3v volume                                         | PASS 
|
   
------------------------------------------------------------------------------
   Create key in the bucket in s3v volume                                | FAIL 
|
   Test timeout 5 minutes exceeded.
   
------------------------------------------------------------------------------
   Try to create a bucket using S3 API                                   | PASS 
|
   
------------------------------------------------------------------------------
   Create key using S3 API                                               | FAIL 
|
   Test timeout 5 minutes exceeded.
   
------------------------------------------------------------------------------
   2.2.0-2.2.0-2-scm1-generate-generate-scm1 :: Generate data            | FAIL 
|
   6 tests, 3 passed, 3 failed
   
=============================================================================="
   ```
   
   In this test, when `scm1` is stopped, DNS resolution for `scm1.org` fell 
back to public DNS, which caused key creation timeouts while retrying on the 
bad address:
   ```
   Address change detected. Old: scm1.org/10.9.0.14 New: scm1.org/208.91.197.27
   ```
   `extra_hosts` forces deterministic in-cluster resolution even when a node is 
down, so HA client retries stay on the intended private IPs. This `extra_hosts` 
is also used at other docker yaml files, where we have HA and stopping 
containers one-by-one (e.g. debug tools, decommissioning) 
   
   Cursor response while debugging: "Most likely root cause in your run: 
scm1.org resolves to a public IP while scm1 is intentionally down, and OM/SCM 
clients get stuck retrying that bad address. `scm1` is the hostname that 
collides with real DNS (`scm1.org`), so when container DNS entry disappears 
during stop, resolver falls back to public DNS. Then Java caches/keeps retrying 
that bad endpoint long enough to hit your 5-minute test timeout."



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to