tverkade opened a new issue, #13422:
URL: https://github.com/apache/cloudstack/issues/13422

   ### problem
   
   On a 4.22.1.0 deployment with two management servers behind a load balancer 
VM and system-VM console access fails intermittently. The browser receives a 
generic Apache-style "Internal Server Error" page. Reopening the console 
sometimes works, sometimes fails, with no change to the environment.
   
   The failure has been isolated to the console session token validation path. 
Every surrounding component has been verified healthy. Pinning the browser 
directly to a single management server (bypassing the load balancer) does not 
resolve it, so this is not simply LB source-IP rewriting.
   
   - Browser shows stock "Internal Server Error" (Apache-style) page 
intermittently when opening a console.
   - CPVM (/var/log/cloud/cloud.out) on failure:
       Session <uuid> has already been used, cannot connect
       External authenticator failed authentication request for vm <vm-uuid> 
with sid <sid>
       ERROR ConsoleProxyNoVNCHandler ... Failed to create viewer ...
         com.cloud.consoleproxy.AuthenticationException: External authenticator 
failed request
   - Also observed earlier: org.eclipse.jetty.websocket.api.CloseException: 
TimeoutException: Idle timeout expired: 300000/300000 ms.
   
   KEY DIAGNOSTIC FINDING
    
   The cloud.console_session table records the load balancer IP (10.125.128.38) 
as
   console_endpoint_creator_address for sessions, and a subset of rows never 
reach acquired/removed
   (these correlate with the failures):
    
       id    uuid        instance  host  acquired             removed           
   creator_address  client_address
       1414  94403a87..  6         14    NULL                 NULL              
   172.31.1.204     NULL
       1413  91842ef1..  85        14    2026-06-15 13:40:54  2026-06-15 
13:41:00  172.31.1.204     172.31.1.204
       1412  ade60441..  1902      27    2026-06-15 13:39:10  2026-06-15 
13:39:20  10.125.128.38    172.31.1.204
       1411  135aaa62..  1902      27    2026-06-15 13:37:16  2026-06-15 
13:39:10  10.125.128.38    172.31.1.204
       1410  6cd4ac53..  1881      27    NULL                 NULL              
   10.125.128.38    NULL
    
   console_endpoint_creator_address is being recorded as the load balancer 
(10.125.128.38) and in
   some rows as the client IP (172.31.1.204) — never as a real management 
server IP (.37/.39).
   Neither .38 nor the client is a valid validation target, which appears to be 
why those sessions
   are never acquired.
    
   On failure, NOTHING is logged on either management server 
(management-server.log filtered for
   console|authentication|failed|console_session shows only cluster heartbeat). 
On success, the
   minting MS logs the full createConsoleEndpoint -> Compose console url -> 
Adding allowed session
   -> ConsoleAccessAuthenticationCommand flow. So failing validations are being 
rejected at the
   CPVM and never reach a real MS.
    
    
   RULED OUT (verified, not assumed)
    
   - CPVM health: up 3+ days, ports 80/8001/8080 bound by the cloud Java 
process (no stray Apache),
     actively serving noVNC. curl of /resource/noVNC/vnc.html returns 200.
   - Network/MTU: DF ping at 1472 bytes succeeds 0% loss between client and 
CPVM; path MTU is full 1500.
   - Management servers: both Up in mshost; both reachable on 8250 from CPVM; 
clocks within ~1s
     (verified date -u).
   - KVM host clock: within ~1s of MS.
   - VNC port: virsh vncdisplay v-1902-VM = :4 (5904), exactly matching the 
port the MS hands out —
     no stale-port mismatch.
   - Agent version: host cloudstack-agent = 4.22.1.0-shapeblue0, matches MS.
   - cluster.node.IP: correctly set per node; no rogue/extra management 
instance;
   - CPVM rebuild: destroyed/recreated multiple times — no effect.
   - Pinning browser directly to a single MS 
(https://<mgmt-server-ip>:8080/client), bypassing the LB:
     still fails. (So this is NOT solely LB source-IP rewriting / cross-MS 
in-memory token, despite
     PR #7094 being present and the console_session table being populated.)
    
    
   WHY THIS LOOKS LIKE A BUG, NOT MISCONFIGURATION
    
   - PR #7094 (DB-backed console sessions for multi-MS) is present — the 
console_session table exists
     and is written.
   - Yet console_endpoint_creator_address is being populated with the load 
balancer IP and sometimes
     the client IP, rather than the processing management server's own IP. 
Those values are not valid
     validation targets.
   - Bypassing the load balancer (direct-to-MS) does not fix it, so source-IP 
rewriting by the LB is
     not a complete explanation.
   - The result is intermittent, single-use-token "already used / external 
authenticator failed"
     rejections at the CPVM, with no corresponding log on any management server.
   
   ### versions
   
   - CloudStack version: 4.22.1.0-shapeblue0
   - Hypervisor: KVM (host agent cloudstack-agent 4.22.1.0-shapeblue0, matched 
to MS)
   - Management servers: 2 nodes
       mshost table: both nodes Up
   - Haproxy load balancers (fronts the management/UI tier; NOT a management 
server)
   - SSL: disabled (consoleproxy.sslEnabled=false, no consoleproxy.url.domain)
   - CPVM: v-1902-VM, agent up, serving noVNC
   - Client: Windows workstation
   - Console settings present: console.session.cleanup.interval=180,
     console.session.cleanup.retention.hours=240,
     consoleproxy.session.timeout=300000, consoleproxy.session.max=50,
     novnc.console.default=true, novnc.console.sourceip.check.enabled=false
   
   ### The steps to reproduce the bug
   
   1. Deploy CloudStack 4.22.1.0 with two management servers behind a load 
balancer fronting the management/UI tier. KVM hypervisor, SSL disabled.
   2. From a client on a different subnet, open the CloudStack UI through the 
load balancer and click "View Console" on a running VM or system VM.
   3. Repeat opening the console several times.
   4. Observe that console access succeeds intermittently — some attempts load 
the noVNC console, others return a generic "Internal Server Error" page in the 
browser.
   5. On a failing attempt, check the CPVM log (/var/log/cloud/cloud.out):
        Session <uuid> has already been used, cannot connect
        External authenticator failed authentication request for vm <vm-uuid> 
with sid <sid>
        com.cloud.consoleproxy.AuthenticationException: External authenticator 
failed request
   6. On a failing attempt, check both management-server logs — NOTHING is 
logged on either MS (no createConsoleEndpoint, no 
ConsoleAccessAuthenticationCommand). On a succeeding attempt, the full flow IS 
logged on the minting MS.
   7. Inspect the cloud.console_session table:
        SELECT 
id,uuid,acquired,removed,console_endpoint_creator_address,client_address FROM 
cloud.console_session ORDER BY created DESC LIMIT 10;
      Note that console_endpoint_creator_address is recorded as the load 
balancer IP, and in some rows the client IP — never a real management server 
IP. Rows with these creator addresses are the ones that never reach 
'acquired'/'removed', and these correlate with the failures.
   
   ### What to do about it?
   
   console_endpoint_creator_address is being populated with the load balancer 
IP (and sometimes the client IP) instead of the processing management server's 
own management IP. Since the CPVM validates the one-time console token against 
the recorded creator address, and neither the LB nor the client can service 
that validation callback, those sessions are never acquired and the CPVM 
rejects the connection — producing the intermittent HTTP 500.
   
   Expected: console_endpoint_creator_address should be the management server 
that processed the createConsoleEndpoint call (a real MS IP from the `host` 
setting), regardless of whether the request arrived via a load balancer.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to