Paweł Grabczak created GUACAMOLE-2221:
-----------------------------------------
Summary: Infinity loop on RDP transport-layer fail
Key: GUACAMOLE-2221
URL: https://issues.apache.org/jira/browse/GUACAMOLE-2221
Project: Guacamole
Issue Type: Bug
Components: guacamole-server
Affects Versions: 1.6.0
Environment: FreeBSD 14.3
Guacamole server 1.6.0
Reporter: Paweł Grabczak
Hello,
I believe I've found a bug introduced (or at least exposed) after upgrading to
Guacamole Server 1.6. I've been using guacd for a long time, and after
upgrading from 1.4 to 1.6 I noticed an issue with the RDP protocol.
When an RDP connection hits a transport-level failure (for example, a
certificate-related error), the guacd process never exits and starts consuming
100% CPU. With a native FreeRDP client, the connection fails correctly and
exits, and the client logs show:
{code:java}
ERRCONNECT_CONNECT_TRANSPORT_FAILED [0x0002000D]{code}
With guacd, it looks like it enters an infinite loop. In guacd debug logs I see
the following repeating continuously:
{code:java}
guacd[40267]: DEBUG: SVC "rdpdr" connected.
guacd[40267]: DEBUG: SVC "rdpsnd" connected.
guacd[40267]: DEBUG: Loading Dynamic Virtual Channel disp
guacd[40267]: DEBUG: Loading Dynamic Virtual Channel rdpgfx
guacd[40267]: DEBUG: Handler not found for ""
guacd[40267]: DEBUG: Handler not found for ""
guacd[40267]: DEBUG: Handler not found for ""
guacd[40267]: DEBUG: BIO_read returned a system error 0: No error: 0
guacd[40267]: DEBUG: transport_read_layer:freerdp_set_last_error_ex
ERRCONNECT_CONNECT_TRANSPORT_FAILED [0x0002000D]
guacd[40267]: DEBUG: Handler not found for ""
guacd[40267]: DEBUG: Handler not found for ""
guacd[40267]: DEBUG: Handler not found for ""
guacd[40267]: DEBUG: Handler not found for ""
guacd[40267]: DEBUG: Handler not found for ""{code}
To understand what guacd was doing while spinning, I captured system calls. The
trace shows repeated poll(..., 0) calls (zero timeout) and read() returning
EAGAIN, which explains the tight loop and 100% CPU usage:
{code:java}
19143: gettimeofday({ 1770992548.370728 },0x0) = 0 (0x0)
19143: gettimeofday({ 1770992548.370771 },0x0) = 0 (0x0)
19143: poll({ 13/POLLIN },1,0) = 0 (0x0)
19143: gettimeofday({ 1770992548.370840 },0x0) = 0 (0x0)
19143: gettimeofday({ 1770992548.370873 },0x0) = 0 (0x0)
19143: poll({ 5/POLLIN 9/POLLIN 86/POLLIN 14/POLLIN 12/POLLIN 13/POLLIN
},6,0) = 1 (0x1)
19143: read(5,0x85b0fde18,8) ERR#35 'Resource temporarily
unavailable'
19143: gettimeofday({ 1770992548.371020 },0x0) = 0 (0x0)
19143: gettimeofday({ 1770992548.371042 },0x0) = 0 (0x0)
19143: poll({ 13/POLLIN },1,0) = 0 (0x0)
19143: gettimeofday({ 1770992548.371094 },0x0) = 0 (0x0)
19143: gettimeofday({ 1770992548.371137 },0x0) = 0 (0x0)
19143: poll({ 5/POLLIN 9/POLLIN 86/POLLIN 14/POLLIN 12/POLLIN 13/POLLIN
},6,0) = 1 (0x1)
19143: read(5,0x85b0fde18,8) ERR#35 'Resource temporarily
unavailable'
19143: gettimeofday({ 1770992548.371255 },0x0) = 0 (0x0)
...{code}
Root cause (as I understand it)
>From investigating the code, it looks like the error state in {{wait_result}}
>can be overwritten:
* When the transport failure occurs, {{guac_rdp_handle_events()}} sets
{{wait_result}} to {{{}-1{}}}.
* However, the loop does not exit immediately.
* In the next iteration, the {{while}} condition assigns {{wait_result}} from
{{{}rdp_guac_client_wait_for_events(){}}}, which returns {{{}0{}}}, effectively
clearing the error flag.
* As a result, the loop never observes the transport failure and continues
polling, spinning the CPU.
h3. Proposed fix
I prepared a PR that breaks out of the loop immediately after
{{guac_rdp_handle_events()}} returns an error, so we stop processing once
FreeRDP reports a connection failure.
The change fixes the issue for me, but I'm not sure whether it could introduce
any regressions in other scenarios. I would appreciate a review and any
suggestions for improvement.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)