[ 
https://issues.apache.org/jira/browse/GUACAMOLE-2221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paweł Grabczak updated GUACAMOLE-2221:
--------------------------------------
    Summary: RDP busy-loop on transport failure  (was: Infinity loop on RDP 
transport-layer fail)

> RDP busy-loop on transport failure
> ----------------------------------
>
>                 Key: GUACAMOLE-2221
>                 URL: https://issues.apache.org/jira/browse/GUACAMOLE-2221
>             Project: Guacamole
>          Issue Type: Bug
>          Components: guacamole-server
>    Affects Versions: 1.6.0
>         Environment: FreeBSD 14.3
> Guacamole server 1.6.0
>            Reporter: Paweł Grabczak
>            Priority: Major
>
> Hello,
> I believe I've found a bug introduced (or at least exposed) after upgrading 
> to Guacamole Server 1.6. I've been using guacd for a long time, and after 
> upgrading from 1.4 to 1.6 I noticed an issue with the RDP protocol.
> When an RDP connection hits a transport-level failure (for example, a 
> certificate-related error), the guacd process never exits and starts 
> consuming 100% CPU. With a native FreeRDP client, the connection fails 
> correctly and exits, and the client logs show:
> {code:java}
> ERRCONNECT_CONNECT_TRANSPORT_FAILED [0x0002000D]{code}
> With guacd, it looks like it enters an infinite loop. In guacd debug logs I 
> see the following repeating continuously:
> {code:java}
>  guacd[40267]: DEBUG:    SVC "rdpdr" connected.                           
>  guacd[40267]: DEBUG:    SVC "rdpsnd" connected.              
>  guacd[40267]: DEBUG:    Loading Dynamic Virtual Channel disp            
>  guacd[40267]: DEBUG:    Loading Dynamic Virtual Channel rdpgfx
>  guacd[40267]: DEBUG:    Handler not found for ""                          
>  guacd[40267]: DEBUG:    Handler not found for ""                          
>  guacd[40267]: DEBUG:    Handler not found for ""                          
>  guacd[40267]: DEBUG:    BIO_read returned a system error 0: No error: 0
>  guacd[40267]: DEBUG:    transport_read_layer:freerdp_set_last_error_ex 
> ERRCONNECT_CONNECT_TRANSPORT_FAILED [0x0002000D]
>  guacd[40267]: DEBUG:    Handler not found for ""
>  guacd[40267]: DEBUG:    Handler not found for ""
>  guacd[40267]: DEBUG:    Handler not found for ""
>  guacd[40267]: DEBUG:    Handler not found for ""
>  guacd[40267]: DEBUG:    Handler not found for ""{code}
> To understand what guacd was doing while spinning, I captured system calls. 
> The trace shows repeated poll(..., 0) calls (zero timeout) and read() 
> returning EAGAIN, which explains the tight loop and 100% CPU usage:
> {code:java}
>   19143: gettimeofday({ 1770992548.370728 },0x0)   = 0 (0x0)                
>   19143: gettimeofday({ 1770992548.370771 },0x0)   = 0 (0x0)                
>   19143: poll({ 13/POLLIN },1,0)                   = 0 (0x0)                
>   19143: gettimeofday({ 1770992548.370840 },0x0)   = 0 (0x0)                
>   19143: gettimeofday({ 1770992548.370873 },0x0)   = 0 (0x0)
>   19143: poll({ 5/POLLIN 9/POLLIN 86/POLLIN 14/POLLIN 12/POLLIN 13/POLLIN 
> },6,0) = 1 (0x1)                                                              
>   
>   19143: read(5,0x85b0fde18,8)                     ERR#35 'Resource 
> temporarily unavailable'                   
>   19143: gettimeofday({ 1770992548.371020 },0x0)   = 0 (0x0)                
>   19143: gettimeofday({ 1770992548.371042 },0x0)   = 0 (0x0)                
>   19143: poll({ 13/POLLIN },1,0)                   = 0 (0x0)                
>   19143: gettimeofday({ 1770992548.371094 },0x0)   = 0 (0x0)                
>   19143: gettimeofday({ 1770992548.371137 },0x0)   = 0 (0x0)
>   19143: poll({ 5/POLLIN 9/POLLIN 86/POLLIN 14/POLLIN 12/POLLIN 13/POLLIN 
> },6,0) = 1 (0x1)                                                              
>   
>   19143: read(5,0x85b0fde18,8)                     ERR#35 'Resource 
> temporarily unavailable'                   
>   19143: gettimeofday({ 1770992548.371255 },0x0)   = 0 (0x0)                
> ...{code}
> Root cause (as I understand it)
> From investigating the code, it looks like the error state in {{wait_result}} 
> can be overwritten:
>  * When the transport failure occurs, {{guac_rdp_handle_events()}} sets 
> {{wait_result}} to {{{}-1{}}}.
>  * However, the loop does not exit immediately.
>  * In the next iteration, the {{while}} condition assigns {{wait_result}} 
> from {{{}rdp_guac_client_wait_for_events(){}}}, which returns {{{}0{}}}, 
> effectively clearing the error flag.
>  * As a result, the loop never observes the transport failure and continues 
> polling, spinning the CPU.
> h3. Proposed fix
> I prepared a PR that breaks out of the loop immediately after 
> {{guac_rdp_handle_events()}} returns an error, so we stop processing once 
> FreeRDP reports a connection failure.
> The change fixes the issue for me, but I'm not sure whether it could 
> introduce any regressions in other scenarios. I would appreciate a review and 
> any suggestions for improvement.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to