[ 
https://issues.apache.org/jira/browse/GUACAMOLE-2221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18059594#comment-18059594
 ] 

Paweł Grabczak commented on GUACAMOLE-2221:
-------------------------------------------

During deeper analysis I realized my PR is essentially a revert of commit of 
[https://github.com/apache/guacamole-server/commit/4220670d] , which was added 
to handle a similar failure mode. With my change applied I can’t reproduce the 
regression described in that commit. If reverting is risky, I can instead gate 
the loop exit on FreeRDP’s error state and only keep polling while 
{{FREERDP_ERROR_SUCCESS}} is reported.

Something like that: 

 
{code:java}
/* Break if FreeRDP has recorded a connection error */
if (freerdp_get_last_error(GUAC_RDP_CONTEXT(rdp_inst)) != 
FREERDP_ERROR_SUCCESS) {
    wait_result = -1;
    break;
}
{code}
 

> RDP busy-loop on transport failure
> ----------------------------------
>
>                 Key: GUACAMOLE-2221
>                 URL: https://issues.apache.org/jira/browse/GUACAMOLE-2221
>             Project: Guacamole
>          Issue Type: Bug
>          Components: guacamole-server
>    Affects Versions: 1.6.0
>         Environment: FreeBSD 14.3
> Guacamole server 1.6.0
>            Reporter: Paweł Grabczak
>            Priority: Major
>
> Hello,
> I believe I've found a bug introduced (or at least exposed) after upgrading 
> to Guacamole Server 1.6. I've been using guacd for a long time, and after 
> upgrading from 1.4 to 1.6 I noticed an issue with the RDP protocol.
> When an RDP connection hits a transport-level failure (for example, a 
> certificate-related error), the guacd process never exits and starts 
> consuming 100% CPU. With a native FreeRDP client, the connection fails 
> correctly and exits, and the client logs show:
> {code:java}
> ERRCONNECT_CONNECT_TRANSPORT_FAILED [0x0002000D]{code}
> With guacd, it looks like it enters an infinite loop. In guacd debug logs I 
> see the following repeating continuously:
> {code:java}
>  guacd[40267]: DEBUG:    SVC "rdpdr" connected.                           
>  guacd[40267]: DEBUG:    SVC "rdpsnd" connected.              
>  guacd[40267]: DEBUG:    Loading Dynamic Virtual Channel disp            
>  guacd[40267]: DEBUG:    Loading Dynamic Virtual Channel rdpgfx
>  guacd[40267]: DEBUG:    Handler not found for ""                          
>  guacd[40267]: DEBUG:    Handler not found for ""                          
>  guacd[40267]: DEBUG:    Handler not found for ""                          
>  guacd[40267]: DEBUG:    BIO_read returned a system error 0: No error: 0
>  guacd[40267]: DEBUG:    transport_read_layer:freerdp_set_last_error_ex 
> ERRCONNECT_CONNECT_TRANSPORT_FAILED [0x0002000D]
>  guacd[40267]: DEBUG:    Handler not found for ""
>  guacd[40267]: DEBUG:    Handler not found for ""
>  guacd[40267]: DEBUG:    Handler not found for ""
>  guacd[40267]: DEBUG:    Handler not found for ""
>  guacd[40267]: DEBUG:    Handler not found for ""{code}
> To understand what guacd was doing while spinning, I captured system calls. 
> The trace shows repeated poll(..., 0) calls (zero timeout) and read() 
> returning EAGAIN, which explains the tight loop and 100% CPU usage:
> {code:java}
>   19143: gettimeofday({ 1770992548.370728 },0x0)   = 0 (0x0)                
>   19143: gettimeofday({ 1770992548.370771 },0x0)   = 0 (0x0)                
>   19143: poll({ 13/POLLIN },1,0)                   = 0 (0x0)                
>   19143: gettimeofday({ 1770992548.370840 },0x0)   = 0 (0x0)                
>   19143: gettimeofday({ 1770992548.370873 },0x0)   = 0 (0x0)
>   19143: poll({ 5/POLLIN 9/POLLIN 86/POLLIN 14/POLLIN 12/POLLIN 13/POLLIN 
> },6,0) = 1 (0x1)                                                              
>   
>   19143: read(5,0x85b0fde18,8)                     ERR#35 'Resource 
> temporarily unavailable'                   
>   19143: gettimeofday({ 1770992548.371020 },0x0)   = 0 (0x0)                
>   19143: gettimeofday({ 1770992548.371042 },0x0)   = 0 (0x0)                
>   19143: poll({ 13/POLLIN },1,0)                   = 0 (0x0)                
>   19143: gettimeofday({ 1770992548.371094 },0x0)   = 0 (0x0)                
>   19143: gettimeofday({ 1770992548.371137 },0x0)   = 0 (0x0)
>   19143: poll({ 5/POLLIN 9/POLLIN 86/POLLIN 14/POLLIN 12/POLLIN 13/POLLIN 
> },6,0) = 1 (0x1)                                                              
>   
>   19143: read(5,0x85b0fde18,8)                     ERR#35 'Resource 
> temporarily unavailable'                   
>   19143: gettimeofday({ 1770992548.371255 },0x0)   = 0 (0x0)                
> ...{code}
> Root cause (as I understand it)
> From investigating the code, it looks like the error state in {{wait_result}} 
> can be overwritten:
>  * When the transport failure occurs, {{guac_rdp_handle_events()}} sets 
> {{wait_result}} to {{{}-1{}}}.
>  * However, the loop does not exit immediately.
>  * In the next iteration, the {{while}} condition assigns {{wait_result}} 
> from {{{}rdp_guac_client_wait_for_events(){}}}, which returns {{{}0{}}}, 
> effectively clearing the error flag.
>  * As a result, the loop never observes the transport failure and continues 
> polling, spinning the CPU.
> h3. Proposed fix
> I prepared a PR that breaks out of the loop immediately after 
> {{guac_rdp_handle_events()}} returns an error, so we stop processing once 
> FreeRDP reports a connection failure.
> The change fixes the issue for me, but I'm not sure whether it could 
> introduce any regressions in other scenarios. I would appreciate a review and 
> any suggestions for improvement.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to