Hello, I have a test where multiple users(15 users join, wait 20 seconds, disconnect/rejoin and so on) are joining the same connection(rdp in this case) in quick succession. This test reliably deadlocks guacd(1.6.0 unmodified). The two interesting threads are:
#0 pthread_rwlock_rdlock from /lib/x86_64-linux-gnu/libpthread.so.0 #1 guac_rwlock_acquire_read_lock at rwlock.c:228 #2 guac_display_layer_get_bounds at display-layer.c:51 #3 guac_display_dup at display.c:259 #4 guac_rdp_join_pending_handler at client.c:135 #5 guac_client_promote_pending_users at client.c:178 #6 guac_client_pending_users_thread at client.c:246 #7 start_thread from /lib/x86_64-linux-gnu/libpthread.so.0 #8 clone from /lib/x86_64-linux-gnu/libc.so.6 --> has __pending_users_lock(which breaks adding/removing users) waits for pending_frame.lock has read lock on last_frame.lock #0 pthread_rwlock_wrlock from /lib/x86_64-linux-gnu/libpthread.so.0 #1 guac_rwlock_acquire_write_lock at rwlock.c:186 #2 guac_display_end_multiple_frames at display-flush.c:323 #3 guac_display_worker_thread at display-worker.c:461 #4 start_thread from /lib/x86_64-linux-gnu/libpthread.so.0 #5 clone from /lib/x86_64-linux-gnu/libc.so.6 --> has pending_frame.lock waits for last_frame.lock I tried to fix this be acquiring pending_frame.lock in guac_display_dup before getting the last_frame.lock. This makes things a lot better. But now I hit another issue reliably after about a minute. Somehow a thread is stuck in a socket write operation: #0 write from /lib/x86_64-linux-gnu/libpthread.so.0 #1 guac_socket_fd_write at socket-fd.c:109 #2 guac_socket_fd_flush at socket-fd.c:189 #3 guac_socket_fd_write_buffered at socket-fd.c:263 #4 guac_socket_fd_write_handler at socket-fd.c:318 #5 __guac_socket_write at socket.c:91 #6 guac_socket_write at socket.c:107 #7 __write_chunk_callback at socket-broadcast.c:135 #8 guac_client_foreach_pending_user at client.c:560 #9 __guac_socket_broadcast_write_handler at socket-broadcast.c:173 #10 __guac_socket_write at socket.c:91 #11 guac_socket_write at socket.c:107 #12 guac_socket_flush_base64 at socket.c:341 #13 guac_socket_write_base64 at socket.c:372 #14 guac_protocol_send_blob at protocol.c:262 #15 guac_png_flush_data at encode-png.c:79 #16 guac_png_write_data at encode-png.c:114 #17 guac_png_cairo_write_handler at encode-png.c:162 #18 ?? from /usr/lib/x86_64-linux-gnu/libcairo.so.2 #19 png_write_chunk_data from /usr/lib/x86_64-linux-gnu/libpng16.so.16 #20 ?? from /usr/lib/x86_64-linux-gnu/libpng16.so.16 #21 ?? from /usr/lib/x86_64-linux-gnu/libpng16.so.16 #22 ?? from /usr/lib/x86_64-linux-gnu/libpng16.so.16 #23 png_write_row from /usr/lib/x86_64-linux-gnu/libpng16.so.16 #24 png_write_image from /usr/lib/x86_64-linux-gnu/libpng16.so.16 #25 ?? from /usr/lib/x86_64-linux-gnu/libcairo.so.2 #26 cairo_surface_write_to_png_stream from /usr/lib/x86_64-linux-gnu/libcairo.so.2 #27 guac_png_cairo_write at encode-png.c:195 #28 guac_png_write at encode-png.c:300 #29 guac_client_stream_png at client.c:799 #30 guac_display_dup at display.c:275 #31 guac_rdp_join_pending_handler at client.c:135 #32 guac_client_promote_pending_users at client.c:178 #33 guac_client_pending_users_thread at client.c:246 #34 start_thread from /lib/x86_64-linux-gnu/libpthread.so.0 #35 clone from /lib/x86_64-linux-gnu/libc.so.6 This thread does not exit anymore. Even if all users are disconnected. No new users can join anymore because the thread holds the __pending_users_lock. Join threads: #0 pthread_rwlock_wrlock from /lib/x86_64-linux-gnu/libpthread.so.0 #1 guac_rwlock_acquire_write_lock at rwlock.c:186 #2 guac_client_add_pending_user at client.c:440 #3 guac_client_add_user at client.c:479 #4 guac_user_handle_connection at user-handshake.c:339 #5 guacd_user_thread at proc.c:99 #6 start_thread from /lib/x86_64-linux-gnu/libpthread.so.0 #7 clone from /lib/x86_64-linux-gnu/libc.so.6 Also, no existing users can be removed from the connection for the same reason. Remove threads: #0 pthread_rwlock_wrlock from /lib/x86_64-linux-gnu/libpthread.so.0 #1 guac_rwlock_acquire_write_lock at rwlock.c:186 #2 guac_client_remove_user at client.c:497 #3 guac_user_handle_connection at user-handshake.c:364 #4 guacd_user_thread at proc.c:99 #5 start_thread from /lib/x86_64-linux-gnu/libpthread.so.0 #6 clone from /lib/x86_64-linux-gnu/libc.so.6 I am not sure how to fix this. Any ideas? Adding timeouts to the currently blocking socket call is the only solution I can come up with. But after discovering the discussion in https://lists.apache.org/thread/94xrxq9w3kd4otcpdn3fh0jwn603m4wp it seems like this might not be the preferred way to fix this. Best Regards, Markus