Hello,

I have a test where multiple users(15 users join, wait 20 seconds, 
disconnect/rejoin and so on) are joining the same connection(rdp in this case) 
in quick succession.
This test reliably deadlocks guacd(1.6.0 unmodified). The two interesting 
threads are:

#0 pthread_rwlock_rdlock from /lib/x86_64-linux-gnu/libpthread.so.0
#1 guac_rwlock_acquire_read_lock at rwlock.c:228
#2 guac_display_layer_get_bounds at display-layer.c:51
#3 guac_display_dup at display.c:259
#4 guac_rdp_join_pending_handler at client.c:135
#5 guac_client_promote_pending_users at client.c:178
#6 guac_client_pending_users_thread at client.c:246
#7 start_thread from /lib/x86_64-linux-gnu/libpthread.so.0
#8 clone from /lib/x86_64-linux-gnu/libc.so.6

--> has __pending_users_lock(which breaks adding/removing users) waits for 
pending_frame.lock has read lock on last_frame.lock


#0 pthread_rwlock_wrlock from /lib/x86_64-linux-gnu/libpthread.so.0
#1 guac_rwlock_acquire_write_lock at rwlock.c:186
#2 guac_display_end_multiple_frames at display-flush.c:323
#3 guac_display_worker_thread at display-worker.c:461
#4 start_thread from /lib/x86_64-linux-gnu/libpthread.so.0
#5 clone from /lib/x86_64-linux-gnu/libc.so.6

--> has pending_frame.lock waits for last_frame.lock

I tried to fix this be acquiring pending_frame.lock in guac_display_dup before 
getting the last_frame.lock.
This makes things a lot better.

But now I hit another issue reliably after about a minute. Somehow a thread is 
stuck in a socket write operation:
#0 write from /lib/x86_64-linux-gnu/libpthread.so.0
#1 guac_socket_fd_write at socket-fd.c:109
#2 guac_socket_fd_flush at socket-fd.c:189
#3 guac_socket_fd_write_buffered at socket-fd.c:263
#4 guac_socket_fd_write_handler at socket-fd.c:318
#5 __guac_socket_write at socket.c:91
#6 guac_socket_write at socket.c:107
#7 __write_chunk_callback at socket-broadcast.c:135
#8 guac_client_foreach_pending_user at client.c:560
#9 __guac_socket_broadcast_write_handler at socket-broadcast.c:173
#10 __guac_socket_write at socket.c:91
#11 guac_socket_write at socket.c:107
#12 guac_socket_flush_base64 at socket.c:341
#13 guac_socket_write_base64 at socket.c:372
#14 guac_protocol_send_blob at protocol.c:262
#15 guac_png_flush_data at encode-png.c:79
#16 guac_png_write_data at encode-png.c:114
#17 guac_png_cairo_write_handler at encode-png.c:162
#18 ?? from /usr/lib/x86_64-linux-gnu/libcairo.so.2
#19 png_write_chunk_data from /usr/lib/x86_64-linux-gnu/libpng16.so.16
#20 ?? from /usr/lib/x86_64-linux-gnu/libpng16.so.16
#21 ?? from /usr/lib/x86_64-linux-gnu/libpng16.so.16
#22 ?? from /usr/lib/x86_64-linux-gnu/libpng16.so.16
#23 png_write_row from /usr/lib/x86_64-linux-gnu/libpng16.so.16
#24 png_write_image from /usr/lib/x86_64-linux-gnu/libpng16.so.16
#25 ?? from /usr/lib/x86_64-linux-gnu/libcairo.so.2
#26 cairo_surface_write_to_png_stream from 
/usr/lib/x86_64-linux-gnu/libcairo.so.2
#27 guac_png_cairo_write at encode-png.c:195
#28 guac_png_write at encode-png.c:300
#29 guac_client_stream_png at client.c:799
#30 guac_display_dup at display.c:275
#31 guac_rdp_join_pending_handler at client.c:135
#32 guac_client_promote_pending_users at client.c:178
#33 guac_client_pending_users_thread at client.c:246
#34 start_thread from /lib/x86_64-linux-gnu/libpthread.so.0
#35 clone from /lib/x86_64-linux-gnu/libc.so.6

This thread does not exit anymore. Even if all users are disconnected.
No new users can join anymore because the thread holds the __pending_users_lock.
Join threads:
#0 pthread_rwlock_wrlock from /lib/x86_64-linux-gnu/libpthread.so.0
#1 guac_rwlock_acquire_write_lock at rwlock.c:186
#2 guac_client_add_pending_user at client.c:440
#3 guac_client_add_user at client.c:479
#4 guac_user_handle_connection at user-handshake.c:339
#5 guacd_user_thread at proc.c:99
#6 start_thread from /lib/x86_64-linux-gnu/libpthread.so.0
#7 clone from /lib/x86_64-linux-gnu/libc.so.6

Also, no existing users can be removed from the connection for the same reason.
Remove threads:
#0 pthread_rwlock_wrlock from /lib/x86_64-linux-gnu/libpthread.so.0
#1 guac_rwlock_acquire_write_lock at rwlock.c:186
#2 guac_client_remove_user at client.c:497
#3 guac_user_handle_connection at user-handshake.c:364
#4 guacd_user_thread at proc.c:99
#5 start_thread from /lib/x86_64-linux-gnu/libpthread.so.0
#6 clone from /lib/x86_64-linux-gnu/libc.so.6

I am not sure how to fix this. Any ideas? Adding timeouts to the currently 
blocking socket call is the only solution I can come up with.
But after discovering the discussion in 
https://lists.apache.org/thread/94xrxq9w3kd4otcpdn3fh0jwn603m4wp it seems like 
this might not be the preferred way to fix this.

Best Regards,
Markus

Reply via email to