On Tue, Mar 12, 2024 at 03:48:24PM -0400, Jeremy Bícha wrote:
> Debian Policy does not say that it is a severity: serious bug because
> you are unable to compile gcr4 on your particular AWS instance. I
> understand that this bug appears serious to you. I also agree that
> there is a real bug in gcr. Perhaps the bug is a race condition.
> Fixing the issue that causes gcr build tests to fail 100% in your test
> case may also fix the flakiness issue seen on the official buildds.

I've been attempting to debug this on an AWS instance provided by
Santiago.  So far I'm afraid I can only report some partial progress,
but I might as well write down what I've got so far.

Whatever the bug is, it is highly sensitive to small perturbations.  For
instance, I found that commenting out non-failing g_test_add calls from
gck/test-gck-object.c:main (even those that run _after_ the tests that
typically fail) was enough to make it fail significantly less often.  I
suspect that this is just the effect of tweaking the state of hash
tables or a random number generator or something.

More unfortunately, attaching almost any kind of debugging tool seems to
perturb timing such that the problem is no longer reproducible; in
particular I was unable to reproduce failures under gdb.  The best I
could do was to generate a core dump, as follows:

  $ gdb gck/test-gck-object core
  GNU gdb (Debian 13.2-1+b1) 13.2
  Copyright (C) 2023 Free Software Foundation, Inc.
  License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
  This is free software: you are free to change and redistribute it.
  There is NO WARRANTY, to the extent permitted by law.
  Type "show copying" and "show warranty" for details.
  This GDB was configured as "x86_64-linux-gnu".
  Type "show configuration" for configuration details.
  For bug reporting instructions, please see:
  <https://www.gnu.org/software/gdb/bugs/>.
  Find the GDB manual and other documentation resources online at:
      <http://www.gnu.org/software/gdb/documentation/>.
  
  For help, type "help".
  Type "apropos word" to search for commands related to "word"...
  Reading symbols from gck/test-gck-object...
  [New LWP 31755]
  [New LWP 31753]
  [New LWP 31754]
  [New LWP 31751]
  [New LWP 31756]
  [Thread debugging using libthread_db enabled]
  Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
  Core was generated by 
`/home/cjwatson/gcr4-4.2.0/obj-x86_64-linux-gnu/gck/test-gck-object'.
  Program terminated with signal SIGSEGV, Segmentation fault.
  #0  0x00007f4fef3a5633 in find_attribute (attr_type=3, 
n_attrs=12008468691120727718, attrs=0x55bef952d90a) at 
../gck/gck-attributes.c:336
  336                     if (attrs[i].type == attr_type)
  [Current thread is 1 (Thread 0x7f4fed97a6c0 (LWP 31755))]
  (gdb) thread apply all bt
  
  Thread 5 (Thread 0x7f4fed1796c0 (LWP 31756)):
  #0  syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  #1  0x00007f4fef2ffc90 in g_cond_wait_until () at 
/lib/x86_64-linux-gnu/libglib-2.0.so.0
  #2  0x00007f4fef26e143 in  () at /lib/x86_64-linux-gnu/libglib-2.0.so.0
  #3  0x00007f4fef2d24ba in  () at /lib/x86_64-linux-gnu/libglib-2.0.so.0
  #4  0x00007f4fef2d1ab1 in  () at /lib/x86_64-linux-gnu/libglib-2.0.so.0
  #5  0x00007f4fef08f45c in start_thread (arg=<optimized out>) at 
./nptl/pthread_create.c:444
  #6  0x00007f4fef10fbbc in clone3 () at 
../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
  
  Thread 4 (Thread 0x7f4fee9c4a00 (LWP 31751)):
  #0  0x00007f4fef102abf in __GI___poll (fds=0x55bba2e90cb0, nfds=1, 
timeout=500) at ../sysdeps/unix/sysv/linux/poll.c:29
  #1  0x00007f4fef2a4277 in  () at /lib/x86_64-linux-gnu/libglib-2.0.so.0
  #2  0x00007f4fef2a4c1f in g_main_loop_run () at 
/lib/x86_64-linux-gnu/libglib-2.0.so.0
  #3  0x000055bba2564664 in loop_wait_until (timeout=<optimized out>) at 
../egg/egg-testing.c:310
  #4  0x000055bba2562a4d in test_find_objects (test=0x55bba2e8f5c0, 
unused=<optimized out>) at ../gck/test-gck-object.c:403
  #5  0x00007f4fef2cf71e in  () at /lib/x86_64-linux-gnu/libglib-2.0.so.0
  #6  0x00007f4fef2cf513 in  () at /lib/x86_64-linux-gnu/libglib-2.0.so.0
  #7  0x00007f4fef2cf513 in  () at /lib/x86_64-linux-gnu/libglib-2.0.so.0
  #8  0x00007f4fef2cfc32 in g_test_run_suite () at 
/lib/x86_64-linux-gnu/libglib-2.0.so.0
  #9  0x00007f4fef2cfcb8 in g_test_run () at 
/lib/x86_64-linux-gnu/libglib-2.0.so.0
  #10 0x000055bba2564b96 in egg_tests_run_with_loop () at 
../egg/egg-testing.c:326
  #11 0x000055bba256268e in main (argc=<optimized out>, argv=<optimized out>) 
at ../gck/test-gck-object.c:426
  
  Thread 3 (Thread 0x7f4fee17b6c0 (LWP 31754)):
  #0  0x00007f4fef102abf in __GI___poll (fds=0x55bba2e904b0, nfds=1, 
timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:29
  #1  0x00007f4fef2a4277 in  () at /lib/x86_64-linux-gnu/libglib-2.0.so.0
  #2  0x00007f4fef2a4930 in g_main_context_iteration () at 
/lib/x86_64-linux-gnu/libglib-2.0.so.0
  #3  0x00007f4fef2a4981 in  () at /lib/x86_64-linux-gnu/libglib-2.0.so.0
  #4  0x00007f4fef2d1ab1 in  () at /lib/x86_64-linux-gnu/libglib-2.0.so.0
  #5  0x00007f4fef08f45c in start_thread (arg=<optimized out>) at 
./nptl/pthread_create.c:444
  #6  0x00007f4fef10fbbc in clone3 () at 
../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
  
  Thread 2 (Thread 0x7f4fee97c6c0 (LWP 31753)):
  #0  syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  #1  0x00007f4fef2ffac4 in g_cond_wait () at 
/lib/x86_64-linux-gnu/libglib-2.0.so.0
  #2  0x00007f4fef26e16b in  () at /lib/x86_64-linux-gnu/libglib-2.0.so.0
  #3  0x00007f4fef2d213a in  () at /lib/x86_64-linux-gnu/libglib-2.0.so.0
  #4  0x00007f4fef2d1ab1 in  () at /lib/x86_64-linux-gnu/libglib-2.0.so.0
  #5  0x00007f4fef08f45c in start_thread (arg=<optimized out>) at 
./nptl/pthread_create.c:444
  #6  0x00007f4fef10fbbc in clone3 () at 
../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
  
  --Type <RET> for more, q to quit, c to continue without paging--
  Thread 1 (Thread 0x7f4fed97a6c0 (LWP 31755)):
  #0  0x00007f4fef3a5633 in find_attribute (attr_type=3, 
n_attrs=12008468691120727718, attrs=0x55bef952d90a) at 
../gck/gck-attributes.c:336
  #1  gck_attributes_find (attrs=attrs@entry=0x55bba2e8a980, attr_type=3) at 
../gck/gck-attributes.c:2077
  #2  0x00007f4fee9b616c in enumerate_and_find_objects (object=2, 
attrs=0x55bba2e8a980, user_data=0x7f4fed979ac0) at ../gck/gck-mock.c:1081
  #3  0x00007f4fee9b8237 in gck_mock_module_enumerate_objects 
(handle=handle@entry=113, func=func@entry=0x7f4fee9b6110 
<enumerate_and_find_objects>, user_data=user_data@entry=0x7f4fed979ac0) at 
../gck/gck-mock.c:203
  #4  0x00007f4fee9b8380 in gck_mock_C_FindObjectsInit (hSession=113, 
pTemplate=0x7f4fe4001770, ulCount=1) at ../gck/gck-mock.c:1114
  #5  0x00007f4fef3b13e9 in perform_find_objects (args=0x55bba2e95980) at 
../gck/gck-session.c:1527
  #6  0x00007f4fef3b91d6 in perform_call (args=<optimized out>, 
cancellable=<optimized out>, cancellable@entry=0x55bba2e95980, func=<optimized 
out>) at ../gck/gck-call.c:67
  #7  perform_call_chain (perform=0x7f4fef3b1370 <perform_find_objects>, 
complete=0x0, cancellable=cancellable@entry=0x0, args=0x55bba2e95980) at 
../gck/gck-call.c:97
  #8  0x00007f4fef3b936b in _gck_call_thread_func (task=0x55bba2e959b0, 
source_object=<optimized out>, task_data=0x55bba2e93250, cancellable=0x0) at 
../gck/gck-call.c:132
  #9  0x00007f4feeed0ca7 in  () at /lib/x86_64-linux-gnu/libgio-2.0.so.0
  #10 0x00007f4fef2d2462 in  () at /lib/x86_64-linux-gnu/libglib-2.0.so.0
  #11 0x00007f4fef2d1ab1 in  () at /lib/x86_64-linux-gnu/libglib-2.0.so.0
  #12 0x00007f4fef08f45c in start_thread (arg=<optimized out>) at 
./nptl/pthread_create.c:444
  #13 0x00007f4fef10fbbc in clone3 () at 
../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

But the failures present unreliably in different ways (SIGSEGV, SIGTRAP,
etc.); this is just one of them.  My suspicion is that this is a
thread-safety issue of some kind, perhaps something like accesses to a
hash table not being locked properly, but that's just a guess.  I'm not
even sure whether it's in gcr4 or in some other layer of the stack.

I'd be happy to try some other things if anyone has pointers for where
might be good places to look, or things I might be able to tweak to make
the bug more reliably reproducible (e.g. places to insert artificial
delays).

I remain entirely unable to reproduce this bug in any form on my laptop.

-- 
Colin Watson (he/him)                              [cjwat...@debian.org]

Reply via email to