Hi Rick,

It's not clear to me what is going on there :-/

Are you able to reproduce this issue with SoftHSM2? Or would it be possible
to give me access to the HSM? Otherwise, I have no idea how to debug the
crash.

Daniel

On 2018-08-27 16:45, Rick van Rein wrote:
Hi,

Roland and I ran into a crashing condition for knotd 2.6.[689],
presumably caused by a race condition in the threaded use of PKCS #11
sessions. We use a commercial, replicated, networked HSM and not SoftHSM2.

WORKAROUND:

We do have a work-around with "conf-set server.background-workers 1" so
this is not a blocking condition for us, but handling our ~1700 zones
concurrency would be add back later.

PROBLEM DESCRIPTION:

Without this work-around, we see crashes quite reliably, on a load that
does a number of zone-set/-unset commands, fired by sequentialised knotc
processes to a knotd that continues to fire zone signing concurrently.
The commands are generated with the knot-aware option -k from ldns-zonediff,
https://github.com/SURFnet/ldns-zonediff

ANALYSIS:

Our HSM reports errors that look like a session handle is reused and
then repeatedly logged into, but not always, so it looks like a race
condition on a session variable,

27.08.2018 11:48:59 | [00006AE9:00006AEE] C_Login
   | E: Error CKR_USER_ALREADY_LOGGED_IN occurred.
27.08.2018 11:48:59 | [00006AE9:00006AEE] C_GetAttributeValue
   | E: Error CKR_USER_NOT_LOGGED_IN occurred.
27.08.2018 11:48:59 | [00006AE9:00006AED] C_Login
   | E: Error CKR_USER_ALREADY_LOGGED_IN occurred.
27.08.2018 11:48:59 | [00006AE9:00006AED] C_GetAttributeValue
   | E: Error CKR_USER_NOT_LOGGED_IN occurred.
27.08.2018 11:49:01 | [00006AE9:00006AED] C_Login
   | E: Error CKR_USER_ALREADY_LOGGED_IN occurred.
27.08.2018 11:49:01 | [00006AE9:00006AED] C_Login
   | E: Error CKR_USER_ALREADY_LOGGED_IN occurred.
27.08.2018 11:49:01 | [00006AE9:00006AED] C_GetAttributeValue
   | E: Error CKR_USER_NOT_LOGGED_IN occurred.
27.08.2018 11:49:02 | [00006AE9:00006AEE] C_Login
   | E: Error CKR_USER_ALREADY_LOGGED_IN occurred.
27.08.2018 11:49:03 | [00006AE9:00006AEE] C_Login
   | E: Error CKR_USER_ALREADY_LOGGED_IN occurred.
27.08.2018 11:55:50 | [0000744C:0000744E] C_Login
   | E: Error CKR_USER_ALREADY_LOGGED_IN occurred.

These errors stopped being reported with the work-around configured.
Until that time, we have crashes, of which the following dumps one:

Thread 4 "knotd" received signal SIGABRT, Aborted.
[Switching to Thread 0x7fffcd1bd700 (LWP 27375)]
0x00007ffff6967428 in __GI_raise (sig=sig@entry=6) at
../sysdeps/unix/sysv/linux/raise.c:54
54      ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) bt
#0  0x00007ffff6967428 in __GI_raise (sig=sig@entry=6) at
../sysdeps/unix/sysv/linux/raise.c:54
#1  0x00007ffff696902a in __GI_abort () at abort.c:89
#2  0x00007ffff69a97ea in __libc_message (do_abort=do_abort@entry=2,
fmt=fmt@entry=0x7ffff6ac2ed8 "*** Error in `%s': %s: 0x%s ***\n") at
../sysdeps/posix/libc_fatal.c:175
#3  0x00007ffff69b237a in malloc_printerr (ar_ptr=,
ptr=,
    str=0x7ffff6ac2fe8 "double free or corruption (out)", action=3) at
malloc.c:5006
#4  _int_free (av=, p=, have_lock=0) at
malloc.c:3867
#5  0x00007ffff69b653c in __GI___libc_free (mem=) at
malloc.c:2968
#6  0x0000555555597ed3 in ?? ()
#7  0x00005555555987c2 in ?? ()
#8  0x000055555559ba01 in ?? ()
#9 0x00007ffff7120338 in ?? () from /usr/lib/x86_64-linux-gnu/liburcu.so.4
#10 0x00007ffff6d036ba in start_thread (arg=0x7fffcd1bd700) at
pthread_create.c:333
#11 0x00007ffff6a3941d in clone () at
../sysdeps/unix/sysv/linux/x86_64/clone.S:109

DEBUGGING HINTS:

Our suspicion is that you may not have set the mutex callbacks when
invoking C_Initialize() on PKCS #11, possibly due to the intermediate
layers of abstraction hiding this from view.  This happens more often.

Then again, the double free might pose another hint.

This is on our soon-to-go-live platform, so I'm afraid it'll be very
difficult to do much more testing, I hope this suffices for your debugging!


I hope this helps Knot DNS to move forward!
 -Rick
--
https://lists.nic.cz/cgi-bin/mailman/listinfo/knot-dns-users

Reply via email to