Hi. I reviewed a kdump generated by a NULL pointer exception during
termination of an iSCSI session. In this instance, the termination of the
session was due to a 'Target-Not-Found' error from target during login.
The system is running SLES15 SP4 ( v5.14.21 )
crash> bt
PID: 61755 TASK: ffff88ae57e4c380 CPU: 6 COMMAND: "kworker/u40:3"
#0 [ffffc90006b6fae8] machine_kexec at ffffffff8106af4e
#1 [ffffc90006b6fb38] __crash_kexec at ffffffff81168dce
#2 [ffffc90006b6fc00] panic at ffffffff8191aa0f
#3 [ffffc90006b6fc88] oops_end at ffffffff8102e3dd
#4 [ffffc90006b6fca8] page_fault_oops at ffffffff8107b6fb
#5 [ffffc90006b6fd28] exc_page_fault at ffffffff81923610
#6 [ffffc90006b6fd50] asm_exc_page_fault at ffffffff81a00f39
[exception RIP: iscsi_sw_tcp_release_conn+111]
RIP: ffffffffc0c8243f RSP: ffffc90006b6fe08 RFLAGS: 00010202
RAX: 0000000000000000 RBX: ffff8881cb225388 RCX: 0000000000000001
RDX: ffff88adbf660900 RSI: ffffffff81f7cb84 RDI: ffff88adbf660980
RBP: ffff888ad68cd140 R8: 0000000000000001 R9: 0000000000000001
R10: 0000000000000000 R11: 00000000000001d2 R12: ffff8881cb225388
R13: ffff8881cb2256a8 R14: ffff8881cb2256a8 R15: ffff888105d8ca05
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
#7 [ffffc90006b6fe38] iscsi_sw_tcp_conn_stop at ffffffffc0c825fd
[iscsi_tcp]
#8 [ffffc90006b6fe58] iscsi_stop_conn at ffffffffc0f276f3
[scsi_transport_iscsi]
#9 [ffffc90006b6fe78] iscsi_cleanup_conn_work_fn at ffffffffc0f277f8
[scsi_transport_iscsi]
#10 [ffffc90006b6fea0] process_one_work at ffffffff810b5766
#11 [ffffc90006b6fed8] worker_thread at ffffffff810b595d
#12 [ffffc90006b6ff10] kthread at ffffffff810bdb63
#13 [ffffc90006b6ff50] ret_from_fork at ffffffff8100204f
Based on code review and journal logs, iscsid detects the login error and
initiates a TERM stop from user space. In parallel, the kernel driver
detects a socket error and initiates a RECOVERY stop on the connection.
*Initiated by iscsid*
iscsi_recv_login_rsp ->
iscsi_login_eh ->
session_conn_shutdown ->
kstop_conn ->
iscsi_if_transport_conn ->
iscsi_if_stop_conn ->
iscsi_stop_conn(conn, STOP_CONN_TERM)
*Initiated by error on TCP socket*
iscsi_sw_sk_state_check ->
iscsi_conn_failure ->
iscsi_conn_error_event ->
iscsi_conn_error_event ->
queue_work(iscsi_conn_cleanup_workq, &conn->cleanup_work);
.
.
iscsi_cleanup_conn_work_fn ->
iscsi_stop_conn(conn, STOP_CONN_RECOVER);
The null pointer exception occurred in the* iscsi_stop_conn *call initiated
from the worker thread for cleanup. Both *iscsi_sw_tcp_conn_stop* and
*iscsi_sw_tcp_release_conn* check for a NULL sock pointer in the connection
but the call to *iscsi_sw_tcp_conn_restore_callbacks* within
*iscsi_sw_tcp_release_conn* does not leaving a small window where the
connection's socket pointer can be set to NULL by the other
*iscsi_stop_conn* call running in parallel resulting in this exception.
It would be simple enough to add a check for a NULL socket pointer in
*iscsi_sw_tcp_conn_restore_callbacks
*but I'm not convinced that is the correct solution. It looks to me that
the resulting state of the session and connections would be different
depending on which of the two calls executes first. If the cleanup thread
successfully stop the connection with RECOVERY, it will set the socket
pointer in the connection to NULL and this will short circuit the iscsid
TERMINATE and keep it from modifying the connection/session states.
Also, I noticed that the cleanup thread's call to iscsi_stop_conn is made
while holding the ep_mutex while the call made from the iscsid is not.
Should the call from iscsid to iscsi_stop_conn be made while holding the
ep_mutex?
Thanks in advance,
Adam
--
You received this message because you are subscribed to the Google Groups
"open-iscsi" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/open-iscsi/fe79d2d1-4b30-4a16-81e7-0e54f49a6c33n%40googlegroups.com.