On 10/27/2010 05:23 PM, Eddie Wai wrote:
On Wed, 2010-10-27 at 14:57 -0700, Mike Christie wrote:
On 10/25/2010 07:52 PM, Eddie Wai wrote:
The race condition can be observed with the following sequence of events:
- For every active sessions, issue an ISCSI_ERR_CONN_FAILED nl msg
(This will eventually put all active connections into the actor_list ready
to execute the session_conn_reopen procedure)
- Asynchronously after a few seconds, call the iscsi_host_remove procedure
(This will notify iscsid with the ISCSI_ERR_INVALID_HOST nl message)
The current code actually handles this by advancing the INVALID_HOST
session_conn_error actor scheduing to the head via the actor_schedule_head
procedure. However, if the current actor thread was call-backed from the
actor_poll loop, then this head scheduling will actually put the INVALID_HOST
actor onto the poll_list instead of the actor_list.
If there are subsequent elements in the actor_list which triggers the reopen
path, then the conn_context for the INVALID_HOST for that particular
connection will get flushed (+ actor_delete). This will then lockup the
libiscsi's iscsi_host_remove call as it will wait indefinitely on all
sessions to be removed.
The fix is to put this head scheduling to the head of the actor_list
regardless of poll mode or not. This will allow all INVALID_PATH actors
to have a chance to get executed before any subsequent reopen actors.
Patch and explanation make sense. I just cannot figure out the code path
that causes us to get the INVALID PATH scheduled from a actor callback.
I think when I wrote the code and when I just re-tested it, we went from
event_loop->ctldev_handle->iscsi_sched_conn_context of the INVALID_PATH
handling. So we never came through actor_poll.
We normally see this when we test using max number of offload
connections (128) running for several hours. The test usually involves
with some kind of invasive operations like MTU change, L2 driver
load/unload, etc. which will put all 128 of the active sessions into the
reopen path and then eventually the INVALID_HOST path.
When the problem happens, what we observed is the following:
- the scheduler shows that it is in the actor_poll path executing the
reopen actors
- while it is polling for the corresponding nl messages from various
operations in the reopen path, a nl message for the INVALID_HOST creeps
in
- this INVALID_HOST message will get processed and scheduled_head will
get called.
- after that, the current reopen actor will continue.
Doh, I see it. reopens' __kipc_call gets error and calls ctldev_handle.
Thanks patched merged in commit e362dd2f1ddbb718f06489d0017cf2250079908a
--
You received this message because you are subscribed to the Google Groups
"open-iscsi" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to
[email protected].
For more options, visit this group at
http://groups.google.com/group/open-iscsi?hl=en.