On Wed, 2010-10-27 at 14:57 -0700, Mike Christie wrote: > On 10/25/2010 07:52 PM, Eddie Wai wrote: > > The race condition can be observed with the following sequence of events: > > - For every active sessions, issue an ISCSI_ERR_CONN_FAILED nl msg > > (This will eventually put all active connections into the actor_list ready > > to execute the session_conn_reopen procedure) > > - Asynchronously after a few seconds, call the iscsi_host_remove procedure > > (This will notify iscsid with the ISCSI_ERR_INVALID_HOST nl message) > > > > The current code actually handles this by advancing the INVALID_HOST > > session_conn_error actor scheduing to the head via the actor_schedule_head > > procedure. However, if the current actor thread was call-backed from the > > actor_poll loop, then this head scheduling will actually put the > > INVALID_HOST > > actor onto the poll_list instead of the actor_list. > > > > If there are subsequent elements in the actor_list which triggers the reopen > > path, then the conn_context for the INVALID_HOST for that particular > > connection will get flushed (+ actor_delete). This will then lockup the > > libiscsi's iscsi_host_remove call as it will wait indefinitely on all > > sessions to be removed. > > > > The fix is to put this head scheduling to the head of the actor_list > > regardless of poll mode or not. This will allow all INVALID_PATH actors > > to have a chance to get executed before any subsequent reopen actors. > > > > Patch and explanation make sense. I just cannot figure out the code path > that causes us to get the INVALID PATH scheduled from a actor callback. > > I think when I wrote the code and when I just re-tested it, we went from > event_loop->ctldev_handle->iscsi_sched_conn_context of the INVALID_PATH > handling. So we never came through actor_poll. > We normally see this when we test using max number of offload connections (128) running for several hours. The test usually involves with some kind of invasive operations like MTU change, L2 driver load/unload, etc. which will put all 128 of the active sessions into the reopen path and then eventually the INVALID_HOST path.
When the problem happens, what we observed is the following: - the scheduler shows that it is in the actor_poll path executing the reopen actors - while it is polling for the corresponding nl messages from various operations in the reopen path, a nl message for the INVALID_HOST creeps in - this INVALID_HOST message will get processed and scheduled_head will get called. - after that, the current reopen actor will continue. Since the original reopen actor came from the actor_poll path, the scheduled_head call for the INVALID_HOST nl message will get scheduled to the poll_list instead. And any subsequent reopen actor in the actor_list will flush all contexts + actor delete the INVALID_HOST request. > Is this happening with the discovery offload passthrough patches by any > chance? > No, we see this with the regular open-iscsi tree doing normal offload connections but undergoing the test described above. Thanks, Eddie -- You received this message because you are subscribed to the Google Groups "open-iscsi" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/open-iscsi?hl=en.
