Mike Christie wrote: > Mike Christie wrote: >> Mike Christie wrote: >>> Mike Christie wrote: >>>> Mike Christie wrote: >>>>> Hannes Reinecke wrote: >>>>>> Mike Christie wrote: >>>>>>>> The second patch is the more important one, as it >>>>>>>> fixes an error during LUN Reset handling in the >>>>>>>> initiator. When sending a LUN Reset during an >>>>>>>> ongoing R2T transfer, we're suspending Tx and >>>>>>>> aborting all _SCSI_ tasks. However, once we're >>>>>>>> done there we're resuming Tx and the R2T transfer >>>>>>>> will happily continue. So we should rather be >>>>>>> This should not be happening. When iscsi_suspend_tx returns the tx >>>>>>> thread has stopped so we know there are no users accessing the task >>>>>>> (well, there could be if a target is sending a tmf response then a r2t, >>>>>>> but if the target is following the rfc there should not be). >>>>>>> >>>>>>> So when fail_scsi_tasks calls >>>>>>> >>>>>>> fail_scsi_task ->iscsi_complete_task (this will cleanup conn->task if >>>>>>> this is the same task) -> __iscsi_put_task >>>>>>> >>>>>>> this should be the last put on the task and that should release it >>>>>>> calling iscsi_free_task which should call cleanup_task to kill any >>>>>>> pending r2t handling and it would remove it from the requeue list. >>>>>>> >>>>>>> If we are sending a data-out for a task that has had fail_scsi_task >>>>>>> ->iscsi_complete_task -> __iscsi_put_task called for it then we are in >>>>>>> bigger trouble because the last put should have been called on it and >>>>>>> we >>>>>>> are accessing a bad task. >>>>>>> >>>>>> This is the log I'm getting: >>>>>> >>>>>> >>>>>> Jul 29 10:34:48 tyne kernel: session1: iscsi_eh_device_reset LU Reset >>>>>> [sc ffff88007b94d080 lun 6] >>>>>> Jul 29 10:34:48 tyne kernel: session1: iscsi_exec_task_mgmt_fn tmf set >>>>>> timeout >>>>>> Jul 29 10:34:48 tyne kernel: connection1:0: task itt 0x3a lun 6 abort >>>>>> transfer >>>>>> Jul 29 10:34:48 tyne kernel: session1: mgmtpdu [op 0x2 hdr->itt 0x5d >>>>>> datalen 0] >>>>>> Jul 29 10:34:48 tyne kernel: connection1:0: mgmtpdu [itt 0x5d task >>>>>> ffff88007a01fc00] xmit >>>>>> Jul 29 10:34:48 tyne kernel: connection1:0: tmf rsp [itt 0x5d] response >>>>>> 0 state 1 >>>>>> Jul 29 10:34:48 tyne kernel: connection1:0: task itt 0x72 lun 6 abort >>>>>> transfer >>>>>> Jul 29 10:34:48 tyne kernel: session1: iscsi_suspend_tx suspend Tx >>>>>> Jul 29 10:34:48 tyne kernel: session1: iscsi_complete_task task itt >>>>>> 0x72 sc ffff88007b5bc580 still active >>>>>> Jul 29 10:34:48 tyne kernel: connection1:0: task itt 0x57 lun 6 abort >>>>>> transfer >>>>>> Jul 29 10:34:48 tyne kernel: connection1:0: task itt 0x59 lun 6 abort >>>>>> transfer >>>>>> Jul 29 10:34:48 tyne kernel: session1: Tx suspended! >>>>>> >>>>>> So we're indeed would have continued the R2T task (itt 0x57 and itt >>>>>> 0x59) even though we've >>>>>> already received a valid TMF response. >>>>>> So I'm afraid it's us ... >>>>> Ah, I misunderstood you. I do not think it has to do with the cleanup >>>>> still leaving r2ts. I am not sure where you are putting printks, but I >>>>> think it is this: >>>>> >>>>> while (!list_empty(&conn->requeue)) { >>>>> if (conn->session->fast_abort && conn->tmf_state != >>>>> TMF_INITIAL) >>>>> break; >>>>> >>>>> Once the tmf completes, we will start sending data again. >>>>> >>>> Ooops. I am too sleepy. Ignore that. I am wrong there. >>>> >>> I guess if fast_abort is 0 though, we will hit this problem. And we will >>> send data-outs when getting tmf responses as well as when we are sending >>> the tmf. >> >> >> I think the problem is wording like in 10.5.1: >> >> For ABORT TASK SET and CLEAR TASK SET, the issuing initiator MUST >> continue to respond to all valid target transfer tags (received via >> R2T, Text Response, NOP-In, or SCSI Data-In PDUs) related to the >> affected task set, even after issuing the task management request. >> >> I think in some other doc (probably the one Mathew and Ulrich mentioned) >> there is wording about doing similar for abort and lu resets. >> >> The things is that I think half of targets want us to respond to r2ts >> and half do not. This is where the fast_abort comes from. If set then we >> reply to r2ts and if not set we do not. I think once we get a successful > > Fudge. I am really going to be now. I mean if it is set we do not reply > to r2ts. If not set then we reply. > Actually, I think it's a race condition:
drivers/scsi/libiscsi.c:iscsi_eh_device_reset() rc = SUCCESS; spin_unlock_bh(&session->lock); iscsi_suspend_tx(conn); So the workqueue thread could wedge in after we've unlocked the session lock and start sending data even though we're meant to suspend transmitting here. Will be trying it. Cheers, Hannes -- Dr. Hannes Reinecke zSeries & Storage h...@suse.de +49 911 74053 688 SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: Markus Rex, HRB 16746 (AG Nürnberg) --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "open-iscsi" group. To post to this group, send email to open-iscsi@googlegroups.com To unsubscribe from this group, send email to open-iscsi+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/open-iscsi -~----------~----~----~----~------~----~------~--~---