I think the point is that the sg path is not equal to the real IO path. You 
are (currently) never going to get correct error handling from the sg path, 
which is considered out of band.

So what are you trying to do here? Are you really developing software that 
uses the sg path, and you need this to work, or are you testing the error 
handling and decided to use sg? In other words, is this a real bug, or just 
something you think should work but does not?

On Tuesday, April 9, 2019 at 5:36:42 PM UTC-4, Jaesoo Lee wrote:
>
> Hi, 
>
> Thanks for the comments. 
>
> I tried to run sg_write_same over sg device and I am seeing the same 
> problem. 
>
> The result is as follows: 
>
> 0. Kernel configs 
> Version: 5.1-rc1 
> Boot parameter: dm_mod.use_blk_mq=Y scsi_mod.use_blk_mq=Y 
>
> 1. Normal state 
> : (As expected) The command succeeded 
>
> $ sg_write_same --lba=100 --xferlen=512 /dev/sg5 
> $ 
>
> 2. Immediately after bringing down the iSCSI interface at the target 
> : (As expected) Failed with DID_TRANSPORT_DISRUPTED after a few seconds 
>
> $ sg_write_same --lba=100 --xferlen=512 /dev/sg5 
> Write same: transport: Host_status=0x0e [DID_TRANSPORT_DISRUPTED] 
> Driver_status=0x00 [DRIVER_OK, SUGGEST_OK] 
>
> Write same(10) command failed 
>
> 3. Immediately after the DID_TRANSPORT_DISRUPTED error 
> : (BUG) The command succeeded after a few seconds 
>
> $ sg_write_same --lba=100 --xferlen=512 /dev/sg5 
> $ 
>
> : Kernel logs 
> Apr  8 18:28:03 init21-16 kernel: session1: session recovery timed out 
> after 10 secs 
> Apr  8 18:28:03 init21-16 kernel: sd 8:0:0:5: rejecting I/O to offline 
> device 
>
> 4. Issued IO again 
> : (As expected) The command failed 
>
> $ sg_write_same --lba=100 --xferlen=512 /dev/sg5 
> Write same: pass through os error: No such device or address 
> Write same(10) command failed 
>
> Let me upload my fix for this and please give me some comments on that. 
>
> Thanks, 
>
> Jaesoo Lee. 
>
> ---------- Forwarded message --------- 
> From: Douglas Gilbert 
> Date: Wed, Apr 3, 2019 at 2:06 PM 
> Subject: Re: [bug report][scsi blk_mq] data corruption due to a bug in 
> SCSI LLD error handling 
> To: <...>
>
>
> On 2019-04-03 4:18 p.m., Jaesoo Lee wrote: 
> > Hello All, 
> > 
> > I encountered this bug while trying to enable dm_blk_mq for our 
> > iSCSI/FCP targets. 
> > 
> > The bug is that the sg_io issued to scsi_blk_mq would succeed even if 
> > LLD wants to error out those requests. 
> > 
> > Let me explain the scenario in more details. 
> > 
> > Setup: 
> > 0. Host kernel configuration 
> > - 4.19.9, 4.20.16 
> > - boot parameter: dm_mod.use_blk_mq=Y scsi_mod.use_blk_mq=Y 
> > scsi_transport_iscsi.debug_session=1 scsi_transport_iscsi.debug_conn=1 
> > 
> > Scenario: 
> > 1. Connect the host to iSCSI target via four paths 
> > : A dm device is created for those target devices 
> > 2. Start an application in the host which generates sg_io ioctl for 
> > XCOPY and WSAME to the dm device with the ratio of around 50% 
> > (pread/pwrite for the rest). 
> > 3. Perform system crash (sysrq-trigger) in the iSCSI target 
> > 
> > Expected result: 
> > - Any outstanding IOs should get failed with errors 
> > 
> > Actual results: 
> > - Normal read/write IOs get failed as expected 
> > - SG_IO ioctls SUCCEEDED!! 
>
> Not all ioctl(SG_IO)s are created equal! 
>
> If you are using the sg v3 interface (i.e. struct sg_io_hdr) then I would 
> expect DRIVER_TIMEOUT in sg_io_obj.driver_status or DID_TIME_OUT in 
> sg_io_obj.host_status to be set on completion. [BTW You will _not_ see 
> a ETIMEDOUT errno; only errors prior to submission yield errno style 
> errors.] 
>
> If you don't see that with ioctl(SG_IO) on a block device then try again 
> on 
> a sg device. If neither report that then the mid-level error processing 
> is broken. 
>
> Doug Gilbert 
>
>
> > - log message: 
> > [Tue Apr  2 11:26:34 2019]  session3: session recovery timed out after 
> 11 secs 
> > [Tue Apr  2 11:26:34 2019]  session3: session_recovery_timedout: 
> > Unblocking SCSI target 
> > .. 
> > [Tue Apr  2 11:26:34 2019] sd 8:0:0:8: scsi_prep_state_check: 
> > rejecting I/O to offline device 
> > [Tue Apr  2 11:26:34 2019] sd 8:0:0:8: scsi_prep_state_check: 
> > rejecting I/O to offline device 
> > [Tue Apr  2 11:26:34 2019] sd 8:0:0:8: scsi_prep_state_check: 
> > rejecting I/O to offline device 
> > [Tue Apr  2 11:26:34 2019] print_req_error: I/O error, dev sdi, sector 
> 30677580 
> > [Tue Apr  2 11:26:34 2019] device-mapper: multipath: Failing path 8:128. 
> > [Tue Apr  2 11:26:34 2019] SG_IO disk=sdi, result=0x0 
> > 
> > - This causes the DATA corruption for the application 
> > 
> > Relavant call stacks: (SG_IO issue path) 
> > [Tue Apr  2 11:26:33 2019] sd 8:0:0:8: [sdi] sd_ioctl: disk=sdi, 
> cmd=0x2285 
> > [Tue Apr  2 11:26:33 2019] SG_IO disk=sdi, retried 1 cmd 93 
> > [Tue Apr  2 11:26:33 2019] CPU: 30 PID: 16080 Comm: iostress Not 
> > tainted 4.19.9-purekernel_dbg.x86_64+ #30 
> > [Tue Apr  2 11:26:33 2019] Hardware name:  /0JP31P, BIOS 2.0.19 
> 08/29/2013 
> > [Tue Apr  2 11:26:33 2019] Call Trace: 
> > [Tue Apr  2 11:26:33 2019]  dump_stack+0x63/0x85 
> > [Tue Apr  2 11:26:33 2019]  sg_io+0x41e/0x480 
> > [Tue Apr  2 11:26:33 2019]  scsi_cmd_ioctl+0x297/0x420 
> > [Tue Apr  2 11:26:33 2019]  ? sdev_prefix_printk+0xe9/0x120 
> > [Tue Apr  2 11:26:33 2019]  ? _cond_resched+0x1a/0x50 
> > [Tue Apr  2 11:26:33 2019]  scsi_cmd_blk_ioctl+0x51/0x70 
> > [Tue Apr  2 11:26:33 2019]  sd_ioctl+0x95/0x1a0 [sd_mod] 
> > [Tue Apr  2 11:26:33 2019]  __blkdev_driver_ioctl+0x25/0x30 
> > [Tue Apr  2 11:26:33 2019]  dm_blk_ioctl+0x79/0x100 [dm_mod] 
> > [Tue Apr  2 11:26:33 2019]  blkdev_ioctl+0x89a/0x940 
> > [Tue Apr  2 11:26:33 2019]  ? do_nanosleep+0xae/0x180 
> > [Tue Apr  2 11:26:33 2019]  block_ioctl+0x3d/0x50 
> > [Tue Apr  2 11:26:33 2019]  do_vfs_ioctl+0xa6/0x600 
> > [Tue Apr  2 11:26:33 2019]  ? __audit_syscall_entry+0xe9/0x130 
> > [Tue Apr  2 11:26:33 2019]  ksys_ioctl+0x6d/0x80 
> > ).[Tue Apr  2 11:26:33 2019]  __x64_sys_ioctl+0x1a/0x20 
> > [Tue Apr  2 11:26:33 2019]  do_syscall_64+0x60/0x180 
> > [Tue Apr  2 11:26:33 2019]  entry_SYSCALL_64_after_hwframe+0x44/0xa9 
> > 
> > According to the analysis, it seems that there is a bug in propagating 
> > errors for iSCSI session timeout (i.e., session_recovery_timedout). 
> > Compared to this, legacy SCSI queue handles the errors in 
> > scsi_request_fn by killing the request as can be seen below. 
> > 
> > 1872 static void scsi_request_fn(struct request_queue *q) 
> > 1873         __releases(q->queue_lock) 
> > 1874         __acquires(q->queue_lock) 
> > 1875 { 
> > .. 
> > 1886         for (;;) { 
> > .. 
> > 1893                 req = blk_peek_request(q); 
> > 1894                 if (!req) 
> > 1895                         break; 
> > 1896 
> > 1897                 if (unlikely(!scsi_device_online(sdev))) { 
> > 1898                         sdev_printk(KERN_ERR, sdev, 
> > 1899                                     "scsi_request_fn: rejecting 
> > I/O to offline device\n"); 
> > 1900                         scsi_kill_request(req, q); 
> > 1901                         continue; 
> > 1902                 } 
> > 1903 
> > 
> > I am not sure in which layer we have to fix this, LLD or 
> scsi_queue_rq(). 
> > 
> > Could someone please take a look? Or, could someone guide me on how to 
> > fix this bug? 
> > 
> > Thanks, 
> > 
> > Jaesoo Lee. 
> > 
>

-- 
You received this message because you are subscribed to the Google Groups 
"open-iscsi" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/open-iscsi.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/open-iscsi/68eeea59-fbe9-413d-bf24-e3523b69d75f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to