On Mon, Sep 7, 2009 at 1:58 PM, Vladislav Bolkhovitin<[email protected]> wrote: > Chris Worley, on 09/06/2009 05:41 PM wrote: >> >> On Sun, Sep 6, 2009 at 3:36 PM, Chris Worley<[email protected]> wrote: >>> >>> On Sun, Sep 6, 2009 at 3:17 PM, Bart Van Assche<[email protected]> >>> wrote: >>>> >>>> On Fri, Sep 4, 2009 at 1:20 AM, Chris Worley <[email protected]> wrote: >>>>> >>>>> On Thu, Sep 3, 2009 at 11:38 AM, Chris Worley<[email protected]> wrote: >>>>>> >>>>>> I've used a couple of initiators (different systems) w/ different >>>>>> OSes, w/ different IB cards (all QDR) and different IB stacks >>>>>> (built-in vs. OFED) and can repeat the problem in all but the >>>>>> RHEL5.2/OFED 1.4.1 target and initiator (but, if the initiator is >>>>>> WinOF and the target is RHEL5.2/OFED1.4.1, then the problem does >>>>>> repeat). >>>>> >>>>> Here's a twist: I used the Ubuntu initiator w/ one of the RHEL >>>>> targets, and the RHEL initiator (same machine as was running WinOF >>>>> from the beginning of this thread) w/ one of the Ubuntu targets: in >>>>> both cases, the problem does not repeat. >>>>> >>>>> That makes it sound like OFED is the cure on either side of the >>>>> connection, but does not explain the issue w/ WinOF (which does fail >>>>> w/ either Ununtu or RHEL targets). >>>> >>>> These results are strange. Regarding the Linux-only tests, I was >>>> assuming failure of a single component (Ubuntu SRP initiator, OFED SRP >>>> initiator, Ubuntu IB driver, OFED IB driver or SRP target), but for >>>> each of these components there is at least one test that passes and at >>>> least one test that fails. So either my assumption is wrong or one of >>>> the above test results is not repeatable. Do you have the time to >>>> repeat the Linux-only tests ? >>> >>> Last night I was rerunning the RHEL5.2 initiator w/ Ubuntu client, and >>> the problem repeated; now, I can't repeat the case where it didn't >>> fail. Still, no errors, other than the eventual timeouts previously >>> shown; the target thinks all is fine, the initiator is stuck. >> >> ... and I haven't had any success w/ Ubuntu target and initiator, 8.10 or >> 9.04. > > 1. Try with kernel parameter maxcpus=1. It will somehow relax possible races > you have, although not completely.
I'll give it a try if/when I can use my systems again. I've already done the lock testing as suggested, with no problems. It really seems like a protocol issue... something is getting dropped (maybe by the IB hardware), and either no retries or the target rejects the retries. > > 2. Try with another hardware, including motherboard. You can have something > like http://lkml.org/lkml/2007/7/31/558 (not exactly it, of course) I'm wondering why it's so easily repeatable by me, and those I work with, and nobody else? I have another completely different configuration w/ the same issue... Today I was running on a dual-socket NHM and an eight-socket Opteron boxes. Both were running RHEL 5.3 w/ 2.6.18-128.7.1.0.1.el5, with all the SCST recommended modifications (which I haven't been running on my systems) and OFED 1.4.1. The Opteron was the target. The apps on the initiator don't need great speed, just very low latency. The app run time is ~4 hours... 2.5hours in, the hang occurs... same symptom, although this isn't raw disks running fio benchmarks... these are LVM striped (4MB) partitions w/ ext3 fs'es on top. The average block sizes as seen from the target are between 64KB and 128KB... not the 8KB and smaller I've been talking about in other tests, although all I can tell is what iostat shows me, and it just displays averages. But, the same issue occurs... the apps on the initiator hang, and the target thinks all is well. An app will hang in one of the file systems... the others seem to be working well (even though they are comprised of the same drives as the hung fs/app), for example: you can do a "find ." from their root w/o hanging "find", but if you try that in the fs where the app is hung, "find" will hang. Lvscan/pvscan will hang too. Strangely, restarting the target (removing ib_srpd and scst_vdisk modules, then re-registering the disks with scst_vdisk and re-modprobing ib_srpt from scratch) causes the apps on the initiator to un-hang and make progress again... (but eventually hang again... seemingly more readily than before). While nothing other than the messages you'd expect (from re-registering the drives to the initiator logging in) occur on the target, the initiator has much to say during this re-registration period, starting w/ the time-out (that has been shown previously): Sep 8 22:04:07 nameme kernel: sd 30:0:0:3: timing out command, waited 360s Sep 8 22:04:07 nameme kernel: sd 30:0:0:3: SCSI error: return code = 0x06000000 Sep 8 22:04:07 nameme kernel: end_request: I/O error, dev sdo, sector 45304704 Sep 8 22:04:16 nameme kernel: sd 29:0:0:8: timing out command, waited 360s Sep 8 22:04:16 nameme kernel: sd 29:0:0:8: SCSI error: return code = 0x06000000 Sep 8 22:04:16 nameme kernel: end_request: I/O error, dev sdj, sector 133535344 Sep 8 22:04:16 nameme kernel: sd 29:0:0:1: timing out command, waited 360s Sep 8 22:04:16 nameme kernel: sd 29:0:0:1: SCSI error: return code = 0x06000000 Sep 8 22:04:16 nameme kernel: end_request: I/O error, dev sdc, sector 55150000 Sep 8 22:04:16 nameme kernel: sd 29:0:0:5: timing out command, waited 360s Sep 8 22:04:16 nameme kernel: sd 29:0:0:5: SCSI error: return code = 0x06000000 Sep 8 22:04:16 nameme kernel: end_request: I/O error, dev sdg, sector 110218720 Sep 8 22:04:16 nameme kernel: sd 29:0:0:6: timing out command, waited 360s Sep 8 22:04:16 nameme kernel: sd 29:0:0:6: SCSI error: return code = 0x06000000 Sep 8 22:04:16 nameme kernel: end_request: I/O error, dev sdh, sector 133521792 Sep 8 22:04:16 nameme kernel: sd 30:0:0:9: timing out command, waited 360s Sep 8 22:04:16 nameme kernel: sd 30:0:0:9: SCSI error: return code = 0x06000000 Sep 8 22:04:16 nameme kernel: end_request: I/O error, dev sdu, sector 123498408 Sep 8 22:04:16 nameme kernel: sd 29:0:0:7: timing out command, waited 360s Sep 8 22:04:16 nameme kernel: sd 29:0:0:7: SCSI error: return code = 0x06000000 Sep 8 22:04:16 nameme kernel: end_request: I/O error, dev sdi, sector 61892872 Sep 8 22:04:26 nameme kernel: sd 29:0:0:0: timing out command, waited 360s Sep 8 22:04:26 nameme kernel: sd 29:0:0:0: SCSI error: return code = 0x06000000 Sep 8 22:04:26 nameme kernel: end_request: I/O error, dev sdb, sector 111165032 Sep 8 22:04:35 nameme kernel: sd 29:0:0:3: timing out command, waited 360s Sep 8 22:04:35 nameme kernel: sd 29:0:0:3: SCSI error: return code = 0x06000000 Sep 8 22:04:35 nameme kernel: end_request: I/O error, dev sde, sector 110192456 Sep 8 22:04:35 nameme kernel: sd 29:0:0:4: timing out command, waited 360s Sep 8 22:04:35 nameme kernel: sd 29:0:0:4: SCSI error: return code = 0x06000000 Sep 8 22:04:35 nameme kernel: end_request: I/O error, dev sdf, sector 81712464 Sep 8 22:04:45 nameme kernel: sd 29:0:0:9: timing out command, waited 360s Sep 8 22:04:45 nameme kernel: sd 29:0:0:9: SCSI error: return code = 0x06000000 Sep 8 22:04:45 nameme kernel: end_request: I/O error, dev sdk, sector 133530608 Sep 8 22:04:45 nameme kernel: sd 30:0:0:4: timing out command, waited 360s Sep 8 22:04:45 nameme kernel: sd 30:0:0:4: SCSI error: return code = 0x06000000 Sep 8 22:04:45 nameme kernel: end_request: I/O error, dev sdp, sector 133527104 Sep 8 22:04:45 nameme kernel: sd 30:0:0:8: timing out command, waited 360s Sep 8 22:04:45 nameme kernel: sd 30:0:0:8: SCSI error: return code = 0x06000000 Sep 8 22:04:45 nameme kernel: end_request: I/O error, dev sdt, sector 133534080 Sep 8 22:04:45 nameme kernel: sd 30:0:0:5: timing out command, waited 360s Sep 8 22:04:45 nameme kernel: sd 30:0:0:5: SCSI error: return code = 0x06000000 Sep 8 22:04:45 nameme kernel: end_request: I/O error, dev sdq, sector 133525384 Sep 8 22:04:45 nameme kernel: sd 30:0:0:6: timing out command, waited 360s Sep 8 22:04:45 nameme kernel: sd 30:0:0:6: SCSI error: return code = 0x06000000 Sep 8 22:04:45 nameme kernel: end_request: I/O error, dev sdr, sector 136997696 Sep 8 22:04:50 nameme kernel: sd 30:0:0:0: timing out command, waited 360s Sep 8 22:04:50 nameme kernel: sd 30:0:0:0: SCSI error: return code = 0x06000000 Sep 8 22:04:50 nameme kernel: end_request: I/O error, dev sdl, sector 133540752 Sep 8 22:05:24 nameme kernel: sd 29:0:0:2: timing out command, waited 360s Sep 8 22:05:24 nameme kernel: sd 29:0:0:2: SCSI error: return code = 0x06000000 Sep 8 22:05:24 nameme kernel: end_request: I/O error, dev sdd, sector 138572160 Sep 8 22:06:37 nameme kernel: host27: ib_srp: DREQ received - connection closed Sep 8 22:06:37 nameme kernel: host28: ib_srp: DREQ received - connection closed Sep 8 22:06:37 nameme kernel: host29: ib_srp: DREQ received - connection closed Sep 8 22:06:37 nameme kernel: host30: ib_srp: DREQ received - connection closed Sep 8 22:06:37 nameme kernel: host31: ib_srp: DREQ received - connection closed Sep 8 22:06:37 nameme kernel: host32: ib_srp: DREQ received - connection closed Sep 8 22:06:39 nameme kernel: host27: ib_srp: connection closed Sep 8 22:06:39 nameme kernel: ib_srp: host27: add qp_in_err timer Sep 8 22:06:39 nameme kernel: host28: ib_srp: connection closed Sep 8 22:06:39 nameme kernel: ib_srp: host28: add qp_in_err timer Sep 8 22:06:39 nameme kernel: host29: ib_srp: connection closed Sep 8 22:06:39 nameme kernel: ib_srp: host29: add qp_in_err timer Sep 8 22:06:39 nameme kernel: host30: ib_srp: connection closed Sep 8 22:06:39 nameme kernel: ib_srp: host30: add qp_in_err timer Sep 8 22:06:39 nameme kernel: host31: ib_srp: connection closed Sep 8 22:06:39 nameme kernel: ib_srp: host31: add qp_in_err timer Sep 8 22:06:39 nameme kernel: host32: ib_srp: connection closed Sep 8 22:06:39 nameme kernel: ib_srp: host32: add qp_in_err timer Sep 8 22:07:04 nameme kernel: host27: ib_srp: srp_qp_in_err_timer called Sep 8 22:07:04 nameme kernel: host27: ib_srp: srp_qp_in_err_timer flushed reset - done Sep 8 22:07:04 nameme kernel: host28: ib_srp: srp_qp_in_err_timer called Sep 8 22:07:04 nameme kernel: host28: ib_srp: srp_qp_in_err_timer flushed reset - done Sep 8 22:07:04 nameme kernel: host29: ib_srp: srp_qp_in_err_timer called Sep 8 22:07:04 nameme kernel: host29: ib_srp: srp_qp_in_err_timer flushed reset - done Sep 8 22:07:04 nameme kernel: host30: ib_srp: srp_qp_in_err_timer called Sep 8 22:07:04 nameme kernel: host30: ib_srp: srp_qp_in_err_timer flushed reset - done Sep 8 22:07:04 nameme kernel: host31: ib_srp: srp_qp_in_err_timer called Sep 8 22:07:04 nameme kernel: host31: ib_srp: srp_qp_in_err_timer flushed reset - done Sep 8 22:07:04 nameme kernel: host32: ib_srp: srp_qp_in_err_timer called Sep 8 22:07:04 nameme kernel: host32: ib_srp: srp_qp_in_err_timer flushed reset - done Sep 8 22:09:10 nameme kernel: host27: ib_srp: DREQ received - connection closed Sep 8 22:09:10 nameme kernel: host28: ib_srp: DREQ received - connection closed Sep 8 22:09:10 nameme kernel: host29: ib_srp: DREQ received - connection closed Sep 8 22:09:10 nameme kernel: host30: ib_srp: DREQ received - connection closed Sep 8 22:09:10 nameme kernel: host31: ib_srp: DREQ received - connection closed Sep 8 22:09:10 nameme kernel: host32: ib_srp: DREQ received - connection closed Sep 8 22:09:12 nameme kernel: host27: ib_srp: connection closed Sep 8 22:09:12 nameme kernel: ib_srp: host27: add qp_in_err timer Sep 8 22:09:12 nameme kernel: host28: ib_srp: connection closed Sep 8 22:09:12 nameme kernel: ib_srp: host28: add qp_in_err timer Sep 8 22:09:12 nameme kernel: host29: ib_srp: connection closed Sep 8 22:09:12 nameme kernel: ib_srp: host29: add qp_in_err timer Sep 8 22:09:12 nameme kernel: host30: ib_srp: connection closed Sep 8 22:09:12 nameme kernel: ib_srp: host30: add qp_in_err timer Sep 8 22:09:12 nameme kernel: host31: ib_srp: connection closed Sep 8 22:09:12 nameme kernel: ib_srp: host31: add qp_in_err timer Sep 8 22:09:12 nameme kernel: host32: ib_srp: connection closed Sep 8 22:09:12 nameme kernel: ib_srp: host32: add qp_in_err timer Sep 8 22:09:37 nameme kernel: host27: ib_srp: srp_qp_in_err_timer called Sep 8 22:09:37 nameme kernel: host27: ib_srp: srp_qp_in_err_timer flushed reset - done Sep 8 22:09:37 nameme kernel: host28: ib_srp: srp_qp_in_err_timer called Sep 8 22:09:37 nameme kernel: host28: ib_srp: srp_qp_in_err_timer flushed reset - done Sep 8 22:09:37 nameme kernel: host29: ib_srp: srp_qp_in_err_timer called Sep 8 22:09:37 nameme kernel: host29: ib_srp: srp_qp_in_err_timer flushed reset - done Sep 8 22:09:37 nameme kernel: host30: ib_srp: srp_qp_in_err_timer called Sep 8 22:09:37 nameme kernel: host30: ib_srp: srp_qp_in_err_timer flushed reset - done Sep 8 22:09:37 nameme kernel: host31: ib_srp: srp_qp_in_err_timer called Sep 8 22:09:37 nameme kernel: host31: ib_srp: srp_qp_in_err_timer flushed reset - done Sep 8 22:09:37 nameme kernel: host32: ib_srp: srp_qp_in_err_timer called Sep 8 22:09:37 nameme kernel: host32: ib_srp: srp_qp_in_err_timer flushed reset - done Sep 8 22:10:36 nameme kernel: host27: ib_srp: DREQ received - connection closed Sep 8 22:10:36 nameme kernel: host28: ib_srp: DREQ received - connection closed Sep 8 22:10:36 nameme kernel: host29: ib_srp: DREQ received - connection closed Sep 8 22:10:36 nameme kernel: host30: ib_srp: DREQ received - connection closed Sep 8 22:10:36 nameme kernel: host31: ib_srp: DREQ received - connection closed Sep 8 22:10:36 nameme kernel: host32: ib_srp: DREQ received - connection closed Sep 8 22:10:39 nameme kernel: host27: ib_srp: connection closed Sep 8 22:10:39 nameme kernel: ib_srp: host27: add qp_in_err timer Sep 8 22:10:39 nameme kernel: host28: ib_srp: connection closed Sep 8 22:10:39 nameme kernel: ib_srp: host28: add qp_in_err timer Sep 8 22:10:39 nameme kernel: host29: ib_srp: connection closed Sep 8 22:10:39 nameme kernel: ib_srp: host29: add qp_in_err timer Sep 8 22:10:39 nameme kernel: host30: ib_srp: connection closed Sep 8 22:10:39 nameme kernel: ib_srp: host30: add qp_in_err timer Sep 8 22:10:39 nameme kernel: host31: ib_srp: connection closed Sep 8 22:10:39 nameme kernel: ib_srp: host31: add qp_in_err timer Sep 8 22:10:39 nameme kernel: host32: ib_srp: connection closed Sep 8 22:10:39 nameme kernel: ib_srp: host32: add qp_in_err timer Sep 8 22:11:04 nameme kernel: host27: ib_srp: srp_qp_in_err_timer called Sep 8 22:11:04 nameme kernel: host27: ib_srp: srp_qp_in_err_timer flushed reset - done Sep 8 22:11:04 nameme kernel: host28: ib_srp: srp_qp_in_err_timer called Sep 8 22:11:04 nameme kernel: host28: ib_srp: srp_qp_in_err_timer flushed reset - done Sep 8 22:11:04 nameme kernel: host29: ib_srp: srp_qp_in_err_timer called Sep 8 22:11:04 nameme kernel: host29: ib_srp: srp_qp_in_err_timer flushed reset - done Sep 8 22:11:04 nameme kernel: host30: ib_srp: srp_qp_in_err_timer called Sep 8 22:11:04 nameme kernel: host30: ib_srp: srp_qp_in_err_timer flushed reset - done Sep 8 22:11:04 nameme kernel: host31: ib_srp: srp_qp_in_err_timer called Sep 8 22:11:04 nameme kernel: host31: ib_srp: srp_qp_in_err_timer flushed reset - done Sep 8 22:11:04 nameme kernel: host32: ib_srp: srp_qp_in_err_timer called Sep 8 22:11:04 nameme kernel: host32: ib_srp: srp_qp_in_err_timer flushed reset - done Here is the start of another hang, un-hung by restarting the target, Sep 8 22:32:33 nameme kernel: sd 29:0:0:7: timing out command, waited 360s Sep 8 22:32:33 nameme kernel: sd 29:0:0:7: SCSI error: return code = 0x06000000 Sep 8 22:32:33 nameme kernel: end_request: I/O error, dev sdi, sector 123549288 Sep 8 22:32:33 nameme kernel: device-mapper: multipath: Failing path 8:128. Sep 8 22:32:33 nameme kernel: sd 29:0:0:8: timing out command, waited 360s Sep 8 22:32:33 nameme kernel: sd 29:0:0:8: SCSI error: return code = 0x06000000 Sep 8 22:32:33 nameme kernel: end_request: I/O error, dev sdj, sector 123538312 Sep 8 22:32:33 nameme kernel: device-mapper: multipath: Failing path 8:144. Sep 8 22:38:33 nameme kernel: sd 29:0:0:8: timing out command, waited 360s Sep 8 22:38:33 nameme kernel: sd 29:0:0:8: SCSI error: return code = 0x06000000 Sep 8 22:38:33 nameme kernel: end_request: I/O error, dev sdj, sector 140607872 Sep 8 22:38:33 nameme kernel: sd 29:0:0:7: timing out command, waited 360s Sep 8 22:38:33 nameme kernel: sd 29:0:0:7: SCSI error: return code = 0x06000000 Sep 8 22:38:33 nameme kernel: end_request: I/O error, dev sdi, sector 123549416 Sep 8 22:38:33 nameme kernel: sd 29:0:0:0: timing out command, waited 360s Sep 8 22:38:33 nameme kernel: sd 29:0:0:0: SCSI error: return code = 0x06000000 Sep 8 22:38:33 nameme kernel: end_request: I/O error, dev sdb, sector 123535744 Sep 8 22:38:33 nameme kernel: device-mapper: multipath: Failing path 8:16. Sep 8 22:38:33 nameme kernel: sd 29:0:0:2: timing out command, waited 360s Sep 8 22:38:33 nameme kernel: sd 29:0:0:2: SCSI error: return code = 0x06000000 Sep 8 22:38:33 nameme kernel: end_request: I/O error, dev sdd, sector 123535744 Sep 8 22:38:33 nameme kernel: device-mapper: multipath: Failing path 8:48. Sep 8 22:38:33 nameme kernel: sd 29:0:0:6: timing out command, waited 360s Sep 8 22:38:33 nameme kernel: sd 29:0:0:6: SCSI error: return code = 0x06000000 Sep 8 22:38:33 nameme kernel: end_request: I/O error, dev sdh, sector 123534960 Sep 8 22:38:33 nameme kernel: device-mapper: multipath: Failing path 8:112. Sep 8 22:40:39 nameme kernel: sd 29:0:0:3: timing out command, waited 360s Sep 8 22:40:39 nameme kernel: sd 29:0:0:3: SCSI error: return code = 0x06000000 Sep 8 22:40:39 nameme kernel: end_request: I/O error, dev sde, sector 123559464 Sep 8 22:40:39 nameme kernel: device-mapper: multipath: Failing path 8:64. Sep 8 22:41:32 nameme kernel: host27: ib_srp: DREQ received - connection closed Sep 8 22:41:32 nameme kernel: host28: ib_srp: DREQ received - connection closed Sep 8 22:41:32 nameme kernel: host29: ib_srp: DREQ received - connection closed Sep 8 22:41:32 nameme kernel: host30: ib_srp: DREQ received - connection closed Sep 8 22:41:32 nameme kernel: host31: ib_srp: DREQ received - connection closed Sep 8 22:41:32 nameme kernel: host32: ib_srp: DREQ received - connection closed Sep 8 22:41:34 nameme kernel: host27: ib_srp: connection closed Sep 8 22:41:34 nameme kernel: ib_srp: host27: add qp_in_err timer Sep 8 22:41:34 nameme kernel: host28: ib_srp: connection closed Sep 8 22:41:34 nameme kernel: ib_srp: host28: add qp_in_err timer Sep 8 22:41:34 nameme kernel: host29: ib_srp: connection closed Sep 8 22:41:34 nameme kernel: ib_srp: host29: add qp_in_err timer Sep 8 22:41:34 nameme kernel: host30: ib_srp: connection closed Sep 8 22:41:34 nameme kernel: ib_srp: host30: add qp_in_err timer Sep 8 22:41:34 nameme kernel: host31: ib_srp: connection closed Sep 8 22:41:34 nameme kernel: ib_srp: host31: add qp_in_err timer Sep 8 22:41:34 nameme kernel: host32: ib_srp: connection closed Sep 8 22:41:34 nameme kernel: ib_srp: host32: add qp_in_err timer Sep 8 22:41:59 nameme kernel: host27: ib_srp: srp_qp_in_err_timer called Sep 8 22:41:59 nameme kernel: host27: ib_srp: srp_qp_in_err_timer flushed reset - done Sep 8 22:41:59 nameme kernel: host28: ib_srp: srp_qp_in_err_timer called Sep 8 22:41:59 nameme kernel: host28: ib_srp: srp_qp_in_err_timer flushed reset - done Sep 8 22:41:59 nameme kernel: host29: ib_srp: srp_qp_in_err_timer called Sep 8 22:41:59 nameme kernel: host29: ib_srp: srp_qp_in_err_timer flushed reset - done Sep 8 22:41:59 nameme kernel: host30: ib_srp: srp_qp_in_err_timer called Sep 8 22:41:59 nameme kernel: host30: ib_srp: srp_qp_in_err_timer flushed reset - done Sep 8 22:41:59 nameme kernel: host31: ib_srp: srp_qp_in_err_timer called Sep 8 22:41:59 nameme kernel: host31: ib_srp: srp_qp_in_err_timer flushed reset - done Sep 8 22:41:59 nameme kernel: host32: ib_srp: srp_qp_in_err_timer called Sep 8 22:41:59 nameme kernel: host32: ib_srp: srp_qp_in_err_timer flushed reset - done If I get a chance to work w/ my systems again, I'll see if this re-registration helps there too. Chris > >> Chris >>> >>> Chris >>>> >>>> Bart. >>>> > > _______________________________________________ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
