On Thu, Sep 3, 2009 at 11:38 AM, Chris Worley<[email protected]> wrote: > On Thu, Sep 3, 2009 at 5:32 AM, Vladislav Bolkhovitin<[email protected]> wrote: >> Chris Worley, on 09/03/2009 08:08 AM wrote: >>> >>> On Wed, Sep 2, 2009 at 2:58 PM, Chris Worley<[email protected]> wrote: >>>> >>>> On Wed, Sep 2, 2009 at 2:00 PM, Bart Van Assche<[email protected]> >>>> wrote: >>>>> >>>>> On Wed, Sep 2, 2009 at 9:53 PM, Chris Worley<[email protected]> wrote: >>>>>> >>>>>> On Wed, Sep 2, 2009 at 1:31 PM, Bart Van >>>>>> Assche<[email protected]> wrote: >>>>>>> >>>>>>> On Tue, Sep 1, 2009 at 1:04 AM, Chris Worley<[email protected]> wrote: >>>>>>>> >>>>>>>> [ ... ] >>>>>>>> I've found a good kernel/scst mix to easily repeat this; I can get it >>>>>>>> to repeatedly hang w/ 8K block transfers running Ubuntu 9.04 w/ the >>>>>>>> 2.6.27-14-server kernel on _both_ target and initiator (i.e. no WinOF >>>>>>>> or OFED at all) and SCST rev 1062 on the target using one drive >>>>>>>> (performance is >600MB/s, >80K IOPS, on the 8KB block sizes being >>>>>>>> used). >>>>>>>> [ ... ] >>>>>>> >>>>>>> Is there a special reason why you are using the 2.6.27-14-server >>>>>>> kernel ? AFAIK the latest Ubuntu 9.04 kernel is 2.6.28-15-server. >>>>>> >>>>>> No special reason other than it didn't get upgraded w/ the rest of the >>>>>> distro... started w/ 8.10. >>>> >>>> I'm upgrading too, to 9.04. >>> >>> I tried the 2.6.28-15-server kernel (along w/ the 9.04 upgrade), and >>> it does repeat the issue. >>> >>> In trying to build a kernel w/ lockdep support as Vlad requested, my >>> lack of Debian knowledge shone through, and, although I believe I >>> followed all the instructions correctly, I'm not sure if I have a >>> 2.6.28-15 or 2.6.28-10 kernel. Anyway, the issue is still repeatable. >>> >>> Whatever kernel that is, I have SRP hung currently. What should I >>> look for in /proc/lockd*? >>> >>> I don't think it's a kernel lock... I think it's a protocol lock, as I >>> can rmmod the target kernel modules (scst_vdisk, scst, and ib_srpt) >>> when the initiator gets in this state. >> >> Since you can rmmod SCST modules, then this shouldn't be SCST or backstorage >> SW/HW issue, because that means there are no stuck or lost SCSI commands. > > At least on the target side. The initiator could think there are > outstanding commands, when they were actually lost on the target (or > the target completed them, and the initiator is in error not thinking > they are completed). > >> So, it should be issue of either SRP target/initiator, or OFED on the target >> or initiator, or your IB hardware on any node. > > I've used a couple of initiators (different systems) w/ different > OSes, w/ different IB cards (all QDR) and different IB stacks > (built-in vs. OFED) and can repeat the problem in all but the > RHEL5.2/OFED 1.4.1 target and initiator (but, if the initiator is > WinOF and the target is RHEL5.2/OFED1.4.1, then the problem does > repeat).
Here's a twist: I used the Ubuntu initiator w/ one of the RHEL targets, and the RHEL initiator (same machine as was running WinOF from the beginning of this thread) w/ one of the Ubuntu targets: in both cases, the problem does not repeat. That makes it sound like OFED is the cure on either side of the connection, but does not explain the issue w/ WinOF (which does fail w/ either Ununtu or RHEL targets). Chris > >> >> You should enable lockdep on both target and initiator (better with other >> kernel debug facilities enabled, see the attached file as a sample) and >> reproduce the issue. > > That's done and reported in another response; it doesn't seem to be a > lock issue. > >> There is a big chance that those facilities will spot >> what's going on wrong there. > > I applied the .config changes you suggested, and the kernel was > certainly more verbose, but I don't think added any information. When > the drives are attached over SRP, I see the following message: > > [ 454.317328] sd 4:0:0:3: [sde] Attached SCSI disk > [ 454.317340] kobject: 'scsi_device' (ffff8804234a3aa0): > kobject_add_internal: parent: '4:0:0:3', set: '<NULL>' > [ 454.317350] kobject: '4:0:0:3' (ffff880423cd2780): > kobject_add_internal: parent: 'scsi_device', set: 'devices' > [ 454.317378] kobject: '4:0:0:3' (ffff880423cd2780): kobject_uevent_env > [ 454.317390] kobject: '4:0:0:3' (ffff880423cd2780): fill_kobj_path: > path = > '/devices/pci0000:00/0000:00:01.0/0000:01:00.0/host4/target4:0:0/4:0:0:3/scsi_device/4:0:0:3' > [ 454.317437] kobject: 'scsi_generic' (ffff8804234a3c38): > kobject_add_internal: parent: '4:0:0:3', set: '<NULL>' > [ 454.317447] kobject: 'sg5' (ffff88042ac4ecb8): > kobject_add_internal: parent: 'scsi_generic', set: 'devices' > [ 454.317489] kobject: 'sg5' (ffff88042ac4ecb8): kobject_uevent_env > [ 454.317500] kobject: 'sg5' (ffff88042ac4ecb8): fill_kobj_path: path > = > '/devices/pci0000:00/0000:00:01.0/0000:01:00.0/host4/target4:0:0/4:0:0:3/scsi_generic/sg5' > [ 454.317523] sd 4:0:0:3: Attached scsi generic sg5 type 0 > > Is there somewhere else to look for problems? > > Thanks, > > Chris >> >> Vlad >> >>> Thanks, >>> >>> Chris >>>> >>>> Chris >>>>>> >>>>>> Do you think that kernel is better? >>>>> >>>>> I noticed this while trying to reproduce this issue. I have no opinion >>>>> yet about which of these two kernels is better. I'll downgrade the >>>>> Ubuntu kernel in my setup. >>>>> >>>>> Bart. >>>>> >>> >>> >>> ------------------------------------------------------------------------------ >>> Let Crystal Reports handle the reporting - Free Crystal Reports 2008 >>> 30-Day trial. Simplify your report design, integration and deployment - and >>> focus on what you do best, core application coding. Discover what's new with >>> Crystal Reports now. http://p.sf.net/sfu/bobj-july >>> _______________________________________________ >>> Scst-devel mailing list >>> [email protected] >>> https://lists.sourceforge.net/lists/listinfo/scst-devel >>> >> >> > _______________________________________________ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
