Re: Intel NVMe troubles?

2016-09-13 Thread Jim Harris
On Monday, September 12, 2016, Borja Marcos <bor...@sarenet.es> wrote:

>
> > On 12 Sep 2016, at 17:23, Jim Harris <jim.har...@gmail.com
> <javascript:;>> wrote:
> >
> > There is an updated DCT 3.0.2 at:  https://downloadcenter.intel.
> > com/download/26221/Intel-SSD-Data-Center-Tool which has a fix for this
> > issue.
> >
> > Borja has already downloaded this update and confirmed it looks good so
> > far.  Posting the update and results here so it is archived on the STABLE
> > mailing list.
>
> Is it just my imagination or has trim performance improved dramatically?
> I’m being unable to replicate
> the I/O stalls that I observed after running some simultaneous Bonnie++
> benchmarks with large files.


Yes - Trim performance should be significantly faster with the FW included
in the 3.0.2 DCT release.

Jim


>
> Thanks,
>
>
>
> Borja.
>
>
>
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: Intel NVMe troubles?

2016-09-12 Thread Jim Harris
On Mon, Aug 1, 2016 at 11:49 AM, Jim Harris <jim.har...@gmail.com> wrote:

>
>
> On Mon, Aug 1, 2016 at 7:38 AM, Borja Marcos <bor...@sarenet.es> wrote:
>
>>
>> >
>> > It looks like all of the TRIM commands are formatted properly.  The
>> failures do not happen until about 10 seconds after the last TRIM to each
>> drive was submitted, and immediately before TRIMs start to the next drive,
>> so I'm assuming the failures are for the the last few TRIM commands but
>> cannot say for sure.  Could you apply patch v2 (attached) which will dump
>> the TRIM payload contents inline with the failure messages?
>>
>> Sure, this is the complete /var/log/messages starting with the system
>> boot. Before booting I destroyed the pool
>> so that you could capture what happens when booting, zpool create, etc.
>>
>> Remember that the drives are in LBA format #3 (4 KB blocks). As far as I
>> know that’s preferred to the old 512 byte blocks.
>>
>> Thank you very much and sorry about the belated response.
>
>
> Hi Borja,
>
> Thanks for the additional testing.  This has all of the detail that I need
> for now.
>
> -Jim
>
>
>
There is an updated DCT 3.0.2 at:  https://downloadcenter.intel.
com/download/26221/Intel-SSD-Data-Center-Tool which has a fix for this
issue.

Borja has already downloaded this update and confirmed it looks good so
far.  Posting the update and results here so it is archived on the STABLE
mailing list.

Thanks,

-Jim
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: Intel NVMe troubles?

2016-08-01 Thread Jim Harris
On Mon, Aug 1, 2016 at 7:38 AM, Borja Marcos <bor...@sarenet.es> wrote:

>
> > On 29 Jul 2016, at 17:44, Jim Harris <jim.har...@gmail.com> wrote:
> >
> >
> >
> > On Fri, Jul 29, 2016 at 1:10 AM, Borja Marcos <bor...@sarenet.es> wrote:
> >
> > > On 28 Jul 2016, at 19:25, Jim Harris <jim.har...@gmail.com> wrote:
> > >
> > > Yes, you should worry.
> > >
> > > Normally we could use the dump_debug sysctls to help debug this - these
> > > sysctls will dump the NVMe I/O submission and completion queues.  But
> in
> > > this case the LBA data is in the payload, not the NVMe submission
> entries,
> > > so dump_debug will not help as much as dumping the NVMe DSM payload
> > > directly.
> > >
> > > Could you try the attached patch and send output after recreating your
> pool?
> >
> > Just in case the evil anti-spam ate my answer, sent the results to your
> Gmail account.
> >
> >
> > Thanks Borja.
> >
> > It looks like all of the TRIM commands are formatted properly.  The
> failures do not happen until about 10 seconds after the last TRIM to each
> drive was submitted, and immediately before TRIMs start to the next drive,
> so I'm assuming the failures are for the the last few TRIM commands but
> cannot say for sure.  Could you apply patch v2 (attached) which will dump
> the TRIM payload contents inline with the failure messages?
>
> Sure, this is the complete /var/log/messages starting with the system
> boot. Before booting I destroyed the pool
> so that you could capture what happens when booting, zpool create, etc.
>
> Remember that the drives are in LBA format #3 (4 KB blocks). As far as I
> know that’s preferred to the old 512 byte blocks.
>
> Thank you very much and sorry about the belated response.


Hi Borja,

Thanks for the additional testing.  This has all of the detail that I need
for now.

-Jim
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: Intel NVMe troubles?

2016-07-29 Thread Jim Harris
On Fri, Jul 29, 2016 at 1:10 AM, Borja Marcos <bor...@sarenet.es> wrote:

>
> > On 28 Jul 2016, at 19:25, Jim Harris <jim.har...@gmail.com> wrote:
> >
> > Yes, you should worry.
> >
> > Normally we could use the dump_debug sysctls to help debug this - these
> > sysctls will dump the NVMe I/O submission and completion queues.  But in
> > this case the LBA data is in the payload, not the NVMe submission
> entries,
> > so dump_debug will not help as much as dumping the NVMe DSM payload
> > directly.
> >
> > Could you try the attached patch and send output after recreating your
> pool?
>
> Just in case the evil anti-spam ate my answer, sent the results to your
> Gmail account.
>
>
Thanks Borja.

It looks like all of the TRIM commands are formatted properly.  The
failures do not happen until about 10 seconds after the last TRIM to each
drive was submitted, and immediately before TRIMs start to the next drive,
so I'm assuming the failures are for the the last few TRIM commands but
cannot say for sure.  Could you apply patch v2 (attached) which will dump
the TRIM payload contents inline with the failure messages?

Thanks,

-Jim


delete_debug_v2.patch
Description: Binary data
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: Intel NVMe troubles?

2016-07-28 Thread Jim Harris
On Thu, Jul 28, 2016 at 3:29 AM, Borja Marcos  wrote:

> Hi :)
>
> Still experimenting with NVMe drives and FreeBSD, and I have ran into
> problems, I think.
>
> I´ve got a server with 10 Intel DC P3500 NVMe drives. Right now, running
> 11-BETA2.
>
> I have updated the firmware in the drives to the latest version (8DV10174)
> using the Data Center Tools.
> And I’ve formatted them for 4 KB blocks (LBA format #3)
>
> nvmecontrol identify nvme0ns1
> Size (in LBAs):  488378646 (465M)
> Capacity (in LBAs):  488378646 (465M)
> Utilization (in LBAs):   488378646 (465M)
> Thin Provisioning:   Not Supported
> Number of LBA Formats:   7
> Current LBA Format:  LBA Format #03
> LBA Format #00: Data Size:   512  Metadata Size: 0
> LBA Format #01: Data Size:   512  Metadata Size: 8
> LBA Format #02: Data Size:   512  Metadata Size:16
> LBA Format #03: Data Size:  4096  Metadata Size: 0
> LBA Format #04: Data Size:  4096  Metadata Size: 8
> LBA Format #05: Data Size:  4096  Metadata Size:64
> LBA Format #06: Data Size:  4096  Metadata Size:   128
>
>
> ZFS properly detects the 4 KB block size and sets the correct ashift (12).
> But I’ve found these error messages
> generated while I created a pool (zpool create tank raidz2 /dev/nvd[0-8]
> spare /dev/nvd9)
>
> Jul 28 13:16:11 nvme2 kernel: nvme0: DATASET MANAGEMENT sqid:6 cid:63
> nsid:1
> Jul 28 13:16:11 nvme2 kernel: nvme0: LBA OUT OF RANGE (00/80) sqid:6
> cid:63 cdw0:0
> Jul 28 13:16:11 nvme2 kernel: nvme0: DATASET MANAGEMENT sqid:6 cid:62
> nsid:1
> Jul 28 13:16:11 nvme2 kernel: nvme0: LBA OUT OF RANGE (00/80) sqid:6
> cid:62 cdw0:0
> Jul 28 13:16:11 nvme2 kernel: nvme0: DATASET MANAGEMENT sqid:6 cid:61
> nsid:1
> Jul 28 13:16:11 nvme2 kernel: nvme0: LBA OUT OF RANGE (00/80) sqid:6
> cid:61 cdw0:0
> Jul 28 13:16:11 nvme2 kernel: nvme0: DATASET MANAGEMENT sqid:6 cid:60
> nsid:1
> Jul 28 13:16:11 nvme2 kernel: nvme0: LBA OUT OF RANGE (00/80) sqid:6
> cid:60 cdw0:0
> Jul 28 13:16:11 nvme2 kernel: nvme0: DATASET MANAGEMENT sqid:6 cid:59
> nsid:1
> Jul 28 13:16:11 nvme2 kernel: nvme0: LBA OUT OF RANGE (00/80) sqid:6
> cid:59 cdw0:0
> Jul 28 13:16:11 nvme2 kernel: nvme0: DATASET MANAGEMENT sqid:6 cid:58
> nsid:1
> Jul 28 13:16:11 nvme2 kernel: nvme0: LBA OUT OF RANGE (00/80) sqid:6
> cid:58 cdw0:0
> Jul 28 13:16:11 nvme2 kernel: nvme0: DATASET MANAGEMENT sqid:6 cid:57
> nsid:1
> Jul 28 13:16:11 nvme2 kernel: nvme0: LBA OUT OF RANGE (00/80) sqid:6
> cid:57 cdw0:0
> Jul 28 13:16:11 nvme2 kernel: nvme0: DATASET MANAGEMENT sqid:6 cid:56
> nsid:1
> Jul 28 13:16:11 nvme2 kernel: nvme0: LBA OUT OF RANGE (00/80) sqid:6
> cid:56 cdw0:0
>
> And the same for the rest of the drives [0-9].
>
> Should I worry?
>

Yes, you should worry.

Normally we could use the dump_debug sysctls to help debug this - these
sysctls will dump the NVMe I/O submission and completion queues.  But in
this case the LBA data is in the payload, not the NVMe submission entries,
so dump_debug will not help as much as dumping the NVMe DSM payload
directly.

Could you try the attached patch and send output after recreating your pool?

-Jim

Thanks!
>
>
>
>
> Borja.
>
>
>
> ___
> freebsd-stable@freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-stable
> To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


delete_debug.patch
Description: Binary data
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: FreeBSD 10.3 - nvme regression

2016-03-07 Thread Jim Harris
On Mon, Mar 7, 2016 at 5:33 AM, Borja Marcos  wrote:

>
> Hello,
>
> I am trying a SuperMicro server with NVME disks. The system boots FreeBSD
> 10.2, panics when booting FreeBSD 10.3.
>
> It was compiled on March 7th and Revision 296191 is included.
>
> On 10.3 it’s crashing right after this line:
>
> nvme9:  mem 0xfba1-0xfba13fff irq 59 at
> device 0.0 on pci134
>
> with a panic.
>
> panic: couldn’t find an APIC vector for IRQ 59.
>
> cpuid = 0
> The backtrace is (sorry, copying from a screen video)
>
> #0 kdb_backtrace=0x60
> #1 vpanic+0x126
> #2 panic+0x43
> #3 ioapic_disable_intr+0
> #4 intr_add_handler+0xfb
> #5 nexus_setup_inter+0x8a
> #6 pci_setup_intr+0x33
> #7 pci_setup_intr+0x33
> #8 bus_setup_intr+0xac
> #9 nvme_ctrlr_configure_intx+0x88
> #10 nvme_ctrlr_construct+0x407
> #11 nvme_attach+0x20
> #12 device_attach+0x43d
> #13 bus_generic_attach+0x2d
> #14 acpi_pci_attach+0x15c
> #15 device_attach+0x43d
> #16 bus_generic_attach+0x2d
> #17 acpi_pcib_attach+0x22c
>
> It said “Uptime 1s” and did a cold reboot.
>

Hi,

(Moving to freebsd-stable.  NVMe is not associated with the SCSI stack at
all.)

Can you please file a bug report on this?

Also, can you try setting the following loader variable before install?

hw.nvme.min_cpus_per_ioq=4

I am fairly certain you are hitting bug 199321, and since you have so many
devices in your system (NVMe + NICs) allocating per-CPU MSIx vectors, that
this last NVMe device cannot even allocate one APIC vector entry for an
INTx interrupt.

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=199321

-Jim


>
>
>
>
> dmesg.boot from 10.2 (the system in installed on a memory stick).
>
> root@ssd9:/usr/src/sys/dev/nvme # cat /var/run/dmesg.boot
> Copyright (c) 1992-2015 The FreeBSD Project.
> Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994
> The Regents of the University of California. All rights reserved.
> FreeBSD is a registered trademark of The FreeBSD Foundation.
> FreeBSD 10.2-RELEASE #0 r28: Wed Aug 12 15:26:37 UTC 2015
> r...@releng1.nyi.freebsd.org:/usr/obj/usr/src/sys/GENERIC amd64
> FreeBSD clang version 3.4.1 (tags/RELEASE_34/dot1-final 208032) 20140512
> CPU: Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz (2400.04-MHz K8-class CPU)
>   Origin="GenuineIntel"  Id=0x306f2  Family=0x6  Model=0x3f  Stepping=2
>
> Features=0xbfebfbff
>
> Features2=0x7ffefbff
>   AMD Features=0x2c100800
>   AMD Features2=0x21
>   Structured Extended
> Features=0x37ab
>   XSAVE Features=0x1
>   VT-x: PAT,HLT,MTF,PAUSE,EPT,UG,VPID,VID,PostIntr
>   TSC: P-state invariant, performance statistics
> real memory  = 137438953472 (131072 MB)
> avail memory = 133409718272 (127229 MB)
> Event timer "LAPIC" quality 600
> ACPI APIC Table: 
> FreeBSD/SMP: Multiprocessor System Detected: 32 CPUs
> FreeBSD/SMP: 2 package(s) x 8 core(s) x 2 SMT threads
>  cpu0 (BSP): APIC ID:  0
>  cpu1 (AP): APIC ID:  1
>  cpu2 (AP): APIC ID:  2
>  cpu3 (AP): APIC ID:  3
>  cpu4 (AP): APIC ID:  4
>  cpu5 (AP): APIC ID:  5
>  cpu6 (AP): APIC ID:  6
>  cpu7 (AP): APIC ID:  7
>  cpu8 (AP): APIC ID:  8
>  cpu9 (AP): APIC ID:  9
>  cpu10 (AP): APIC ID: 10
>  cpu11 (AP): APIC ID: 11
>  cpu12 (AP): APIC ID: 12
>  cpu13 (AP): APIC ID: 13
>  cpu14 (AP): APIC ID: 14
>  cpu15 (AP): APIC ID: 15
>  cpu16 (AP): APIC ID: 16
>  cpu17 (AP): APIC ID: 17
>  cpu18 (AP): APIC ID: 18
>  cpu19 (AP): APIC ID: 19
>  cpu20 (AP): APIC ID: 20
>  cpu21 (AP): APIC ID: 21
>  cpu22 (AP): APIC ID: 22
>  cpu23 (AP): APIC ID: 23
>  cpu24 (AP): APIC ID: 24
>  cpu25 (AP): APIC ID: 25
>  cpu26 (AP): APIC ID: 26
>  cpu27 (AP): APIC ID: 27
>  cpu28 (AP): APIC ID: 28
>  cpu29 (AP): APIC ID: 29
>  cpu30 (AP): APIC ID: 30
>  cpu31 (AP): APIC ID: 31
> ioapic0  irqs 0-23 on motherboard
> ioapic1  irqs 24-47 on motherboard
> ioapic2  irqs 48-71 on motherboard
> random:  initialized
> module_register_init: MOD_LOAD (vesa, 0x80db8eb0, 0) error 19
> kbd1 at kbdmux0
> acpi0:  on motherboard
> acpi0: Power Button (fixed)
> cpu0:  on acpi0
> cpu1:  on acpi0
> cpu2:  on acpi0
> cpu3:  on acpi0
> cpu4:  on acpi0
> cpu5:  on acpi0
> cpu6:  on acpi0
> cpu7:  on acpi0
> cpu8:  on acpi0
> cpu9:  on acpi0
> cpu10:  on acpi0
> cpu11:  on acpi0
> cpu12:  on acpi0
> cpu13:  on acpi0
> cpu14:  on acpi0
> cpu15:  on acpi0
> cpu16:  on acpi0
> cpu17:  on acpi0
> cpu18:  on acpi0
> cpu19:  on acpi0
> cpu20:  on acpi0
> cpu21:  on acpi0
> cpu22:  on acpi0
> cpu23:  on acpi0
> cpu24:  on acpi0
> cpu25:  on acpi0
> cpu26:  on acpi0
> cpu27:  on acpi0
> cpu28:  on acpi0
> cpu29:  on acpi0
> cpu30:  on acpi0
> cpu31:  on 

Re: Dell NVMe issues

2015-10-06 Thread Jim Harris
On Tue, Oct 6, 2015 at 9:42 AM, Steven Hartland <kill...@multiplay.co.uk>
wrote:

> Also looks like nvme exposes a timeout_period sysctl you could try
> increasing that as it could be too small for a full disk TRIM.
>

> Under CAM SCSI da support we have a delete_max which limits the max single
> request size for a delete it may be we need something similar for nvme as
> well to prevent this as it should still be chunking the deletes to ensure
> this sort of thing doesn't happen.


See attached.  Sean - can you try this patch with TRIM re-enabled in ZFS?

I would be curious if TRIM passes without this patch if you increase the
timeout_period as suggested.

-Jim




>
>
> On 06/10/2015 16:18, Sean Kelly wrote:
>
>> Back in May, I posted about issues I was having with a Dell PE R630 with
>> 4x800GB NVMe SSDs. I would get kernel panics due to the inability to assign
>> all the interrupts because of
>> https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=199321 <
>> https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=199321>. Jim Harris
>> helped fix this issue so I bought several more of these servers, Including
>> ones with 4x1.6TB drives…
>>
>> while the new servers with 4x800GB drives still work, the ones with
>> 4x1.6TB drives do not. When I do a
>> zpool create tank mirror nvd0 nvd1 mirror nvd2 nvd3
>> the command never returns and the kernel logs:
>> nvme0: resetting controller
>> nvme0: controller ready did not become 0 within 2000 ms
>>
>> I’ve tried several different things trying to understand where the actual
>> problem is.
>> WORKS: dd if=/dev/nvd0 of=/dev/null bs=1m
>> WORKS: dd if=/dev/zero of=/dev/nvd0 bs=1m
>> WORKS: newfs /dev/nvd0
>> FAILS: zpool create tank mirror nvd[01]
>> FAILS: gpart add -t freebsd-zfs nvd[01] && zpool create tank mirror
>> nvd[01]p1
>> FAILS: gpart add -t freebsd-zfs -s 1400g nvd[01[ && zpool create tank
>> nvd[01]p1
>> WORKS: gpart add -t freebsd-zfs -s 800g nvd[01] && zpool create tank
>> nvd[01]p1
>>
>> NOTE: The above commands are more about getting the point across, not
>> validity. I wiped the disk clean between gpart attempts and used GPT.
>>
>> So it seems like zpool works if I don’t cross past ~800GB. But other
>> things like dd and newfs work.
>>
>> When I get the kernel messages about the controller resetting and then
>> not responding, the NVMe subsystem hangs entirely. Since my boot disks are
>> not NVMe, the system continues to work but no more NVMe stuff can be done.
>> Further, attempting to reboot hangs and I have to do a power cycle.
>>
>> Any thoughts on what the deal may be here?
>>
>> 10.2-RELEASE-p5
>>
>> nvme0@pci0:132:0:0: class=0x010802 card=0x1f971028 chip=0xa820144d
>> rev=0x03 hdr=0x00
>>  vendor = 'Samsung Electronics Co Ltd'
>>  class  = mass storage
>>  subclass   = NVM
>>
>>
> ___
> freebsd-stable@freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-stable
> To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
>


nvd.patch
Description: Binary data
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: Dell NVMe issues

2015-10-06 Thread Jim Harris
On Tue, Oct 6, 2015 at 11:46 AM, Steven Hartland <kill...@multiplay.co.uk>
wrote:

> On 06/10/2015 19:03, Jim Harris wrote:
>
>
>
> On Tue, Oct 6, 2015 at 9:42 AM, Steven Hartland <
> <kill...@multiplay.co.uk>kill...@multiplay.co.uk> wrote:
>
>> Also looks like nvme exposes a timeout_period sysctl you could try
>> increasing that as it could be too small for a full disk TRIM.
>>
>
>> Under CAM SCSI da support we have a delete_max which limits the max
>> single request size for a delete it may be we need something similar for
>> nvme as well to prevent this as it should still be chunking the deletes to
>> ensure this sort of thing doesn't happen.
>
>
> See attached.  Sean - can you try this patch with TRIM re-enabled in ZFS?
>
> I would be curious if TRIM passes without this patch if you increase the
> timeout_period as suggested.
>
> -Jim
>
>
> Interesting does the nvme spec not provide information from the device as
> to what its optimal / max deallocate request size should be like the ATA
> spec exposes?
>

Correct - there is no way for devices to specify a max/optimal deallocate
size in NVMe.  There is an implicit limit from the 32-bit LBA length in the
DSM Range data structure defined by the spec.  So this patch is needed
anyways to make sure we don't overflow the 32-bit LBA length.  Sean's
drives are 1.6TB which fits in an unsigned 32-bit value on a 512-byte
sector formatted controller, so I don't think that is the problem in Sean's
case.

-Jim



> Regards
> Steve
>
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: ISCI bus_alloc_resource failed

2015-09-08 Thread Jim Harris
On Mon, Sep 7, 2015 at 10:37 PM, Bradley W. Dutton <
brad-fbsd-sta...@duttonbros.com> wrote:

> Quoting Jim Harris <jim.har...@gmail.com>:
>
> On Mon, Sep 7, 2015 at 7:29 PM, Bradley W. Dutton <
>> brad-fbsd-sta...@duttonbros.com> wrote:
>>
>> There are 2 devices in the same group so I passed both of them:
>>> http://duttonbrosllc.com/misc/vmware_esxi_passthrough_config.png
>>>
>>> At the time I wasn't sure if this was necessary but I just tried the
>>> Centos 7 VM and it worked without the SMBus device being passed through.
>>> I
>>> then tried the FreeBSD VM without SMBus and saw the same allocation error
>>> as before. Looks like the SMBus device is a red herring?
>>>
>>>
>>> Looks like on ESXi we are using Xen HVM init ops, which do not enable
>> MSI.
>> And the isci driver is not reverting to INTx resource allocation when MSIx
>> vector allocation fails.  I've added reverting to INTx in the attached
>> patch - can you try once more?
>>
>> Thanks,
>>
>> -Jim
>>
>
> That patch worked. No allocation errors and the drives work as expected.
>
> Thanks again,
> Brad
>
>
Thanks Brad.  Committed as r287563 (pci_enable_busmaster) and r287564
(pci_alloc_msix check).
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: ISCI bus_alloc_resource failed

2015-09-07 Thread Jim Harris
On Mon, Sep 7, 2015 at 7:29 PM, Bradley W. Dutton <
brad-fbsd-sta...@duttonbros.com> wrote:

> There are 2 devices in the same group so I passed both of them:
> http://duttonbrosllc.com/misc/vmware_esxi_passthrough_config.png
>
> At the time I wasn't sure if this was necessary but I just tried the
> Centos 7 VM and it worked without the SMBus device being passed through. I
> then tried the FreeBSD VM without SMBus and saw the same allocation error
> as before. Looks like the SMBus device is a red herring?
>
>
Looks like on ESXi we are using Xen HVM init ops, which do not enable MSI.
And the isci driver is not reverting to INTx resource allocation when MSIx
vector allocation fails.  I've added reverting to INTx in the attached
patch - can you try once more?

Thanks,

-Jim


isci.patch
Description: Binary data
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: ISCI bus_alloc_resource failed

2015-09-07 Thread Jim Harris
On Mon, Sep 7, 2015 at 10:34 AM, Bradley W. Dutton <
brad-fbsd-sta...@duttonbros.com> wrote:

> Hi,
>
> I'm having trouble with the isci driver in both stable and current. I see
> the following dmesg in stable:
>
> isci0:  port
> 0x5000-0x50ff mem 0xe7afc000-0xe7af,0xe740-0xe77f irq 19 at
> device 0.0 on pci11
> isci: 1:51 ISCI bus_alloc_resource failed
>
>
> I'm running FreeBSD on VMWare ESXi 6 with vt-d passthrough of the isci
> devices, here is the relevant pciconf output:
>
> none2@pci0:3:0:0:   class=0x0c0500 card=0x062815d9 chip=0x1d708086
> rev=0x06 hdr=0x00
> vendor = 'Intel Corporation'
> device = 'C600/X79 series chipset SMBus Controller 0'
> class  = serial bus
> subclass   = SMBus
> cap 10[90] = PCI-Express 2 endpoint max data 128(128) link x32(x32)
>  speed 5.0(5.0) ASPM disabled(L0s)
> cap 01[cc] = powerspec 3  supports D0 D3  current D0
> cap 05[d4] = MSI supports 1 message
> ecap 000e[100] = ARI 1
> isci0@pci0:11:0:0:  class=0x010700 card=0x062815d9 chip=0x1d6b8086
> rev=0x06 hdr=0x00
> vendor = 'Intel Corporation'
> device = 'C602 chipset 4-Port SATA Storage Control Unit'
> class  = mass storage
> subclass   = SAS
> cap 01[98] = powerspec 3  supports D0 D3  current D0
> cap 10[c4] = PCI-Express 2 endpoint max data 128(128) link x32(x32)
>  speed 5.0(5.0) ASPM disabled(L0s)
> cap 11[a0] = MSI-X supports 2 messages
>  Table in map 0x10[0x2000], PBA in map 0x10[0x3000]
> ecap 0001[100] = AER 1 0 fatal 0 non-fatal 1 corrected
> ecap 000e[138] = ARI 1
> ecap 0017[180] = TPH Requester 1
> ecap 0010[140] = SRIOV 1
>
>
> I haven't tried booting on bare metal but running a linux distro (centos
> 7) in the same VM works without issue. Is is possible the SRIOV option is
> causing trouble? I don't see a BIOS option to disable that setting on this
> server like I have on some others. Any other ideas to get this working?
>

I do not think the SRIOV is the problem here.  I do notice that isci(4)
does not explicitly enable PCI busmaster, which will cause problems with
PCI passthrough.  I've attached a patch that rectifies that issue.  I'm not
certain that is the root cause of the interrupt resource allocation failure
though.

Could you:

1) Apply the attached patch and retest.
2) If you still see the resource allocation failure, reboot in verbose mode
and provide the resulting dmesg output.

Thanks,

-Jim


> Thanks,
> Brad
>
> ___
> freebsd-stable@freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-stable
> To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
>


isci_busmaster.patch
Description: Binary data
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: Problems adding Intel 750 to zfs pool

2015-07-21 Thread Jim Harris
On Fri, Jul 17, 2015 at 9:35 PM, dy...@techtangents.com 
dy...@techtangents.com wrote:

 Hi,

 I've installed an Intel 750 400GB NVMe PCIe SSD in a Dell R320 running
 FreeBSD 10.2-beta-1... not STABLE, but not far behind, I think. Apologies
 if this is the wrong mailing list, or if this has been fixed in STABLE
 since the beta.

 Anyway, I've gparted it into 2 partitions - 16GB for slog/zil and 357GB
 for l2arc. Adding the slog partition to the pool takes about 2 minutes -
 machine seems hung during that time. Ping works, but I can't open another
 ssh session.

 Adding the l2arc doesn't seem to complete - it's been going 10 minutes now
 and nothing. Ping works, but I can't log in to the local console or another
 ssh session.

 I'm adding the partitions using their gpt names. i.e.
 zpool add zroot log gpt/slog
 zpool add zroot cache gpt/l2arc

 The system BIOS is up-to-date. The OS was a fresh 10.1 install, then
 freebsd-update to 10.2-beta2. 10.1 exhibited the same symptoms.

 Root is on zfs.

 Device was tested to be working on Windows 8.1 on a Dell T1700 workstation.

 Any ideas?


Hi Dylan,

I just committed SVN r285767 which should fix this issue.  I will request
MFC to stable/10 after the 3 day waiting period.

Thanks,

-Jim


 Cheers,

 Dylan Just

 ___
 freebsd-stable@freebsd.org mailing list
 http://lists.freebsd.org/mailman/listinfo/freebsd-stable
 To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: 10.1 NVMe kernel panic

2015-06-02 Thread Jim Harris
On Thu, May 21, 2015 at 8:33 AM, Sean Kelly smke...@smkelly.org wrote:

 Greetings.

 I have a Dell R630 server with four of Dell’s 800GB NVMe SSDs running
 FreeBSD 10.1-p10. According to the PCI vendor, they are some sort of
 rebranded Samsung drive. If I boot the system and then load nvme.ko and
 nvd.ko from a command line, the drives show up okay. If I put
 nvme_load=“YES”
 nvd_load=“YES”
 in /boot/loader.conf, the box panics on boot:
 panic: nexus_setup_intr: NULL irq resource!

 If I boot the system with “Safe Mode: ON” from the loader menu, it also
 boots successfully and the drives show up.

 You can see a full ‘boot -v’ here:
 http://smkelly.org/stuff/nvme-panic.txt 
 http://smkelly.org/stuff/nvme-panic.txt

 Anyone have any insight into what the issue may be here? Ideally I need to
 get this working in the next few days or return this thing to Dell.


Hi Sean,

Can you try adding hw.nvme.force_intx=1 to /boot/loader.conf?

I suspect you are able to load the drivers successfully after boot because
interrupt assignments are not restricted to CPU0 at that point - see
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=199321 for a related
issue.  Your logs clearly show that vectors were allocated for the first 2
NVMe SSDs, but the third could not get its full allocation.  There is a bug
in the INTx fallback code that needs to be fixed - you do not hit this bug
when loading after boot because bug #199321 only affects interrupt
allocation during boot.

If the force_intx test works, would you able to upgrade your nvme drivers
to the latest on stable/10?  There are several patches (one related to
interrupt vector allocation) that have been pushed to stable/10 since 10.1
was released, and I will be pushing another patch for the issue you have
reported shortly.

Thanks,

-Jim





 Thanks!

 --
 Sean Kelly
 smke...@smkelly.org
 http://smkelly.org

 ___
 freebsd-stable@freebsd.org mailing list
 http://lists.freebsd.org/mailman/listinfo/freebsd-stable
 To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: AHCI Patsburg SATA controller and slow transfer speed

2013-07-03 Thread Jim Harris
On Thu, Jun 27, 2013 at 6:38 PM, Jeremy Chadwick j...@koitsu.org wrote:


 Intel Patsburg is otherwise known as Intel X79.  The X79
 chipset/southbridge offers 6 SATA ports, 2 of which are SATA600, and the
 remaining 4 are SATA300:

 http://en.wikipedia.org/wiki/Intel_X79


While Wikipedia correctly says Intel X79 is codenamed Patsburg, most
Patsburgs are actually known as the C60x family of chipsets.  X79 pairs
with a Core i7 while the C60x pairs with a Xeon E5.

ark.intel.com is a very good decoder ring for all of the Intel code names.

http://ark.intel.com/products/codename/29968/Patsburg

-Jim
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: Strange CAM errors

2012-12-17 Thread Jim Harris
On Mon, Dec 17, 2012 at 9:26 AM, Willem Jan Withagen w...@digiware.nlwrote:

 On 2012-12-17 15:38, Steven Hartland wrote:
  Check the smart results of each disk in the array you may have a failing
  disk.
  - Original Message - From: Willem Jan Withagen 
 w...@digiware.nl
  To: FreeBSD Stable Users freebsd-stable@freebsd.org
  Sent: Monday, December 17, 2012 10:58 AM
  Subject: Strange CAM errors
 
 
  Hi,
 
  I have not noticed this before, but my system rebooted this morning and
  in the following security report I found a lot of messgaes in the
  dmesg-part like:
 
  +(probe0:arcmsr0:0:16:1): INQUIRY. CDB: 12 20 0 0 24 0
  +(probe0:arcmsr0:0:16:1): CAM status: Command timeout
  +(probe0:arcmsr0:0:16:1): Retrying command
  +(probe0:arcmsr0:0:16:1): INQUIRY. CDB: 12 20 0 0 24 0
  +(probe0:arcmsr0:0:16:1): CAM status: Command timeout
  +(probe0:arcmsr0:0:16:1): Retrying command
 
  And it seems that bus 16 is:
  +pass6 at arcmsr0 bus 0 scbus0 target 16 lun 0
  +pass6: Areca RAID controller R001 Fixed Processor SCSI-0 device
 
  The system has been running
  FreeBSD zfs.digiware.nl 9.1-PRERELEASE FreeBSD 9.1-PRERELEASE #3: Wed
  Nov 14 13:25:55 CET 2012
  r...@zfs.digiware.nl:/usr/obj/usr/srcs/src9/src/sys/ZFS  amd64
  for already a while.
 
  Anybody suggestions as to why I have these messages?
 
  They are during the boot sequence, so no smartd talking to the disks at
  that moment.
 
  --WjW
 
  ps: dmesg, config, etc at:

  http://www.tegenbosch28.nl/FreeBSD/Systems/ZFS
  ps2: upgrading to the most recent 9.1

 'mmm,

 Smartd seems to think otherwise...

 'camcontrol rescan all' actually delivers the same pack of errors.

 --WjW


The timeouts are occurring on inquiry commands to non-zero LUNs.  arcmsr(4)
is returning CAM_SEL_TIMEOUT instead of CAM_DEV_NOT_THERE for inquiry
commands to this device and LUN  0.  CAM_DEV_NOT_THERE is preferred to
remove these types of warnings, and similar patches have gone into for
other SCSI drivers recently.

Can you try this patch?

Index: sys/dev/arcmsr/arcmsr.c
===
--- sys/dev/arcmsr/arcmsr.c (revision 244190)
+++ sys/dev/arcmsr/arcmsr.c (working copy)
@@ -2439,7 +2439,7 @@
char *buffer=pccb-csio.data_ptr;

if (pccb-ccb_h.target_lun) {
-   pccb-ccb_h.status |= CAM_SEL_TIMEOUT;
+   pccb-ccb_h.status |= CAM_DEV_NOT_THERE;
xpt_done(pccb);
return;
}
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: Strange CAM errors

2012-12-17 Thread Jim Harris
On Mon, Dec 17, 2012 at 2:45 PM, Willem Jan Withagen w...@digiware.nlwrote:

 On 17-12-2012 20:16, Jim Harris wrote:
  The timeouts are occurring on inquiry commands to non-zero LUNs.
  arcmsr(4) is returning CAM_SEL_TIMEOUT instead of CAM_DEV_NOT_THERE for
  inquiry commands to this device and LUN  0.  CAM_DEV_NOT_THERE is
  preferred to remove these types of warnings, and similar patches have
  gone into for other SCSI drivers recently.
 
  Can you try this patch?
 
  Index: sys/dev/arcmsr/arcmsr.c
  ===
  --- sys/dev/arcmsr/arcmsr.c (revision 244190)
  +++ sys/dev/arcmsr/arcmsr.c (working copy)
  @@ -2439,7 +2439,7 @@
  char *buffer=pccb-csio.data_ptr;
 
  if (pccb-ccb_h.target_lun) {
  -   pccb-ccb_h.status |= CAM_SEL_TIMEOUT;
  +   pccb-ccb_h.status |= CAM_DEV_NOT_THERE;
  xpt_done(pccb);
  return;
  }
 

 Hi Jim,

 The noise has gone down by a factor of 5, now I get:

 (probe6:arcmsr0:0:16:1): INQUIRY. CDB: 12 20 0 0 24 0
 (probe6:arcmsr0:0:16:1): CAM status: Unable to terminate I/O CCB request
 (probe6:arcmsr0:0:16:1): Error 5, Unretryable error
 (probe6:arcmsr0:0:16:2): INQUIRY. CDB: 12 40 0 0 24 0

 Which is defined in sys/cam/cam.c 
 as CAM_UA_TERMIO, but that error is nowhere set in the arcmsr code


There is something out of sync on your system.  I just noticed this, but
your original error messages were showing Command timeout
(CAM_CMD_TIMEOUT) even though the driver was returning CAM_SEL_TIMEOUT.
Now in this case, driver is returning CAM_DEV_NOT_THERE, but CAM is
printing error message for CAM_UA_TERMIO.  In both cases, driver is
returning value X, but cam is interpreting it as X+1.  So CAM and arcmsr(4)
seem to have a different idea of the values of the cam_status enumeration.

Can you provide details on your build environment?  Are you building arcmsr
as a loadable module or do you specify device arcmsr in your kernel
config to link it statically?  I'm suspecting loadable module, although I
have no idea how these values would get out of sync since this enumeration
hasn't changed in probably 10+ years.

-Jim
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: Strange CAM errors

2012-12-17 Thread Jim Harris
On Mon, Dec 17, 2012 at 3:21 PM, Willem Jan Withagen w...@digiware.nlwrote:

 On 17-12-2012 23:10, Jim Harris wrote:
 
 
  On Mon, Dec 17, 2012 at 2:45 PM, Willem Jan Withagen w...@digiware.nl
  mailto:w...@digiware.nl wrote:
 
  On 17-12-2012 20:16, Jim Harris wrote:
   The timeouts are occurring on inquiry commands to non-zero LUNs.
   arcmsr(4) is returning CAM_SEL_TIMEOUT instead of
  CAM_DEV_NOT_THERE for
   inquiry commands to this device and LUN  0.  CAM_DEV_NOT_THERE is
   preferred to remove these types of warnings, and similar patches
 have
   gone into for other SCSI drivers recently.
  
   Can you try this patch?
  
   Index: sys/dev/arcmsr/arcmsr.c
   ===
   --- sys/dev/arcmsr/arcmsr.c (revision 244190)
   +++ sys/dev/arcmsr/arcmsr.c (working copy)
   @@ -2439,7 +2439,7 @@
   char *buffer=pccb-csio.data_ptr;
  
   if (pccb-ccb_h.target_lun) {
   -   pccb-ccb_h.status |= CAM_SEL_TIMEOUT;
   +   pccb-ccb_h.status |= CAM_DEV_NOT_THERE;
   xpt_done(pccb);
   return;
   }
  
 
  Hi Jim,
 
  The noise has gone down by a factor of 5, now I get:
 
  (probe6:arcmsr0:0:16:1): INQUIRY. CDB: 12 20 0 0 24 0
  (probe6:arcmsr0:0:16:1): CAM status: Unable to terminate I/O CCB
 request
  (probe6:arcmsr0:0:16:1): Error 5, Unretryable error
  (probe6:arcmsr0:0:16:2): INQUIRY. CDB: 12 40 0 0 24 0
 
  Which is defined in sys/cam/cam.c 
  as CAM_UA_TERMIO, but that error is nowhere set in the arcmsr
 code
 
 
  There is something out of sync on your system.  I just noticed this, but
  your original error messages were showing Command timeout
  (CAM_CMD_TIMEOUT) even though the driver was returning CAM_SEL_TIMEOUT.
  Now in this case, driver is returning CAM_DEV_NOT_THERE, but CAM is
  printing error message for CAM_UA_TERMIO.  In both cases, driver is
  returning value X, but cam is interpreting it as X+1.  So CAM and
  arcmsr(4) seem to have a different idea of the values of the cam_status
  enumeration.
 
  Can you provide details on your build environment?  Are you building
  arcmsr as a loadable module or do you specify device arcmsr in your
  kernel config to link it statically?  I'm suspecting loadable module,
  although I have no idea how these values would get out of sync since
  this enumeration hasn't changed in probably 10+ years.

 arcmsr is build in the kernel

 [/usr/src] w...@zfs.digiware.nl kldstat
 Id Refs AddressSize Name
  1   28 0x8020 b55be0   kernel
  21 0x80d56000 6138 nullfs.ko
  31 0x80d5d000 2153b0   zfs.ko
  42 0x80f73000 5e38 opensolaris.ko
  51 0x80f79000 f510 aio.ko
  61 0x80f89000 2a20 coretemp.ko
  71 0x81012000 316d4nfscl.ko
  82 0x81044000 10827nfscommon.ko

 And I just refetched 9.1-PRERELEASE this afternoon over svn

 Could this have something to do with Clang  gcc 
 Not that I did anything to change this.

 Note that I have nothing changed other than the KERNEL CONFIG file.

 And both kernel and world were build at the same time this afternoon.
 With your patch I just only rebuild kernel and modules.


Never mind my earlier comment on out-of-sync.  It's another bug in
arcmsr(4) - CAM_REQ_CMP == 0x1, and in the LUN  0 case here it OR's the
status values together, causing the off-by-one issue we were seeing.

Please try the following patch instead (reverting earlier patch):

Index: sys/dev/arcmsr/arcmsr.c
===
--- sys/dev/arcmsr/arcmsr.c (revision 244190)
+++ sys/dev/arcmsr/arcmsr.c (working copy)
@@ -2432,14 +2432,13 @@
 static void arcmsr_handle_virtual_command(struct AdapterControlBlock *acb,
union ccb * pccb)
 {
-   pccb-ccb_h.status |= CAM_REQ_CMP;
switch (pccb-csio.cdb_io.cdb_bytes[0]) {
case INQUIRY: {
unsigned char inqdata[36];
char *buffer=pccb-csio.data_ptr;

if (pccb-ccb_h.target_lun) {
-   pccb-ccb_h.status |= CAM_SEL_TIMEOUT;
+   pccb-ccb_h.status |= CAM_DEV_NOT_THERE;
xpt_done(pccb);
return;
}
@@ -2455,6 +2454,7 @@
strncpy(inqdata[16], RAID controller , 16);  /* Product
Identification */
strncpy(inqdata[32], R001, 4); /* Product Revision */
memcpy(buffer, inqdata, sizeof(inqdata));
+   pccb-ccb_h.status |= CAM_REQ_CMP;
xpt_done(pccb);
}
break;
@@ -2464,10 +2464,12 @@
pccb-ccb_h.status |= CAM_SCSI_STATUS_ERROR

Re: Strange CAM errors

2012-12-17 Thread Jim Harris
On Mon, Dec 17, 2012 at 4:52 PM, Willem Jan Withagen w...@digiware.nlwrote:

 Right,

 That did the trick.
 Thanx for the code.

 --WjW



Patch committed as r244369. It will get MFC'd but obviously won't be in 9.1.

Thanks,

-Jim
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: tws bug ? (LSI SAS 9750)

2012-10-26 Thread Jim Harris
On Fri, Oct 26, 2012 at 1:18 PM, John-Mark Gurney j...@funkthat.com wrote:

 I'm seeing similar stuff on the hpt27xx driver:
 (probe18:hpt27xx0:0:18:0): INQUIRY. CDB: 12 0 0 0 24 0
 (probe18:hpt27xx0:0:18:0): CAM status: Invalid Target ID
 (probe18:hpt27xx0:0:18:0): Error 22, Unretryable error

 Should I make a similar change in sys/dev/hpt27xx/osm_bsd.c?  Looks like
 there are two CAM_TID_INVALID lines, but from reading the comments, only
 the second one should change...

 Correct?  If so, I'll try making the change and make sure everything
 works well.


Yes - I agree that a similar change is needed, and only to the second
one in that file.
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: tws bug ? (LSI SAS 9750)

2012-09-22 Thread Jim Harris
On Sat, Sep 22, 2012 at 1:32 AM, Thomas Mueller muelle...@insightbb.com wrote:

 The specific subject of this thread is not my issue, but I did notice
 problems apparently related to CAM on a SATA hard drive.


I would suggest starting a new thread if you have a different issue.

 I use one UFS partition, with FreeBSD 9.0-BETA1 installed (subsequently
 updated on another partition, using GPT as opposed to MBR), for ports tree
 and also NetBSD pkgsrc and NetBSD source code.  I built NetBSD 5.1_STABLE i386
 from FreeBSD and also built xorg-modular on the new NetBSD installation from
 pkgsrc.  Going into and out of the newly installed Xorg resulted in some
 crashes with the FreeBSD 9.0-BETA1 partition mounted and not cleanly
 unmounted.  File system was damaged, and FreeBSD fsck_ffs wouldn't fix it,
 went into a loop:


 Script started on Wed Sep 19 04:15:02 2012
 fsck_ffs /dev/ada0p9
 ** /dev/ada0p9
 ** Last Mounted on /BETA1
 ** Phase 1 - Check Blocks and Sizes

 CANNOT READ BLK: 7584192
 CONTINUE? [yn] y

 THE FOLLOWING DISK SECTORS COULD NOT BE READ: 7584318, 7584319,
 ** Phase 2 - Check Pathnames
 ** Phase 3 - Check Connectivity
 ** Phase 4 - Check Reference Counts
 ** Phase 5 - Check Cyl groups
 1475900 files, 4638292 used, 21162419 free (61643 frags, 2637597 blocks, 0.2% 
 fragmentation)

 * FILE SYSTEM STILL DIRTY *

 * PLEASE RERUN FSCK *

 Script done on Wed Sep 19 04:17:27 2012


 This happened repeatedly, meaning an impasse.

 I didn't get to record preceding error messages relating to ATA and CAM but,
 seeing this last message, wonder if there are some bugs in the CAM.

 I booted that new NetBSD 5.1_STABLE i386 installation, on a USB stick, was
 able to mount that partition and see it wasn't trashed though there was a
 message about the dirty flag.  I then umounted and ran NetBSD fsck_ffs
 successfully, just a few files were lost, and FreeBSD can access that
 partition again.

 I still intend to be more cautious when in NetBSD, not mounting a FreeBSD
 partition unnecessarily when doing something crash-prone on my system in
 NetBSD, such as going into and out of X.

 Tom
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: tws bug ? (LSI SAS 9750)

2012-09-21 Thread Jim Harris
On Fri, Sep 21, 2012 at 1:07 PM, Mike Tancsa m...@sentex.net wrote:
 Hi,
 I have been trying out a nice new tws controller and decided to enable
 debugging in the kernel and run some stress tests.  With a regular
 GENERIC kernel, it boots up fine.  But with debugging, it panics on
 boot. Anyone know whats up ? Is this something that should be sent
 directly to LSI ?

Through a code inspection, this mutex is being recursed whether or not
debugging is enabled.  There is no code path here specific to
INVARIANTS.  And the main IO path in this driver is always recursing
on this lock - it is not specific to the initialization callstack you
listed below.

The best course of action seems to be initializing the lock with
MTX_RECURSE, since the driver seems to expect to be able to recurse on
the io_lock.  Can you try the following patch?

diff --git a/sys/dev/tws/tws.c b/sys/dev/tws/tws.c
index b1615db..d156d40 100644
--- a/sys/dev/tws/tws.c
+++ b/sys/dev/tws/tws.c
@@ -197,7 +197,7 @@ tws_attach(device_t dev)
 mtx_init( sc-q_lock, tws_q_lock, NULL, MTX_DEF);
 mtx_init( sc-sim_lock,  tws_sim_lock, NULL, MTX_DEF);
 mtx_init( sc-gen_lock,  tws_gen_lock, NULL, MTX_DEF);
-mtx_init( sc-io_lock,  tws_io_lock, NULL, MTX_DEF);
+mtx_init( sc-io_lock,  tws_io_lock, NULL, MTX_DEF | MTX_RECURSE);

 if ( tws_init_trace_q(sc) == FAILURE )
 printf(trace init failure\n);




 pcib0: ACPI Host-PCI bridge port 0xcf8-0xcff on acpi0
 pci0: ACPI PCI bus on pcib0
 pcib1: ACPI PCI-PCI bridge irq 16 at device 1.0 on pci0
 pci1: ACPI PCI bus on pcib1
 pcib2: ACPI PCI-PCI bridge irq 17 at device 1.1 on pci0
 pci2: ACPI PCI bus on pcib2
 LSI 3ware device driver for SAS/SATA storage controllers, version:
 10.80.00.003
 tws0: LSI 3ware SAS/SATA Storage Controller port 0x4000-0x40ff mem
 0xc246-0xc2463fff,0xc240-0xc243 irq 17 at device 0.0 on pci2
 tws0: Using legacy INTx
 panic: _mtx_lock_sleep: recursed on non-recursive mutex tws_io_lock @
 /usr/HEAD/src/sys/dev/tws/tws_hdm.c:287

 cpuid = 0
 KDB: stack backtrace:
 db_trace_self_wrapper() at db_trace_self_wrapper+0x2a
 kdb_backtrace() at kdb_backtrace+0x37
 panic() at panic+0x1d8
 _mtx_lock_sleep() at _mtx_lock_sleep+0x27f
 _mtx_lock_flags() at _mtx_lock_flags+0xf1
 tws_submit_command() at tws_submit_command+0x3f
 tws_dmamap_data_load_cbfn() at tws_dmamap_data_load_cbfn+0xb7
 bus_dmamap_load() at bus_dmamap_load+0x16c
 tws_map_request() at tws_map_request+0x78
 tws_get_param() at tws_get_param+0xe1
 tws_display_ctlr_info() at tws_display_ctlr_info+0x4c
 tws_init_ctlr() at tws_init_ctlr+0x6d
 tws_attach() at tws_attach+0x68c
 device_attach() at device_attach+0x72
 bus_generic_attach() at bus_generic_attach+0x1a
 acpi_pci_attach() at acpi_pci_attach+0x164
 device_attach() at device_attach+0x72
 bus_generic_attach() at bus_generic_attach+0x1a
 acpi_pcib_attach() at acpi_pcib_attach+0x1a7
 acpi_pcib_pci_attach() at acpi_pcib_pci_attach+0x9b
 device_attach() at device_attach+0x72
 bus_generic_attach() at bus_generic_attach+0x1a
 acpi_pci_attach() at acpi_pci_attach+0x164
 device_attach() at device_attach+0x72
 bus_generic_attach() at bus_generic_attach+0x1a
 acpi_pcib_attach() at acpi_pcib_attach+0x1a7
 acpi_pcib_acpi_attach() at acpi_pcib_acpi_attach+0x1f6
 device_attach() at device_attach+0x72
 bus_generic_attach() at bus_generic_attach+0x1a
 acpi_attach() at acpi_attach+0xbc1
 device_attach() at device_attach+0x72
 bus_generic_attach() at bus_generic_attach+0x1a
 nexus_acpi_attach() at nexus_acpi_attach+0x69
 device_attach() at device_attach+0x72
 bus_generic_new_pass() at bus_generic_new_pass+0xd6
 bus_set_pass() at bus_set_pass+0x7a
 configure() at configure+0xa
 mi_startup() at mi_startup+0x77
 btext() at btext+0x2c
 KDB: enter: panic
 [ thread pid 0 tid 10 ]
 Stopped at  kdb_enter+0x3b: movq$0,0x993262(%rip)
 db


 int
 tws_submit_command(struct tws_softc *sc, struct tws_request *req)
 {
 u_int32_t regl, regh;
 u_int64_t mfa=0;

 /*
  * mfa register  read and write must be in order.
  * Get the io_lock to protect against simultinous
  * passthru calls
  */
 mtx_lock(sc-io_lock);

 if ( sc-obfl_q_overrun ) {
 tws_init_obfl_q(sc);
 }



 With no debugging in the kernel, it boots up fine

 pcib2: ACPI PCI-PCI bridge irq 17 at device 1.1 on pci0
 pci2: ACPI PCI bus on pcib2
 LSI 3ware device driver for SAS/SATA storage controllers, version:
 10.80.00.003
 tws0: LSI 3ware SAS/SATA Storage Controller port 0x4000-0x40ff mem
 0xc246-0xc2463fff,0xc240-0xc243 irq 17 at device 0.0 on pci2
 tws0: Using legacy INTx
 tws0: Controller details: Model 9750-4i, 8 Phys, Firmware FH9X
 5.12.00.007, BIOS BE9X 5.11.00.006
 em0: Intel(R) PRO/1000 Network Connection 7.3.2 port 0x5040-0x505f mem
 0xc250-0xc251,0xc257-0xc2570fff irq 19 at device 25.0 on pci0
 em0: Using an MSI interrupt
 em0: Ethernet address: 00:1e:67:45:b6:29
 ehci0: EHCI (generic) USB 2.0 controller mem 

Re: tws bug ? (LSI SAS 9750)

2012-09-21 Thread Jim Harris
On Fri, Sep 21, 2012 at 3:11 PM, Mike Tancsa m...@sentex.net wrote:
 On 9/21/2012 4:59 PM, Jim Harris wrote:

snip

 Thanks, that allows it to boot up now!

 pci2: ACPI PCI bus on pcib2
 LSI 3ware device driver for SAS/SATA storage controllers, version:
 10.80.00.003
 tws0: LSI 3ware SAS/SATA Storage Controller port 0x4000-0x40ff mem
 0xc246-0xc2463fff,0xc240-0xc243 irq 17 at device 0.0 on pci2
 tws0: Using MSI
 tws0: Controller details: Model 9750-4i, 8 Phys, Firmware FH9X
 5.12.00.007, BIOS BE9X 5.11.00.006
 em0: Intel(R) PRO/1000 Network Connection 7.3.2 port 0x5040-0x505f mem
 0xc250-0xc251,0xc257-0xc2570fff irq 19 at device 25.0 on pci0
 .
 then a lot of
 .
 (probe65:tws0:0:65:0): INQUIRY. CDB: 12 0 0 0 24 0
 (probe65:tws0:0:65:0): CAM status: Invalid Target ID
 (probe65:tws0:0:65:0): Error 22, Unretryable error
 (probe1:tws0:0:1:0): INQUIRY. CDB: 12 0 0 0 24 0
 (probe1:tws0:0:1:0): CAM status: Invalid Target ID
 (probe1:tws0:0:1:0): Error 22, Unretryable error
 (probe2:tws0:0:2:0): INQUIRY. CDB: 12 0 0 0 24 0
 (probe2:tws0:0:2:0): CAM status: Invalid Target ID
 .
 .
 .
 (probe63:tws0:0:63:0): INQUIRY. CDB: 12 0 0 0 24 0
 (probe63:tws0:0:63:0): CAM status: Invalid Target ID
 (probe63:tws0:0:63:0): Error 22, Unretryable error
 (probe64:tws0:0:64:0): INQUIRY. CDB: 12 0 0 0 24 0
 (probe64:tws0:0:64:0): CAM status: Invalid Target ID
 (probe64:tws0:0:64:0): Error 22, Unretryable error

These can be ignored.  CAM is just telling you that there are no
devices attached at these target IDs.

 da0 at tws0 bus 0 scbus0 target 0 lun 0
 da0: LSI 9750-4iDISK 5.12 Fixed Direct Access SCSI-5 device
 da0: 6000.000MB/s transfers
 da0: 953654MB (1953083392 512 byte sectors: 255H 63S/T 121573C)
 SMP: AP CPU #1 Launched!
 SMP: AP CPU #4 Launched!


snip


 Any thoughts on msi vs no msi ?  Time to run some stress tests.  Its
 certainly a fast little controller for the money!


Typically MSI is preferred to INTx for performance reasons.  I can't
speak for why the original author made INTx the default though.

Regards,

-Jim
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: tws bug ? (LSI SAS 9750)

2012-09-21 Thread Jim Harris
On Fri, Sep 21, 2012 at 5:37 PM, Mike Tancsa m...@sentex.net wrote:
 On 9/21/2012 8:03 PM, Jim Harris wrote:
 .
 then a lot of
 .
 (probe65:tws0:0:65:0): INQUIRY. CDB: 12 0 0 0 24 0
 (probe65:tws0:0:65:0): CAM status: Invalid Target ID
 (probe65:tws0:0:65:0): Error 22, Unretryable error
 (probe1:tws0:0:1:0): INQUIRY. CDB: 12 0 0 0 24 0
 (probe1:tws0:0:1:0): CAM status: Invalid Target ID
 (probe1:tws0:0:1:0): Error 22, Unretryable error
 (probe2:tws0:0:2:0): INQUIRY. CDB: 12 0 0 0 24 0
 (probe2:tws0:0:2:0): CAM status: Invalid Target ID
 .
 .
 .
 (probe63:tws0:0:63:0): INQUIRY. CDB: 12 0 0 0 24 0
 (probe63:tws0:0:63:0): CAM status: Invalid Target ID
 (probe63:tws0:0:63:0): Error 22, Unretryable error
 (probe64:tws0:0:64:0): INQUIRY. CDB: 12 0 0 0 24 0
 (probe64:tws0:0:64:0): CAM status: Invalid Target ID
 (probe64:tws0:0:64:0): Error 22, Unretryable error

 These can be ignored.  CAM is just telling you that there are no
 devices attached at these target IDs.

 What about a change similar to what Alexander Motin did in

 http://lists.freebsd.org/pipermail/svn-src-head/2012-June/038196.html

Ah, yes.  I was thinking you had CAM_DEBUG enabled which is why you
were seeing this spew - but that's not the case.  This indeed should
be fixed and not just ignored.

Seeing the attributions on Alexander's commit, you certainly seem to
have a monopoly on controllers that exhibit this problem on FreeBSD.
:)

I believe the CAM_LUN_INVALID here should be fixed as well, similar to
the twa commit.  If you send me a revised patch I will commit it.

Thanks,

-Jim



 0(ich10)# diff -u tws_cam.c.orig tws_cam.c
 --- tws_cam.c.orig  2012-09-21 20:10:43.0 -0400
 +++ tws_cam.c   2012-09-21 20:11:11.0 -0400
 @@ -532,7 +532,7 @@
  ccb-ccb_h.status |= CAM_LUN_INVALID;
  } else {
  TWS_TRACE_DEBUG(sc, invalid target error,0,0);
 -ccb-ccb_h.status |= CAM_TID_INVALID;
 +ccb-ccb_h.status |= CAM_SEL_TIMEOUT;
  }

  } else {
 1(ich10)#

 ---Mike


 --
 ---
 Mike Tancsa, tel +1 519 651 3400
 Sentex Communications, m...@sentex.net
 Providing Internet services since 1994 www.sentex.net
 Cambridge, Ontario Canada   http://www.tancsa.com/
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: geom_virstor with kernel panic in FreeBSD 9.x

2012-08-03 Thread Jim Harris
On Fri, Aug 3, 2012 at 5:06 AM, Marcelo Gondim gon...@bsdinfo.com.br wrote:
 Hi all,

 I sent a PR [1] but I decided to also send the problem here.
 If you try to destroy a geom_virstor that does not exist, this causes a
 kernel panic immediately.

 Just try:

 gvirstor load
 gvirstor destroy tatata

 # uname -a
 FreeBSD zeus..xxx.br 9.1-PRERELEASE FreeBSD 9.1-PRERELEASE #27: Mon Jul
 16 01:41:24 BRT 2012 r...@zeus..xxx.br:/usr/obj/usr/src/sys/GONDIM
 amd64

 [1] http://www.freebsd.org/cgi/query-pr.cgi?pr=170199

 Best regards,
 Gondim


Hi Gondim,

Can you test the following patch?

Index: sys/geom/virstor/g_virstor.c
===
--- sys/geom/virstor/g_virstor.c(revision 238909)
+++ sys/geom/virstor/g_virstor.c(working copy)
@@ -235,6 +235,12 @@
return;
}
sc = virstor_find_geom(cp, name);
+   if (sc == NULL) {
+   gctl_error(req, Don't know anything about '%s', name);
+   g_topology_unlock();
+   return;
+   }
+
LOG_MSG(LVL_INFO, Stopping %s by the userland command,
sc-geom-name);
update_metadata(sc);

Thanks,

-Jim



 ___
 freebsd-stable@freebsd.org mailing list
 http://lists.freebsd.org/mailman/listinfo/freebsd-stable
 To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: geom_virstor with kernel panic in FreeBSD 9.x

2012-08-03 Thread Jim Harris
On Fri, Aug 3, 2012 at 10:04 AM, Marcelo Gondim gon...@bsdinfo.com.br wrote:
 Hi Jim,

 When I applied the patch gave this error:

 # patch  /root/patch.diff
 Hmm...  Looks like a unified diff to me...
 The text leading up to this was:
 --

 |--- sys/geom/virstor/g_virstor.c(revision 238909)
 |+++ sys/geom/virstor/g_virstor.c(working copy)
 --
 Patching file sys/geom/virstor/g_virstor.c using Plan A...
 Hunk #1 failed at 235.
 1 out of 1 hunks failed--saving rejects to sys/geom/virstor/g_virstor.c.rej
 done




Strange.  It applies with no issues on my checkout.

Let's try an attachment instead.  If this doesn't work, could you
kindly apply the patch by hand?

Thanks,

-Jim


virstor.diff
Description: Binary data
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: geom_virstor with kernel panic in FreeBSD 9.x

2012-08-03 Thread Jim Harris
On Fri, Aug 3, 2012 at 1:12 PM, Marcelo Gondim gon...@bsdinfo.com.br wrote:

 Hi Jim,

 Perfect!!!

 # gvirstor destroy tudo
 gvirstor: Don't know anything about 'tudo'




Patch applied to head as r239021.  I have requested approval from re@
to merge to stable/9.

Thank you for confirming the patch on your end.

Regards,

-Jim
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org