Re: hpsa driver bug crack kernel down!
Hey David, On Mon, Apr 14, 2014 at 05:03:51PM +, Woodhouse, David wrote: Jiang, if you can then let me have a copy with a signed-off-by I'll shepherd it upstream along with your other patch which is already in my iommu-2.6.git tree. What is the state of these fixes? I plan to send out a pull-request before easter and hoped to include these fixes as well. Thanks, Joerg ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: hpsa driver bug crack kernel down!
On Wed, 2014-04-16 at 15:37 +0200, j...@8bytes.org wrote: Hey David, On Mon, Apr 14, 2014 at 05:03:51PM +, Woodhouse, David wrote: Jiang, if you can then let me have a copy with a signed-off-by I'll shepherd it upstream along with your other patch which is already in my iommu-2.6.git tree. What is the state of these fixes? I plan to send out a pull-request before easter and hoped to include these fixes as well. I'm travelling and was going to do some final testing and send out a pull request after I got home tomorrow. But since you ask... Please pull from git://git.infradead.org/iommu-2.6.git David Woodhouse (1): iommu/vt-d: Fix get_domain_for_dev() handling of upstream PCIe bridges Jiang Liu (2): iommu/vt-d: fix memory leakage caused by commit ea8ea46 iommu/vt-d: fix bug in matching PCI devices with DRHD/RMRR descriptors drivers/iommu/dmar.c| 3 ++- drivers/iommu/intel-iommu.c | 10 +++--- 2 files changed, 9 insertions(+), 4 deletions(-) -- David WoodhouseOpen Source Technology Centre david.woodho...@intel.com Intel Corporation smime.p7s Description: S/MIME cryptographic signature ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: hpsa driver bug crack kernel down!
On Wed, Apr 16, 2014 at 01:58:44PM +, Woodhouse, David wrote: On Wed, 2014-04-16 at 15:37 +0200, j...@8bytes.org wrote: What is the state of these fixes? I plan to send out a pull-request before easter and hoped to include these fixes as well. I'm travelling and was going to do some final testing and send out a pull request after I got home tomorrow. But since you ask... Please pull from git://git.infradead.org/iommu-2.6.git David Woodhouse (1): iommu/vt-d: Fix get_domain_for_dev() handling of upstream PCIe bridges Jiang Liu (2): iommu/vt-d: fix memory leakage caused by commit ea8ea46 iommu/vt-d: fix bug in matching PCI devices with DRHD/RMRR descriptors drivers/iommu/dmar.c| 3 ++- drivers/iommu/intel-iommu.c | 10 +++--- 2 files changed, 9 insertions(+), 4 deletions(-) Pulled, thanks David. I will also do some additional testing before sending it upstream. Joerg ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: hpsa driver bug crack kernel down!
Hi Davidlohr, Thanks for the information! According to lspci output, device :02:00.2 is HP ILO controller, device :03:00.0 is RAID controller. Both ILO and RAID controllers need to access reserved memory range [0x7f61e000 - 0x7f61] in physical mode. According to dmesg output, BIOS has reserved memory and IOMMU has setup 1:1 mapping for ILO and RAID controller to access this range. Related log messages as below: BIOS-e820: [mem 0x7f61d000-0x8fff] reserved IOMMU: Setting RMRR: IOMMU: Setting identity map for device :03:00.0 [0x7f61e000 - 0x7f61] IOMMU: Setting identity map for device :02:00.0 [0x7f61e000 - 0x7f61] IOMMU: Setting identity map for device :02:00.2 [0x7f61e000 - 0x7f61] From the screenshot, device :02:00.2 fails to access memory address 0x7f61e000. That indicates IOMMU driver fails to setup 1:1 mapping for Reserved Memory Range for ILO controller. So could you please help to check whether you could observe boot messages like IOMMU: Setting identity map for device :02:00.2 [0x7f61e000 - 0x7f61] with the failure kernel image? It would be great if boot messages could be saved when failing to boot, so we could get more information from log. BTW, I have double checked related code, and still can't find a reliable explanation for the regression:( Thanks! Gerry On 2014/4/11 0:19, Davidlohr Bueso wrote: On Thu, 2014-04-10 at 08:46 +, Woodhouse, David wrote: On Thu, 2014-04-10 at 09:15 +0200, Joerg Roedel wrote: [+ David, VT-d maintainer ] Jiang, David, can you please have a look into this issue? DMAR:[fault reason 02] Present bit in context entry is clear dmar: DRHD: handling fault status reg 602 dmar: DMAR:[DMA Read] Request device [02:00.0] fault addr 7f61e000 That Present bit in context entry is clear fault means that we have not set up *any* mappings for this PCI device… on this IOMMU. Yes, specifically (finally done bisecting): commit 2e45528930388658603ea24d49cf52867b928d3e Author: Jiang Liu jiang@linux.intel.com Date: Wed Feb 19 14:07:36 2014 +0800 iommu/vt-d: Unify the way to process DMAR device scope array This commit is about how we decide which IOMMU a given PCI device is attached to. Thus, my first guess would be that we are quite happily setting up the requested DMA maps on the *wrong* IOMMU, and then taking faults when the device actually tries to do DMA. However, I'm not 100% convinced of that. The fault address looks suspiciously like a true physical address, not a virtual bus address of the type that we'd normally allocate for a dma_map_* operation. Those would start at 0xf000 and work downwards, typically. Do you have 'iommu=pt' on the kernel command line? No. Can I see the full dmesg as this system boots, and also a copy of the DMAR table? Attaching a dmesg from one of the kernels that boots. It doesn't appear to have much of the related information... is there any debug config option I can enable that might give you more data? ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: hpsa driver bug crack kernel down!
Hi all, I guess I found the root cause. It's a bug in matching device scope, variable 'level' should be decreased when walking up PCI topology. Could you please help to test following patch? Thanks! Gerry diff --git a/drivers/iommu/dmar.c b/drivers/iommu/dmar.c index f445c10..1f8308c 100644 --- a/drivers/iommu/dmar.c +++ b/drivers/iommu/dmar.c @@ -152,7 +152,7 @@ dmar_alloc_pci_notify_info(struct pci_dev *dev, unsigned long event) info-seg = pci_domain_nr(dev-bus); info-level = level; if (event == BUS_NOTIFY_ADD_DEVICE) { - for (tmp = dev, level--; tmp; tmp = tmp-bus-self) { + for (tmp = dev, level--; tmp; level--, tmp = tmp-bus-self) { info-path[level].device = PCI_SLOT(tmp-devfn); info-path[level].function = PCI_FUNC(tmp-devfn); if (pci_is_root_bus(tmp-bus)) On 2014/4/11 0:19, Davidlohr Bueso wrote: On Thu, 2014-04-10 at 08:46 +, Woodhouse, David wrote: On Thu, 2014-04-10 at 09:15 +0200, Joerg Roedel wrote: [+ David, VT-d maintainer ] Jiang, David, can you please have a look into this issue? DMAR:[fault reason 02] Present bit in context entry is clear dmar: DRHD: handling fault status reg 602 dmar: DMAR:[DMA Read] Request device [02:00.0] fault addr 7f61e000 That Present bit in context entry is clear fault means that we have not set up *any* mappings for this PCI device… on this IOMMU. Yes, specifically (finally done bisecting): commit 2e45528930388658603ea24d49cf52867b928d3e Author: Jiang Liu jiang@linux.intel.com Date: Wed Feb 19 14:07:36 2014 +0800 iommu/vt-d: Unify the way to process DMAR device scope array This commit is about how we decide which IOMMU a given PCI device is attached to. Thus, my first guess would be that we are quite happily setting up the requested DMA maps on the *wrong* IOMMU, and then taking faults when the device actually tries to do DMA. However, I'm not 100% convinced of that. The fault address looks suspiciously like a true physical address, not a virtual bus address of the type that we'd normally allocate for a dma_map_* operation. Those would start at 0xf000 and work downwards, typically. Do you have 'iommu=pt' on the kernel command line? No. Can I see the full dmesg as this system boots, and also a copy of the DMAR table? Attaching a dmesg from one of the kernels that boots. It doesn't appear to have much of the related information... is there any debug config option I can enable that might give you more data? ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: hpsa driver bug crack kernel down!
Sorry for the delay, I've been having to take turns for this box. On Fri, 2014-04-11 at 09:18 +, Woodhouse, David wrote: On Thu, 2014-04-10 at 09:19 -0700, Davidlohr Bueso wrote: Attaching a dmesg from one of the kernels that boots. It doesn't appear to have much of the related information... is there any debug config option I can enable that might give you more data? I'd like the contents of /sys/firmware/acpi/tables/DMAR please. Attached is the disassembly of the raw output. And please could you also apply this patch to both the last-working and first-failing kernels and show me the output in both cases? So I still cannot get around getting the info for the first failing kernel, but below is for the last working. Thanks. Device 0:03:00.0 on IOMMU at a800 Device 0:03:00.0 on IOMMU at a800 IOMMU: Setting identity map for device :02:00.0 [0x7f61e000 - 0x7f61] Device 0:02:00.0 on IOMMU at a800 Device 0:02:00.0 on IOMMU at a800 IOMMU: Setting identity map for device :02:00.2 [0x7f61e000 - 0x7f61] Device 0:02:00.2 on IOMMU at a800 Device 0:02:00.2 on IOMMU at a800 IOMMU: Setting identity map for device :00:1d.0 [0x7f7e7000 - 0x7f7ecfff] Device 0:00:1d.0 on IOMMU at a800 Device 0:00:1d.0 on IOMMU at a800 IOMMU: Setting identity map for device :00:1d.1 [0x7f7e7000 - 0x7f7ecfff] Device 0:00:1d.1 on IOMMU at a800 Device 0:00:1d.1 on IOMMU at a800 IOMMU: Setting identity map for device :00:1d.2 [0x7f7e7000 - 0x7f7ecfff] Device 0:00:1d.2 on IOMMU at a800 Device 0:00:1d.2 on IOMMU at a800 IOMMU: Setting identity map for device :00:1d.3 [0x7f7e7000 - 0x7f7ecfff] Device 0:00:1d.3 on IOMMU at a800 Device 0:00:1d.3 on IOMMU at a800 IOMMU: Setting identity map for device :02:00.0 [0x7f7e7000 - 0x7f7ecfff] Device 0:02:00.0 on IOMMU at a800 IOMMU: Setting identity map for device :02:00.2 [0x7f7e7000 - 0x7f7ecfff] Device 0:02:00.2 on IOMMU at a800 IOMMU: Setting identity map for device :02:00.4 [0x7f7e7000 - 0x7f7ecfff] Device 0:02:00.4 on IOMMU at a800 Device 0:02:00.4 on IOMMU at a800 IOMMU: Setting identity map for device :00:1d.7 [0x7f7ee000 - 0x7f7e] Device 0:00:1d.7 on IOMMU at a800 Device 0:00:1d.7 on IOMMU at a800 IOMMU: Prepare 0-16MiB unity mapping for LPC IOMMU: Setting identity map for device :00:1f.0 [0x0 - 0xff] Device 0:00:1f.0 on IOMMU at a800 Device 0:00:1f.0 on IOMMU at a800 PCI-DMA: Intel(R) Virtualization Technology for Directed I/O Device 0:00:00.0 on IOMMU at a800 Device 0:00:01.0 on IOMMU at a800 Device 0:00:02.0 on IOMMU at a800 Device 0:00:03.0 on IOMMU at a800 Device 0:00:04.0 on IOMMU at a800 Device 0:00:05.0 on IOMMU at a800 Device 0:00:06.0 on IOMMU at a800 Device 0:00:07.0 on IOMMU at a800 Device 0:00:08.0 on IOMMU at a800 Device 0:00:09.0 on IOMMU at a800 Device 0:00:0a.0 on IOMMU at a800 Device 0:00:14.0 on IOMMU at a800 Device 0:00:1c.0 on IOMMU at a800 Device 0:00:1c.4 on IOMMU at a800 Device 0:00:1d.0 on IOMMU at a800 Device 0:00:1d.1 on IOMMU at a800 Device 0:00:1d.2 on IOMMU at a800 Device 0:00:1d.3 on IOMMU at a800 Device 0:00:1d.7 on IOMMU at a800 Device 0:00:1e.0 on IOMMU at a800 Device 0:00:1f.0 on IOMMU at a800 Device 0:04:00.0 on IOMMU at a800 Device 0:04:00.1 on IOMMU at a800 Device 0:04:00.2 on IOMMU at a800 Device 0:04:00.3 on IOMMU at a800 Device 0:03:00.0 on IOMMU at a800 Device 0:02:00.0 on IOMMU at a800 Device 0:02:00.2 on IOMMU at a800 Device 0:02:00.4 on IOMMU at a800 Device 0:01:03.0 on IOMMU at a800 Device 0:50:00.0 on IOMMU at ac00 Device 0:50:01.0 on IOMMU at ac00 Device 0:50:02.0 on IOMMU at ac00 Device 0:50:03.0 on IOMMU at ac00 Device 0:50:04.0 on IOMMU at ac00 Device 0:50:05.0 on IOMMU at ac00 Device 0:50:06.0 on IOMMU at ac00 Device 0:50:07.0 on IOMMU at ac00 Device 0:50:08.0 on IOMMU at ac00 Device 0:50:09.0 on IOMMU at ac00 Device 0:50:0a.0 on IOMMU at ac00 Device 0:50:14.0 on IOMMU at a800 Device 0:a0:00.0 on IOMMU at b000 Device 0:a0:01.0 on IOMMU at b000 Device 0:a0:02.0 on IOMMU at b000 Device 0:a0:03.0 on IOMMU at b000 Device 0:a0:04.0 on IOMMU at b000 Device 0:a0:05.0 on IOMMU at b000 Device 0:a0:06.0 on IOMMU at b000 Device 0:a0:07.0 on IOMMU at b000 Device 0:a0:08.0 on IOMMU at b000 Device 0:a0:09.0 on IOMMU at b000 Device 0:a0:0a.0 on IOMMU at b000 Device 0:a0:14.0 on IOMMU at a800 Device 0:7c:00.0 on IOMMU at a800 Device 0:7c:08.0 on IOMMU at a800 Device 0:82:00.0 on IOMMU at a800 Device 0:82:08.0 on IOMMU at a800 /* * Intel ACPI Component Architecture * AML Disassembler version 20140325-64 [Apr 11 2014] * Copyright (c) 2000 - 2014 Intel Corporation * * Disassembly of DMAR.raw, Fri Apr 11 09:10:10 2014 * * ACPI Data Table [DMAR] * * Format:
Re: hpsa driver bug crack kernel down!
Hi Davidlohr, Thanks for providing the DMAR table. According to the DMAR table, one bug in the iommu driver fails to handle this entry: [1D2h 0466 1] Device Scope Entry Type : 01 [1D3h 0467 1] Entry Length : 0A [1D4h 0468 2] Reserved : [1D6h 0470 1] Enumeration ID : 00 [1D7h 0471 1] PCI Bus Number : 00 [1D8h 0472 2] PCI Path : 1C,04 [1DAh 0474 2] PCI Path : 00,02 And the patch sent out by me should fix this bug. Could you please help to have a try? Thanks! Gerry On 2014/4/14 23:45, Davidlohr Bueso wrote: Sorry for the delay, I've been having to take turns for this box. On Fri, 2014-04-11 at 09:18 +, Woodhouse, David wrote: On Thu, 2014-04-10 at 09:19 -0700, Davidlohr Bueso wrote: Attaching a dmesg from one of the kernels that boots. It doesn't appear to have much of the related information... is there any debug config option I can enable that might give you more data? I'd like the contents of /sys/firmware/acpi/tables/DMAR please. Attached is the disassembly of the raw output. And please could you also apply this patch to both the last-working and first-failing kernels and show me the output in both cases? So I still cannot get around getting the info for the first failing kernel, but below is for the last working. Thanks. Device 0:03:00.0 on IOMMU at a800 Device 0:03:00.0 on IOMMU at a800 IOMMU: Setting identity map for device :02:00.0 [0x7f61e000 - 0x7f61] Device 0:02:00.0 on IOMMU at a800 Device 0:02:00.0 on IOMMU at a800 IOMMU: Setting identity map for device :02:00.2 [0x7f61e000 - 0x7f61] Device 0:02:00.2 on IOMMU at a800 Device 0:02:00.2 on IOMMU at a800 IOMMU: Setting identity map for device :00:1d.0 [0x7f7e7000 - 0x7f7ecfff] Device 0:00:1d.0 on IOMMU at a800 Device 0:00:1d.0 on IOMMU at a800 IOMMU: Setting identity map for device :00:1d.1 [0x7f7e7000 - 0x7f7ecfff] Device 0:00:1d.1 on IOMMU at a800 Device 0:00:1d.1 on IOMMU at a800 IOMMU: Setting identity map for device :00:1d.2 [0x7f7e7000 - 0x7f7ecfff] Device 0:00:1d.2 on IOMMU at a800 Device 0:00:1d.2 on IOMMU at a800 IOMMU: Setting identity map for device :00:1d.3 [0x7f7e7000 - 0x7f7ecfff] Device 0:00:1d.3 on IOMMU at a800 Device 0:00:1d.3 on IOMMU at a800 IOMMU: Setting identity map for device :02:00.0 [0x7f7e7000 - 0x7f7ecfff] Device 0:02:00.0 on IOMMU at a800 IOMMU: Setting identity map for device :02:00.2 [0x7f7e7000 - 0x7f7ecfff] Device 0:02:00.2 on IOMMU at a800 IOMMU: Setting identity map for device :02:00.4 [0x7f7e7000 - 0x7f7ecfff] Device 0:02:00.4 on IOMMU at a800 Device 0:02:00.4 on IOMMU at a800 IOMMU: Setting identity map for device :00:1d.7 [0x7f7ee000 - 0x7f7e] Device 0:00:1d.7 on IOMMU at a800 Device 0:00:1d.7 on IOMMU at a800 IOMMU: Prepare 0-16MiB unity mapping for LPC IOMMU: Setting identity map for device :00:1f.0 [0x0 - 0xff] Device 0:00:1f.0 on IOMMU at a800 Device 0:00:1f.0 on IOMMU at a800 PCI-DMA: Intel(R) Virtualization Technology for Directed I/O Device 0:00:00.0 on IOMMU at a800 Device 0:00:01.0 on IOMMU at a800 Device 0:00:02.0 on IOMMU at a800 Device 0:00:03.0 on IOMMU at a800 Device 0:00:04.0 on IOMMU at a800 Device 0:00:05.0 on IOMMU at a800 Device 0:00:06.0 on IOMMU at a800 Device 0:00:07.0 on IOMMU at a800 Device 0:00:08.0 on IOMMU at a800 Device 0:00:09.0 on IOMMU at a800 Device 0:00:0a.0 on IOMMU at a800 Device 0:00:14.0 on IOMMU at a800 Device 0:00:1c.0 on IOMMU at a800 Device 0:00:1c.4 on IOMMU at a800 Device 0:00:1d.0 on IOMMU at a800 Device 0:00:1d.1 on IOMMU at a800 Device 0:00:1d.2 on IOMMU at a800 Device 0:00:1d.3 on IOMMU at a800 Device 0:00:1d.7 on IOMMU at a800 Device 0:00:1e.0 on IOMMU at a800 Device 0:00:1f.0 on IOMMU at a800 Device 0:04:00.0 on IOMMU at a800 Device 0:04:00.1 on IOMMU at a800 Device 0:04:00.2 on IOMMU at a800 Device 0:04:00.3 on IOMMU at a800 Device 0:03:00.0 on IOMMU at a800 Device 0:02:00.0 on IOMMU at a800 Device 0:02:00.2 on IOMMU at a800 Device 0:02:00.4 on IOMMU at a800 Device 0:01:03.0 on IOMMU at a800 Device 0:50:00.0 on IOMMU at ac00 Device 0:50:01.0 on IOMMU at ac00 Device 0:50:02.0 on IOMMU at ac00 Device 0:50:03.0 on IOMMU at ac00 Device 0:50:04.0 on IOMMU at ac00 Device 0:50:05.0 on IOMMU at ac00 Device 0:50:06.0 on IOMMU at ac00 Device 0:50:07.0 on IOMMU at ac00 Device 0:50:08.0 on IOMMU at ac00 Device 0:50:09.0 on IOMMU at ac00 Device 0:50:0a.0 on IOMMU at ac00 Device 0:50:14.0 on IOMMU at a800 Device 0:a0:00.0 on IOMMU at b000 Device 0:a0:01.0 on IOMMU at b000 Device 0:a0:02.0 on IOMMU at b000
Re: hpsa driver bug crack kernel down!
On Tue, 2014-04-15 at 00:19 +0800, Jiang Liu wrote: Hi Davidlohr, Thanks for providing the DMAR table. According to the DMAR table, one bug in the iommu driver fails to handle this entry: [1D2h 0466 1] Device Scope Entry Type : 01 [1D3h 0467 1] Entry Length : 0A [1D4h 0468 2] Reserved : [1D6h 0470 1] Enumeration ID : 00 [1D7h 0471 1] PCI Bus Number : 00 [1D8h 0472 2] PCI Path : 1C,04 [1DAh 0474 2] PCI Path : 00,02 And the patch sent out by me should fix this bug. Could you please help to have a try? Sorry, I am unable to find any patches from you regarding this issue... I must be missing something. Could you please point me to the lkml link? Thanks. ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: hpsa driver bug crack kernel down!
On Mon, 2014-04-14 at 09:44 -0700, Davidlohr Bueso wrote: On Tue, 2014-04-15 at 00:19 +0800, Jiang Liu wrote: Hi Davidlohr, Thanks for providing the DMAR table. According to the DMAR table, one bug in the iommu driver fails to handle this entry: [1D2h 0466 1] Device Scope Entry Type : 01 [1D3h 0467 1] Entry Length : 0A [1D4h 0468 2] Reserved : [1D6h 0470 1] Enumeration ID : 00 [1D7h 0471 1] PCI Bus Number : 00 [1D8h 0472 2] PCI Path : 1C,04 [1DAh 0474 2] PCI Path : 00,02 And the patch sent out by me should fix this bug. Could you please help to have a try? Sorry, I am unable to find any patches from you regarding this issue... I must be missing something. Could you please point me to the lkml link? Never mind, I got it internally. I'll let you know as soon as I can test it later today. ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: hpsa driver bug crack kernel down!
On Mon, 2014-04-14 at 09:47 -0700, Davidlohr Bueso wrote: On Mon, 2014-04-14 at 09:44 -0700, Davidlohr Bueso wrote: On Tue, 2014-04-15 at 00:19 +0800, Jiang Liu wrote: Hi Davidlohr, Thanks for providing the DMAR table. According to the DMAR table, one bug in the iommu driver fails to handle this entry: [1D2h 0466 1] Device Scope Entry Type : 01 [1D3h 0467 1] Entry Length : 0A [1D4h 0468 2] Reserved : [1D6h 0470 1] Enumeration ID : 00 [1D7h 0471 1] PCI Bus Number : 00 [1D8h 0472 2] PCI Path : 1C,04 [1DAh 0474 2] PCI Path : 00,02 And the patch sent out by me should fix this bug. Could you please help to have a try? Sorry, I am unable to find any patches from you regarding this issue... I must be missing something. Could you please point me to the lkml link? Never mind, I got it internally. I'll let you know as soon as I can test it later today. Thanks. Jiang, if you can then let me have a copy with a signed-off-by I'll shepherd it upstream along with your other patch which is already in my iommu-2.6.git tree. -- Sent with Evolution's ActiveSync support. David WoodhouseOpen Source Technology Centre david.woodho...@intel.com Intel Corporation smime.p7s Description: S/MIME cryptographic signature ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: hpsa driver bug crack kernel down!
On Mon, 2014-04-14 at 16:57 +0800, Jiang Liu wrote: Hi all, I guess I found the root cause. It's a bug in matching device scope, variable 'level' should be decreased when walking up PCI topology. Could you please help to test following patch? Thanks! Gerry Worked like a charm -- I no longer see all those DMAR messages and the hpsa hard lockup is gone, thanks. Feel free to add my: Reported-and-tested-by: Davidlohr Bueso davidl...@hp.com ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: hpsa driver bug crack kernel down!
On Thu, 2014-04-10 at 17:17 -0600, Shuah Khan wrote: This smells very much like the problem that was solved couple of years ago for SI domain. It is likely that path is broken with the DMAR device scope array change. Please take a look to see if the following no longer occurs. Looks like BIOS could be expecting this RMRR to be still mapped. /* * We want to prevent any device associated with an RMRR from * getting placed into the SI Domain. This is done because * problems exist when devices are moved in and out of domains * and their respective RMRR info is lost. We exempt USB devices * from this process due to their usage of RMRRs that are known * to not be needed after BIOS hand-off to OS. */ if (device_has_rmrr(dev) (pdev-class 8) != PCI_CLASS_SERIAL_USB) return 0; Yeah, I'd be inclined to agree although I've tested with graphics *since* these patches. That's another case where we need to preserve the RMRR mapping after the driver takes over — and it *was* working. -- David WoodhouseOpen Source Technology Centre david.woodho...@intel.com Intel Corporation smime.p7s Description: S/MIME cryptographic signature ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: hpsa driver bug crack kernel down!
On Thu, 2014-04-10 at 09:19 -0700, Davidlohr Bueso wrote: Attaching a dmesg from one of the kernels that boots. It doesn't appear to have much of the related information... is there any debug config option I can enable that might give you more data? I'd like the contents of /sys/firmware/acpi/tables/DMAR please. And please could you also apply this patch to both the last-working and first-failing kernels and show me the output in both cases? diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c index dd576c0..d52ac03 100644 --- a/drivers/iommu/intel-iommu.c +++ b/drivers/iommu/intel-iommu.c @@ -683,6 +683,12 @@ static struct intel_iommu *device_to_iommu(int segment, u8 bus, u8 devfn) out: rcu_read_unlock(); + if (iommu) + printk(Device %x:%02x:%02x.%d on IOMMU at %llx\n, segment, bus, + PCI_SLOT(devfn), PCI_FUNC(devfn), drhd-reg_base_addr); + else + printk(Device %x:%02x:%02x.%d on no IOMMU\n, segment, bus, + PCI_SLOT(devfn), PCI_FUNC(devfn)); return iommu; } -- Sent with Evolution's ActiveSync support. David WoodhouseOpen Source Technology Centre david.woodho...@intel.com Intel Corporation smime.p7s Description: S/MIME cryptographic signature ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: hpsa driver bug crack kernel down!
[+ David, VT-d maintainer ] Jiang, David, can you please have a look into this issue? Thanks, Joerg On Wed, Apr 09, 2014 at 11:32:37PM -0700, Davidlohr Bueso wrote: On Wed, 2014-04-09 at 22:03 -0600, Bjorn Helgaas wrote: [+cc Joerg, iommu list] On Wed, Apr 9, 2014 at 6:19 PM, Davidlohr Bueso davidl...@hp.com wrote: On Wed, 2014-04-09 at 16:50 -0700, James Bottomley wrote: On Wed, 2014-04-09 at 16:40 -0700, Davidlohr Bueso wrote: On Wed, 2014-04-09 at 16:10 -0700, James Bottomley wrote: On Wed, 2014-04-09 at 16:08 -0700, James Bottomley wrote: [+linux-scsi] On Wed, 2014-04-09 at 15:49 -0700, Davidlohr Bueso wrote: On Wed, 2014-04-09 at 10:39 +0800, Baoquan He wrote: Hi, The kernel is 3.14.0+ which is pulled just now. Cc'ing more people. While the hpsa driver appears to be involved in some way, I'm sure if this is a related issue, but as of today's pull I'm getting another problem that causes my DL980 not to come up. *Massive* amounts of: DMAR:[fault reason 02] Present bit in context entry is clear dmar: DRHD: handling fault status reg 602 dmar: DMAR:[DMA Read] Request device [02:00.0] fault addr 7f61e000 Then: hpsa :03:00.0: Controller lockup detected: 0x ... Workqueue: events hpsa_monitor_ctlr_worker [hpsa] ... Screenshot of the actual LOCKUP: http://stgolabs.net/hpsa-hard-lockup-3.14+.png While I haven't bisected, things worked fine until at least until commit 39de65aa2c3e (April 2nd). Any ideas? Well, it's either a DMA remapping issue or a hpsa one. Your assertion that everything worked fine until 39de65aa2c3e would tend to vindicate hpsa, Hmm here you mean DMA, right? No, it vindicates the hpsa changes ... they don't seem to be causing problems until something goes wrong with dma remapping. because all the hpsa changes went in before that under Missing crucial info: commit 1a0b6abaea78f73d9bc0a2f6df2d9e4c917cade1 Merge: 3e75c6d b2bff6c Author: Linus Torvalds torva...@linux-foundation.org Date: Tue Apr 1 18:49:04 2014 -0700 Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi can you revalidate that this commit works OK just to make sure? Ok so I don't see those DMA messages and system starts just fine. I'm thinking perhaps something broke after the IO mmu stuff in commit 3f583bc21977a608908b83d03ee2250426a5695c... could this be indirectly causing the CPU stalls and just blame hpsa in the path as a side effect? /me goes out to try the commit. That's my guess. The DMAR messages are DMA remapping issues caused in the IOMMU. If I had to guess, I'd say the DMAR fault message is indicating the IOMMU is calling for a mapping address before it can satisfy the driver read request, which is causing the hang apparently in the hpsa driver. I've added linux-pci to the cc; I think they deal with iommu issues on x86. So that merge commit appears to be the culprit, I see both the DMA messages and the lockup blaming hpsa... My understanding so far (please correct me if I'm wrong): 39de65aa2c3e OK (Merge branch 'i2c/for-next') 1a0b6abaea78 OK (Merge tag 'scsi-misc') 3f583bc21977 BAD (Merge tag 'iommu-updates-v3.15') Yes, specifically (finally done bisecting): commit 2e45528930388658603ea24d49cf52867b928d3e Author: Jiang Liu jiang@linux.intel.com Date: Wed Feb 19 14:07:36 2014 +0800 iommu/vt-d: Unify the way to process DMAR device scope array Now we have a PCI bus notification based mechanism to update DMAR device scope array, we could extend the mechanism to support boot time initialization too, which will help to unify and simplify the implementation. Signed-off-by: Jiang Liu jiang@linux.intel.com Signed-off-by: Joerg Roedel j...@8bytes.org ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: hpsa driver bug crack kernel down!
On Thu, 2014-04-10 at 09:15 +0200, Joerg Roedel wrote: [+ David, VT-d maintainer ] Jiang, David, can you please have a look into this issue? DMAR:[fault reason 02] Present bit in context entry is clear dmar: DRHD: handling fault status reg 602 dmar: DMAR:[DMA Read] Request device [02:00.0] fault addr 7f61e000 That Present bit in context entry is clear fault means that we have not set up *any* mappings for this PCI device… on this IOMMU. Yes, specifically (finally done bisecting): commit 2e45528930388658603ea24d49cf52867b928d3e Author: Jiang Liu jiang@linux.intel.com Date: Wed Feb 19 14:07:36 2014 +0800 iommu/vt-d: Unify the way to process DMAR device scope array This commit is about how we decide which IOMMU a given PCI device is attached to. Thus, my first guess would be that we are quite happily setting up the requested DMA maps on the *wrong* IOMMU, and then taking faults when the device actually tries to do DMA. However, I'm not 100% convinced of that. The fault address looks suspiciously like a true physical address, not a virtual bus address of the type that we'd normally allocate for a dma_map_* operation. Those would start at 0xf000 and work downwards, typically. Do you have 'iommu=pt' on the kernel command line? Can I see the full dmesg as this system boots, and also a copy of the DMAR table? We should also rate-limit DMA faults, which would avoid the lockup failure mode. Bjorn, what should an IOMMU driver *do* when it detects that a device is creating an endless stream of DMA faults and isn't aborting the transaction? I can set it to silent so that it just stops *reporting* the DMA faults for that device... and I suppose I can re-enable them when I next see a DMA mapping for it (although actually it'd be better to have a hook to do that on FLR or something like that). But there must be a better answer than that, surely? And I don't want to hack it up locally in *one* specific IOMMU driver, any more than I have to. On a POWER system with EEH, the kernel would end up isolating the offending device completely, and subsequently resetting it... -- David WoodhouseOpen Source Technology Centre david.woodho...@intel.com Intel Corporation smime.p7s Description: S/MIME cryptographic signature ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: hpsa driver bug crack kernel down!
On Wed, 2014-04-09 at 22:03 -0600, Bjorn Helgaas wrote: [+cc Joerg, iommu list] On Wed, Apr 9, 2014 at 6:19 PM, Davidlohr Bueso davidl...@hp.com wrote: On Wed, 2014-04-09 at 16:50 -0700, James Bottomley wrote: On Wed, 2014-04-09 at 16:40 -0700, Davidlohr Bueso wrote: On Wed, 2014-04-09 at 16:10 -0700, James Bottomley wrote: On Wed, 2014-04-09 at 16:08 -0700, James Bottomley wrote: [+linux-scsi] On Wed, 2014-04-09 at 15:49 -0700, Davidlohr Bueso wrote: On Wed, 2014-04-09 at 10:39 +0800, Baoquan He wrote: Hi, The kernel is 3.14.0+ which is pulled just now. Cc'ing more people. While the hpsa driver appears to be involved in some way, I'm sure if this is a related issue, but as of today's pull I'm getting another problem that causes my DL980 not to come up. *Massive* amounts of: DMAR:[fault reason 02] Present bit in context entry is clear dmar: DRHD: handling fault status reg 602 dmar: DMAR:[DMA Read] Request device [02:00.0] fault addr 7f61e000 Then: hpsa :03:00.0: Controller lockup detected: 0x ... Workqueue: events hpsa_monitor_ctlr_worker [hpsa] ... Screenshot of the actual LOCKUP: http://stgolabs.net/hpsa-hard-lockup-3.14+.png While I haven't bisected, things worked fine until at least until commit 39de65aa2c3e (April 2nd). Any ideas? Well, it's either a DMA remapping issue or a hpsa one. Your assertion that everything worked fine until 39de65aa2c3e would tend to vindicate hpsa, Hmm here you mean DMA, right? No, it vindicates the hpsa changes ... they don't seem to be causing problems until something goes wrong with dma remapping. because all the hpsa changes went in before that under Missing crucial info: commit 1a0b6abaea78f73d9bc0a2f6df2d9e4c917cade1 Merge: 3e75c6d b2bff6c Author: Linus Torvalds torva...@linux-foundation.org Date: Tue Apr 1 18:49:04 2014 -0700 Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi can you revalidate that this commit works OK just to make sure? Ok so I don't see those DMA messages and system starts just fine. I'm thinking perhaps something broke after the IO mmu stuff in commit 3f583bc21977a608908b83d03ee2250426a5695c... could this be indirectly causing the CPU stalls and just blame hpsa in the path as a side effect? /me goes out to try the commit. That's my guess. The DMAR messages are DMA remapping issues caused in the IOMMU. If I had to guess, I'd say the DMAR fault message is indicating the IOMMU is calling for a mapping address before it can satisfy the driver read request, which is causing the hang apparently in the hpsa driver. I've added linux-pci to the cc; I think they deal with iommu issues on x86. So that merge commit appears to be the culprit, I see both the DMA messages and the lockup blaming hpsa... My understanding so far (please correct me if I'm wrong): 39de65aa2c3e OK (Merge branch 'i2c/for-next') 1a0b6abaea78 OK (Merge tag 'scsi-misc') 3f583bc21977 BAD (Merge tag 'iommu-updates-v3.15') Yes, specifically (finally done bisecting): commit 2e45528930388658603ea24d49cf52867b928d3e Author: Jiang Liu jiang@linux.intel.com Date: Wed Feb 19 14:07:36 2014 +0800 iommu/vt-d: Unify the way to process DMAR device scope array Now we have a PCI bus notification based mechanism to update DMAR device scope array, we could extend the mechanism to support boot time initialization too, which will help to unify and simplify the implementation. Signed-off-by: Jiang Liu jiang@linux.intel.com Signed-off-by: Joerg Roedel j...@8bytes.org ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: hpsa driver bug crack kernel down!
On Thu, Apr 10, 2014 at 2:46 AM, Woodhouse, David david.woodho...@intel.com wrote: DMAR:[fault reason 02] Present bit in context entry is clear dmar: DRHD: handling fault status reg 602 dmar: DMAR:[DMA Read] Request device [02:00.0] fault addr 7f61e000 That Present bit in context entry is clear fault means that we have not set up *any* mappings for this PCI device… on this IOMMU. Yes, specifically (finally done bisecting): commit 2e45528930388658603ea24d49cf52867b928d3e Author: Jiang Liu jiang@linux.intel.com Date: Wed Feb 19 14:07:36 2014 +0800 iommu/vt-d: Unify the way to process DMAR device scope array This commit is about how we decide which IOMMU a given PCI device is attached to. Thus, my first guess would be that we are quite happily setting up the requested DMA maps on the *wrong* IOMMU, and then taking faults when the device actually tries to do DMA. However, I'm not 100% convinced of that. The fault address looks suspiciously like a true physical address, not a virtual bus address of the type that we'd normally allocate for a dma_map_* operation. Those would start at 0xf000 and work downwards, typically. I like the wrong IOMMU (or no IOMMU at all) theory. If we didn't connect the device with an IOMMU at all, that would explain the device DMAing directly to a physical address, wouldn't it? Do you have 'iommu=pt' on the kernel command line? Can I see the full dmesg as this system boots, and also a copy of the DMAR table? We should also rate-limit DMA faults, which would avoid the lockup failure mode. Bjorn, what should an IOMMU driver *do* when it detects that a device is creating an endless stream of DMA faults and isn't aborting the transaction? You mentioned that POWER with EEH does something intelligent in this case, but I'm not familiar with that code. We have AER support, which can result in resetting a device, but I think DMA faults are reported differently, and I don't think there's any nice existing way for PCI to deal with them. Maybe there should be, though. Bjorn ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: hpsa driver bug crack kernel down!
On 4/10/2014 11:14 AM, Bjorn Helgaas wrote: On Thu, Apr 10, 2014 at 2:46 AM, Woodhouse, David david.woodho...@intel.com wrote: DMAR:[fault reason 02] Present bit in context entry is clear dmar: DRHD: handling fault status reg 602 dmar: DMAR:[DMA Read] Request device [02:00.0] fault addr 7f61e000 That Present bit in context entry is clear fault means that we have not set up *any* mappings for this PCI device… on this IOMMU. Yes, specifically (finally done bisecting): commit 2e45528930388658603ea24d49cf52867b928d3e Author: Jiang Liu jiang@linux.intel.com Date: Wed Feb 19 14:07:36 2014 +0800 iommu/vt-d: Unify the way to process DMAR device scope array This commit is about how we decide which IOMMU a given PCI device is attached to. Thus, my first guess would be that we are quite happily setting up the requested DMA maps on the *wrong* IOMMU, and then taking faults when the device actually tries to do DMA. However, I'm not 100% convinced of that. The fault address looks suspiciously like a true physical address, not a virtual bus address of the type that we'd normally allocate for a dma_map_* operation. Those would start at 0xf000 and work downwards, typically. I like the wrong IOMMU (or no IOMMU at all) theory. If we didn't connect the device with an IOMMU at all, that would explain the device DMAing directly to a physical address, wouldn't it? Do you have 'iommu=pt' on the kernel command line? Can I see the full dmesg as this system boots, and also a copy of the DMAR table? This will be really helpful information. This box has devices with RMRR records and if they're not set up correctly, DMAR faults can occur. We should also rate-limit DMA faults, which would avoid the lockup failure mode. Bjorn, what should an IOMMU driver *do* when it detects that a device is creating an endless stream of DMA faults and isn't aborting the transaction? You mentioned that POWER with EEH does something intelligent in this case, but I'm not familiar with that code. We have AER support, which can result in resetting a device, but I think DMA faults are reported differently, and I don't think there's any nice existing way for PCI to deal with them. Maybe there should be, though. Bjorn ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: hpsa driver bug crack kernel down!
On Thu, 2014-04-10 at 09:14 -0600, Bjorn Helgaas wrote: Thus, my first guess would be that we are quite happily setting up the requested DMA maps on the *wrong* IOMMU, and then taking faults when the device actually tries to do DMA. I like the wrong IOMMU (or no IOMMU at all) theory. If we didn't connect the device with an IOMMU at all, that would explain the device DMAing directly to a physical address, wouldn't it? An unlikely failure mode. We're much more likely to see *wrong* IOMMU than no IOMMU. And thus we'd still see the distinctive virtual addresses just below 4GiB. However, Rob's answer may solve that puzzle. If this is one of those abominations where the device continues to do DMA to system memory even after the OS is up and running and *thinks* it has control of the hardware, then the offending address will be listed in an RMRR entry (which tells the OS to set up a 1:1 mapping for access to certain memory ranges for a given device). And will be inside an E820 reserved region. A little odd that such an error would trigger only when we're actually trying to initialise the device from the Linux driver, not as soon as we enable the IOMMU. But all things are possible. But the DMAR table and dmesg that I asked for would give us a bit more information and hopefully let us stop speculating... We should also rate-limit DMA faults, which would avoid the lockup failure mode. Bjorn, what should an IOMMU driver *do* when it detects that a device is creating an endless stream of DMA faults and isn't aborting the transaction? You mentioned that POWER with EEH does something intelligent in this case, but I'm not familiar with that code. We have AER support, which can result in resetting a device, but I think DMA faults are reported differently, and I don't think there's any nice existing way for PCI to deal with them. Maybe there should be, though. Quite frankly, I don't care how *you* deal with them, or even if you can. All I want to know is how I tell you about the problem, because *I* sure as hell don't want to be trying to deal with it in the IOMMU code. That's a generic PCI layer thing. :) -- David WoodhouseOpen Source Technology Centre david.woodho...@intel.com Intel Corporation smime.p7s Description: S/MIME cryptographic signature ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: hpsa driver bug crack kernel down!
On Thu, 2014-04-10 at 16:34 +0800, Jiang Liu wrote: Hi Baoquan, Could you please help to give output of lspci -? Attached. Is device hpsa :03:00.0 a legacy PCI device(non-PCIe)? It may have relationship with IOMMU driver. I honestly don't know. PCI is way out of my area of knowledge. 00:00.0 Host bridge: Intel Corporation 5520/5500/X58 I/O Hub to ESI Port (rev 22) Subsystem: Hewlett-Packard Company Device 330b Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx- Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast TAbort- TAbort- MAbort- SERR- PERR- INTx- Capabilities: access denied 00:01.0 PCI bridge: Intel Corporation 5520/5500/X58 I/O Hub PCI Express Root Port 1 (rev 22) (prog-if 00 [Normal decode]) Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx+ Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast TAbort- TAbort- MAbort- SERR- PERR- INTx- Latency: 0, Cache Line Size: 64 bytes Bus: primary=00, secondary=0e, subordinate=10, sec-latency=0 I/O behind bridge: f000-0fff Memory behind bridge: fff0-000f Prefetchable memory behind bridge: fff0-000f Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast TAbort- TAbort- MAbort- SERR- PERR- BridgeCtl: Parity+ SERR+ NoISA- VGA- MAbort- Reset- FastB2B- PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn- Capabilities: access denied Kernel driver in use: pcieport Kernel modules: shpchp 00:02.0 PCI bridge: Intel Corporation 5520/5500/X58 I/O Hub PCI Express Root Port 2 (rev 22) (prog-if 00 [Normal decode]) Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx+ Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast TAbort- TAbort- MAbort- SERR- PERR- INTx- Latency: 0, Cache Line Size: 64 bytes Bus: primary=00, secondary=14, subordinate=14, sec-latency=0 I/O behind bridge: f000-0fff Memory behind bridge: fff0-000f Prefetchable memory behind bridge: fff0-000f Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast TAbort- TAbort- MAbort- SERR- PERR- BridgeCtl: Parity+ SERR+ NoISA- VGA- MAbort- Reset- FastB2B- PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn- Capabilities: access denied Kernel driver in use: pcieport Kernel modules: shpchp 00:03.0 PCI bridge: Intel Corporation 5520/5500/X58 I/O Hub PCI Express Root Port 3 (rev 22) (prog-if 00 [Normal decode]) Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx+ Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast TAbort- TAbort- MAbort- SERR- PERR- INTx- Latency: 0, Cache Line Size: 64 bytes Bus: primary=00, secondary=04, subordinate=04, sec-latency=0 I/O behind bridge: f000-0fff Memory behind bridge: 9000-99ff Prefetchable memory behind bridge: fff0-000f Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast TAbort- TAbort- MAbort+ SERR- PERR- BridgeCtl: Parity+ SERR+ NoISA- VGA- MAbort- Reset- FastB2B- PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn- Capabilities: access denied Kernel driver in use: pcieport Kernel modules: shpchp 00:04.0 PCI bridge: Intel Corporation 5520/X58 I/O Hub PCI Express Root Port 4 (rev 22) (prog-if 00 [Normal decode]) Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx+ Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast TAbort- TAbort- MAbort- SERR- PERR- INTx- Latency: 0, Cache Line Size: 64 bytes Bus: primary=00, secondary=15, subordinate=15, sec-latency=0 I/O behind bridge: f000-0fff Memory behind bridge: fff0-000f Prefetchable memory behind bridge: fff0-000f Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast TAbort- TAbort- MAbort- SERR- PERR- BridgeCtl: Parity+ SERR+ NoISA- VGA- MAbort- Reset- FastB2B- PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn- Capabilities: access denied Kernel driver in use: pcieport Kernel modules: shpchp 00:05.0 PCI bridge: Intel Corporation 5520/X58 I/O Hub PCI Express Root Port 5 (rev 22) (prog-if 00 [Normal decode]) Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx+ Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast TAbort- TAbort- MAbort- SERR- PERR- INTx- Latency: 0, Cache Line Size: 64 bytes Bus: primary=00, secondary=11, subordinate=13,
Re: hpsa driver bug crack kernel down!
On Thu, 2014-04-10 at 09:19 -0700, Davidlohr Bueso wrote: dmar: DMAR:[DMA Read] Request device [02:00.0] fault addr 7f61e000 That Present bit in context entry is clear fault means that we have not set up *any* mappings for this PCI device… on this IOMMU. Yes, specifically (finally done bisecting): commit 2e45528930388658603ea24d49cf52867b928d3e Author: Jiang Liu jiang@linux.intel.com Date: Wed Feb 19 14:07:36 2014 +0800 iommu/vt-d: Unify the way to process DMAR device scope array This commit is about how we decide which IOMMU a given PCI device is attached to. Thus, my first guess would be that we are quite happily setting up the requested DMA maps on the *wrong* IOMMU, and then taking faults when the device actually tries to do DMA. However, I'm not 100% convinced of that. The fault address looks suspiciously like a true physical address, not a virtual bus address of the type that we'd normally allocate for a dma_map_* operation. Those would start at 0xf000 and work downwards, typically. Do you have 'iommu=pt' on the kernel command line? No. Can I see the full dmesg as this system boots, and also a copy of the DMAR table? Attaching a dmesg from one of the kernels that boots. It doesn't appear to have much of the related information... It shows us that the address 0x7f61e000 is in an E820-reserved region, and that there's and RMRR covering that region for an unspecified PCI device, but that's going to be the hpsa. So if isn't just a simple case of us assigning this device to the wrong IOMMU, *perhaps* it's that we lose the RMRR when the driver takes control of the device. RMRRs are generally expected to be a boot-time thing, for things like legacy keyboard/mouse emulation via USB. Using them while the system is *active* is... horrid. We've often not quite handled that right. -- David WoodhouseOpen Source Technology Centre david.woodho...@intel.com Intel Corporation smime.p7s Description: S/MIME cryptographic signature ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: hpsa driver bug crack kernel down!
On Wed, Apr 09, 2014 at 11:32:37PM -0700, Davidlohr Bueso wrote: On Wed, 2014-04-09 at 22:03 -0600, Bjorn Helgaas wrote: [+cc Joerg, iommu list] On Wed, Apr 9, 2014 at 6:19 PM, Davidlohr Bueso davidl...@hp.com wrote: On Wed, 2014-04-09 at 16:50 -0700, James Bottomley wrote: On Wed, 2014-04-09 at 16:40 -0700, Davidlohr Bueso wrote: On Wed, 2014-04-09 at 16:10 -0700, James Bottomley wrote: On Wed, 2014-04-09 at 16:08 -0700, James Bottomley wrote: [+linux-scsi] On Wed, 2014-04-09 at 15:49 -0700, Davidlohr Bueso wrote: On Wed, 2014-04-09 at 10:39 +0800, Baoquan He wrote: Hi, The kernel is 3.14.0+ which is pulled just now. Cc'ing more people. While the hpsa driver appears to be involved in some way, I'm sure if this is a related issue, but as of today's pull I'm getting another problem that causes my DL980 not to come up. *Massive* amounts of: DMAR:[fault reason 02] Present bit in context entry is clear dmar: DRHD: handling fault status reg 602 dmar: DMAR:[DMA Read] Request device [02:00.0] fault addr 7f61e000 Then: hpsa :03:00.0: Controller lockup detected: 0x ... Workqueue: events hpsa_monitor_ctlr_worker [hpsa] ... Screenshot of the actual LOCKUP: http://stgolabs.net/hpsa-hard-lockup-3.14+.png While I haven't bisected, things worked fine until at least until commit 39de65aa2c3e (April 2nd). Any ideas? Well, it's either a DMA remapping issue or a hpsa one. Your assertion that everything worked fine until 39de65aa2c3e would tend to vindicate hpsa, Hmm here you mean DMA, right? No, it vindicates the hpsa changes ... they don't seem to be causing problems until something goes wrong with dma remapping. because all the hpsa changes went in before that under Missing crucial info: commit 1a0b6abaea78f73d9bc0a2f6df2d9e4c917cade1 Merge: 3e75c6d b2bff6c Author: Linus Torvalds torva...@linux-foundation.org Date: Tue Apr 1 18:49:04 2014 -0700 Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi can you revalidate that this commit works OK just to make sure? Ok so I don't see those DMA messages and system starts just fine. I'm thinking perhaps something broke after the IO mmu stuff in commit 3f583bc21977a608908b83d03ee2250426a5695c... could this be indirectly causing the CPU stalls and just blame hpsa in the path as a side effect? /me goes out to try the commit. That's my guess. The DMAR messages are DMA remapping issues caused in the IOMMU. If I had to guess, I'd say the DMAR fault message is indicating the IOMMU is calling for a mapping address before it can satisfy the driver read request, which is causing the hang apparently in the hpsa driver. I've added linux-pci to the cc; I think they deal with iommu issues on x86. So that merge commit appears to be the culprit, I see both the DMA messages and the lockup blaming hpsa... My understanding so far (please correct me if I'm wrong): 39de65aa2c3e OK (Merge branch 'i2c/for-next') 1a0b6abaea78 OK (Merge tag 'scsi-misc') ^^^ this one, 1a0b6abaea78, did not work for me, crashing in hpsa_enter_performant mode() which was surprsing to me as I am pretty sure I tried on this very same machine I'm using now (DL360p with P420, P430 and P420i) with 3.14-rc-something plus all the hpsa patches that I thought were merged in. But now I am seeing: [a0002bd0] hpsa_enter_performant_mode+0x4c0/0x540 [hpsa] RSP: 0018:88042c515a78 EFLAGS: 00010297 RAX: RBX: 88042c65 RCX: 0004 RDX: RSI: 0001 RDI: RBP: 88042c515b48 R08: R09: 8af03cc0 R10: R11: 0001 R12: 88042c515a98 R13: 6104 R14: 88042c515ad8 R15: a0001630 FS: 7f86f7a38700() GS:88043f56() knlGS: CS: 0010 DS: ES: CR0: 80050033 usb 1-1.6: new low-speed USB device number 3 using ehci-pci CR2: CR3: 00042c4c3000 CR4: 000407e0 Stack: 8024 a0c0 abe0 00060005 00080007 000a0009 000c000b 000e000d 001f 00120011 00040013 Call Trace: [a0c0] ? SA5_fifo_full+0x20/0x20 [hpsa] [abe0] ? SA5_ioaccel_mode1_completed+0xd0/0xd0 [hpsa] [a000aab6] hpsa_put_ctlr_into_performant_mode+0x186/0x320 [hpsa] [a0005132] ? hpsa_allocate_sg_chain_blocks+0xa2/0xd0 [hpsa] [a000b08b]
Re: hpsa driver bug crack kernel down!
On Thu, Apr 10, 2014 at 2:45 PM, scame...@beardog.cce.hp.com wrote: 3f583bc21977 BAD (Merge tag 'iommu-updates-v3.15') Yes, specifically (finally done bisecting): commit 2e45528930388658603ea24d49cf52867b928d3e Author: Jiang Liu jiang@linux.intel.com Date: Wed Feb 19 14:07:36 2014 +0800 iommu/vt-d: Unify the way to process DMAR device scope array Now we have a PCI bus notification based mechanism to update DMAR device scope array, we could extend the mechanism to support boot time initialization too, which will help to unify and simplify the implementation. Signed-off-by: Jiang Liu jiang@linux.intel.com Signed-off-by: Joerg Roedel j...@8bytes.org My git bisect appears to be converging on something else, something within the hpsa patches that I sent up recently, unfortunately for me. Will let you all know when it converges. This smells very much like the problem that was solved couple of years ago for SI domain. It is likely that path is broken with the DMAR device scope array change. Please take a look to see if the following no longer occurs. Looks like BIOS could be expecting this RMRR to be still mapped. /* * We want to prevent any device associated with an RMRR from * getting placed into the SI Domain. This is done because * problems exist when devices are moved in and out of domains * and their respective RMRR info is lost. We exempt USB devices * from this process due to their usage of RMRRs that are known * to not be needed after BIOS hand-off to OS. */ if (device_has_rmrr(dev) (pdev-class 8) != PCI_CLASS_SERIAL_USB) return 0; -- Shuah ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: hpsa driver bug crack kernel down!
On 04/10/14 at 04:34pm, Jiang Liu wrote: Hi Baoquan, Could you please help to give output of lspci -? Is device hpsa :03:00.0 a legacy PCI device(non-PCIe)? It may have relationship with IOMMU driver. Thanks! Gerry Hi, I just saw your mail now. Do you still need the output of lspci - on my test machine? In fact, I didn't see the DMAR error related to intel vt-d issues. If the output is helpful, I can make a latest build to do this. Thanks Baoquan On 2014/4/10 12:03, Bjorn Helgaas wrote: [+cc Joerg, iommu list] On Wed, Apr 9, 2014 at 6:19 PM, Davidlohr Bueso davidl...@hp.com wrote: On Wed, 2014-04-09 at 16:50 -0700, James Bottomley wrote: On Wed, 2014-04-09 at 16:40 -0700, Davidlohr Bueso wrote: On Wed, 2014-04-09 at 16:10 -0700, James Bottomley wrote: On Wed, 2014-04-09 at 16:08 -0700, James Bottomley wrote: [+linux-scsi] On Wed, 2014-04-09 at 15:49 -0700, Davidlohr Bueso wrote: On Wed, 2014-04-09 at 10:39 +0800, Baoquan He wrote: Hi, The kernel is 3.14.0+ which is pulled just now. Cc'ing more people. While the hpsa driver appears to be involved in some way, I'm sure if this is a related issue, but as of today's pull I'm getting another problem that causes my DL980 not to come up. *Massive* amounts of: DMAR:[fault reason 02] Present bit in context entry is clear dmar: DRHD: handling fault status reg 602 dmar: DMAR:[DMA Read] Request device [02:00.0] fault addr 7f61e000 Then: hpsa :03:00.0: Controller lockup detected: 0x ... Workqueue: events hpsa_monitor_ctlr_worker [hpsa] ... Screenshot of the actual LOCKUP: http://stgolabs.net/hpsa-hard-lockup-3.14+.png While I haven't bisected, things worked fine until at least until commit 39de65aa2c3e (April 2nd). Any ideas? Well, it's either a DMA remapping issue or a hpsa one. Your assertion that everything worked fine until 39de65aa2c3e would tend to vindicate hpsa, Hmm here you mean DMA, right? No, it vindicates the hpsa changes ... they don't seem to be causing problems until something goes wrong with dma remapping. because all the hpsa changes went in before that under Missing crucial info: commit 1a0b6abaea78f73d9bc0a2f6df2d9e4c917cade1 Merge: 3e75c6d b2bff6c Author: Linus Torvalds torva...@linux-foundation.org Date: Tue Apr 1 18:49:04 2014 -0700 Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi can you revalidate that this commit works OK just to make sure? Ok so I don't see those DMA messages and system starts just fine. I'm thinking perhaps something broke after the IO mmu stuff in commit 3f583bc21977a608908b83d03ee2250426a5695c... could this be indirectly causing the CPU stalls and just blame hpsa in the path as a side effect? /me goes out to try the commit. That's my guess. The DMAR messages are DMA remapping issues caused in the IOMMU. If I had to guess, I'd say the DMAR fault message is indicating the IOMMU is calling for a mapping address before it can satisfy the driver read request, which is causing the hang apparently in the hpsa driver. I've added linux-pci to the cc; I think they deal with iommu issues on x86. So that merge commit appears to be the culprit, I see both the DMA messages and the lockup blaming hpsa... My understanding so far (please correct me if I'm wrong): 39de65aa2c3e OK (Merge branch 'i2c/for-next') 1a0b6abaea78 OK (Merge tag 'scsi-misc') 3f583bc21977 BAD (Merge tag 'iommu-updates-v3.15') -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: hpsa driver bug crack kernel down!
On 04/10/14 at 04:34pm, Jiang Liu wrote: Hi Baoquan, Could you please help to give output of lspci -? Is device hpsa :03:00.0 a legacy PCI device(non-PCIe)? It may have relationship with IOMMU driver. Thanks! Gerry Well, the machine bug was reported on is a AMD machine, and it doesn't have the IOMMU problem. David saw there are some DMAR errors, it should be a intel machine which use the VT-d. On 2014/4/10 12:03, Bjorn Helgaas wrote: [+cc Joerg, iommu list] On Wed, Apr 9, 2014 at 6:19 PM, Davidlohr Bueso davidl...@hp.com wrote: On Wed, 2014-04-09 at 16:50 -0700, James Bottomley wrote: On Wed, 2014-04-09 at 16:40 -0700, Davidlohr Bueso wrote: On Wed, 2014-04-09 at 16:10 -0700, James Bottomley wrote: On Wed, 2014-04-09 at 16:08 -0700, James Bottomley wrote: [+linux-scsi] On Wed, 2014-04-09 at 15:49 -0700, Davidlohr Bueso wrote: On Wed, 2014-04-09 at 10:39 +0800, Baoquan He wrote: Hi, The kernel is 3.14.0+ which is pulled just now. Cc'ing more people. While the hpsa driver appears to be involved in some way, I'm sure if this is a related issue, but as of today's pull I'm getting another problem that causes my DL980 not to come up. *Massive* amounts of: DMAR:[fault reason 02] Present bit in context entry is clear dmar: DRHD: handling fault status reg 602 dmar: DMAR:[DMA Read] Request device [02:00.0] fault addr 7f61e000 Then: hpsa :03:00.0: Controller lockup detected: 0x ... Workqueue: events hpsa_monitor_ctlr_worker [hpsa] ... Screenshot of the actual LOCKUP: http://stgolabs.net/hpsa-hard-lockup-3.14+.png While I haven't bisected, things worked fine until at least until commit 39de65aa2c3e (April 2nd). Any ideas? Well, it's either a DMA remapping issue or a hpsa one. Your assertion that everything worked fine until 39de65aa2c3e would tend to vindicate hpsa, Hmm here you mean DMA, right? No, it vindicates the hpsa changes ... they don't seem to be causing problems until something goes wrong with dma remapping. because all the hpsa changes went in before that under Missing crucial info: commit 1a0b6abaea78f73d9bc0a2f6df2d9e4c917cade1 Merge: 3e75c6d b2bff6c Author: Linus Torvalds torva...@linux-foundation.org Date: Tue Apr 1 18:49:04 2014 -0700 Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi can you revalidate that this commit works OK just to make sure? Ok so I don't see those DMA messages and system starts just fine. I'm thinking perhaps something broke after the IO mmu stuff in commit 3f583bc21977a608908b83d03ee2250426a5695c... could this be indirectly causing the CPU stalls and just blame hpsa in the path as a side effect? /me goes out to try the commit. That's my guess. The DMAR messages are DMA remapping issues caused in the IOMMU. If I had to guess, I'd say the DMAR fault message is indicating the IOMMU is calling for a mapping address before it can satisfy the driver read request, which is causing the hang apparently in the hpsa driver. I've added linux-pci to the cc; I think they deal with iommu issues on x86. So that merge commit appears to be the culprit, I see both the DMA messages and the lockup blaming hpsa... My understanding so far (please correct me if I'm wrong): 39de65aa2c3e OK (Merge branch 'i2c/for-next') 1a0b6abaea78 OK (Merge tag 'scsi-misc') 3f583bc21977 BAD (Merge tag 'iommu-updates-v3.15') -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: hpsa driver bug crack kernel down!
[+cc Joerg, iommu list] On Wed, Apr 9, 2014 at 6:19 PM, Davidlohr Bueso davidl...@hp.com wrote: On Wed, 2014-04-09 at 16:50 -0700, James Bottomley wrote: On Wed, 2014-04-09 at 16:40 -0700, Davidlohr Bueso wrote: On Wed, 2014-04-09 at 16:10 -0700, James Bottomley wrote: On Wed, 2014-04-09 at 16:08 -0700, James Bottomley wrote: [+linux-scsi] On Wed, 2014-04-09 at 15:49 -0700, Davidlohr Bueso wrote: On Wed, 2014-04-09 at 10:39 +0800, Baoquan He wrote: Hi, The kernel is 3.14.0+ which is pulled just now. Cc'ing more people. While the hpsa driver appears to be involved in some way, I'm sure if this is a related issue, but as of today's pull I'm getting another problem that causes my DL980 not to come up. *Massive* amounts of: DMAR:[fault reason 02] Present bit in context entry is clear dmar: DRHD: handling fault status reg 602 dmar: DMAR:[DMA Read] Request device [02:00.0] fault addr 7f61e000 Then: hpsa :03:00.0: Controller lockup detected: 0x ... Workqueue: events hpsa_monitor_ctlr_worker [hpsa] ... Screenshot of the actual LOCKUP: http://stgolabs.net/hpsa-hard-lockup-3.14+.png While I haven't bisected, things worked fine until at least until commit 39de65aa2c3e (April 2nd). Any ideas? Well, it's either a DMA remapping issue or a hpsa one. Your assertion that everything worked fine until 39de65aa2c3e would tend to vindicate hpsa, Hmm here you mean DMA, right? No, it vindicates the hpsa changes ... they don't seem to be causing problems until something goes wrong with dma remapping. because all the hpsa changes went in before that under Missing crucial info: commit 1a0b6abaea78f73d9bc0a2f6df2d9e4c917cade1 Merge: 3e75c6d b2bff6c Author: Linus Torvalds torva...@linux-foundation.org Date: Tue Apr 1 18:49:04 2014 -0700 Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi can you revalidate that this commit works OK just to make sure? Ok so I don't see those DMA messages and system starts just fine. I'm thinking perhaps something broke after the IO mmu stuff in commit 3f583bc21977a608908b83d03ee2250426a5695c... could this be indirectly causing the CPU stalls and just blame hpsa in the path as a side effect? /me goes out to try the commit. That's my guess. The DMAR messages are DMA remapping issues caused in the IOMMU. If I had to guess, I'd say the DMAR fault message is indicating the IOMMU is calling for a mapping address before it can satisfy the driver read request, which is causing the hang apparently in the hpsa driver. I've added linux-pci to the cc; I think they deal with iommu issues on x86. So that merge commit appears to be the culprit, I see both the DMA messages and the lockup blaming hpsa... My understanding so far (please correct me if I'm wrong): 39de65aa2c3e OK (Merge branch 'i2c/for-next') 1a0b6abaea78 OK (Merge tag 'scsi-misc') 3f583bc21977 BAD (Merge tag 'iommu-updates-v3.15') ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu