Re: hpsa driver bug crack kernel down!

2014-04-16 Thread j...@8bytes.org
Hey David,

On Mon, Apr 14, 2014 at 05:03:51PM +, Woodhouse, David wrote:
 Jiang, if you can then let me have a copy with a signed-off-by I'll
 shepherd it upstream along with your other patch which is already in my
 iommu-2.6.git tree.

What is the state of these fixes? I plan to send out a pull-request
before easter and hoped to include these fixes as well.

Thanks,

Joerg


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: hpsa driver bug crack kernel down!

2014-04-16 Thread Woodhouse, David
On Wed, 2014-04-16 at 15:37 +0200, j...@8bytes.org wrote:
 Hey David,
 
 On Mon, Apr 14, 2014 at 05:03:51PM +, Woodhouse, David wrote:
  Jiang, if you can then let me have a copy with a signed-off-by I'll
  shepherd it upstream along with your other patch which is already in my
  iommu-2.6.git tree.
 
 What is the state of these fixes? I plan to send out a pull-request
 before easter and hoped to include these fixes as well.

I'm travelling and was going to do some final testing and send out a
pull request after I got home tomorrow. But since you ask...

Please pull from
git://git.infradead.org/iommu-2.6.git

David Woodhouse (1):
  iommu/vt-d: Fix get_domain_for_dev() handling of upstream PCIe bridges

Jiang Liu (2):
  iommu/vt-d: fix memory leakage caused by commit ea8ea46
  iommu/vt-d: fix bug in matching PCI devices with DRHD/RMRR descriptors

 drivers/iommu/dmar.c|  3 ++-
 drivers/iommu/intel-iommu.c | 10 +++---
 2 files changed, 9 insertions(+), 4 deletions(-)



-- 
David WoodhouseOpen Source Technology Centre
david.woodho...@intel.com  Intel Corporation


smime.p7s
Description: S/MIME cryptographic signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: hpsa driver bug crack kernel down!

2014-04-16 Thread j...@8bytes.org
On Wed, Apr 16, 2014 at 01:58:44PM +, Woodhouse, David wrote:
 On Wed, 2014-04-16 at 15:37 +0200, j...@8bytes.org wrote:
  What is the state of these fixes? I plan to send out a pull-request
  before easter and hoped to include these fixes as well.
 
 I'm travelling and was going to do some final testing and send out a
 pull request after I got home tomorrow. But since you ask...
 
 Please pull from
   git://git.infradead.org/iommu-2.6.git
 
 David Woodhouse (1):
   iommu/vt-d: Fix get_domain_for_dev() handling of upstream PCIe bridges
 
 Jiang Liu (2):
   iommu/vt-d: fix memory leakage caused by commit ea8ea46
   iommu/vt-d: fix bug in matching PCI devices with DRHD/RMRR descriptors
 
  drivers/iommu/dmar.c|  3 ++-
  drivers/iommu/intel-iommu.c | 10 +++---
  2 files changed, 9 insertions(+), 4 deletions(-)

Pulled, thanks David. I will also do some additional testing before
sending it upstream.


Joerg


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: hpsa driver bug crack kernel down!

2014-04-14 Thread Jiang Liu
Hi Davidlohr,
Thanks for the information!
According to lspci output, device :02:00.2 is HP ILO
controller, device :03:00.0 is RAID controller. Both ILO and
RAID controllers need to access reserved memory range
[0x7f61e000 - 0x7f61] in physical mode.

According to dmesg output, BIOS has reserved memory and
IOMMU has setup 1:1 mapping for ILO and RAID controller to access
this range. Related log messages as below:
BIOS-e820: [mem 0x7f61d000-0x8fff] reserved
IOMMU: Setting RMRR:
IOMMU: Setting identity map for device :03:00.0 [0x7f61e000 -
0x7f61]
IOMMU: Setting identity map for device :02:00.0 [0x7f61e000 -
0x7f61]
IOMMU: Setting identity map for device :02:00.2 [0x7f61e000 -
0x7f61]

From the screenshot, device :02:00.2 fails to access
memory address 0x7f61e000. That indicates IOMMU driver fails to
setup 1:1 mapping for Reserved Memory Range for ILO controller.
So could you please help to check whether you could observe boot
messages like IOMMU: Setting identity map for device :02:00.2
[0x7f61e000 - 0x7f61] with the failure kernel image?

It would be great if boot messages could be saved when
failing to boot, so we could get more information from log.

BTW, I have double checked related code, and still can't
find a reliable explanation for the regression:(

Thanks!
Gerry

On 2014/4/11 0:19, Davidlohr Bueso wrote:
 On Thu, 2014-04-10 at 08:46 +, Woodhouse, David wrote:
 On Thu, 2014-04-10 at 09:15 +0200, Joerg Roedel wrote:
 [+ David, VT-d maintainer ]

 Jiang, David, can you please have a look into this issue?


 DMAR:[fault reason 02] Present bit in context entry is clear
 dmar: DRHD: handling fault status reg 602
 dmar: DMAR:[DMA Read] Request device [02:00.0] fault addr 7f61e000

 That Present bit in context entry is clear fault means that we have
 not set up *any* mappings for this PCI device… on this IOMMU.

 Yes, specifically (finally done bisecting):

 commit 2e45528930388658603ea24d49cf52867b928d3e
 Author: Jiang Liu jiang@linux.intel.com
 Date:   Wed Feb 19 14:07:36 2014 +0800

 iommu/vt-d: Unify the way to process DMAR device scope array

 This commit is about how we decide which IOMMU a given PCI device is
 attached to.

 Thus, my first guess would be that we are quite happily setting up the
 requested DMA maps on the *wrong* IOMMU, and then taking faults when the
 device actually tries to do DMA.

 However, I'm not 100% convinced of that. The fault address looks
 suspiciously like a true physical address, not a virtual bus address of
 the type that we'd normally allocate for a dma_map_* operation. Those
 would start at 0xf000 and work downwards, typically.

 Do you have 'iommu=pt' on the kernel command line? 
 
 No.
 
 Can I see the full
 dmesg as this system boots, and also a copy of the DMAR table?
 
 Attaching a dmesg from one of the kernels that boots. It doesn't appear
 to have much of the related information... is there any debug config
 option I can enable that might give you more data?
 
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: hpsa driver bug crack kernel down!

2014-04-14 Thread Jiang Liu
Hi all,
I guess I found the root cause. It's a bug in matching
device scope, variable 'level' should be decreased when walking up PCI
topology.
Could you please help to test following patch?
Thanks!
Gerry

diff --git a/drivers/iommu/dmar.c b/drivers/iommu/dmar.c
index f445c10..1f8308c 100644
--- a/drivers/iommu/dmar.c
+++ b/drivers/iommu/dmar.c
@@ -152,7 +152,7 @@ dmar_alloc_pci_notify_info(struct pci_dev *dev,
unsigned long event)
info-seg = pci_domain_nr(dev-bus);
info-level = level;
if (event == BUS_NOTIFY_ADD_DEVICE) {
-   for (tmp = dev, level--; tmp; tmp = tmp-bus-self) {
+   for (tmp = dev, level--; tmp; level--, tmp =
tmp-bus-self) {
info-path[level].device = PCI_SLOT(tmp-devfn);
info-path[level].function = PCI_FUNC(tmp-devfn);
if (pci_is_root_bus(tmp-bus))


On 2014/4/11 0:19, Davidlohr Bueso wrote:
 On Thu, 2014-04-10 at 08:46 +, Woodhouse, David wrote:
 On Thu, 2014-04-10 at 09:15 +0200, Joerg Roedel wrote:
 [+ David, VT-d maintainer ]

 Jiang, David, can you please have a look into this issue?


 DMAR:[fault reason 02] Present bit in context entry is clear
 dmar: DRHD: handling fault status reg 602
 dmar: DMAR:[DMA Read] Request device [02:00.0] fault addr 7f61e000

 That Present bit in context entry is clear fault means that we have
 not set up *any* mappings for this PCI device… on this IOMMU.

 Yes, specifically (finally done bisecting):

 commit 2e45528930388658603ea24d49cf52867b928d3e
 Author: Jiang Liu jiang@linux.intel.com
 Date:   Wed Feb 19 14:07:36 2014 +0800

 iommu/vt-d: Unify the way to process DMAR device scope array

 This commit is about how we decide which IOMMU a given PCI device is
 attached to.

 Thus, my first guess would be that we are quite happily setting up the
 requested DMA maps on the *wrong* IOMMU, and then taking faults when the
 device actually tries to do DMA.

 However, I'm not 100% convinced of that. The fault address looks
 suspiciously like a true physical address, not a virtual bus address of
 the type that we'd normally allocate for a dma_map_* operation. Those
 would start at 0xf000 and work downwards, typically.

 Do you have 'iommu=pt' on the kernel command line? 
 
 No.
 
 Can I see the full
 dmesg as this system boots, and also a copy of the DMAR table?
 
 Attaching a dmesg from one of the kernels that boots. It doesn't appear
 to have much of the related information... is there any debug config
 option I can enable that might give you more data?
 
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: hpsa driver bug crack kernel down!

2014-04-14 Thread Davidlohr Bueso
Sorry for the delay, I've been having to take turns for this box.

On Fri, 2014-04-11 at 09:18 +, Woodhouse, David wrote:
 On Thu, 2014-04-10 at 09:19 -0700, Davidlohr Bueso wrote:
  Attaching a dmesg from one of the kernels that boots. It doesn't appear
  to have much of the related information... is there any debug config
  option I can enable that might give you more data?
 
 I'd like the contents of /sys/firmware/acpi/tables/DMAR please.

Attached is the disassembly of the raw output.

  And
 please could you also apply this patch to both the last-working and
 first-failing kernels and show me the output in both cases?

So I still cannot get around getting the info for the first failing
kernel, but below is for the last working. Thanks.

Device 0:03:00.0 on IOMMU at a800
Device 0:03:00.0 on IOMMU at a800
IOMMU: Setting identity map for device :02:00.0 [0x7f61e000 - 0x7f61]
Device 0:02:00.0 on IOMMU at a800
Device 0:02:00.0 on IOMMU at a800
IOMMU: Setting identity map for device :02:00.2 [0x7f61e000 - 0x7f61]
Device 0:02:00.2 on IOMMU at a800
Device 0:02:00.2 on IOMMU at a800
IOMMU: Setting identity map for device :00:1d.0 [0x7f7e7000 - 0x7f7ecfff]
Device 0:00:1d.0 on IOMMU at a800
Device 0:00:1d.0 on IOMMU at a800
IOMMU: Setting identity map for device :00:1d.1 [0x7f7e7000 - 0x7f7ecfff]
Device 0:00:1d.1 on IOMMU at a800
Device 0:00:1d.1 on IOMMU at a800
IOMMU: Setting identity map for device :00:1d.2 [0x7f7e7000 - 0x7f7ecfff]
Device 0:00:1d.2 on IOMMU at a800
Device 0:00:1d.2 on IOMMU at a800
IOMMU: Setting identity map for device :00:1d.3 [0x7f7e7000 - 0x7f7ecfff]
Device 0:00:1d.3 on IOMMU at a800
Device 0:00:1d.3 on IOMMU at a800
IOMMU: Setting identity map for device :02:00.0 [0x7f7e7000 - 0x7f7ecfff]
Device 0:02:00.0 on IOMMU at a800
IOMMU: Setting identity map for device :02:00.2 [0x7f7e7000 - 0x7f7ecfff]
Device 0:02:00.2 on IOMMU at a800
IOMMU: Setting identity map for device :02:00.4 [0x7f7e7000 - 0x7f7ecfff]
Device 0:02:00.4 on IOMMU at a800
Device 0:02:00.4 on IOMMU at a800
IOMMU: Setting identity map for device :00:1d.7 [0x7f7ee000 - 0x7f7e]
Device 0:00:1d.7 on IOMMU at a800
Device 0:00:1d.7 on IOMMU at a800
IOMMU: Prepare 0-16MiB unity mapping for LPC
IOMMU: Setting identity map for device :00:1f.0 [0x0 - 0xff]
Device 0:00:1f.0 on IOMMU at a800
Device 0:00:1f.0 on IOMMU at a800
PCI-DMA: Intel(R) Virtualization Technology for Directed I/O
Device 0:00:00.0 on IOMMU at a800
Device 0:00:01.0 on IOMMU at a800
Device 0:00:02.0 on IOMMU at a800
Device 0:00:03.0 on IOMMU at a800
Device 0:00:04.0 on IOMMU at a800
Device 0:00:05.0 on IOMMU at a800
Device 0:00:06.0 on IOMMU at a800
Device 0:00:07.0 on IOMMU at a800
Device 0:00:08.0 on IOMMU at a800
Device 0:00:09.0 on IOMMU at a800
Device 0:00:0a.0 on IOMMU at a800
Device 0:00:14.0 on IOMMU at a800
Device 0:00:1c.0 on IOMMU at a800
Device 0:00:1c.4 on IOMMU at a800
Device 0:00:1d.0 on IOMMU at a800
Device 0:00:1d.1 on IOMMU at a800
Device 0:00:1d.2 on IOMMU at a800
Device 0:00:1d.3 on IOMMU at a800
Device 0:00:1d.7 on IOMMU at a800
Device 0:00:1e.0 on IOMMU at a800
Device 0:00:1f.0 on IOMMU at a800
Device 0:04:00.0 on IOMMU at a800
Device 0:04:00.1 on IOMMU at a800
Device 0:04:00.2 on IOMMU at a800
Device 0:04:00.3 on IOMMU at a800
Device 0:03:00.0 on IOMMU at a800
Device 0:02:00.0 on IOMMU at a800
Device 0:02:00.2 on IOMMU at a800
Device 0:02:00.4 on IOMMU at a800
Device 0:01:03.0 on IOMMU at a800
Device 0:50:00.0 on IOMMU at ac00
Device 0:50:01.0 on IOMMU at ac00
Device 0:50:02.0 on IOMMU at ac00
Device 0:50:03.0 on IOMMU at ac00
Device 0:50:04.0 on IOMMU at ac00
Device 0:50:05.0 on IOMMU at ac00
Device 0:50:06.0 on IOMMU at ac00
Device 0:50:07.0 on IOMMU at ac00
Device 0:50:08.0 on IOMMU at ac00
Device 0:50:09.0 on IOMMU at ac00
Device 0:50:0a.0 on IOMMU at ac00
Device 0:50:14.0 on IOMMU at a800
Device 0:a0:00.0 on IOMMU at b000
Device 0:a0:01.0 on IOMMU at b000
Device 0:a0:02.0 on IOMMU at b000
Device 0:a0:03.0 on IOMMU at b000
Device 0:a0:04.0 on IOMMU at b000
Device 0:a0:05.0 on IOMMU at b000
Device 0:a0:06.0 on IOMMU at b000
Device 0:a0:07.0 on IOMMU at b000
Device 0:a0:08.0 on IOMMU at b000
Device 0:a0:09.0 on IOMMU at b000
Device 0:a0:0a.0 on IOMMU at b000
Device 0:a0:14.0 on IOMMU at a800
Device 0:7c:00.0 on IOMMU at a800
Device 0:7c:08.0 on IOMMU at a800
Device 0:82:00.0 on IOMMU at a800
Device 0:82:08.0 on IOMMU at a800

/*
 * Intel ACPI Component Architecture
 * AML Disassembler version 20140325-64 [Apr 11 2014]
 * Copyright (c) 2000 - 2014 Intel Corporation
 * 
 * Disassembly of DMAR.raw, Fri Apr 11 09:10:10 2014
 *
 * ACPI Data Table [DMAR]
 *
 * Format: 

Re: hpsa driver bug crack kernel down!

2014-04-14 Thread Jiang Liu
Hi Davidlohr,
Thanks for providing the DMAR table. According to the DMAR
table, one bug in the iommu driver fails to handle this entry:
[1D2h 0466   1]  Device Scope Entry Type : 01
[1D3h 0467   1] Entry Length : 0A
[1D4h 0468   2] Reserved : 
[1D6h 0470   1]   Enumeration ID : 00
[1D7h 0471   1]   PCI Bus Number : 00
[1D8h 0472   2] PCI Path : 1C,04
[1DAh 0474   2] PCI Path : 00,02

And the patch sent out by me should fix this bug. Could you please help
to have a try?
Thanks!
Gerry

On 2014/4/14 23:45, Davidlohr Bueso wrote:
 Sorry for the delay, I've been having to take turns for this box.
 
 On Fri, 2014-04-11 at 09:18 +, Woodhouse, David wrote:
 On Thu, 2014-04-10 at 09:19 -0700, Davidlohr Bueso wrote:
 Attaching a dmesg from one of the kernels that boots. It doesn't appear
 to have much of the related information... is there any debug config
 option I can enable that might give you more data?

 I'd like the contents of /sys/firmware/acpi/tables/DMAR please.
 
 Attached is the disassembly of the raw output.
 
  And
 please could you also apply this patch to both the last-working and
 first-failing kernels and show me the output in both cases?
 
 So I still cannot get around getting the info for the first failing
 kernel, but below is for the last working. Thanks.
 
 Device 0:03:00.0 on IOMMU at a800
 Device 0:03:00.0 on IOMMU at a800
 IOMMU: Setting identity map for device :02:00.0 [0x7f61e000 - 0x7f61]
 Device 0:02:00.0 on IOMMU at a800
 Device 0:02:00.0 on IOMMU at a800
 IOMMU: Setting identity map for device :02:00.2 [0x7f61e000 - 0x7f61]
 Device 0:02:00.2 on IOMMU at a800
 Device 0:02:00.2 on IOMMU at a800
 IOMMU: Setting identity map for device :00:1d.0 [0x7f7e7000 - 0x7f7ecfff]
 Device 0:00:1d.0 on IOMMU at a800
 Device 0:00:1d.0 on IOMMU at a800
 IOMMU: Setting identity map for device :00:1d.1 [0x7f7e7000 - 0x7f7ecfff]
 Device 0:00:1d.1 on IOMMU at a800
 Device 0:00:1d.1 on IOMMU at a800
 IOMMU: Setting identity map for device :00:1d.2 [0x7f7e7000 - 0x7f7ecfff]
 Device 0:00:1d.2 on IOMMU at a800
 Device 0:00:1d.2 on IOMMU at a800
 IOMMU: Setting identity map for device :00:1d.3 [0x7f7e7000 - 0x7f7ecfff]
 Device 0:00:1d.3 on IOMMU at a800
 Device 0:00:1d.3 on IOMMU at a800
 IOMMU: Setting identity map for device :02:00.0 [0x7f7e7000 - 0x7f7ecfff]
 Device 0:02:00.0 on IOMMU at a800
 IOMMU: Setting identity map for device :02:00.2 [0x7f7e7000 - 0x7f7ecfff]
 Device 0:02:00.2 on IOMMU at a800
 IOMMU: Setting identity map for device :02:00.4 [0x7f7e7000 - 0x7f7ecfff]
 Device 0:02:00.4 on IOMMU at a800
 Device 0:02:00.4 on IOMMU at a800
 IOMMU: Setting identity map for device :00:1d.7 [0x7f7ee000 - 0x7f7e]
 Device 0:00:1d.7 on IOMMU at a800
 Device 0:00:1d.7 on IOMMU at a800
 IOMMU: Prepare 0-16MiB unity mapping for LPC
 IOMMU: Setting identity map for device :00:1f.0 [0x0 - 0xff]
 Device 0:00:1f.0 on IOMMU at a800
 Device 0:00:1f.0 on IOMMU at a800
 PCI-DMA: Intel(R) Virtualization Technology for Directed I/O
 Device 0:00:00.0 on IOMMU at a800
 Device 0:00:01.0 on IOMMU at a800
 Device 0:00:02.0 on IOMMU at a800
 Device 0:00:03.0 on IOMMU at a800
 Device 0:00:04.0 on IOMMU at a800
 Device 0:00:05.0 on IOMMU at a800
 Device 0:00:06.0 on IOMMU at a800
 Device 0:00:07.0 on IOMMU at a800
 Device 0:00:08.0 on IOMMU at a800
 Device 0:00:09.0 on IOMMU at a800
 Device 0:00:0a.0 on IOMMU at a800
 Device 0:00:14.0 on IOMMU at a800
 Device 0:00:1c.0 on IOMMU at a800
 Device 0:00:1c.4 on IOMMU at a800
 Device 0:00:1d.0 on IOMMU at a800
 Device 0:00:1d.1 on IOMMU at a800
 Device 0:00:1d.2 on IOMMU at a800
 Device 0:00:1d.3 on IOMMU at a800
 Device 0:00:1d.7 on IOMMU at a800
 Device 0:00:1e.0 on IOMMU at a800
 Device 0:00:1f.0 on IOMMU at a800
 Device 0:04:00.0 on IOMMU at a800
 Device 0:04:00.1 on IOMMU at a800
 Device 0:04:00.2 on IOMMU at a800
 Device 0:04:00.3 on IOMMU at a800
 Device 0:03:00.0 on IOMMU at a800
 Device 0:02:00.0 on IOMMU at a800
 Device 0:02:00.2 on IOMMU at a800
 Device 0:02:00.4 on IOMMU at a800
 Device 0:01:03.0 on IOMMU at a800
 Device 0:50:00.0 on IOMMU at ac00
 Device 0:50:01.0 on IOMMU at ac00
 Device 0:50:02.0 on IOMMU at ac00
 Device 0:50:03.0 on IOMMU at ac00
 Device 0:50:04.0 on IOMMU at ac00
 Device 0:50:05.0 on IOMMU at ac00
 Device 0:50:06.0 on IOMMU at ac00
 Device 0:50:07.0 on IOMMU at ac00
 Device 0:50:08.0 on IOMMU at ac00
 Device 0:50:09.0 on IOMMU at ac00
 Device 0:50:0a.0 on IOMMU at ac00
 Device 0:50:14.0 on IOMMU at a800
 Device 0:a0:00.0 on IOMMU at b000
 Device 0:a0:01.0 on IOMMU at b000
 Device 0:a0:02.0 on IOMMU at b000

Re: hpsa driver bug crack kernel down!

2014-04-14 Thread Davidlohr Bueso
On Tue, 2014-04-15 at 00:19 +0800, Jiang Liu wrote:
 Hi Davidlohr,
   Thanks for providing the DMAR table. According to the DMAR
 table, one bug in the iommu driver fails to handle this entry:
 [1D2h 0466   1]  Device Scope Entry Type : 01
 [1D3h 0467   1] Entry Length : 0A
 [1D4h 0468   2] Reserved : 
 [1D6h 0470   1]   Enumeration ID : 00
 [1D7h 0471   1]   PCI Bus Number : 00
 [1D8h 0472   2] PCI Path : 1C,04
 [1DAh 0474   2] PCI Path : 00,02
 
   And the patch sent out by me should fix this bug. Could you please help
 to have a try?

Sorry, I am unable to find any patches from you regarding this issue...
I must be missing something. Could you please point me to the lkml link?

Thanks.

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: hpsa driver bug crack kernel down!

2014-04-14 Thread Davidlohr Bueso
On Mon, 2014-04-14 at 09:44 -0700, Davidlohr Bueso wrote:
 On Tue, 2014-04-15 at 00:19 +0800, Jiang Liu wrote:
  Hi Davidlohr,
  Thanks for providing the DMAR table. According to the DMAR
  table, one bug in the iommu driver fails to handle this entry:
  [1D2h 0466   1]  Device Scope Entry Type : 01
  [1D3h 0467   1] Entry Length : 0A
  [1D4h 0468   2] Reserved : 
  [1D6h 0470   1]   Enumeration ID : 00
  [1D7h 0471   1]   PCI Bus Number : 00
  [1D8h 0472   2] PCI Path : 1C,04
  [1DAh 0474   2] PCI Path : 00,02
  
  And the patch sent out by me should fix this bug. Could you please help
  to have a try?
 
 Sorry, I am unable to find any patches from you regarding this issue...
 I must be missing something. Could you please point me to the lkml link?

Never mind, I got it internally. I'll let you know  as soon as I can
test it later today.

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: hpsa driver bug crack kernel down!

2014-04-14 Thread Woodhouse, David
On Mon, 2014-04-14 at 09:47 -0700, Davidlohr Bueso wrote:
 On Mon, 2014-04-14 at 09:44 -0700, Davidlohr Bueso wrote:
  On Tue, 2014-04-15 at 00:19 +0800, Jiang Liu wrote:
   Hi Davidlohr,
 Thanks for providing the DMAR table. According to the DMAR
   table, one bug in the iommu driver fails to handle this entry:
   [1D2h 0466   1]  Device Scope Entry Type : 01
   [1D3h 0467   1] Entry Length : 0A
   [1D4h 0468   2] Reserved : 
   [1D6h 0470   1]   Enumeration ID : 00
   [1D7h 0471   1]   PCI Bus Number : 00
   [1D8h 0472   2] PCI Path : 1C,04
   [1DAh 0474   2] PCI Path : 00,02
   
 And the patch sent out by me should fix this bug. Could you please help
   to have a try?
  
  Sorry, I am unable to find any patches from you regarding this issue...
  I must be missing something. Could you please point me to the lkml link?
 
 Never mind, I got it internally. I'll let you know  as soon as I can
 test it later today.

Thanks.

Jiang, if you can then let me have a copy with a signed-off-by I'll
shepherd it upstream along with your other patch which is already in my
iommu-2.6.git tree.

-- 
  Sent with Evolution's ActiveSync support.

David WoodhouseOpen Source Technology Centre
david.woodho...@intel.com  Intel Corporation






smime.p7s
Description: S/MIME cryptographic signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: hpsa driver bug crack kernel down!

2014-04-14 Thread Davidlohr Bueso
On Mon, 2014-04-14 at 16:57 +0800, Jiang Liu wrote:
 Hi all,
   I guess I found the root cause. It's a bug in matching
 device scope, variable 'level' should be decreased when walking up PCI
 topology.
   Could you please help to test following patch?
 Thanks!
 Gerry

Worked like a charm -- I no longer see all those DMAR messages and the
hpsa hard lockup is gone, thanks. Feel free to add my:

Reported-and-tested-by: Davidlohr Bueso davidl...@hp.com

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: hpsa driver bug crack kernel down!

2014-04-11 Thread David Woodhouse
On Thu, 2014-04-10 at 17:17 -0600, Shuah Khan wrote:
 This smells very much like the problem that was solved couple of years
 ago for SI domain. It is likely that path is broken with the DMAR
 device scope array change. Please take a look to see if the following
 no longer occurs. Looks like BIOS could be expecting this RMRR to be
 still mapped.
 
/*
  * We want to prevent any device associated with an RMRR from
  * getting placed into the SI Domain. This is done because
  * problems exist when devices are moved in and out of domains
  * and their respective RMRR info is lost. We exempt USB 
 devices
  * from this process due to their usage of RMRRs that are 
 known
  * to not be needed after BIOS hand-off to OS.
  */
 if (device_has_rmrr(dev) 
 (pdev-class  8) != PCI_CLASS_SERIAL_USB)
 return 0;

Yeah, I'd be inclined to agree although I've tested with graphics
*since* these patches. That's another case where we need to preserve the
RMRR mapping after the driver takes over — and it *was* working.

-- 
David WoodhouseOpen Source Technology Centre
david.woodho...@intel.com  Intel Corporation


smime.p7s
Description: S/MIME cryptographic signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: hpsa driver bug crack kernel down!

2014-04-11 Thread Woodhouse, David
On Thu, 2014-04-10 at 09:19 -0700, Davidlohr Bueso wrote:
 Attaching a dmesg from one of the kernels that boots. It doesn't appear
 to have much of the related information... is there any debug config
 option I can enable that might give you more data?

I'd like the contents of /sys/firmware/acpi/tables/DMAR please. And
please could you also apply this patch to both the last-working and
first-failing kernels and show me the output in both cases?

diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index dd576c0..d52ac03 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -683,6 +683,12 @@ static struct intel_iommu *device_to_iommu(int segment, u8 
bus, u8 devfn)
 out:
rcu_read_unlock();
 
+   if (iommu)
+   printk(Device %x:%02x:%02x.%d on IOMMU at %llx\n, segment, 
bus,
+  PCI_SLOT(devfn), PCI_FUNC(devfn), drhd-reg_base_addr);
+   else
+   printk(Device %x:%02x:%02x.%d on no IOMMU\n, segment, bus,
+  PCI_SLOT(devfn), PCI_FUNC(devfn));
return iommu;
 }
 


-- 
  Sent with Evolution's ActiveSync support.

David WoodhouseOpen Source Technology Centre
david.woodho...@intel.com  Intel Corporation






smime.p7s
Description: S/MIME cryptographic signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: hpsa driver bug crack kernel down!

2014-04-10 Thread Joerg Roedel
[+ David, VT-d maintainer ]

Jiang, David, can you please have a look into this issue?

Thanks,

Joerg

On Wed, Apr 09, 2014 at 11:32:37PM -0700, Davidlohr Bueso wrote:
 On Wed, 2014-04-09 at 22:03 -0600, Bjorn Helgaas wrote:
  [+cc Joerg, iommu list]
  
  On Wed, Apr 9, 2014 at 6:19 PM, Davidlohr Bueso davidl...@hp.com wrote:
   On Wed, 2014-04-09 at 16:50 -0700, James Bottomley wrote:
   On Wed, 2014-04-09 at 16:40 -0700, Davidlohr Bueso wrote:
On Wed, 2014-04-09 at 16:10 -0700, James Bottomley wrote:
 On Wed, 2014-04-09 at 16:08 -0700, James Bottomley wrote:
  [+linux-scsi]
  On Wed, 2014-04-09 at 15:49 -0700, Davidlohr Bueso wrote:
   On Wed, 2014-04-09 at 10:39 +0800, Baoquan He wrote:
Hi,
   
The kernel is 3.14.0+ which is pulled just now.
  
   Cc'ing more people.
  
   While the hpsa driver appears to be involved in some way, I'm 
   sure if
   this is a related issue, but as of today's pull I'm getting 
   another
   problem that causes my DL980 not to come up.
  
   *Massive* amounts of:
  
   DMAR:[fault reason 02] Present bit in context entry is clear
   dmar: DRHD: handling fault status reg 602
   dmar: DMAR:[DMA Read] Request device [02:00.0] fault addr 
   7f61e000
  
   Then:
  
   hpsa :03:00.0: Controller lockup detected: 0x
   ...
   Workqueue: events hpsa_monitor_ctlr_worker [hpsa]
   ...
  
   Screenshot of the actual LOCKUP:
   http://stgolabs.net/hpsa-hard-lockup-3.14+.png
  
   While I haven't bisected, things worked fine until at least 
   until commit
   39de65aa2c3e (April 2nd).
  
   Any ideas?
 
  Well, it's either a DMA remapping issue or a hpsa one.  Your 
  assertion
  that everything worked fine until 39de65aa2c3e would tend to 
  vindicate
  hpsa,
   
Hmm here you mean DMA, right?
  
   No, it vindicates the hpsa changes ... they don't seem to be causing
   problems until something goes wrong with dma remapping.
  
 because all the hpsa changes went in before that under
 Missing crucial info:

 commit 1a0b6abaea78f73d9bc0a2f6df2d9e4c917cade1

  Merge: 3e75c6d b2bff6c
  Author: Linus Torvalds torva...@linux-foundation.org
  Date:   Tue Apr 1 18:49:04 2014 -0700
 
  Merge tag 'scsi-misc' of
  git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
 
  can you revalidate that this commit works OK just to make sure?
   
Ok so I don't see those DMA messages and system starts just fine. I'm
thinking perhaps something broke after the IO mmu stuff in commit
3f583bc21977a608908b83d03ee2250426a5695c... could this be indirectly
causing the CPU stalls and just blame hpsa in the path as a side 
effect?
   
/me goes out to try the commit.
  
   That's my guess.  The DMAR messages are DMA remapping issues caused in
   the IOMMU.  If I had to guess, I'd say the DMAR fault message is
   indicating the IOMMU is calling for a mapping address before it can
   satisfy the driver read request, which is causing the hang apparently in
   the hpsa driver.
  
   I've added linux-pci to the cc; I think they deal with iommu issues on
   x86.
  
   So that merge commit appears to be the culprit, I see both the DMA
   messages and the lockup blaming hpsa...
  
  My understanding so far (please correct me if I'm wrong):
  
  39de65aa2c3e OK (Merge branch 'i2c/for-next')
  1a0b6abaea78 OK (Merge tag 'scsi-misc')
  3f583bc21977 BAD (Merge tag 'iommu-updates-v3.15')
 
 Yes, specifically (finally done bisecting):
 
 commit 2e45528930388658603ea24d49cf52867b928d3e
 Author: Jiang Liu jiang@linux.intel.com
 Date:   Wed Feb 19 14:07:36 2014 +0800
 
 iommu/vt-d: Unify the way to process DMAR device scope array
 
 Now we have a PCI bus notification based mechanism to update DMAR
 device scope array, we could extend the mechanism to support boot
 time initialization too, which will help to unify and simplify
 the implementation.
 
 Signed-off-by: Jiang Liu jiang@linux.intel.com
 Signed-off-by: Joerg Roedel j...@8bytes.org
 

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: hpsa driver bug crack kernel down!

2014-04-10 Thread Woodhouse, David
On Thu, 2014-04-10 at 09:15 +0200, Joerg Roedel wrote:
 [+ David, VT-d maintainer ]
 
 Jiang, David, can you please have a look into this issue?
 

DMAR:[fault reason 02] Present bit in context entry is clear
dmar: DRHD: handling fault status reg 602
dmar: DMAR:[DMA Read] Request device [02:00.0] fault addr 
7f61e000

That Present bit in context entry is clear fault means that we have
not set up *any* mappings for this PCI device… on this IOMMU.

  Yes, specifically (finally done bisecting):
  
  commit 2e45528930388658603ea24d49cf52867b928d3e
  Author: Jiang Liu jiang@linux.intel.com
  Date:   Wed Feb 19 14:07:36 2014 +0800
  
  iommu/vt-d: Unify the way to process DMAR device scope array

This commit is about how we decide which IOMMU a given PCI device is
attached to.

Thus, my first guess would be that we are quite happily setting up the
requested DMA maps on the *wrong* IOMMU, and then taking faults when the
device actually tries to do DMA.

However, I'm not 100% convinced of that. The fault address looks
suspiciously like a true physical address, not a virtual bus address of
the type that we'd normally allocate for a dma_map_* operation. Those
would start at 0xf000 and work downwards, typically.

Do you have 'iommu=pt' on the kernel command line? Can I see the full
dmesg as this system boots, and also a copy of the DMAR table?


We should also rate-limit DMA faults, which would avoid the lockup
failure mode. Bjorn, what should an IOMMU driver *do* when it detects
that a device is creating an endless stream of DMA faults and isn't
aborting the transaction?

I can set it to silent so that it just stops *reporting* the DMA faults
for that device... and I suppose I can re-enable them when I next see a
DMA mapping for it (although actually it'd be better to have a hook to
do that on FLR or something like that). But there must be a better
answer than that, surely? And I don't want to hack it up locally in
*one* specific IOMMU driver, any more than I have to.

On a POWER system with EEH, the kernel would end up isolating the
offending device completely, and subsequently resetting it...

-- 
David WoodhouseOpen Source Technology Centre
david.woodho...@intel.com  Intel Corporation


smime.p7s
Description: S/MIME cryptographic signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: hpsa driver bug crack kernel down!

2014-04-10 Thread Davidlohr Bueso
On Wed, 2014-04-09 at 22:03 -0600, Bjorn Helgaas wrote:
 [+cc Joerg, iommu list]
 
 On Wed, Apr 9, 2014 at 6:19 PM, Davidlohr Bueso davidl...@hp.com wrote:
  On Wed, 2014-04-09 at 16:50 -0700, James Bottomley wrote:
  On Wed, 2014-04-09 at 16:40 -0700, Davidlohr Bueso wrote:
   On Wed, 2014-04-09 at 16:10 -0700, James Bottomley wrote:
On Wed, 2014-04-09 at 16:08 -0700, James Bottomley wrote:
 [+linux-scsi]
 On Wed, 2014-04-09 at 15:49 -0700, Davidlohr Bueso wrote:
  On Wed, 2014-04-09 at 10:39 +0800, Baoquan He wrote:
   Hi,
  
   The kernel is 3.14.0+ which is pulled just now.
 
  Cc'ing more people.
 
  While the hpsa driver appears to be involved in some way, I'm sure 
  if
  this is a related issue, but as of today's pull I'm getting another
  problem that causes my DL980 not to come up.
 
  *Massive* amounts of:
 
  DMAR:[fault reason 02] Present bit in context entry is clear
  dmar: DRHD: handling fault status reg 602
  dmar: DMAR:[DMA Read] Request device [02:00.0] fault addr 7f61e000
 
  Then:
 
  hpsa :03:00.0: Controller lockup detected: 0x
  ...
  Workqueue: events hpsa_monitor_ctlr_worker [hpsa]
  ...
 
  Screenshot of the actual LOCKUP:
  http://stgolabs.net/hpsa-hard-lockup-3.14+.png
 
  While I haven't bisected, things worked fine until at least until 
  commit
  39de65aa2c3e (April 2nd).
 
  Any ideas?

 Well, it's either a DMA remapping issue or a hpsa one.  Your 
 assertion
 that everything worked fine until 39de65aa2c3e would tend to 
 vindicate
 hpsa,
  
   Hmm here you mean DMA, right?
 
  No, it vindicates the hpsa changes ... they don't seem to be causing
  problems until something goes wrong with dma remapping.
 
because all the hpsa changes went in before that under
Missing crucial info:
   
commit 1a0b6abaea78f73d9bc0a2f6df2d9e4c917cade1
   
 Merge: 3e75c6d b2bff6c
 Author: Linus Torvalds torva...@linux-foundation.org
 Date:   Tue Apr 1 18:49:04 2014 -0700

 Merge tag 'scsi-misc' of
 git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi

 can you revalidate that this commit works OK just to make sure?
  
   Ok so I don't see those DMA messages and system starts just fine. I'm
   thinking perhaps something broke after the IO mmu stuff in commit
   3f583bc21977a608908b83d03ee2250426a5695c... could this be indirectly
   causing the CPU stalls and just blame hpsa in the path as a side effect?
  
   /me goes out to try the commit.
 
  That's my guess.  The DMAR messages are DMA remapping issues caused in
  the IOMMU.  If I had to guess, I'd say the DMAR fault message is
  indicating the IOMMU is calling for a mapping address before it can
  satisfy the driver read request, which is causing the hang apparently in
  the hpsa driver.
 
  I've added linux-pci to the cc; I think they deal with iommu issues on
  x86.
 
  So that merge commit appears to be the culprit, I see both the DMA
  messages and the lockup blaming hpsa...
 
 My understanding so far (please correct me if I'm wrong):
 
 39de65aa2c3e OK (Merge branch 'i2c/for-next')
 1a0b6abaea78 OK (Merge tag 'scsi-misc')
 3f583bc21977 BAD (Merge tag 'iommu-updates-v3.15')

Yes, specifically (finally done bisecting):

commit 2e45528930388658603ea24d49cf52867b928d3e
Author: Jiang Liu jiang@linux.intel.com
Date:   Wed Feb 19 14:07:36 2014 +0800

iommu/vt-d: Unify the way to process DMAR device scope array

Now we have a PCI bus notification based mechanism to update DMAR
device scope array, we could extend the mechanism to support boot
time initialization too, which will help to unify and simplify
the implementation.

Signed-off-by: Jiang Liu jiang@linux.intel.com
Signed-off-by: Joerg Roedel j...@8bytes.org

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: hpsa driver bug crack kernel down!

2014-04-10 Thread Bjorn Helgaas
On Thu, Apr 10, 2014 at 2:46 AM, Woodhouse, David
david.woodho...@intel.com wrote:

DMAR:[fault reason 02] Present bit in context entry is clear
dmar: DRHD: handling fault status reg 602
dmar: DMAR:[DMA Read] Request device [02:00.0] fault addr 
7f61e000

 That Present bit in context entry is clear fault means that we have
 not set up *any* mappings for this PCI device… on this IOMMU.

  Yes, specifically (finally done bisecting):
 
  commit 2e45528930388658603ea24d49cf52867b928d3e
  Author: Jiang Liu jiang@linux.intel.com
  Date:   Wed Feb 19 14:07:36 2014 +0800
 
  iommu/vt-d: Unify the way to process DMAR device scope array

 This commit is about how we decide which IOMMU a given PCI device is
 attached to.

 Thus, my first guess would be that we are quite happily setting up the
 requested DMA maps on the *wrong* IOMMU, and then taking faults when the
 device actually tries to do DMA.

 However, I'm not 100% convinced of that. The fault address looks
 suspiciously like a true physical address, not a virtual bus address of
 the type that we'd normally allocate for a dma_map_* operation. Those
 would start at 0xf000 and work downwards, typically.

I like the wrong IOMMU (or no IOMMU at all) theory.  If we didn't
connect the device with an IOMMU at all, that would explain the device
DMAing directly to a physical address, wouldn't it?

 Do you have 'iommu=pt' on the kernel command line? Can I see the full
 dmesg as this system boots, and also a copy of the DMAR table?

 We should also rate-limit DMA faults, which would avoid the lockup
 failure mode. Bjorn, what should an IOMMU driver *do* when it detects
 that a device is creating an endless stream of DMA faults and isn't
 aborting the transaction?

You mentioned that POWER with EEH does something intelligent in this
case, but I'm not familiar with that code.  We have AER support, which
can result in resetting a device, but I think DMA faults are reported
differently, and I don't think there's any nice existing way for PCI
to deal with them.  Maybe there should be, though.

Bjorn
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: hpsa driver bug crack kernel down!

2014-04-10 Thread Linda Knippers
On 4/10/2014 11:14 AM, Bjorn Helgaas wrote:
 On Thu, Apr 10, 2014 at 2:46 AM, Woodhouse, David
 david.woodho...@intel.com wrote:
 
 DMAR:[fault reason 02] Present bit in context entry is clear
 dmar: DRHD: handling fault status reg 602
 dmar: DMAR:[DMA Read] Request device [02:00.0] fault addr 7f61e000

 That Present bit in context entry is clear fault means that we have
 not set up *any* mappings for this PCI device… on this IOMMU.

 Yes, specifically (finally done bisecting):

 commit 2e45528930388658603ea24d49cf52867b928d3e
 Author: Jiang Liu jiang@linux.intel.com
 Date:   Wed Feb 19 14:07:36 2014 +0800

 iommu/vt-d: Unify the way to process DMAR device scope array

 This commit is about how we decide which IOMMU a given PCI device is
 attached to.

 Thus, my first guess would be that we are quite happily setting up the
 requested DMA maps on the *wrong* IOMMU, and then taking faults when the
 device actually tries to do DMA.

 However, I'm not 100% convinced of that. The fault address looks
 suspiciously like a true physical address, not a virtual bus address of
 the type that we'd normally allocate for a dma_map_* operation. Those
 would start at 0xf000 and work downwards, typically.
 
 I like the wrong IOMMU (or no IOMMU at all) theory.  If we didn't
 connect the device with an IOMMU at all, that would explain the device
 DMAing directly to a physical address, wouldn't it?
 
 Do you have 'iommu=pt' on the kernel command line? Can I see the full
 dmesg as this system boots, and also a copy of the DMAR table?

This will be really helpful information.  This box has devices with
RMRR records and if they're not set up correctly, DMAR faults can occur.


 We should also rate-limit DMA faults, which would avoid the lockup
 failure mode. Bjorn, what should an IOMMU driver *do* when it detects
 that a device is creating an endless stream of DMA faults and isn't
 aborting the transaction?
 
 You mentioned that POWER with EEH does something intelligent in this
 case, but I'm not familiar with that code.  We have AER support, which
 can result in resetting a device, but I think DMA faults are reported
 differently, and I don't think there's any nice existing way for PCI
 to deal with them.  Maybe there should be, though.
 
 Bjorn
 ___
 iommu mailing list
 iommu@lists.linux-foundation.org
 https://lists.linuxfoundation.org/mailman/listinfo/iommu
 

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: hpsa driver bug crack kernel down!

2014-04-10 Thread Woodhouse, David
On Thu, 2014-04-10 at 09:14 -0600, Bjorn Helgaas wrote:
  Thus, my first guess would be that we are quite happily setting up the
  requested DMA maps on the *wrong* IOMMU, and then taking faults when the
  device actually tries to do DMA.
 
 I like the wrong IOMMU (or no IOMMU at all) theory.  If we didn't
 connect the device with an IOMMU at all, that would explain the device
 DMAing directly to a physical address, wouldn't it?

An unlikely failure mode. We're much more likely to see *wrong* IOMMU
than no IOMMU. And thus we'd still see the distinctive virtual addresses
just below 4GiB.

However, Rob's answer may solve that puzzle. If this is one of those
abominations where the device continues to do DMA to system memory even
after the OS is up and running and *thinks* it has control of the
hardware, then the offending address will be listed in an RMRR entry
(which tells the OS to set up a 1:1 mapping for access to certain memory
ranges for a given device). And will be inside an E820 reserved region.

A little odd that such an error would trigger only when we're actually
trying to initialise the device from the Linux driver, not as soon as we
enable the IOMMU. But all things are possible.

But the DMAR table and dmesg that I asked for would give us a bit more
information and hopefully let us stop speculating...

  We should also rate-limit DMA faults, which would avoid the lockup
  failure mode. Bjorn, what should an IOMMU driver *do* when it detects
  that a device is creating an endless stream of DMA faults and isn't
  aborting the transaction?
 
 You mentioned that POWER with EEH does something intelligent in this
 case, but I'm not familiar with that code.  We have AER support, which
 can result in resetting a device, but I think DMA faults are reported
 differently, and I don't think there's any nice existing way for PCI
 to deal with them.  Maybe there should be, though.

Quite frankly, I don't care how *you* deal with them, or even if you
can. All I want to know is how I tell you about the problem, because *I*
sure as hell don't want to be trying to deal with it in the IOMMU code.
That's a generic PCI layer thing. :)

-- 
David WoodhouseOpen Source Technology Centre
david.woodho...@intel.com  Intel Corporation


smime.p7s
Description: S/MIME cryptographic signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: hpsa driver bug crack kernel down!

2014-04-10 Thread Davidlohr Bueso
On Thu, 2014-04-10 at 16:34 +0800, Jiang Liu wrote:
 Hi Baoquan,
   Could you please help to give output of lspci -?

Attached.

 Is device hpsa :03:00.0 a legacy PCI device(non-PCIe)?
 It may have relationship with IOMMU driver.

I honestly don't know. PCI is way out of my area of knowledge.
00:00.0 Host bridge: Intel Corporation 5520/5500/X58 I/O Hub to ESI Port (rev 
22)
Subsystem: Hewlett-Packard Company Device 330b
Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr+ 
Stepping- SERR+ FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast TAbort- TAbort- 
MAbort- SERR- PERR- INTx-
Capabilities: access denied

00:01.0 PCI bridge: Intel Corporation 5520/5500/X58 I/O Hub PCI Express Root 
Port 1 (rev 22) (prog-if 00 [Normal decode])
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ 
Stepping- SERR+ FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast TAbort- TAbort- 
MAbort- SERR- PERR- INTx-
Latency: 0, Cache Line Size: 64 bytes
Bus: primary=00, secondary=0e, subordinate=10, sec-latency=0
I/O behind bridge: f000-0fff
Memory behind bridge: fff0-000f
Prefetchable memory behind bridge: fff0-000f
Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast TAbort- TAbort- 
MAbort- SERR- PERR-
BridgeCtl: Parity+ SERR+ NoISA- VGA- MAbort- Reset- FastB2B-
PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
Capabilities: access denied
Kernel driver in use: pcieport
Kernel modules: shpchp

00:02.0 PCI bridge: Intel Corporation 5520/5500/X58 I/O Hub PCI Express Root 
Port 2 (rev 22) (prog-if 00 [Normal decode])
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ 
Stepping- SERR+ FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast TAbort- TAbort- 
MAbort- SERR- PERR- INTx-
Latency: 0, Cache Line Size: 64 bytes
Bus: primary=00, secondary=14, subordinate=14, sec-latency=0
I/O behind bridge: f000-0fff
Memory behind bridge: fff0-000f
Prefetchable memory behind bridge: fff0-000f
Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast TAbort- TAbort- 
MAbort- SERR- PERR-
BridgeCtl: Parity+ SERR+ NoISA- VGA- MAbort- Reset- FastB2B-
PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
Capabilities: access denied
Kernel driver in use: pcieport
Kernel modules: shpchp

00:03.0 PCI bridge: Intel Corporation 5520/5500/X58 I/O Hub PCI Express Root 
Port 3 (rev 22) (prog-if 00 [Normal decode])
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ 
Stepping- SERR+ FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast TAbort- TAbort- 
MAbort- SERR- PERR- INTx-
Latency: 0, Cache Line Size: 64 bytes
Bus: primary=00, secondary=04, subordinate=04, sec-latency=0
I/O behind bridge: f000-0fff
Memory behind bridge: 9000-99ff
Prefetchable memory behind bridge: fff0-000f
Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast TAbort- TAbort- 
MAbort+ SERR- PERR-
BridgeCtl: Parity+ SERR+ NoISA- VGA- MAbort- Reset- FastB2B-
PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
Capabilities: access denied
Kernel driver in use: pcieport
Kernel modules: shpchp

00:04.0 PCI bridge: Intel Corporation 5520/X58 I/O Hub PCI Express Root Port 4 
(rev 22) (prog-if 00 [Normal decode])
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ 
Stepping- SERR+ FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast TAbort- TAbort- 
MAbort- SERR- PERR- INTx-
Latency: 0, Cache Line Size: 64 bytes
Bus: primary=00, secondary=15, subordinate=15, sec-latency=0
I/O behind bridge: f000-0fff
Memory behind bridge: fff0-000f
Prefetchable memory behind bridge: fff0-000f
Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast TAbort- TAbort- 
MAbort- SERR- PERR-
BridgeCtl: Parity+ SERR+ NoISA- VGA- MAbort- Reset- FastB2B-
PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
Capabilities: access denied
Kernel driver in use: pcieport
Kernel modules: shpchp

00:05.0 PCI bridge: Intel Corporation 5520/X58 I/O Hub PCI Express Root Port 5 
(rev 22) (prog-if 00 [Normal decode])
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ 
Stepping- SERR+ FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast TAbort- TAbort- 
MAbort- SERR- PERR- INTx-
Latency: 0, Cache Line Size: 64 bytes
Bus: primary=00, secondary=11, subordinate=13, 

Re: hpsa driver bug crack kernel down!

2014-04-10 Thread Woodhouse, David
On Thu, 2014-04-10 at 09:19 -0700, Davidlohr Bueso wrote:
  dmar: DMAR:[DMA Read] Request device [02:00.0] fault addr 
  7f61e000
  
  That Present bit in context entry is clear fault means that we have
  not set up *any* mappings for this PCI device… on this IOMMU.
  
Yes, specifically (finally done bisecting):

commit 2e45528930388658603ea24d49cf52867b928d3e
Author: Jiang Liu jiang@linux.intel.com
Date:   Wed Feb 19 14:07:36 2014 +0800

iommu/vt-d: Unify the way to process DMAR device scope array
  
  This commit is about how we decide which IOMMU a given PCI device is
  attached to.
  
  Thus, my first guess would be that we are quite happily setting up the
  requested DMA maps on the *wrong* IOMMU, and then taking faults when the
  device actually tries to do DMA.
  
  However, I'm not 100% convinced of that. The fault address looks
  suspiciously like a true physical address, not a virtual bus address of
  the type that we'd normally allocate for a dma_map_* operation. Those
  would start at 0xf000 and work downwards, typically.
  
  Do you have 'iommu=pt' on the kernel command line? 
 
 No.
 
  Can I see the full
  dmesg as this system boots, and also a copy of the DMAR table?
 
 Attaching a dmesg from one of the kernels that boots. It doesn't appear
 to have much of the related information... 

It shows us that the address 0x7f61e000 is in an E820-reserved region,
and that there's and RMRR covering that region for an unspecified PCI
device, but that's going to be the hpsa.

So if isn't just a simple case of us assigning this device to the wrong
IOMMU, *perhaps* it's that we lose the RMRR when the driver takes
control of the device. RMRRs are generally expected to be a boot-time
thing, for things like legacy keyboard/mouse emulation via USB. Using
them while the system is *active* is... horrid. We've often not quite
handled that right.

-- 
David WoodhouseOpen Source Technology Centre
david.woodho...@intel.com  Intel Corporation


smime.p7s
Description: S/MIME cryptographic signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: hpsa driver bug crack kernel down!

2014-04-10 Thread scameron
On Wed, Apr 09, 2014 at 11:32:37PM -0700, Davidlohr Bueso wrote:
 On Wed, 2014-04-09 at 22:03 -0600, Bjorn Helgaas wrote:
  [+cc Joerg, iommu list]
  
  On Wed, Apr 9, 2014 at 6:19 PM, Davidlohr Bueso davidl...@hp.com wrote:
   On Wed, 2014-04-09 at 16:50 -0700, James Bottomley wrote:
   On Wed, 2014-04-09 at 16:40 -0700, Davidlohr Bueso wrote:
On Wed, 2014-04-09 at 16:10 -0700, James Bottomley wrote:
 On Wed, 2014-04-09 at 16:08 -0700, James Bottomley wrote:
  [+linux-scsi]
  On Wed, 2014-04-09 at 15:49 -0700, Davidlohr Bueso wrote:
   On Wed, 2014-04-09 at 10:39 +0800, Baoquan He wrote:
Hi,
   
The kernel is 3.14.0+ which is pulled just now.
  
   Cc'ing more people.
  
   While the hpsa driver appears to be involved in some way, I'm 
   sure if
   this is a related issue, but as of today's pull I'm getting 
   another
   problem that causes my DL980 not to come up.
  
   *Massive* amounts of:
  
   DMAR:[fault reason 02] Present bit in context entry is clear
   dmar: DRHD: handling fault status reg 602
   dmar: DMAR:[DMA Read] Request device [02:00.0] fault addr 
   7f61e000
  
   Then:
  
   hpsa :03:00.0: Controller lockup detected: 0x
   ...
   Workqueue: events hpsa_monitor_ctlr_worker [hpsa]
   ...
  
   Screenshot of the actual LOCKUP:
   http://stgolabs.net/hpsa-hard-lockup-3.14+.png
  
   While I haven't bisected, things worked fine until at least 
   until commit
   39de65aa2c3e (April 2nd).
  
   Any ideas?
 
  Well, it's either a DMA remapping issue or a hpsa one.  Your 
  assertion
  that everything worked fine until 39de65aa2c3e would tend to 
  vindicate
  hpsa,
   
Hmm here you mean DMA, right?
  
   No, it vindicates the hpsa changes ... they don't seem to be causing
   problems until something goes wrong with dma remapping.
  
 because all the hpsa changes went in before that under
 Missing crucial info:

 commit 1a0b6abaea78f73d9bc0a2f6df2d9e4c917cade1

  Merge: 3e75c6d b2bff6c
  Author: Linus Torvalds torva...@linux-foundation.org
  Date:   Tue Apr 1 18:49:04 2014 -0700
 
  Merge tag 'scsi-misc' of
  git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
 
  can you revalidate that this commit works OK just to make sure?
   
Ok so I don't see those DMA messages and system starts just fine. I'm
thinking perhaps something broke after the IO mmu stuff in commit
3f583bc21977a608908b83d03ee2250426a5695c... could this be indirectly
causing the CPU stalls and just blame hpsa in the path as a side 
effect?
   
/me goes out to try the commit.
  
   That's my guess.  The DMAR messages are DMA remapping issues caused in
   the IOMMU.  If I had to guess, I'd say the DMAR fault message is
   indicating the IOMMU is calling for a mapping address before it can
   satisfy the driver read request, which is causing the hang apparently in
   the hpsa driver.
  
   I've added linux-pci to the cc; I think they deal with iommu issues on
   x86.
  
   So that merge commit appears to be the culprit, I see both the DMA
   messages and the lockup blaming hpsa...
  
  My understanding so far (please correct me if I'm wrong):
  
  39de65aa2c3e OK (Merge branch 'i2c/for-next')
  1a0b6abaea78 OK (Merge tag 'scsi-misc')

^^^ this one, 1a0b6abaea78, did not work for me, crashing in
hpsa_enter_performant mode() which was surprsing to me as I am
pretty sure I tried on this very same machine I'm using now
(DL360p with P420, P430 and P420i) with 3.14-rc-something plus
all the hpsa patches that I thought were merged in.

But now I am seeing:

 [a0002bd0] hpsa_enter_performant_mode+0x4c0/0x540 [hpsa]
RSP: 0018:88042c515a78  EFLAGS: 00010297
RAX:  RBX: 88042c65 RCX: 0004
RDX:  RSI: 0001 RDI: 
RBP: 88042c515b48 R08:  R09: 8af03cc0
R10:  R11: 0001 R12: 88042c515a98
R13: 6104 R14: 88042c515ad8 R15: a0001630
FS:  7f86f7a38700() GS:88043f56() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
usb 1-1.6: new low-speed USB device number 3 using ehci-pci
CR2:  CR3: 00042c4c3000 CR4: 000407e0
Stack:
 8024 a0c0 abe0 
 00060005 00080007 000a0009 000c000b
 000e000d 001f 00120011 00040013
Call Trace:
 [a0c0] ? SA5_fifo_full+0x20/0x20 [hpsa]
 [abe0] ? SA5_ioaccel_mode1_completed+0xd0/0xd0 [hpsa]
 [a000aab6] hpsa_put_ctlr_into_performant_mode+0x186/0x320 [hpsa]
 [a0005132] ? hpsa_allocate_sg_chain_blocks+0xa2/0xd0 [hpsa]
 [a000b08b] 

Re: hpsa driver bug crack kernel down!

2014-04-10 Thread Shuah Khan
On Thu, Apr 10, 2014 at 2:45 PM,  scame...@beardog.cce.hp.com wrote:
  3f583bc21977 BAD (Merge tag 'iommu-updates-v3.15')

 Yes, specifically (finally done bisecting):

 commit 2e45528930388658603ea24d49cf52867b928d3e
 Author: Jiang Liu jiang@linux.intel.com
 Date:   Wed Feb 19 14:07:36 2014 +0800

 iommu/vt-d: Unify the way to process DMAR device scope array

 Now we have a PCI bus notification based mechanism to update DMAR
 device scope array, we could extend the mechanism to support boot
 time initialization too, which will help to unify and simplify
 the implementation.

 Signed-off-by: Jiang Liu jiang@linux.intel.com
 Signed-off-by: Joerg Roedel j...@8bytes.org

 My git bisect appears to be converging on something else, something
 within the hpsa patches that I sent up recently, unfortunately for
 me.  Will let you all know when it converges.


This smells very much like the problem that was solved couple of years
ago for SI domain. It is likely that path is broken with the DMAR
device scope array change. Please take a look to see if the following
no longer occurs. Looks like BIOS could be expecting this RMRR to be
still mapped.

   /*
 * We want to prevent any device associated with an RMRR from
 * getting placed into the SI Domain. This is done because
 * problems exist when devices are moved in and out of domains
 * and their respective RMRR info is lost. We exempt USB devices
 * from this process due to their usage of RMRRs that are known
 * to not be needed after BIOS hand-off to OS.
 */
if (device_has_rmrr(dev) 
(pdev-class  8) != PCI_CLASS_SERIAL_USB)
return 0;

-- Shuah
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: hpsa driver bug crack kernel down!

2014-04-10 Thread Baoquan He
On 04/10/14 at 04:34pm, Jiang Liu wrote:
 Hi Baoquan,
   Could you please help to give output of lspci -?
 Is device hpsa :03:00.0 a legacy PCI device(non-PCIe)?
 It may have relationship with IOMMU driver.
 Thanks!
 Gerry

Hi,

I just saw your mail now. Do you still need the output of lspci -
on my test machine? 

In fact, I didn't see the DMAR error related to intel vt-d issues.

If the output is helpful, I can make a latest build to do this.

Thanks
Baoquan

 
 On 2014/4/10 12:03, Bjorn Helgaas wrote:
  [+cc Joerg, iommu list]
  
  On Wed, Apr 9, 2014 at 6:19 PM, Davidlohr Bueso davidl...@hp.com wrote:
  On Wed, 2014-04-09 at 16:50 -0700, James Bottomley wrote:
  On Wed, 2014-04-09 at 16:40 -0700, Davidlohr Bueso wrote:
  On Wed, 2014-04-09 at 16:10 -0700, James Bottomley wrote:
  On Wed, 2014-04-09 at 16:08 -0700, James Bottomley wrote:
  [+linux-scsi]
  On Wed, 2014-04-09 at 15:49 -0700, Davidlohr Bueso wrote:
  On Wed, 2014-04-09 at 10:39 +0800, Baoquan He wrote:
  Hi,
 
  The kernel is 3.14.0+ which is pulled just now.
 
  Cc'ing more people.
 
  While the hpsa driver appears to be involved in some way, I'm sure if
  this is a related issue, but as of today's pull I'm getting another
  problem that causes my DL980 not to come up.
 
  *Massive* amounts of:
 
  DMAR:[fault reason 02] Present bit in context entry is clear
  dmar: DRHD: handling fault status reg 602
  dmar: DMAR:[DMA Read] Request device [02:00.0] fault addr 7f61e000
 
  Then:
 
  hpsa :03:00.0: Controller lockup detected: 0x
  ...
  Workqueue: events hpsa_monitor_ctlr_worker [hpsa]
  ...
 
  Screenshot of the actual LOCKUP:
  http://stgolabs.net/hpsa-hard-lockup-3.14+.png
 
  While I haven't bisected, things worked fine until at least until 
  commit
  39de65aa2c3e (April 2nd).
 
  Any ideas?
 
  Well, it's either a DMA remapping issue or a hpsa one.  Your assertion
  that everything worked fine until 39de65aa2c3e would tend to vindicate
  hpsa,
 
  Hmm here you mean DMA, right?
 
  No, it vindicates the hpsa changes ... they don't seem to be causing
  problems until something goes wrong with dma remapping.
 
  because all the hpsa changes went in before that under
  Missing crucial info:
 
  commit 1a0b6abaea78f73d9bc0a2f6df2d9e4c917cade1
 
  Merge: 3e75c6d b2bff6c
  Author: Linus Torvalds torva...@linux-foundation.org
  Date:   Tue Apr 1 18:49:04 2014 -0700
 
  Merge tag 'scsi-misc' of
  git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
 
  can you revalidate that this commit works OK just to make sure?
 
  Ok so I don't see those DMA messages and system starts just fine. I'm
  thinking perhaps something broke after the IO mmu stuff in commit
  3f583bc21977a608908b83d03ee2250426a5695c... could this be indirectly
  causing the CPU stalls and just blame hpsa in the path as a side effect?
 
  /me goes out to try the commit.
 
  That's my guess.  The DMAR messages are DMA remapping issues caused in
  the IOMMU.  If I had to guess, I'd say the DMAR fault message is
  indicating the IOMMU is calling for a mapping address before it can
  satisfy the driver read request, which is causing the hang apparently in
  the hpsa driver.
 
  I've added linux-pci to the cc; I think they deal with iommu issues on
  x86.
 
  So that merge commit appears to be the culprit, I see both the DMA
  messages and the lockup blaming hpsa...
  
  My understanding so far (please correct me if I'm wrong):
  
  39de65aa2c3e OK (Merge branch 'i2c/for-next')
  1a0b6abaea78 OK (Merge tag 'scsi-misc')
  3f583bc21977 BAD (Merge tag 'iommu-updates-v3.15')
  --
  To unsubscribe from this list: send the line unsubscribe linux-kernel in
  the body of a message to majord...@vger.kernel.org
  More majordomo info at  http://vger.kernel.org/majordomo-info.html
  Please read the FAQ at  http://www.tux.org/lkml/
  
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: hpsa driver bug crack kernel down!

2014-04-10 Thread Baoquan He
On 04/10/14 at 04:34pm, Jiang Liu wrote:
 Hi Baoquan,
   Could you please help to give output of lspci -?
 Is device hpsa :03:00.0 a legacy PCI device(non-PCIe)?
 It may have relationship with IOMMU driver.
 Thanks!
 Gerry

Well, the machine bug was reported on is a AMD machine, and it doesn't
have the IOMMU problem. David saw there are some DMAR errors, it should
be a intel machine which use the VT-d.

 
 On 2014/4/10 12:03, Bjorn Helgaas wrote:
  [+cc Joerg, iommu list]
  
  On Wed, Apr 9, 2014 at 6:19 PM, Davidlohr Bueso davidl...@hp.com wrote:
  On Wed, 2014-04-09 at 16:50 -0700, James Bottomley wrote:
  On Wed, 2014-04-09 at 16:40 -0700, Davidlohr Bueso wrote:
  On Wed, 2014-04-09 at 16:10 -0700, James Bottomley wrote:
  On Wed, 2014-04-09 at 16:08 -0700, James Bottomley wrote:
  [+linux-scsi]
  On Wed, 2014-04-09 at 15:49 -0700, Davidlohr Bueso wrote:
  On Wed, 2014-04-09 at 10:39 +0800, Baoquan He wrote:
  Hi,
 
  The kernel is 3.14.0+ which is pulled just now.
 
  Cc'ing more people.
 
  While the hpsa driver appears to be involved in some way, I'm sure if
  this is a related issue, but as of today's pull I'm getting another
  problem that causes my DL980 not to come up.
 
  *Massive* amounts of:
 
  DMAR:[fault reason 02] Present bit in context entry is clear
  dmar: DRHD: handling fault status reg 602
  dmar: DMAR:[DMA Read] Request device [02:00.0] fault addr 7f61e000
 
  Then:
 
  hpsa :03:00.0: Controller lockup detected: 0x
  ...
  Workqueue: events hpsa_monitor_ctlr_worker [hpsa]
  ...
 
  Screenshot of the actual LOCKUP:
  http://stgolabs.net/hpsa-hard-lockup-3.14+.png
 
  While I haven't bisected, things worked fine until at least until 
  commit
  39de65aa2c3e (April 2nd).
 
  Any ideas?
 
  Well, it's either a DMA remapping issue or a hpsa one.  Your assertion
  that everything worked fine until 39de65aa2c3e would tend to vindicate
  hpsa,
 
  Hmm here you mean DMA, right?
 
  No, it vindicates the hpsa changes ... they don't seem to be causing
  problems until something goes wrong with dma remapping.
 
  because all the hpsa changes went in before that under
  Missing crucial info:
 
  commit 1a0b6abaea78f73d9bc0a2f6df2d9e4c917cade1
 
  Merge: 3e75c6d b2bff6c
  Author: Linus Torvalds torva...@linux-foundation.org
  Date:   Tue Apr 1 18:49:04 2014 -0700
 
  Merge tag 'scsi-misc' of
  git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
 
  can you revalidate that this commit works OK just to make sure?
 
  Ok so I don't see those DMA messages and system starts just fine. I'm
  thinking perhaps something broke after the IO mmu stuff in commit
  3f583bc21977a608908b83d03ee2250426a5695c... could this be indirectly
  causing the CPU stalls and just blame hpsa in the path as a side effect?
 
  /me goes out to try the commit.
 
  That's my guess.  The DMAR messages are DMA remapping issues caused in
  the IOMMU.  If I had to guess, I'd say the DMAR fault message is
  indicating the IOMMU is calling for a mapping address before it can
  satisfy the driver read request, which is causing the hang apparently in
  the hpsa driver.
 
  I've added linux-pci to the cc; I think they deal with iommu issues on
  x86.
 
  So that merge commit appears to be the culprit, I see both the DMA
  messages and the lockup blaming hpsa...
  
  My understanding so far (please correct me if I'm wrong):
  
  39de65aa2c3e OK (Merge branch 'i2c/for-next')
  1a0b6abaea78 OK (Merge tag 'scsi-misc')
  3f583bc21977 BAD (Merge tag 'iommu-updates-v3.15')
  --
  To unsubscribe from this list: send the line unsubscribe linux-kernel in
  the body of a message to majord...@vger.kernel.org
  More majordomo info at  http://vger.kernel.org/majordomo-info.html
  Please read the FAQ at  http://www.tux.org/lkml/
  
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: hpsa driver bug crack kernel down!

2014-04-09 Thread Bjorn Helgaas
[+cc Joerg, iommu list]

On Wed, Apr 9, 2014 at 6:19 PM, Davidlohr Bueso davidl...@hp.com wrote:
 On Wed, 2014-04-09 at 16:50 -0700, James Bottomley wrote:
 On Wed, 2014-04-09 at 16:40 -0700, Davidlohr Bueso wrote:
  On Wed, 2014-04-09 at 16:10 -0700, James Bottomley wrote:
   On Wed, 2014-04-09 at 16:08 -0700, James Bottomley wrote:
[+linux-scsi]
On Wed, 2014-04-09 at 15:49 -0700, Davidlohr Bueso wrote:
 On Wed, 2014-04-09 at 10:39 +0800, Baoquan He wrote:
  Hi,
 
  The kernel is 3.14.0+ which is pulled just now.

 Cc'ing more people.

 While the hpsa driver appears to be involved in some way, I'm sure if
 this is a related issue, but as of today's pull I'm getting another
 problem that causes my DL980 not to come up.

 *Massive* amounts of:

 DMAR:[fault reason 02] Present bit in context entry is clear
 dmar: DRHD: handling fault status reg 602
 dmar: DMAR:[DMA Read] Request device [02:00.0] fault addr 7f61e000

 Then:

 hpsa :03:00.0: Controller lockup detected: 0x
 ...
 Workqueue: events hpsa_monitor_ctlr_worker [hpsa]
 ...

 Screenshot of the actual LOCKUP:
 http://stgolabs.net/hpsa-hard-lockup-3.14+.png

 While I haven't bisected, things worked fine until at least until 
 commit
 39de65aa2c3e (April 2nd).

 Any ideas?
   
Well, it's either a DMA remapping issue or a hpsa one.  Your assertion
that everything worked fine until 39de65aa2c3e would tend to vindicate
hpsa,
 
  Hmm here you mean DMA, right?

 No, it vindicates the hpsa changes ... they don't seem to be causing
 problems until something goes wrong with dma remapping.

   because all the hpsa changes went in before that under
   Missing crucial info:
  
   commit 1a0b6abaea78f73d9bc0a2f6df2d9e4c917cade1
  
Merge: 3e75c6d b2bff6c
Author: Linus Torvalds torva...@linux-foundation.org
Date:   Tue Apr 1 18:49:04 2014 -0700
   
Merge tag 'scsi-misc' of
git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
   
can you revalidate that this commit works OK just to make sure?
 
  Ok so I don't see those DMA messages and system starts just fine. I'm
  thinking perhaps something broke after the IO mmu stuff in commit
  3f583bc21977a608908b83d03ee2250426a5695c... could this be indirectly
  causing the CPU stalls and just blame hpsa in the path as a side effect?
 
  /me goes out to try the commit.

 That's my guess.  The DMAR messages are DMA remapping issues caused in
 the IOMMU.  If I had to guess, I'd say the DMAR fault message is
 indicating the IOMMU is calling for a mapping address before it can
 satisfy the driver read request, which is causing the hang apparently in
 the hpsa driver.

 I've added linux-pci to the cc; I think they deal with iommu issues on
 x86.

 So that merge commit appears to be the culprit, I see both the DMA
 messages and the lockup blaming hpsa...

My understanding so far (please correct me if I'm wrong):

39de65aa2c3e OK (Merge branch 'i2c/for-next')
1a0b6abaea78 OK (Merge tag 'scsi-misc')
3f583bc21977 BAD (Merge tag 'iommu-updates-v3.15')
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu