BUG: bad page map under Xen

2013-10-21 Thread Lukas Hejtmanek
Hello,

I'm trying to get SR-IOV working under Xen (4.2). It almost works except
memory bug. This is easily reproducible just in Dom0. 

I have Connect-X3 card with the latest firmware. OFED 2.0-3 drivers. I tried
3.2 kernel from Debian, 3.10 kernel from Debian and vanila 3.11.5 kernel. All
are the same. 

As soon as I issue ibv_devinfo command, it produces the following messages
into dmesg. Problem is that with ib_rdma_bw command, I get more of those
messages and moreover, oom killer gets confused and kills almost all
processes.

[23502.645455] mlx4_core :06:00.0: mlx4_ib: Port 1 logical link is up
[23550.181907] mlx4_ib check_flow_steering_support: Device managed flow 
steering is unavailable for IB port in multifunction env.
[23550.183822] swap_free: Unused swap offset entry 0001
[23550.183868] BUG: Bad page map in process ibv_devinfo  pte:0200 
pmd:1b7df4067
[23550.183939] addr:7f7ef5e18000 vm_flags:400844fa anon_vma:  
(null) mapping:8801b83c0480 index:380fe0882
[23550.184022] vma-vm_file-f_op-mmap: ib_uverbs_mmap+0x0/0x2d [ib_uverbs]
[23550.195382] Pid: 13813, comm: ibv_devinfo Tainted: G   O 
3.2.0-0.bpo.4-amd64 #1 Debian 3.2.41-2+deb7u2~bpo60+1+zs4
[23550.195461] Call Trace:
[23550.195508]  [810d9009] ? print_bad_pte+0x1f5/0x20d
[23550.195553]  [810db083] ? unmap_vmas+0x5fe/0x814
[23550.195601]  [810c68dd] ? __add_page_to_lru_list+0x53/0x53
[23550.195647]  [810df2de] ? unmap_region+0x9f/0x102
[23550.195694]  [8100d722] ? __switch_to+0x23b/0x2b1
[23550.195741]  [8103d870] ? pick_next_task_fair+0xfc/0x10c
[23550.195788]  [810463a2] ? finish_task_switch+0x53/0xc7
[23550.195832]  [810e01f7] ? do_munmap+0x281/0x2eb
[23550.195875]  [810e02a0] ? sys_munmap+0x3f/0x55
[23550.195921]  [8136e51c] ? system_call_fastpath+0x16/0x1b
[23550.195965] Disabling lock debugging due to kernel taint
[23550.196412] mlx4_ib check_flow_steering_support: Device managed flow 
steering is unavailable for IB port in multifunction env.
[23550.198303] swap_free: Unused swap offset entry 0001
[23550.198348] BUG: Bad page map in process ibv_devinfo  pte:0200 
pmd:1b7df4067
[23550.198424] addr:7f7ef5e18000 vm_flags:400844fa anon_vma:  
(null) mapping:8801b83c09a0 index:380fe0082
[23550.198508] vma-vm_file-f_op-mmap: ib_uverbs_mmap+0x0/0x2d [ib_uverbs]
[23550.198558] Pid: 13813, comm: ibv_devinfo Tainted: GB  O 
3.2.0-0.bpo.4-amd64 #1 Debian 3.2.41-2+deb7u2~bpo60+1+zs4
[23550.198637] Call Trace:
[23550.198680]  [810d9009] ? print_bad_pte+0x1f5/0x20d
[23550.198730]  [810db083] ? unmap_vmas+0x5fe/0x814
[23550.198775]  [810c68dd] ? __add_page_to_lru_list+0x53/0x53
[23550.198820]  [810df2de] ? unmap_region+0x9f/0x102
[23550.198865]  [8100d6b0] ? __switch_to+0x1c9/0x2b1
[23550.198913]  [8103d870] ? pick_next_task_fair+0xfc/0x10c
[23550.198959]  [810463a2] ? finish_task_switch+0x53/0xc7
[23550.199005]  [810e01f7] ? do_munmap+0x281/0x2eb
[23550.199052]  [810e02a0] ? sys_munmap+0x3f/0x55
[23550.199096]  [8136e51c] ? system_call_fastpath+0x16/0x1b
[23550.199766] mlx4_ib check_flow_steering_support: Device managed flow 
steering is unavailable for IB port in multifunction env.
[23550.201661] swap_free: Unused swap offset entry 0001
[23550.201706] BUG: Bad page map in process ibv_devinfo  pte:0200 
pmd:1b7df4067
[23550.201776] addr:7f7ef5e18000 vm_flags:400844fa anon_vma:  
(null) mapping:8801b83c0ec0 index:380fdf882
[23550.201861] vma-vm_file-f_op-mmap: ib_uverbs_mmap+0x0/0x2d [ib_uverbs]
[23550.201908] Pid: 13813, comm: ibv_devinfo Tainted: GB  O 
3.2.0-0.bpo.4-amd64 #1 Debian 3.2.41-2+deb7u2~bpo60+1+zs4
[23550.201990] Call Trace:
[23550.202032]  [810d9009] ? print_bad_pte+0x1f5/0x20d
[23550.202081]  [810db083] ? unmap_vmas+0x5fe/0x814
[23550.202125]  [810df2de] ? unmap_region+0x9f/0x102
[23550.202169]  [8100d6b0] ? __switch_to+0x1c9/0x2b1
[23550.202217]  [8103d870] ? pick_next_task_fair+0xfc/0x10c
[23550.202267]  [810463a2] ? finish_task_switch+0x53/0xc7
[23550.202312]  [810e01f7] ? do_munmap+0x281/0x2eb
[23550.202355]  [810e02a0] ? sys_munmap+0x3f/0x55
[23550.202398]  [8136e51c] ? system_call_fastpath+0x16/0x1b
[23550.202925] mlx4_ib check_flow_steering_support: Device managed flow 
steering is unavailable for IB port in multifunction env.
[23550.213336] swap_free: Unused swap offset entry 0001
[23550.213377] BUG: Bad page map in process ibv_devinfo  pte:0200 
pmd:1b7df4067
[23550.213448] addr:7f7ef5e18000 vm_flags:400844fa anon_vma:  
(null) mapping:8801b6bd8ec0 index:380fdf082
[23550.213527] vma-vm_file-f_op-mmap: ib_uverbs_mmap+0x0/0x2d [ib_uverbs]
[23550.213573] Pid: 13813, comm: ibv_devinfo Tainted: GB  O 
3.2.0-0.bpo.4-amd64 #1 Debian 3.2.41-2+deb7u2~bpo60+1+zs4
[23550.213651] 

Re: [Xen-devel] BUG: bad page map under Xen

2013-10-21 Thread konrad wilk


On 10/21/2013 7:57 AM, Lukas Hejtmanek wrote:

Hello,

I'm trying to get SR-IOV working under Xen (4.2). It almost works except
memory bug. This is easily reproducible just in Dom0.

I have Connect-X3 card with the latest firmware. OFED 2.0-3 drivers. I tried
3.2 kernel from Debian, 3.10 kernel from Debian and vanila 3.11.5 kernel. All
are the same.


Ha! Funny you mention that. I had been looking at this.

As soon as I issue ibv_devinfo command, it produces the following messages
into dmesg. Problem is that with ib_rdma_bw command, I get more of those
messages and moreover, oom killer gets confused and kills almost all
processes.

[23502.645455] mlx4_core :06:00.0: mlx4_ib: Port 1 logical link is up
[23550.181907] mlx4_ib check_flow_steering_support: Device managed flow 
steering is unavailable for IB port in multifunction env.
[23550.183822] swap_free: Unused swap offset entry 0001
[23550.183868] BUG: Bad page map in process ibv_devinfo  pte:0200 
pmd:1b7df4067
[23550.183939] addr:7f7ef5e18000 vm_flags:400844fa anon_vma:  
(null) mapping:8801b83c0480 index:380fe0882
[23550.184022] vma-vm_file-f_op-mmap: ib_uverbs_mmap+0x0/0x2d [ib_uverbs]
[23550.195382] Pid: 13813, comm: ibv_devinfo Tainted: G   O 
3.2.0-0.bpo.4-amd64 #1 Debian 3.2.41-2+deb7u2~bpo60+1+zs4
[23550.195461] Call Trace:
[23550.195508]  [810d9009] ? print_bad_pte+0x1f5/0x20d
[23550.195553]  [810db083] ? unmap_vmas+0x5fe/0x814
[23550.195601]  [810c68dd] ? __add_page_to_lru_list+0x53/0x53
[23550.195647]  [810df2de] ? unmap_region+0x9f/0x102
[23550.195694]  [8100d722] ? __switch_to+0x23b/0x2b1
[23550.195741]  [8103d870] ? pick_next_task_fair+0xfc/0x10c
[23550.195788]  [810463a2] ? finish_task_switch+0x53/0xc7
[23550.195832]  [810e01f7] ? do_munmap+0x281/0x2eb
[23550.195875]  [810e02a0] ? sys_munmap+0x3f/0x55
[23550.195921]  [8136e51c] ? system_call_fastpath+0x16/0x1b
[23550.195965] Disabling lock debugging due to kernel taint
[23550.196412] mlx4_ib check_flow_steering_support: Device managed flow 
steering is unavailable for IB port in multifunction env.
[23550.198303] swap_free: Unused swap offset entry 0001
[23550.198348] BUG: Bad page map in process ibv_devinfo  pte:0200 
pmd:1b7df4067
[23550.198424] addr:7f7ef5e18000 vm_flags:400844fa anon_vma:  
(null) mapping:8801b83c09a0 index:380fe0082

..

this happens only if running under Xen. Native kernel in the same version is OK.

Is it a known bug or is something wrong with BIOS/firmware?

It is a bug in the drivers I believe. The issue is that the mapping 
created for the second mmap
call is done without VM_IO and on an PFN that is RAM (and not the BAR). 
But I am not entirely

sure and hopefully this week will have a better idea and fix. Stay tuned.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Xen-devel] BUG: bad page map under Xen

2013-10-21 Thread Jan Beulich
 On 21.10.13 at 13:57, Lukas Hejtmanek xhejt...@ics.muni.cz wrote:
 I'm trying to get SR-IOV working under Xen (4.2). It almost works except
 memory bug. This is easily reproducible just in Dom0. 

So without any SR-IOV then, I suppose?

 [23502.645455] mlx4_core :06:00.0: mlx4_ib: Port 1 logical link is up
 [23550.181907] mlx4_ib check_flow_steering_support: Device managed flow 
 steering is unavailable for IB port in multifunction env.
 [23550.183822] swap_free: Unused swap offset entry 0001
 [23550.183868] BUG: Bad page map in process ibv_devinfo  pte:0200 
 pmd:1b7df4067
 [23550.183939] addr:7f7ef5e18000 vm_flags:400844fa anon_vma:  
 (null) mapping:8801b83c0480 index:380fe0882
 [23550.184022] vma-vm_file-f_op-mmap: ib_uverbs_mmap+0x0/0x2d [ib_uverbs]

Looking at ib_uverbs_mmap() and its necessary (for mlx4)
descendant mlx4_ib_mmap() I see that the latter calls
io_remap_pfn_range(), but afaict there's nowhere _PAGE_IOMAP
would get set here (as opposed to
arch/x86/pci/i386.c:pci_mmap_page_range() for example). Could
you check whether adding that flag helps? (I'm copying the kernel
maintainers so that they could correct me if I'm wrong here - it
would seem to me that this could equally be the reason for why
there are other reports of certain things not working as expected
in domains with more than 4Gb.)

You could also consider trying an openSUSE kernel - there, other
than upstream, there's no need for each and every caller of
io_remap_pfn_range() to take care of setting _PAGE_IOMAP (and
I vaguely recall having discussed this a couple of years back with
Konrad et al).

Jan

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Xen-devel] BUG: bad page map under Xen

2013-10-21 Thread Jan Beulich
 On 21.10.13 at 14:59, konrad wilk konrad.w...@oracle.com wrote:
 It is a bug in the drivers I believe. The issue is that the mapping 
 created for the second mmap
 call is done without VM_IO and on an PFN that is RAM (and not the BAR). 

So while putting together the reply that I had sent to Lukas a
minute ago I was actually hunting for that VM_IO - _PAGE_IOMAP
translation, and wasn't able to find it anywhere. As you say it
nevertheless exists - what am I overlooking (and why would then
pci_mmap_page_range() nevertheless have to set _PAGE_IOMAP
by hand)?

Jan

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Xen-devel] BUG: bad page map under Xen

2013-10-21 Thread konrad wilk


On 10/21/2013 9:18 AM, Jan Beulich wrote:

On 21.10.13 at 14:59, konrad wilk konrad.w...@oracle.com wrote:

It is a bug in the drivers I believe. The issue is that the mapping
created for the second mmap
call is done without VM_IO and on an PFN that is RAM (and not the BAR).

So while putting together the reply that I had sent to Lukas a
minute ago I was actually hunting for that VM_IO - _PAGE_IOMAP
translation, and wasn't able to find it anywhere. As you say it
nevertheless exists - what am I overlooking (and why would then
pci_mmap_page_range() nevertheless have to set _PAGE_IOMAP
by hand)?


The P2M (arch/x86/xen/p2m.c) is consulted which for the MMIO gaps and
E820_RESV has the MFNs set to the PFN. This is the 1-1 pfn/mfn stuff 
that I implemented
some time ago - as hpa was opposed to having the _PAGE_IOMAP being stuck 
on any macro
call to pgprot_writecombine|noncached|etc. Or perhaps that was on the 
arch_something_prot.


Anyhow, the odd thing is that looking at the code:

 669 if (io_remap_pfn_range(vma, vma-vm_start,
 670 to_mucontext(context)-uar.pfn +
 671 dev-dev-caps.num_uars,
 672PAGE_SIZE, vma-vm_page_prot))

The PFN in question (uar.pfn) is in mlx4_uar_alloc is set to:

159 uar-pfn = (pci_resource_start(dev-pdev, 2)  PAGE_SHIFT) 
+ offset;


So is the BAR not in the MMIO region? Or is it the 64-bit type MMIO that 
lays outside the 4GB and

hence when the P2M is consulted it thinks its INVALID_P2M_ENTRY?

Which comes back to the bug you (Jan) discovered when you pointed out 
that PVH needs to setup MMIO entries
for 64-bit MMIO regions which can be outside the 4GB region sigh. And 
that is something the pvops kernel
completly ignores as it assumes that any region past the E820 can be 
used for ballooning.


Anyhow, one easy thing to figure out is to get the lspci -v output from 
the InfiniBand card
to see where its BARs are, and also the start of the kernel. You should 
see an E820 map (please also boot with

debug on the Linux command line).


Jan



--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Xen-devel] BUG: bad page map under Xen

2013-10-21 Thread konrad wilk


On 10/21/2013 9:39 AM, konrad wilk wrote:


On 10/21/2013 9:18 AM, Jan Beulich wrote:

On 21.10.13 at 14:59, konrad wilk konrad.w...@oracle.com wrote:

It is a bug in the drivers I believe. The issue is that the mapping
created for the second mmap
call is done without VM_IO and on an PFN that is RAM (and not the BAR).

So while putting together the reply that I had sent to Lukas a
minute ago I was actually hunting for that VM_IO - _PAGE_IOMAP
translation, and wasn't able to find it anywhere. As you say it
nevertheless exists - what am I overlooking (and why would then
pci_mmap_page_range() nevertheless have to set _PAGE_IOMAP
by hand)?


The P2M (arch/x86/xen/p2m.c) is consulted which for the MMIO gaps and
E820_RESV has the MFNs set to the PFN. This is the 1-1 pfn/mfn stuff 
that I implemented
some time ago - as hpa was opposed to having the _PAGE_IOMAP being 
stuck on any macro
call to pgprot_writecombine|noncached|etc. Or perhaps that was on the 
arch_something_prot.


This is the one that Jeremy cooked up some time ago:
http://lkml.indiana.edu/hypermail/linux/kernel/1010.2/03012.html

And here was the thread:
http://www.spinics.net/lists/linux-rdma/msg07085.html

which I thought had been fixed by the P2M identity code.


Anyhow, the odd thing is that looking at the code:

 669 if (io_remap_pfn_range(vma, vma-vm_start,
 670 to_mucontext(context)-uar.pfn +
 671 dev-dev-caps.num_uars,
 672PAGE_SIZE, 
vma-vm_page_prot))


The PFN in question (uar.pfn) is in mlx4_uar_alloc is set to:

159 uar-pfn = (pci_resource_start(dev-pdev, 2)  
PAGE_SHIFT) + offset;


So is the BAR not in the MMIO region? Or is it the 64-bit type MMIO 
that lays outside the 4GB and

hence when the P2M is consulted it thinks its INVALID_P2M_ENTRY?

Which comes back to the bug you (Jan) discovered when you pointed out 
that PVH needs to setup MMIO entries
for 64-bit MMIO regions which can be outside the 4GB region sigh. 
And that is something the pvops kernel
completly ignores as it assumes that any region past the E820 can be 
used for ballooning.


Anyhow, one easy thing to figure out is to get the lspci -v output 
from the InfiniBand card
to see where its BARs are, and also the start of the kernel. You 
should see an E820 map (please also boot with

debug on the Linux command line).


Jan





--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Xen-devel] BUG: bad page map under Xen

2013-10-21 Thread Lukas Hejtmanek
On Mon, Oct 21, 2013 at 09:39:33AM -0400, konrad wilk wrote:
 Anyhow, one easy thing to figure out is to get the lspci -v output
 from the InfiniBand card
 to see where its BARs are, and also the start of the kernel. You
 should see an E820 map (please also boot with
 debug on the Linux command line).

note, adding _PAGE_IO as Jan suggested fixed those mem errors.

here is lspci from the card and its virtual functions.

06:00.0 Network controller: Mellanox Technologies MT27500 Family [ConnectX-3]
Subsystem: Mellanox Technologies Device 0017
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- 
Stepping- SERR- FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast TAbort- TAbort- 
MAbort- SERR- PERR- INTx-
Latency: 0, Cache Line Size: 64 bytes
Interrupt: pin A routed to IRQ 42
Region 0: Memory at dfa0 (64-bit, non-prefetchable) [size=1M]
Region 2: Memory at 380fff00 (64-bit, prefetchable) [size=8M]
Expansion ROM at df90 [disabled] [size=1M]
Capabilities: [40] Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA 
PME(D0-,D1-,D2-,D3hot-,D3cold-)
Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [48] Vital Product Data
Product Name: CX353A - ConnectX-3 QSFP
Read-only fields:
[PN] Part number: MCX353A-QCBT 
[EC] Engineering changes: A4
[SN] Serial number: MT1327X00814
[V0] Vendor specific: PCIe Gen3 x8
[RV] Reserved: checksum good, 0 byte(s) reserved
Read/write fields:
[V1] Vendor specific: N/A   
[YA] Asset tag: N/A 
[RW] Read-write area: 105 byte(s) free
[RW] Read-write area: 253 byte(s) free
[RW] Read-write area: 253 byte(s) free
[RW] Read-write area: 253 byte(s) free
[RW] Read-write area: 253 byte(s) free
[RW] Read-write area: 253 byte(s) free
[RW] Read-write area: 253 byte(s) free
[RW] Read-write area: 253 byte(s) free
[RW] Read-write area: 253 byte(s) free
[RW] Read-write area: 253 byte(s) free
[RW] Read-write area: 253 byte(s) free
[RW] Read-write area: 253 byte(s) free
[RW] Read-write area: 253 byte(s) free
[RW] Read-write area: 253 byte(s) free
[RW] Read-write area: 253 byte(s) free
[RW] Read-write area: 252 byte(s) free
End
Capabilities: [9c] MSI-X: Enable+ Count=128 Masked-
Vector table: BAR=0 offset=0007c000
PBA: BAR=0 offset=0007d000
Capabilities: [60] Express (v2) Endpoint, MSI 00
DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s 64ns, 
L1 unlimited
ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset+
DevCtl: Report errors: Correctable- Non-Fatal- Fatal- 
Unsupported-
RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop- FLReset-
MaxPayload 256 bytes, MaxReadReq 512 bytes
DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- 
TransPend-
LnkCap: Port #8, Speed 8GT/s, Width x8, ASPM L0s, Latency L0 
unlimited, L1 unlimited
ClockPM- Surprise- LLActRep- BwNot-
LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk+
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 8GT/s, Width x8, TrErr- Train- SlotClk+ DLActive- 
BWMgmt- ABWMgmt-
DevCap2: Completion Timeout: Range ABCD, TimeoutDis+
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-
LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-, 
Selectable De-emphasis: -6dB
 Transmit Margin: Normal Operating Range, 
EnterModifiedCompliance- ComplianceSOS-
 Compliance De-emphasis: -6dB
LnkSta2: Current De-emphasis Level: -6dB, 
EqualizationComplete+, EqualizationPhase1+
 EqualizationPhase2+, EqualizationPhase3-, 
LinkEqualizationRequest-
Capabilities: [100 v1] Alternative Routing-ID Interpretation (ARI)
ARICap: MFVC- ACS-, Next Function: 0
ARICtl: MFVC- ACS-, Function Group: 0
Capabilities: [148 v1] Device Serial Number 00-02-c9-03-00-b6-fc-70
Capabilities: [108 v1] Single Root I/O Virtualization (SR-IOV)
IOVCap: Migration-, Interrupt 

Re: [Xen-devel] BUG: bad page map under Xen

2013-10-21 Thread Konrad Rzeszutek Wilk
On Mon, Oct 21, 2013 at 04:06:07PM +0200, Lukas Hejtmanek wrote:
 On Mon, Oct 21, 2013 at 09:39:33AM -0400, konrad wilk wrote:
  Anyhow, one easy thing to figure out is to get the lspci -v output
  from the InfiniBand card
  to see where its BARs are, and also the start of the kernel. You
  should see an E820 map (please also boot with
  debug on the Linux command line).
 
 note, adding _PAGE_IO as Jan suggested fixed those mem errors.

nods Right.
 
 here is lspci from the card and its virtual functions.
 
 06:00.0 Network controller: Mellanox Technologies MT27500 Family [ConnectX-3]
 Subsystem: Mellanox Technologies Device 0017
 Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- 
 Stepping- SERR- FastB2B- DisINTx+
 Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast TAbort- 
 TAbort- MAbort- SERR- PERR- INTx-
 Latency: 0, Cache Line Size: 64 bytes
 Interrupt: pin A routed to IRQ 42
 Region 0: Memory at dfa0 (64-bit, non-prefetchable) [size=1M]
 Region 2: Memory at 380fff00 (64-bit, prefetchable) [size=8M]

Wow.

 06:00.1 Network controller: Mellanox Technologies MT27500 Family [ConnectX-3 
 Virtual Function]
 Subsystem: Mellanox Technologies Device 61b0
 Control: I/O- Mem- BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- 
 Stepping- SERR- FastB2B- DisINTx-
 Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast TAbort- 
 TAbort- MAbort- SERR- PERR- INTx-
 Latency: 0
 Region 2: [virtual] Memory at 380fdf00 (64-bit, prefetchable) 
 [size=8M]

Wow again.

.. snip..
 and this is from dmesg:
 
 [0.00] e820: BIOS-provided physical RAM map:
 [0.00] Xen: [mem 0x-0x00090fff] usable
 [0.00] Xen: [mem 0x00091800-0x000f] reserved
 [0.00] Xen: [mem 0x0010-0x7dd76fff] usable
 [0.00] Xen: [mem 0x7dd77000-0x7ddb5fff] reserved
 [0.00] Xen: [mem 0x7ddb6000-0x7debefff] ACPI data
 [0.00] Xen: [mem 0x7debf000-0x7e0dafff] ACPI NVS
 [0.00] Xen: [mem 0x7e0db000-0x7f357fff] reserved
 [0.00] Xen: [mem 0x7f358000-0x7f7f] ACPI NVS
 [0.00] Xen: [mem 0x8000-0x8fff] reserved
 [0.00] Xen: [mem 0xfec0-0xfec01fff] reserved
 [0.00] Xen: [mem 0xfec4-0xfec40fff] reserved
 [0.00] Xen: [mem 0xfed1c000-0xfed3] reserved
 [0.00] Xen: [mem 0xfee0-0xfee00fff] reserved
 [0.00] Xen: [mem 0xff00-0x] reserved
 [0.00] Xen: [mem 0x0001-0x00107fff] usable

Odd, there should be messages about 1-1 mapping when you use 'debug'.

But either way - the problem (bug) is what I suspected - we treat any region
past the E820 as INVALID_P2M_ENTRY and hence doing any set_pte(..) operations
will fetch an 0 value, which in turn means that the PTE is zero (with the
0x200 _PAGE_SPECIAL b/c of VMA tracking).

Now the fix is to determine _where_ the end of real memory is so that we
can make sure that ballooning will work (in case of dom0_mem_max parameter).
And then anything past that PFN can be treated as IDENTITY_FRAME.

Naively, I think this patch would do it:

diff --git a/arch/x86/xen/setup.c b/arch/x86/xen/setup.c
index 09f3059..3871554 100644
--- a/arch/x86/xen/setup.c
+++ b/arch/x86/xen/setup.c
@@ -92,6 +92,9 @@ static void __init xen_add_extra_mem(u64 start, u64 size)
 
__set_phys_to_machine(pfn, INVALID_P2M_ENTRY);
}
+   /* Anything past the balloon area is marked as identity. */
+   for (pfn = xen_max_p2m_pfn; pfn  MAX_DOMAIN_PAGES; pfn++)
+   __set_phys_to_machine(pfn, IDENTITY_FRAME(pfn));
 }
 
 static unsigned long __init xen_do_chunk(unsigned long start,

But this is not even compile tested :-(

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Xen-devel] BUG: bad page map under Xen

2013-10-21 Thread Lukas Hejtmanek
On Mon, Oct 21, 2013 at 10:18:55AM -0400, Konrad Rzeszutek Wilk wrote:
 Odd, there should be messages about 1-1 mapping when you use 'debug'.

cat /proc/cmdline 
placeholder root=UUID=b5711e0a-3fc8-44ec-940f-112e60d8f143 ro debug

so I suppose, I did it right. Maybe I didn't compile something important in?

-- 
Lukáš Hejtmánek
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Xen-devel] BUG: bad page map under Xen

2013-10-21 Thread Jan Beulich
 On 21.10.13 at 16:18, Konrad Rzeszutek Wilk konrad.w...@oracle.com wrote:
 On Mon, Oct 21, 2013 at 04:06:07PM +0200, Lukas Hejtmanek wrote:
 Region 2: Memory at 380fff00 (64-bit, prefetchable) [size=8M]
...
 --- a/arch/x86/xen/setup.c
 +++ b/arch/x86/xen/setup.c
 @@ -92,6 +92,9 @@ static void __init xen_add_extra_mem(u64 start, u64 size)
  
 __set_phys_to_machine(pfn, INVALID_P2M_ENTRY);
 }
 +   /* Anything past the balloon area is marked as identity. */
 +   for (pfn = xen_max_p2m_pfn; pfn  MAX_DOMAIN_PAGES; pfn++)
 +   __set_phys_to_machine(pfn, IDENTITY_FRAME(pfn));

Hardly - MAX_DOMAIN_PAGES derives from
CONFIG_XEN_MAX_DOMAIN_MEMORY, which in turn is unrelated
to where MMIO might be. Should you perhaps simply start from
an all 1:1 mapping, inserting the RAM translations as you find
them?

Jan

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH RFC 2/9] IB/core: Introduce Signature Verbs API

2013-10-21 Thread Hefty, Sean
 The signature handover operation is binding all the necessary
 information for the HCA together: where is the data (data_mr), where is
 the protection information (prot_mr), what are the signature properties
 (sig_attrs).
 Once this step is taken (WR is posted), a single MR (sig_mr) describes
 the signature handover operation and can be used to perform RDMA under
 signature presence.
 Once the HCA will perform RDMA over this MR, it will take into account
 the signature context of the transaction and will follow the signature
 attributes configured.

It seems like this changes loses the ability to use an SGL.  Why are the 
signature properties separate from the protection information? 
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH RFC 1/9] IB/core: Introduce indirect and protected memory regions

2013-10-21 Thread Hefty, Sean
 This MR can be perceived as a generalization of fast_reg MR. When using
 fast memory registration the verbs user will call ib_alloc_fast_reg_mr() in
 order to allocate an MR that may be used for fast registration method by
 posting
 a fast registration work-request on the send-queue (FRWR). The user does
 not pass any memory buffers to ib_alloc_fast_reg_mr() as the actual
 registration is done via posting WR. This follows the same notation, but
 allows new functionality (such as signature enable).

This makes sense to me.

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Xen-devel] BUG: bad page map under Xen

2013-10-21 Thread Konrad Rzeszutek Wilk
On Mon, Oct 21, 2013 at 03:27:50PM +0100, Jan Beulich wrote:
  On 21.10.13 at 16:18, Konrad Rzeszutek Wilk konrad.w...@oracle.com 
  wrote:
  On Mon, Oct 21, 2013 at 04:06:07PM +0200, Lukas Hejtmanek wrote:
  Region 2: Memory at 380fff00 (64-bit, prefetchable) [size=8M]
 ...
  --- a/arch/x86/xen/setup.c
  +++ b/arch/x86/xen/setup.c
  @@ -92,6 +92,9 @@ static void __init xen_add_extra_mem(u64 start, u64 size)
   
  __set_phys_to_machine(pfn, INVALID_P2M_ENTRY);
  }
  +   /* Anything past the balloon area is marked as identity. */
  +   for (pfn = xen_max_p2m_pfn; pfn  MAX_DOMAIN_PAGES; pfn++)
  +   __set_phys_to_machine(pfn, IDENTITY_FRAME(pfn));
 
 Hardly - MAX_DOMAIN_PAGES derives from
 CONFIG_XEN_MAX_DOMAIN_MEMORY, which in turn is unrelated
 to where MMIO might be. Should you perhaps simply start from

Looks like your mailer ate some words.

 an all 1:1 mapping, inserting the RAM translations as you find
 them?


Yeah, as this code can be called for the regions under 4GB. Definitly
needs more analysis.

Were you suggesting a lookup when we scan the PCI devices? (xen_add_device)?

 
 Jan
 
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Xen-devel] BUG: bad page map under Xen

2013-10-21 Thread Jan Beulich
 On 21.10.13 at 16:44, Konrad Rzeszutek Wilk konrad.w...@oracle.com wrote:
 On Mon, Oct 21, 2013 at 03:27:50PM +0100, Jan Beulich wrote:
  On 21.10.13 at 16:18, Konrad Rzeszutek Wilk konrad.w...@oracle.com 
  wrote:
  On Mon, Oct 21, 2013 at 04:06:07PM +0200, Lukas Hejtmanek wrote:
  Region 2: Memory at 380fff00 (64-bit, prefetchable) [size=8M]
 ...
  --- a/arch/x86/xen/setup.c
  +++ b/arch/x86/xen/setup.c
  @@ -92,6 +92,9 @@ static void __init xen_add_extra_mem(u64 start, u64 size)
   
  __set_phys_to_machine(pfn, INVALID_P2M_ENTRY);
  }
  +   /* Anything past the balloon area is marked as identity. */
  +   for (pfn = xen_max_p2m_pfn; pfn  MAX_DOMAIN_PAGES; pfn++)
  +   __set_phys_to_machine(pfn, IDENTITY_FRAME(pfn));
 
 Hardly - MAX_DOMAIN_PAGES derives from
 CONFIG_XEN_MAX_DOMAIN_MEMORY, which in turn is unrelated
 to where MMIO might be. Should you perhaps simply start from
 
 Looks like your mailer ate some words.

I don't think so - they're all there in the text you quoted.

 an all 1:1 mapping, inserting the RAM translations as you find
 them?
 
 
 Yeah, as this code can be called for the regions under 4GB. Definitly
 needs more analysis.
 
 Were you suggesting a lookup when we scan the PCI devices? (xen_add_device)?

That was for PVH, and is obviously fragile, as there can be MMIO
regions not matched by any PCI device's BAR. We could hope for
all of them to be below 4Gb, but I think (based on logs I got to see
recently from a certain vendor's upcoming systems) this isn't going
to work out.

Jan

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC 2/9] IB/core: Introduce Signature Verbs API

2013-10-21 Thread Sagi Grimberg

On 10/21/2013 5:34 PM, Hefty, Sean wrote:

The signature handover operation is binding all the necessary
information for the HCA together: where is the data (data_mr), where is
the protection information (prot_mr), what are the signature properties
(sig_attrs).
Once this step is taken (WR is posted), a single MR (sig_mr) describes
the signature handover operation and can be used to perform RDMA under
signature presence.
Once the HCA will perform RDMA over this MR, it will take into account
the signature context of the transaction and will follow the signature
attributes configured.

It seems like this changes loses the ability to use an SGL.


I don't think so,
Signature MR simply describes a signature associated memory region 
i.e. it is a memory region that
also defines some signature operation offload aside from normal RDMA 
(for example validate  strip).
SGL are used to publish several rkeys for the server/target/peer to 
perform RDMA on each.
In this case the user previously registered each MR which he wishes it's 
peer to RDMA over.
Same story here, if user has several signature associated MRs, where he 
wish his peer to RDMA over (in a protected manner),

he can use these rkeys to construct SGL.


   Why are the signature properties separate from the protection information?


Well,
Protection information is the actual protection block guards of the data 
(i.e. CRCs, XORs, DIFs etc..), while the signature properties
structure is the descriptor telling the HCA how to 
treat/validate/generate the protection information.


Note that signature support requires the HCA to be able to support 
INSERT operations.
This means that there is no protection information and the HCA is asked 
to generate it and add it to the data stream

(which may be incoming or outgoing...),

Hope this helps.

Sagi.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/9] IB/core: Don't include command header size in uverbs create_flow

2013-10-21 Thread Roland Dreier
On Thu, Oct 17, 2013 at 8:29 AM, Yann Droneaud ydrone...@opteya.com wrote:
 Just to double check, by no means we are talking on reverting the
 whole series, flow-steering will be in
 3.12 in the IB core and mlx4_ib

Correct, just hiding the unstable userspace ABI for now.

 I believe it will disable by #if 0 / #endif.
 It could be protected by #ifdef
 CONFIG_INFINIBAND_EXPERIMENTAL_UVERBS_FLOW_STEERING
 instead, but I don't know if it's acceptable practice for such piece of code
 that should be exposed in its current state.

I like the Kconfig option -- I just made a patch that does that (and
made the option depend on STAGING), I'll post it for comments.

 - R.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH/RFC] IB/core: Temporarily disable create_flow/destroy_flow uverbs

2013-10-21 Thread Roland Dreier
From: Yann Droneaud ydrone...@opteya.com

The create_flow/destroy_flow uverbs and the associated extensions to
the user-kernel verbs ABI are under review and are too experimental to
freeze at this point.

So userspace is not exposed to experimental features and an uinstable
ABI, temporarily disable this for v3.12 (with a Kconfig option behind
staging to reenable it if desired).

The feature will be enabled after proper cleanup for v3.13.

Signed-off-by: Yann Droneaud ydrone...@opteya.com
Link: http://marc.info/?i=cover.1381351016.git.ydrone...@opteya.com
Link: http://marc.info/?i=cover.1381177342.git.ydrone...@opteya.com

[ Add a Kconfig option to reenable these verbs.  - Roland ]

Signed-off-by: Roland Dreier rol...@purestorage.com
---
 drivers/infiniband/Kconfig| 11 +++
 drivers/infiniband/core/uverbs.h  |  2 ++
 drivers/infiniband/core/uverbs_cmd.c  |  4 
 drivers/infiniband/core/uverbs_main.c |  6 ++
 drivers/infiniband/hw/mlx4/main.c |  2 ++
 include/uapi/rdma/ib_user_verbs.h |  6 ++
 6 files changed, 31 insertions(+)

diff --git a/drivers/infiniband/Kconfig b/drivers/infiniband/Kconfig
index 5ceda710f516..b84791f03a27 100644
--- a/drivers/infiniband/Kconfig
+++ b/drivers/infiniband/Kconfig
@@ -31,6 +31,17 @@ config INFINIBAND_USER_ACCESS
  libibverbs, libibcm and a hardware driver library from
  http://www.openfabrics.org/git/.
 
+config INFINIBAND_EXPERIMENTAL_UVERBS_FLOW_STEERING
+   bool Experimental and unstable ABI for userspace access to flow 
steering verbs
+   depends on INFINIBAND_USER_ACCESS
+   depends on STAGING
+   ---help---
+ The final ABI for userspace access to flow steering verbs
+ has not been defined.  To use the current ABI, *WHICH WILL
+ CHANGE IN THE FUTURE*, say Y here.
+
+ If unsure, say N.
+
 config INFINIBAND_USER_MEM
bool
depends on INFINIBAND_USER_ACCESS != n
diff --git a/drivers/infiniband/core/uverbs.h b/drivers/infiniband/core/uverbs.h
index d040b877475f..d8f9c6c272d7 100644
--- a/drivers/infiniband/core/uverbs.h
+++ b/drivers/infiniband/core/uverbs.h
@@ -217,7 +217,9 @@ IB_UVERBS_DECLARE_CMD(destroy_srq);
 IB_UVERBS_DECLARE_CMD(create_xsrq);
 IB_UVERBS_DECLARE_CMD(open_xrcd);
 IB_UVERBS_DECLARE_CMD(close_xrcd);
+#ifdef CONFIG_INFINIBAND_EXPERIMENTAL_UVERBS_FLOW_STEERING
 IB_UVERBS_DECLARE_CMD(create_flow);
 IB_UVERBS_DECLARE_CMD(destroy_flow);
+#endif /* CONFIG_INFINIBAND_EXPERIMENTAL_UVERBS_FLOW_STEERING */
 
 #endif /* UVERBS_H */
diff --git a/drivers/infiniband/core/uverbs_cmd.c 
b/drivers/infiniband/core/uverbs_cmd.c
index f2b81b9ee0d6..2f0f01b70e3b 100644
--- a/drivers/infiniband/core/uverbs_cmd.c
+++ b/drivers/infiniband/core/uverbs_cmd.c
@@ -54,7 +54,9 @@ static struct uverbs_lock_class qp_lock_class = { .name = 
QP-uobj };
 static struct uverbs_lock_class ah_lock_class  = { .name = AH-uobj };
 static struct uverbs_lock_class srq_lock_class = { .name = SRQ-uobj };
 static struct uverbs_lock_class xrcd_lock_class = { .name = XRCD-uobj };
+#ifdef CONFIG_INFINIBAND_EXPERIMENTAL_UVERBS_FLOW_STEERING
 static struct uverbs_lock_class rule_lock_class = { .name = RULE-uobj };
+#endif /* CONFIG_INFINIBAND_EXPERIMENTAL_UVERBS_FLOW_STEERING */
 
 #define INIT_UDATA(udata, ibuf, obuf, ilen, olen)  \
do {\
@@ -2599,6 +2601,7 @@ out_put:
return ret ? ret : in_len;
 }
 
+#ifdef CONFIG_INFINIBAND_EXPERIMENTAL_UVERBS_FLOW_STEERING
 static int kern_spec_to_ib_spec(struct ib_kern_spec *kern_spec,
union ib_flow_spec *ib_spec)
 {
@@ -2824,6 +2827,7 @@ ssize_t ib_uverbs_destroy_flow(struct ib_uverbs_file 
*file,
 
return ret ? ret : in_len;
 }
+#endif /* CONFIG_INFINIBAND_EXPERIMENTAL_UVERBS_FLOW_STEERING */
 
 static int __uverbs_create_xsrq(struct ib_uverbs_file *file,
struct ib_uverbs_create_xsrq *cmd,
diff --git a/drivers/infiniband/core/uverbs_main.c 
b/drivers/infiniband/core/uverbs_main.c
index 75ad86c4abf8..2df31f68ea09 100644
--- a/drivers/infiniband/core/uverbs_main.c
+++ b/drivers/infiniband/core/uverbs_main.c
@@ -115,8 +115,10 @@ static ssize_t (*uverbs_cmd_table[])(struct ib_uverbs_file 
*file,
[IB_USER_VERBS_CMD_CLOSE_XRCD]  = ib_uverbs_close_xrcd,
[IB_USER_VERBS_CMD_CREATE_XSRQ] = ib_uverbs_create_xsrq,
[IB_USER_VERBS_CMD_OPEN_QP] = ib_uverbs_open_qp,
+#ifdef CONFIG_INFINIBAND_EXPERIMENTAL_UVERBS_FLOW_STEERING
[IB_USER_VERBS_CMD_CREATE_FLOW] = ib_uverbs_create_flow,
[IB_USER_VERBS_CMD_DESTROY_FLOW]= ib_uverbs_destroy_flow
+#endif /* CONFIG_INFINIBAND_EXPERIMENTAL_UVERBS_FLOW_STEERING */
 };
 
 static void ib_uverbs_add_one(struct ib_device *device);
@@ -605,6 +607,7 @@ static ssize_t ib_uverbs_write(struct file *filp, const 
char __user *buf,
if (!(file-device-ib_dev-uverbs_cmd_mask  (1ull