[PATCH] Added another USB product ID for ELAN touchscreen quirks.

2015-05-09 Thread Logan Gunthorpe
I've had the same issue as described in commit

c68929f75dfcb6354918862b91b5778585de1fa5

Except my touchscreen's ID is

ID 04f3:0125 Elan Microelectronics Corp.

Signed-off-by: Logan Gunthorpe log...@deltatee.com
---
 drivers/usb/core/quirks.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/usb/core/quirks.c b/drivers/usb/core/quirks.c
index 41e510a..d85abfe 100644
--- a/drivers/usb/core/quirks.c
+++ b/drivers/usb/core/quirks.c
@@ -106,6 +106,9 @@ static const struct usb_device_id usb_quirk_list[] = {
{ USB_DEVICE(0x04f3, 0x010c), .driver_info =
USB_QUIRK_DEVICE_QUALIFIER },
 
+   { USB_DEVICE(0x04f3, 0x0125), .driver_info =
+   USB_QUIRK_DEVICE_QUALIFIER },
+
{ USB_DEVICE(0x04f3, 0x016f), .driver_info =
USB_QUIRK_DEVICE_QUALIFIER },
 
-- 
2.1.4

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 10/15] block, dax: fix lifetime of in-kernel dax mappings

2015-10-07 Thread Logan Gunthorpe

Hi Dan,

We've uncovered another issue during testing with these patches. We get 
a kernel panic sometimes just while using a DAX filesystem. I've traced 
the issue back to this patch. (There's a stack trace at the end of this 
email.)


On 22/09/15 10:42 PM, Dan Williams wrote:

+static void dax_unmap_bh(const struct buffer_head *bh, void __pmem *addr)
+{
+   struct block_device *bdev = bh->b_bdev;
+   struct request_queue *q = bdev->bd_queue;
+
+   if (IS_ERR(addr))
+   return;
+   blk_dax_put(q);
  }

@@ -127,9 +159,8 @@ static ssize_t dax_io(struct inode *inode, struct iov_iter 
*iter,
if (pos == bh_max) {
bh->b_size = PAGE_ALIGN(end - pos);
bh->b_state = 0;
-   retval = get_block(inode, block, bh,
-  iov_iter_rw(iter) == WRITE);
-   if (retval)
+   rc = get_block(inode, block, bh, rw == WRITE);
+   if (rc)
break;
if (!buffer_size_valid(bh))
bh->b_size = 1 << blkbits;
@@ -178,8 +213,9 @@ static ssize_t dax_io(struct inode *inode, struct iov_iter 
*iter,

if (need_wmb)
wmb_pmem();
+   dax_unmap_bh(bh, kmap);

-   return (pos == start) ? retval : pos - start;
+   return (pos == start) ? rc : pos - start;
  }


The problem is if get_block fails and returns an error code, it will 
still call dax_unmap_bh which tries to dereference bh->b_bdev. However, 
seeing get_block failed, that pointer is NULL. Maybe a null check in 
dax_unmap_bh would be sufficient?



Thanks,

Logan

--


[   35.391790] BUG: unable to handle kernel NULL pointer dereference at 
00a0
[   35.393306] IP: [] dax_unmap_bh+0x41/0x70
[   35.394253] PGD 7c7ed067 PUD 7c01f067 PMD 0
[   35.395020] Oops:  [#1] SMP
[   35.395597] Modules linked in: mtramonb(O)
[   35.396320] CPU: 0 PID: 1501 Comm: dd Tainted: G   O4.3.0-rc2+ 
#51
[   35.397500] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
[   35.399728] task: 88007bc6d600 ti: 88007925 task.ti: 
88007925
[   35.401006] RIP: 0010:[]  [] 
dax_unmap_bh+0x41/0x70
[   35.402402] RSP: 0018:880079253ab8  EFLAGS: 00010246
[   35.403321] RAX:  RBX:  RCX: 0012
[   35.404550] RDX: 0007 RSI: 0246 RDI: 8181f630
[   35.405783] RBP: 880079253ac8 R08: 000a R09: fffe
[   35.407011] R10: 88007fc1ad90 R11: 6abc R12: fffb
[   35.408237] R13: 0038e000 R14: 0038e000 R15: 0038e000
[   35.409466] FS:  7f0e81476700() GS:88007fc0() 
knlGS:
[   35.410830] CS:  0010 DS:  ES:  CR0: 80050033
[   35.411796] CR2: 00a0 CR3: 7926f000 CR4: 06f0
[   35.412990] Stack:
[   35.413340]  ffe4 880079253cf0 880079253be0 
811d2259
[   35.414670]  0246 81203120 880079253e68 

[   35.415981]  00017c394800 0038e000 0001 
fffb
[   35.417298] Call Trace:
[   35.417727]  [] dax_do_io+0x199/0x700
[   35.418615]  [] ? _ext4_get_block+0x200/0x200
[   35.419819]  [] ? jbd2_journal_stop+0x60/0x390
[   35.420886]  [] ext4_ind_direct_IO+0x8d/0x410
[   35.421908]  [] ext4_direct_IO+0x2da/0x540
[   35.422859]  [] ? ext4_dirty_inode+0x5c/0x70
[   35.423841]  [] generic_file_direct_write+0xaa/0x170
[   35.424931]  [] __generic_file_write_iter+0xc2/0x1f0
[   35.426027]  [] ext4_file_write_iter+0x13b/0x420
[   35.427066]  [] ? pick_next_entity+0xb2/0x190
[   35.428061]  [] __vfs_write+0xa7/0xf0
[   35.428940]  [] vfs_write+0xa9/0x190
[   35.429810]  [] ? vfs_read+0x114/0x130
[   35.430706]  [] SyS_write+0x46/0xa0
[   35.431561]  [] entry_SYSCALL_64_fastpath+0x12/0x71
[   35.432637] Code: fe 48 c7 c7 4c 63 7e 81 e8 11 ec f5 ff 48 8b 5b 30 48 c7 c7 52 
63 7e 81 31 c0 48 89 de e8 fc eb f5 ff 31 c0 48 c7 c7 30 f6 81 81 <48> 8b 9b a0 
00 00 00 e8 e7 eb f5 ff 49 81 fc 00 f0 ff ff 77 08
[   35.437029] RIP  [] dax_unmap_bh+0x41/0x70
[   35.438005]  RSP 
[   35.438608] CR2: 00a0
[   35.439194] ---[ end trace 323695b29b46dd96 ]---


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 14/15] mm, dax, pmem: introduce {get|put}_dev_pagemap() for dax-gup

2015-10-02 Thread Logan Gunthorpe
Hi Dan,

We've been doing some experimenting and testing with this patchset.
Specifically, we are trying to use you're ZONE_DEVICE work to enable
peer to peer PCIe transfers. This is actually working pretty well
(though we're still testing and working through some things).

However, we've found a couple of issues:

On Wed, Sep 23, 2015 at 12:42:27AM -0400, Dan Williams wrote:
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 3d6baa7d4534..20097e7b679a 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -49,12 +49,16 @@ struct page {
>* updated asynchronously */
>   union {
>   struct address_space *mapping;  /* If low bit clear, points to
> -  * inode address_space, or NULL.
> +  * inode address_space, unless
> +  * the page is in ZONE_DEVICE
> +  * then it points to its parent
> +  * dev_pagemap, otherwise NULL.
>* If page mapped as anonymous
>* memory, low bit is set, and
>* it points to anon_vma object:
>* see PAGE_MAPPING_ANON below.
>*/
> + struct dev_pagemap *pgmap;
>   void *s_mem;/* slab first object */
>   };


When you add to this union and overide the mapping value, we see bugs
in calls to set_page_dirty when it tries to dereference mapping. I believe
a change to page_mapping is required such as the patch that's at the end of
this email.


> diff --git a/mm/gup.c b/mm/gup.c
> index a798293fc648..1064e9a489a4 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -98,7 +98,16 @@ retry:
>   }
>
>   page = vm_normal_page(vma, address, pte);
> - if (unlikely(!page)) {
> + if (!page && pte_devmap(pte) && (flags & FOLL_GET)) {
> + /*
> +  * Only return device mapping pages in the FOLL_GET case since
> +  * they are only valid while holding the pgmap reference.
> +  */
> + if (get_dev_pagemap(pte_pfn(pte), NULL))
> + page = pte_page(pte);
> + else
> + goto no_page;
> + } else if (unlikely(!page)) {

I've found that if a driver creates a ZONE_DEVICE mapping but doesn't
create the pagemap (using devm_register_pagemap) then the get_user_pages code
will go into an infinite loop. I'm not really sure if this as an issue or
not but it seems a bit undesirable for a buggy driver to be able to cause this.

My thoughts are that either devm_register_pagemap needs to be done by
devm_memremap_pages so a driver cannot use one without the other,
or the GUP code needs to return EFAULT if no pagemap was registered so
it doesn't loop forever.

Thanks!

Logan



diff --git a/mm/util.c b/mm/util.c
index 68ff8a5..19af683 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -368,6 +368,9 @@ struct address_space *page_mapping(struct page *page)
return swap_address_space(entry);
}

+   if (unlikely(is_zone_device_page(page)))
+   return NULL;
+
mapping = (unsigned long)page->mapping;
if (mapping & PAGE_MAPPING_FLAGS)
return NULL;
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 14/15] mm, dax, pmem: introduce {get|put}_dev_pagemap() for dax-gup

2015-10-02 Thread Logan Gunthorpe

Hi Dan,

Good to know you've already addressed the struct page issue. We'll watch 
out for an updated patchset to try.



On 02/10/15 03:53 PM, Dan Williams wrote:

Hmm, I didn't have peer-to-peer PCI-E in mind for this mechanism, but
the test report is welcome nonetheless.  The definition of dma_addr_t
is the device view of host memory, not necessarily the device view of
a peer device's memory range, so I expect you'll run into issues with
IOMMUs and other parts of the kernel that assume this definition.


Yeah, we've actually been doing this with a number of more "hacky" 
techniques for some time. ZONE_DEVICE just provides us with a much 
cleaner way to set this up that doesn't require patching around 
get_user_pages in various places in the kernel.


We've never had any issues with the IOMMU getting in the way (at least 
on Intel x86). My understanding always was that the IOMMU sits between a 
PCI card and main memory; it doesn't get in the way of peer-to-peer 
transfers. Though admittedly, I don't have a complete understanding of 
how the IOMMU works in the kernel. I'm just speaking from experimental 
experience. We've never actually tried this on other architectures.


Thanks,

Logan
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 14/15] mm, dax, pmem: introduce {get|put}_dev_pagemap() for dax-gup

2015-10-02 Thread Logan Gunthorpe



On 02/10/15 03:53 PM, Dan Williams wrote:

Yes, this location for dev_pagemap will not work.  I've since moved it
to a union with the lru list_head since ZONE_DEVICE pages memory
should always have an elevated page count and never land on a slab
allocator lru.


Oh, also, I was actually hoping to make use of the lru list_head in the 
future with ZONE_DEVICE memory. One thought I had was once we have a 
PCIe device with a BAR space, we'd then need to have a way of allocating 
these buffers when user space needs them. The simple way I was thinking 
was to just use the lru list head to store lists of used and unused 
pages -- though there are probably other solutions to this that don't 
require using struct pages.


Thanks,

Logan
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/3] ntb_transport: Check the number of spads the hardware supports

2016-06-07 Thread Logan Gunthorpe
Hi Jon,

Thanks for the feedback. I'll send an updated patch in a moment.

On 04/06/16 09:40 AM, Jon Mason wrote:
> Nit, please add spaces around '*' (per checkpatch)

I'll change this, but I did run it through checkpatch and it did not
warn about this.

> Please explicitly point out that this is being modified in the commit
> message.  I don't see them being used, so probably not a big deal
> (unless Dave Jiang has something queued that will use it).

Done. I feel like he can always add them back in when he adds the
functionality. This way, when he does, MAX_SPAD will be updated and the
check will still be correct.

> Move this check above the dev_to_node assignment above.

Done.


Logan


Re: PROBLEM: Resume form hibernate broken by setting NX on gap

2016-06-12 Thread Logan Gunthorpe

Hey Rafael,

Awesome, this patch fixes the problem! Nice work.

Thanks,

Logan

On 12/06/16 08:31 AM, Rafael J. Wysocki wrote:

On Saturday, June 11, 2016 10:48:08 PM Logan Gunthorpe wrote:

Hey,


Hi,


On 11/06/16 07:05 PM, Rafael J. Wysocki wrote:

1) Commit ab76f7b4ab only extends the NX bit between __ex_table and
rodata; which, by my understanding, shouldn't be used by anything. And
__ex_table and rodata are fixed by the kernel's binary so both symbols
should be the same in both the image kernel and the boot kernel given
that both are running from the same binary.


Well, what if the kernel is relocated?


Ah, I'm sure I don't fully grasp the implications of that but I would
assume that if the image kernel were located somewhere else it would
still be far away from the boot kernel's ex_table/rodata boundary.


Probably.


2) When ab76f7b4ab is reverted, hibernation seems to work 100%. Though,
when it's in place, it only works some of the time. Given that commit is
only extending the NX region a bit, if there is some random mismatch,
why does it never reach rodata? In other words, why is rodata a magic
line that seems to work all the time -- why doesn't this random mismatch
ever extend into the rodata region? rodata isn't _that_ far away from
the end of ex_table.


That's a very good question. :-)


Yeah, I guess if we knew the answer we'd understand what was going on
and have a fix.


Right.

Appended is one more patch to try.

It actually fixes a theoretical problem, so I'll need to add comments to it
(as it is far from obvious IMO) and a changelog etc and post it as a proper
submission.

So the concern is that the page copying done in the loop in core_restore_code()
may corrupt the kernel text part of the temporary memory mapping used at that
time, because that's the original kernel text mapping from the boot kernel.

That doesn't matter for the loop itself (as that code runs from a "safe" page
guaranteed not to be overwritten), but it (quite theoretically) may matter for
the final jump to the image kernel's restore_registers().  [I realized that
it might be a problem only after I had started to think about the problem
you reported.]

As a bonus, the patch also eliminates the possible concern about the
executability of the memory mapped via the kernel text mapping in the boot
kernel, so IMO it's worth to give it a shot.  I've tested it lightly on
one machine, but I guess it would just crash right away if there were
any problems in it.

Thanks,
Rafael


---
  arch/x86/power/hibernate_64.c |   46 
++
  arch/x86/power/hibernate_asm_64.S |   28 +--
  2 files changed, 58 insertions(+), 16 deletions(-)

Index: linux-pm/arch/x86/power/hibernate_64.c
===
--- linux-pm.orig/arch/x86/power/hibernate_64.c
+++ linux-pm/arch/x86/power/hibernate_64.c
@@ -27,7 +27,8 @@ extern asmlinkage __visible int restore_
   * Address to jump to in the last phase of restore in order to get to the 
image
   * kernel's text (this value is passed in the image header).
   */
-unsigned long restore_jump_address __visible;
+void *restore_jump_address __visible;
+unsigned long jump_address_phys;

  /*
   * Value of the cr3 register from before the hibernation (this value is passed
@@ -37,6 +38,9 @@ unsigned long restore_cr3 __visible;

  pgd_t *temp_level4_pgt __visible;

+void *restore_pgd_addr __visible;
+pgd_t restore_pgd __visible;
+
  void *relocated_restore_code __visible;

  static void *alloc_pgt_page(void *context)
@@ -44,6 +48,33 @@ static void *alloc_pgt_page(void *contex
return (void *)get_safe_page(GFP_ATOMIC);
  }

+static int prepare_temporary_text_mapping(void)
+{
+   unsigned long vaddr = (unsigned long)restore_jump_address;
+   unsigned long paddr = jump_address_phys & PMD_MASK;
+   pmd_t *pmd;
+   pud_t *pud;
+
+   pud = (pud_t *)get_safe_page(GFP_ATOMIC);
+   if (!pud)
+   return -ENOMEM;
+
+   restore_pgd = __pgd(__pa(pud) | _KERNPG_TABLE);
+
+   pud += pud_index(vaddr);
+   pmd = (pmd_t *)get_safe_page(GFP_ATOMIC);
+   if (!pmd)
+   return -ENOMEM;
+
+   set_pud(pud, __pud(__pa(pmd) | _KERNPG_TABLE));
+
+   pmd += pmd_index(vaddr);
+   set_pmd(pmd, __pmd(paddr | __PAGE_KERNEL_LARGE_EXEC));
+
+   restore_pgd_addr = temp_level4_pgt + pgd_index(vaddr);
+   return 0;
+}
+
  static int set_up_temporary_mappings(void)
  {
struct x86_mapping_info info = {
@@ -63,6 +94,10 @@ static int set_up_temporary_mappings(voi
set_pgd(temp_level4_pgt + pgd_index(__START_KERNEL_map),
init_level4_pgt[pgd_index(__START_KERNEL_map)]);

+   result = prepare_temporary_text_mapping();
+   if (result)
+   return result;
+
/* Set up the direct mapping from scratch */
for (i = 0; i < nr_pfn_mapped; i++) {
mstart = pf

Re: PROBLEM: Resume form hibernate broken by setting NX on gap

2016-06-10 Thread Logan Gunthorpe
Hey,

On 10/06/16 12:09 PM, Kees Cook wrote:
>> restore_code: 880157c3b000
>> jump_addr: 81446be0
>>
>>
>> diff --git a/arch/x86/power/hibernate_64.c b/arch/x86/power/hibernate_64.c
>> index 009947d..6efedb7 100644
>> --- a/arch/x86/power/hibernate_64.c
>> +++ b/arch/x86/power/hibernate_64.c
>> @@ -92,6 +92,9 @@ int swsusp_arch_resume(void)
>> memcpy(relocated_restore_code, _restore_code,
>>_registers - _restore_code);
>>
>> +   pr_info("restore_code: %p\n", relocated_restore_code);
>> +   pr_info("jump_addr: %lx\n", restore_jump_address);
>> +
> 
> Also interesting would be the "relocated_restore_code" address, as
> well as a dump of /sys/kernel/debug/kernel_page_tables (from
> CONFIG_X86_PTDUMP).

Is that not what I printed? If not, can you give me a better hint as to
what you're looking for so I can spin another kernel? I'll also provide
the kernel_page_tables once I do that.

> I'm baffled by the problem, but the best I can understand is the the
> relocated_restore_code range isn't executable (which should be visible
> from finding it in /sys/kernel/debug/kernel_page_tables), but I don't
> see how to solve that since my original patch didn't work.

Yeah this is definitely a baffling problem.

Thanks,

Logan


Re: PROBLEM: Resume form hibernate broken by setting NX on gap

2016-06-10 Thread Logan Gunthorpe


On 10/06/16 04:29 PM, Rafael J. Wysocki wrote:
> OK, I have a theory, but I need a bit of help.
> 
> This may be a dumb question, but I don't quite remember the answer readily.
> 
> Given a physical address, how do I get the corresponding virtual one under
> the kernel identity mapping?

Is that not just 'phys_to_virt(x)'??

Logan


Re: PROBLEM: Resume form hibernate broken by setting NX on gap

2016-06-11 Thread Logan Gunthorpe

Hey Rafael,

Thank for looking into this. I tried the patch below applied to v4.6 and 
I still got a lockup on resume. Additionally there was a kernel warning 
at arch/x86/mm/pageattr.c:1414 change_page_attr_set_clr+0x2bb/0x440 and 
another one right afterwards at kernel/smp.c:416 
smp_call_function_many+0xb5/0x250.


Regardless, Ill try the other patch you sent shortly.

Logan

On 11/06/16 05:48 AM, Rafael J. Wysocki wrote:

On Saturday, June 11, 2016 03:47:59 AM Rafael J. Wysocki wrote:

On Saturday, June 11, 2016 02:13:45 AM Rafael J. Wysocki wrote:

On Saturday, June 11, 2016 12:33:31 AM Rafael J. Wysocki wrote:

On Friday, June 10, 2016 04:28:01 PM Logan Gunthorpe wrote:


On 10/06/16 04:29 PM, Rafael J. Wysocki wrote:

OK, I have a theory, but I need a bit of help.

This may be a dumb question, but I don't quite remember the answer readily.

Given a physical address, how do I get the corresponding virtual one under
the kernel identity mapping?


Is that not just 'phys_to_virt(x)'??


Ah that.  Thanks!


So first, appended is a cleanup of the 64-bit hibernate/restore code that 
hopefully
will make it a bit clearer what happens there.

In particular, it follows from it quite clearly that the restore will only work
if the virtual address of the trampoline (_registers) in the image
kernel happens to match the virtual address of the same chunk of memory in the
boot kernel (and after it has switched to the temporary page tables).

That appears to be the case most of the time, or hibernation would randomly
break for people all over, but I'm not actually sure if that's guaranteed to
happen.  If not, that very well may explain unexpected failures in there.

If it is not guaranteed to happen, the solution would be to pass the physical
address of that memory from the image kernel to the boot kernel instead of its
virtual address,


OK, the appended patch does the trick for me.

Logan, can you please try it?  It shouldn't break things, but I'm wondering if
the problem you're having after commit ab76f7b4ab is still reproducible with it.


No, it won't help, and I think I know what's going on.

The temporary page tables set up by set_up_temporary_mappings() re-use the
original boot kernel's text mapping, so if the trampoline code address
falls into a non-executable page in that mapping, the boot kernel can't
pass control to the image kernel.

If that theory is correct, the patch below should help, but IMO it should
be applied regardless to eliminate this particular failure mode.

I'll prepare another patch to pass the physical address on top of this one.

Thanks,
Rafael

---
From: Rafael J. Wysocki <rafael.j.wyso...@intel.com>
Subject: [PATCH] x86 / hibernate: Ensure that 64-bit trampoline code is 
executable

During resume from hibernation, the boot kernel has to jump to
the image kernel's trampoline code to pass control to it.  On
x86_64 the address of that code is passed from the image kernel
to the boot kernel in the image header.

That address is interpreted with respect to the original boot
kernel's kernel text memory mapping, so if the page containing it
is not executable in that mapping, the jump to it will crash the
kernel, so prevent that from happening.

While at it, clean up the way in which the trampoline code address
is saved by the image kernel (assembly code is involved in that
unnecessarily etc).

Signed-off-by: Rafael J. Wysocki <rafael.j.wyso...@intel.com>
---
  arch/x86/power/hibernate_64.c |   15 ---
  arch/x86/power/hibernate_asm_64.S |3 ---
  2 files changed, 12 insertions(+), 6 deletions(-)

Index: linux-pm/arch/x86/power/hibernate_64.c
===
--- linux-pm.orig/arch/x86/power/hibernate_64.c
+++ linux-pm/arch/x86/power/hibernate_64.c
@@ -12,6 +12,7 @@
  #include 
  #include 

+#include 
  #include 
  #include 
  #include 
@@ -27,7 +28,7 @@ extern asmlinkage __visible int restore_
   * Address to jump to in the last phase of restore in order to get to the 
image
   * kernel's text (this value is passed in the image header).
   */
-unsigned long restore_jump_address __visible;
+void *restore_jump_address __visible;

  /*
   * Value of the cr3 register from before the hibernation (this value is passed
@@ -91,6 +92,14 @@ int swsusp_arch_resume(void)
return -ENOMEM;
memcpy(relocated_restore_code, _restore_code,
   _registers - _restore_code);
+   /*
+* The temporary memory mapping set up by set_up_temporary_mappings()
+* above re-uses the original kernel text mapping.  If the page with
+* restore_jump_address is not executable in that mapping, it won't be
+* possible to pass control to the image kernel, so prevent that from
+* happening.
+*/
+   set_memory_x((unsigned long)restore_jump_address, 1);

restore_image();
return 0;
@@ -108,7 +117,7 @@ int pfn_is_nosave(unsigned long pfn)
 

Re: [PATCH 5/8] ntb_tool: BUG: Ensure the buffer size is large enough to return all spads

2016-06-11 Thread Logan Gunthorpe

Ok, I'll add a comment.

Logan

On 10/06/16 08:35 PM, Allen Hubbe wrote:

On Fri, Jun 10, 2016 at 6:54 PM, Logan Gunthorpe <log...@deltatee.com> wrote:

On hardware with 32 scratchpad registers the spad field in ntb tool
could chop off the end. The maximum buffer size is increased from
256 to 15 times the number or scratchpads.

Signed-off-by: Logan Gunthorpe <log...@deltatee.com>


It could be marginally better if there was an explanation to accompany
the magic number 15, but it's not a big deal.  One might guess it has
something to do with the expected length of the formatted string.

Acked-by: Allen Hubbe <allen.hu...@emc.com>


---
  drivers/ntb/test/ntb_tool.c | 5 +++--
  1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/ntb/test/ntb_tool.c b/drivers/ntb/test/ntb_tool.c
index 4c01057..954e1d5 100644
--- a/drivers/ntb/test/ntb_tool.c
+++ b/drivers/ntb/test/ntb_tool.c
@@ -368,7 +368,9 @@ static ssize_t tool_spadfn_read(struct tool_ctx *tc, char 
__user *ubuf,
 if (!spad_read_fn)
 return -EINVAL;

-   buf_size = min_t(size_t, size, 0x100);
+   spad_count = ntb_spad_count(tc->ntb);
+
+   buf_size = min_t(size_t, size, spad_count * 15);

 buf = kmalloc(buf_size, GFP_KERNEL);
 if (!buf)
@@ -376,7 +378,6 @@ static ssize_t tool_spadfn_read(struct tool_ctx *tc, char 
__user *ubuf,

 pos = 0;

-   spad_count = ntb_spad_count(tc->ntb);
 for (i = 0; i < spad_count; ++i) {
 pos += scnprintf(buf + pos, buf_size - pos, "%d\t%#x\n",
  i, spad_read_fn(tc->ntb, i));
--
2.1.4

--
You received this message because you are subscribed to the Google Groups 
"linux-ntb" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to linux-ntb+unsubscr...@googlegroups.com.
To post to this group, send email to linux-...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/linux-ntb/d9488f2c946644c2b1258a78929d3543747283ec.1465598632.git.logang%40deltatee.com.
For more options, visit https://groups.google.com/d/optout.


Re: [PATCH 6/8] ntb_tool: Add link status file to debugfs

2016-06-11 Thread Logan Gunthorpe

Hey Allen,

Thanks for the feedback it's a bit more complicated but I don't object 
to that. I'll work something up on Monday.


I was trying to avoid adding link controls, but if we do, would you say 
the module should still enable the link when it's installed? Or would we 
have the user explicitly have to enable the link before using it?


Thanks,

Logan

On 10/06/16 08:27 PM, Allen Hubbe wrote:

On Fri, Jun 10, 2016 at 6:54 PM, Logan Gunthorpe <log...@deltatee.com> wrote:

In order to more successfully script with ntb_tool it's useful to
have a link file to check the link status so that the script
doesn't use the other files until the link is up.

This commit adds a 'link' file to the debugfs directory which reads
0 or 1 depending on the link status. For scripting convenience, writing
will block until the link is up (discarding anything that was written).

Signed-off-by: Logan Gunthorpe <log...@deltatee.com>
---
  drivers/ntb/test/ntb_tool.c | 45 +
  1 file changed, 45 insertions(+)

diff --git a/drivers/ntb/test/ntb_tool.c b/drivers/ntb/test/ntb_tool.c
index 954e1d5..116352e 100644
--- a/drivers/ntb/test/ntb_tool.c
+++ b/drivers/ntb/test/ntb_tool.c
@@ -59,6 +59,12 @@
   *
   * Eg: check if clearing the doorbell mask generates an interrupt.
   *
+ * # Check the link status
+ * root@self# cat $DBG_DIR/link
+ *
+ * # Block until the link is up
+ * root@self# echo > $DBG_DIR/link


I think a file to get and set the link status is a good idea, but the
way it is done as proposed here is not in a similar style to other
ntb_tool operations.  Other operations simply read a register and
format the value, or scan a value and write a register.  Similarly, I
think the link status could be done in the same way: use the read file
operation to get the current status with ntb_link_is_up(), and use the
file write operation to enable or disable the link with
ntb_link_enable() and ntb_link_disable().

Waiting for link status is an interesting concept, too.  Really, one
might be interested in a change in link status, whether up or down.
What about a link event file that supports write to arm the event, and
read to block for the event.  Consider an implementation based on
.  It would be used in combination with the link
status file, above, as follows.

1: Write 1 to the event file.  This arms the event.
   - The event will be disarmed by the next tool_link_event().

2: The application may read the link status file if it is interested
in waiting for a particular event.

3. The application may wait for an event by reading the event file
   - The application will wait as long as the event is still armed.
   - If the event was disarmed before waiting, the application will not block.

4. The application should read the link status again.

In any case, I think it would be more expected and natural to block
while reading a file versus writing it.


+ *
   * # Set the doorbell mask
   * root@self# echo 's 1' > $DBG_DIR/mask
   *
@@ -127,6 +133,7 @@ struct tool_ctx {
 struct work_struct link_cleanup;
 bool link_is_up;
 struct delayed_work link_work;
+   wait_queue_head_t link_wq;
 int mw_count;
 struct tool_mw mws[MAX_MWS];
  };
@@ -237,6 +244,7 @@ static void tool_link_work(struct work_struct *work)
 "Error setting up memory windows: %d\n", rc);

 tc->link_is_up = true;
+   wake_up(>link_wq);
  }

  static void tool_link_cleanup(struct work_struct *work)
@@ -573,6 +581,39 @@ static TOOL_FOPS_RDWR(tool_peer_spad_fops,
   tool_peer_spad_read,
   tool_peer_spad_write);

+static ssize_t tool_link_read(struct file *filep, char __user *ubuf,
+ size_t size, loff_t *offp)
+{
+   struct tool_ctx *tc = filep->private_data;
+   char *buf;
+   ssize_t pos, rc;
+
+   buf = kmalloc(64, GFP_KERNEL);
+   if (!buf)
+   return -ENOMEM;
+
+   pos = scnprintf(buf, 64, "%d\n", tc->link_is_up);
+   rc = simple_read_from_buffer(ubuf, size, offp, buf, pos);
+
+   kfree(buf);
+
+   return rc;
+}
+
+static ssize_t tool_link_write(struct file *filep, const char __user *ubuf,
+  size_t size, loff_t *offp)
+{
+   struct tool_ctx *tc = filep->private_data;
+
+   if (wait_event_interruptible(tc->link_wq, tc->link_is_up))
+   return -ERESTART;
+
+   return size;
+}
+
+static TOOL_FOPS_RDWR(tool_link_fops,
+ tool_link_read,
+ tool_link_write);

  static ssize_t tool_mw_read(struct file *filep, char __user *ubuf,
 size_t size, loff_t *offp)
@@ -708,6 +749,9 @@ static void tool_setup_dbgfs(struct tool_ctx *tc)
 debugfs_create_file("peer_spad", S_IRUSR | S_IWUSR, tc->dbgfs,
 tc, _peer_spad_

Re: [PATCH 7/8] ntb_pingpong: Add a debugfs file to get the ping count

2016-06-11 Thread Logan Gunthorpe

The pp_debugfs_dir is already initialized by the module init function.
If it doesn't exist here, I think we should just return instead of
trying again.  It's also worth noting, though it is probably no harm,
the code here does not check debugfs_initialized().



+static int __init tool_init(void)


This should be pp_init() not tool_init().


Yup, this is just sloppy copying on my part. I copied from two different 
places. I'll fix these oversights.


Thanks,

Logan



Re: PROBLEM: Resume form hibernate broken by setting NX on gap

2016-06-11 Thread Logan Gunthorpe

Hey Rafael,

I tried this patch as well and there was no change.

I have a couple tentative observations to make though. None of this is 
100% clear to me so please correct me if I'm wrong anywhere:


1) Commit ab76f7b4ab only extends the NX bit between __ex_table and 
rodata; which, by my understanding, shouldn't be used by anything. And 
__ex_table and rodata are fixed by the kernel's binary so both symbols 
should be the same in both the image kernel and the boot kernel given 
that both are running from the same binary.


2) When ab76f7b4ab is reverted, hibernation seems to work 100%. Though, 
when it's in place, it only works some of the time. Given that commit is 
only extending the NX region a bit, if there is some random mismatch, 
why does it never reach rodata? In other words, why is rodata a magic 
line that seems to work all the time -- why doesn't this random mismatch 
ever extend into the rodata region? rodata isn't _that_ far away from 
the end of ex_table.


Anyway, thanks again for looking into this.

Logan


On 10/06/16 07:47 PM, Rafael J. Wysocki wrote:

On Saturday, June 11, 2016 02:13:45 AM Rafael J. Wysocki wrote:

On Saturday, June 11, 2016 12:33:31 AM Rafael J. Wysocki wrote:

On Friday, June 10, 2016 04:28:01 PM Logan Gunthorpe wrote:


On 10/06/16 04:29 PM, Rafael J. Wysocki wrote:

OK, I have a theory, but I need a bit of help.

This may be a dumb question, but I don't quite remember the answer readily.

Given a physical address, how do I get the corresponding virtual one under
the kernel identity mapping?


Is that not just 'phys_to_virt(x)'??


Ah that.  Thanks!


So first, appended is a cleanup of the 64-bit hibernate/restore code that 
hopefully
will make it a bit clearer what happens there.

In particular, it follows from it quite clearly that the restore will only work
if the virtual address of the trampoline (_registers) in the image
kernel happens to match the virtual address of the same chunk of memory in the
boot kernel (and after it has switched to the temporary page tables).

That appears to be the case most of the time, or hibernation would randomly
break for people all over, but I'm not actually sure if that's guaranteed to
happen.  If not, that very well may explain unexpected failures in there.

If it is not guaranteed to happen, the solution would be to pass the physical
address of that memory from the image kernel to the boot kernel instead of its
virtual address,


OK, the appended patch does the trick for me.

Logan, can you please try it?  It shouldn't break things, but I'm wondering if
the problem you're having after commit ab76f7b4ab is still reproducible with it.

Thanks,
Rafael

---
From: Rafael J. Wysocki <rafael.j.wyso...@intel.com>
Subject: [PATCH] x86 / hibernate: Pass physical address of the trampoline

During resume from hibernation, the boot kernel has to jump to
the image kernel's trampoline code to pass control to it.  On
x86_64 the address of that code is passed from the image kernel
to the boot kernel in the image header.

Currently, the way in which that address is saved by the image kernel
is not entirely straightforward (assembly code is involved in that
unnecessarily etc).  Moreover, the current code passes the virtual
address of the trampoline with the assumption that the image and boot
kernels will always use the same virtual addresses for the chunk of
physical address space occupied by the trampoline code.

In case that assumption is not valid, pass the physical address of
the trampoline code from the image kernel to the boot kernel and
update RESTORE_MAGIC to avoid situations in which the boot kernel
may use a different trampoline address passing convention from the
one used by the image kernel.

Also drop the assembly code chunk updating restore_jump_address
in swsusp_arch_suspend() as it is not necessary any more.

Signed-off-by: Rafael J. Wysocki <rafael.j.wyso...@intel.com>
---
  arch/x86/power/hibernate_64.c |8 
  arch/x86/power/hibernate_asm_64.S |3 ---
  2 files changed, 4 insertions(+), 7 deletions(-)

Index: linux-pm/arch/x86/power/hibernate_64.c
===
--- linux-pm.orig/arch/x86/power/hibernate_64.c
+++ linux-pm/arch/x86/power/hibernate_64.c
@@ -27,7 +27,7 @@ extern asmlinkage __visible int restore_
   * Address to jump to in the last phase of restore in order to get to the 
image
   * kernel's text (this value is passed in the image header).
   */
-unsigned long restore_jump_address __visible;
+void *restore_jump_address __visible;

  /*
   * Value of the cr3 register from before the hibernation (this value is passed
@@ -113,7 +113,7 @@ struct restore_data_record {
unsigned long magic;
  };

-#define RESTORE_MAGIC  0x0123456789ABCDEFUL
+#define RESTORE_MAGIC  0x0123456789ABCDF0UL

  /**
   *arch_hibernation_header_save - populate the architecture specific part
@@ -126,7 +126,7 @@ int arch_hibernation_head

Re: PROBLEM: Resume form hibernate broken by setting NX on gap

2016-06-11 Thread Logan Gunthorpe

Hey,

On 11/06/16 07:05 PM, Rafael J. Wysocki wrote:

1) Commit ab76f7b4ab only extends the NX bit between __ex_table and
rodata; which, by my understanding, shouldn't be used by anything. And
__ex_table and rodata are fixed by the kernel's binary so both symbols
should be the same in both the image kernel and the boot kernel given
that both are running from the same binary.


Well, what if the kernel is relocated?


Ah, I'm sure I don't fully grasp the implications of that but I would 
assume that if the image kernel were located somewhere else it would 
still be far away from the boot kernel's ex_table/rodata boundary.



2) When ab76f7b4ab is reverted, hibernation seems to work 100%. Though,
when it's in place, it only works some of the time. Given that commit is
only extending the NX region a bit, if there is some random mismatch,
why does it never reach rodata? In other words, why is rodata a magic
line that seems to work all the time -- why doesn't this random mismatch
ever extend into the rodata region? rodata isn't _that_ far away from
the end of ex_table.


That's a very good question. :-)


Yeah, I guess if we knew the answer we'd understand what was going on 
and have a fix.




Can you please check if the patch below makes any difference?


I'm afraid it's no different. The kernel still freezes on resume. 
Though, no warnings with this one.


Thanks,

Logan


Re: [PATCH 6/8] ntb_tool: Add link status file to debugfs

2016-06-14 Thread Logan Gunthorpe


On 14/06/16 09:45 AM, Allen Hubbe wrote:
> 
> Feel free to disregard my suggestion above.  I hope my comment has not cost 
> you too much time.
> 
> The way you have written it already, and used it in the self-test script is 
> much more concise.
> 
>>> + * root@self# echo > $DBG_DIR/link
> 
> Acked-by: allen.hu...@emc.com
> 
> 
> 
> Eventually, I think it would be useful to let ntb_tool enable and disable the 
> link.  In that case, it might also be useful in a test script to wait for 
> link down, not just link up.
> 
> What about this:
> 
> # Wait for the link to be up or down
> root@self# echo 1 > $DBG_DIR/link
> root@self# echo 0 > $DBG_DIR/link
> 
> It need not be a part of this patch, but eventually:
> 
> # Enable or disable the link
> root@self# echo 1 > $DBG_DIR/link_ctrl
> root@self# echo 0 > $DBG_DIR/link_ctrl
> 
> # Reading the link_ctrl file can also give the link status
> root@self# cat $DBG_DIR/link_ctrl
> 
> Finally, I wonder if the file called "link" in this patch should be called 
> "link_wait" or similar, so its purpose is obviously not for enabling and 
> disabling the link.
> 

Actually I've already implemented something similar to your original
suggestion. I'll be submitting a v2 of this set shortly.

Logan


Re: [PATCH 8/8] ntb_test: Add a selftest script for the NTB subsystem

2016-06-14 Thread Logan Gunthorpe


On 14/06/16 09:47 AM, Allen Hubbe wrote:
> From: Logan Gunthorpe
>> This script automates testing doorbells, scratchpads and memory windows
>> for an NTB device. It can be run locally, with the NTB looped
>> back to the same host or use SSH to remotely control the second host.
>>
>> In the single host case, the script just needs to be passed two
>> arguments: a PCI ID for each side of the link. In the two host case
>> the -r option must be used to specify the remote hostname (which must
>> be SSH accessible and should probably have ssh-keys exchanged).
>>
>> A sample run looks like this:
>>
>> $ sudo ./ntb_test.sh :03:00.1 :83:00.1 -p 29
>> Starting ntb_tool tests...
>> Running db tests on: :03:00.1 / :83:00.1
>>   Passed
>> Running db tests on: :83:00.1 / :03:00.1
>>   Passed
>> Running spad tests on: :03:00.1 / :83:00.1
>>   Passed
>> Running spad tests on: :83:00.1 / :03:00.1
>>   Passed
>> Running mw0 tests on: :03:00.1 / :83:00.1
>>   Passed
>> Running mw0 tests on: :83:00.1 / :03:00.1
>>   Passed
>> Running mw1 tests on: :03:00.1 / :83:00.1
>>   Passed
>> Running mw1 tests on: :83:00.1 / :03:00.1
>>   Passed
>>
>> Starting ntb_pingpong tests...
>> Running ping pong tests on: :03:00.1 / :83:00.1
>>   Passed
>>
>> Starting ntb_perf tests...
>> Running local perf test without DMA
>>   0: copied 536870912 bytes in 238205 usecs, 2253 MBytes/s
>>   Passed
>> Running remote perf test without DMA
>>   0: copied 536870912 bytes in 238205 usecs, 2253 MBytes/s
>>   Passed
>>
>> Signed-off-by: Logan Gunthorpe <log...@deltatee.com>
>> ---
>>  MAINTAINERS |   1 +
>>  tools/testing/selftests/ntb/ntb_test.sh | 386 
>> 
>>  2 files changed, 387 insertions(+)
>>  create mode 100755 tools/testing/selftests/ntb/ntb_test.sh
>>
>> diff --git a/MAINTAINERS b/MAINTAINERS
>> index 9c567a4..f178e7e 100644
>> --- a/MAINTAINERS
>> +++ b/MAINTAINERS
>> @@ -7846,6 +7846,7 @@ F: drivers/ntb/
>>  F:  drivers/net/ntb_netdev.c
>>  F:  include/linux/ntb.h
>>  F:  include/linux/ntb_transport.h
>> +F:  tools/testing/selftests/ntb/
>>
>>  NTB INTEL DRIVER
>>  M:  Jon Mason <jdma...@kudzu.us>
>> diff --git a/tools/testing/selftests/ntb/ntb_test.sh
>> b/tools/testing/selftests/ntb/ntb_test.sh
>> new file mode 100755
>> index 000..e4a89e9
>> --- /dev/null
>> +++ b/tools/testing/selftests/ntb/ntb_test.sh
>> @@ -0,0 +1,386 @@
>> +#!/bin/bash
>> +# Copyright (c) 2016 Microsemi. All Rights Reserved.
>> +#
>> +# This program is free software; you can redistribute it and/or
>> +# modify it under the terms of the GNU General Public License as
>> +# published by the Free Software Foundation; either version 2 of
>> +# the License, or (at your option) any later version.
>> +#
>> +# This program is distributed in the hope that it would be useful,
>> +# but WITHOUT ANY WARRANTY; without even the implied warranty of
>> +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
>> +# GNU General Public License for more details.
>> +#
>> +# Author: Logan Gunthorpe <log...@deltatee.com>
>> +
>> +REMOTE_HOST=
>> +LIST_DEVS=FALSE
>> +
>> +DEBUGFS=${DEBUGFS-/sys/kernel/debug}
>> +
>> +PERF_RUN_ORDER=32
>> +MAX_MW_SIZE=0
>> +RUN_DMA_TESTS=
>> +DONT_CLEANUP=
>> +
>> +function show_help()
>> +{
>> +echo "Usage: $0 [OPTIONS] LOCAL_DEV REMOTE_DEV"
>> +echo "Run tests on a pair of NTB endpoints."
>> +echo
>> +echo "If the NTB device loops back to the same host then,"
>> +echo "just specifying the two PCI ids on the command line is"
>> +echo "sufficient. Otherwise, if the NTB link spans two hosts"
>> +echo "use the -r option to specify the hostname for the remote"
>> +echo "device. SSH will then be used to test the remote side."
>> +echo "An SSH key between the root users of the host would then"
>> +echo "be highly recommended."
>> +echo
>> +echo "Options:"
>> +echo "  -C  don't cleanup ntb modules on exit"
>> +echo "  -d  run dma tests"
>> +echo "  -h  show this help message"
>> +echo "  -l  

Re: [PATCH 8/8] ntb_test: Add a selftest script for the NTB subsystem

2016-06-14 Thread Logan Gunthorpe
On 14/06/16 08:16 AM, Shuah Khan wrote:
> On 06/14/2016 08:06 AM, Jon Mason wrote:
>> On Fri, Jun 10, 2016 at 6:54 PM, Logan Gunthorpe <log...@deltatee.com> wrote:
>>> This script automates testing doorbells, scratchpads and memory windows
>>> for an NTB device. It can be run locally, with the NTB looped
>>> back to the same host or use SSH to remotely control the second host.
>>>
>>> In the single host case, the script just needs to be passed two
>>> arguments: a PCI ID for each side of the link. In the two host case
>>> the -r option must be used to specify the remote hostname (which must
>>> be SSH accessible and should probably have ssh-keys exchanged).
>>
>> I appreciate the work that you are putting in here, but test shell
>> scripts are not accepted into the kernel source.

Yeah, I wasn't aware of this rule. I previously did some work on a very
similar shell script for NVM fabrics. Though that hasn't made it to
upstream yet.

> I don't see any reason for this script to be not part of kernel selftests.
> I think it will be a good addition. We probably don't want to include it in
> the auto run of the selftest suite.
> Jon! I you would like to take this script through your ntb tree, here is
> my ack for the script for kselftest part.
> 
> Acked-by: Shuah Khan <shua...@osg.samsung.com>
> 

Thanks Shauh!

>> I think a better place for this to be shared would be on the github
>> account wiki, https://github.com/jonmason/ntb/wiki
>> In fact, I'd really like for someone to add some pages there on using
>> the ntb tools and testing.  If you are willing, I'd be most
>> appreciative.

I can probably make some time for this later in the week.


Logan



Re: [PATCH v2 6/8] ntb_tool: Add link status and files to debugfs

2016-06-14 Thread Logan Gunthorpe


On 14/06/16 01:33 PM, Allen Hubbe wrote:
>> diff --git a/drivers/ntb/test/ntb_tool.c b/drivers/ntb/test/ntb_tool.c
>> index cba31fd..9bebd0d 100644
>> --- a/drivers/ntb/test/ntb_tool.c
>> +++ b/drivers/ntb/test/ntb_tool.c
>> @@ -59,6 +59,13 @@
>>   *
>>   * Eg: check if clearing the doorbell mask generates an interrupt.
>>   *
>> + * # Check the link status
>> + * root@self# cat $DBG_DIR/link
>> + *
>> + * # Block until the link is up
>> + * root@self# echo Y > $DBG_DIR/link_event
>> + * root@self# cat $DBG_DIR/link_event
>> + *
>>   * # Set the doorbell mask
>>   * root@self# echo 's 1' > $DBG_DIR/mask
>>   *
>> @@ -126,7 +133,9 @@ struct tool_ctx {
>>  struct dentry *dbgfs;
>>  struct work_struct link_cleanup;
>>  bool link_is_up;
> 
> Really, link_is_up means "memory windows are configured."  This comes from 
> your earlier patch that introduced memory windows to ntb_tool.

Yes, this is technically true. However, I don't think the distinction is
necessary. The user only really cares whether everything is up and
usable -- not whether the link is just physically up or not.


>> +bool link_event;
>>  struct delayed_work link_work;
>> +wait_queue_head_t link_wq;
>>  int mw_count;
>>  struct tool_mw mws[MAX_MWS];
>>  };
>> @@ -237,6 +246,7 @@ static void tool_link_work(struct work_struct *work)
>>  "Error setting up memory windows: %d\n", rc);
>>
>>  tc->link_is_up = true;
> 
> In other words, "memory windows are configured" = true.

Technically, yes.

>> +wake_up(>link_wq);
>>  }
>>
>>  static void tool_link_cleanup(struct work_struct *work)
>> @@ -246,6 +256,9 @@ static void tool_link_cleanup(struct work_struct *work)
>>
>>  if (!tc->link_is_up)
>>  cancel_delayed_work_sync(>link_work);
>> +
>> +tc->link_is_up = false;
> 
> If this was never set false anywhere in the patch that added memory windows, 
> I wonder if there is a bug.

Yup, this looks like an oversight on my part. However, I don't think it
resulted in any noticeable bug seeing, at the time, the only way to
bring the link back down was to remove the module or the device. It is
only strictly necessary now that we have the 'link' file which can
control the link.

>> +wake_up(>link_wq);
>>  }
>>
>>  static void tool_link_event(void *ctx)
>> @@ -578,6 +591,95 @@ static TOOL_FOPS_RDWR(tool_peer_spad_fops,
>>tool_peer_spad_read,
>>tool_peer_spad_write);
>>
>> +static ssize_t tool_link_read(struct file *filep, char __user *ubuf,
>> +  size_t size, loff_t *offp)
>> +{
>> +struct tool_ctx *tc = filep->private_data;
>> +char buf[3];
>> +
>> +buf[0] = tc->link_is_up ? 'Y' : 'N';
> 
> I think tc->link_is_up should instead be ntb_link_is_up(tc->ntb).

I disagree. Bad things will happen if the user waits on the event and
then immediately uses the memory windows. It will just be buggy and
racy. I can't see a situation where the user would want to wait for the
link to come up and not have everything in ntb_tool ready and usable.

>> +buf[1] = '\n';
>> +buf[2] = '\0';
>> +
>> +return simple_read_from_buffer(ubuf, size, offp, buf, 2);
>> +}
>> +
>> +static ssize_t tool_link_write(struct file *filep, const char __user *ubuf,
>> +   size_t size, loff_t *offp)
>> +{
>> +struct tool_ctx *tc = filep->private_data;
>> +char buf[32];
>> +size_t buf_size;
>> +bool val;
>> +int rc;
>> +
>> +buf_size = min(size, (sizeof(buf) - 1));
>> +if (copy_from_user(buf, ubuf, buf_size))
>> +return -EFAULT;
>> +
>> +buf[buf_size] = '\0';
>> +
>> +rc = strtobool(buf, );
>> +if (rc)
>> +return rc;
>> +
>> +if (val)
>> +ntb_link_enable(tc->ntb, NTB_SPEED_AUTO, NTB_WIDTH_AUTO);
>> +else
>> +ntb_link_disable(tc->ntb);
>> +
>> +return size;
>> +}
>> +
>> +static TOOL_FOPS_RDWR(tool_link_fops,
>> +  tool_link_read,
>> +  tool_link_write);
>> +
>> +static ssize_t tool_link_event_read(struct file *filep, char __user *ubuf,
>> +size_t size, loff_t *offp)
>> +{
>> +struct tool_ctx *tc = filep->private_data;
>> +char buf[3];
>> +
>> +if (wait_event_interruptible(tc->link_wq,
>> + tc->link_is_up == tc->link_event))
> 
> I think tc->link_is_up should instead be ntb_link_is_up(tc->ntb).

See above.

>> +return -ERESTART;
>> +
>> +buf[0] = tc->link_is_up ? 'Y' : 'N';
>> +buf[1] = '\n';
>> +buf[2] = '\0';
>> +
>> +return simple_read_from_buffer(ubuf, size, offp, buf, 2);
>> +}
>> +
>> +static ssize_t tool_link_event_write(struct file *filep,
>> + const char __user *ubuf,
>> + size_t size, loff_t *offp)
>> +{
>> +struct tool_ctx *tc = filep->private_data;
>> +char buf[32];
>> +size_t buf_size;
>> +

Re: [PATCH v2 6/8] ntb_tool: Add link status and files to debugfs

2016-06-15 Thread Logan Gunthorpe


On 15/06/16 10:02 AM, Allen Hubbe wrote:
>> My understanding is that ntb_tool is really just a test client to verify
>> the API and the hardware. I personally would not recommend it for any
>> real applications. As such, I don't think this philosophical argument
>> really matches that goal.
> 
> The purpose is to "verify the API and the hardware", not to support "real 
> applications."
> 
> The link status reported by the tool should be the link status reported by 
> "the API and the hardware," and not something else that might be convenient 
> for "my ntb_test script" or "anyone else trying to script with ntb_tool."  
> The primary purpose of ntb_tool is api-level debugging of hardware and 
> drivers, not scripting.
> 
> The problem with races in ntb_tool is due to auto-configuration of memory 
> windows in ntb_tool.  Instead of having ntb_tool setup the memory windows 
> automatically, maybe instead it should provide a file to control the memory 
> windows via debugfs.  Reading the file can format what is returned by 
> ntb_mw_get_range(), and writing the file can allocate a buffer and call 
> ntb_mw_set_trans(), or ntb_mw_clear_trans() and free the buffer.  Then, the 
> test script can wait for link up, then setup the memory windows, and then 
> finally proceed with the rest of the tests, and there would be no race.  
> There would be no confusion about what "link up" means, and ntb_tool would 
> more closely resemble the ntb.h api for memory windows.

Ok, well this is a good deal more complicated than it is now but I can
live with it. I'll work something up shortly.


> That's interesting about the hardware.  Maybe the driver for that particular 
> hardware should make sure that any translation register programming happens 
> before reporting link up to its clients.  Otherwise, ntb_transport will be 
> broken on that hardware.  The ntb_transport driver configures memory windows 
> the first time the link comes up, and only ever again if a different memory 
> window size is negotiated (unlikely).
> 
> There are two reasons for doing the configuration after link up in 
> ntb_transport.  First, it avoids consuming memory resources if the link never 
> comes up.  Second, ntb_transport negotiates with its peer how much of the 
> memory window will actually be used.  The ntb_perf tool is similar.

Hmm, yes I didn't notice that in ntb_transport. We'll have to make some
considerations for that in our driver.

Logan


Re: [PATCH v2 6/8] ntb_tool: Add link status and files to debugfs

2016-06-14 Thread Logan Gunthorpe


On 14/06/16 03:46 PM, Allen Hubbe wrote:
> The ntb_tool is intended to be a simple low level access to the ntb.h api.  
> As much as possible, I think ntb_tool should directly expose the ntb.h api 
> through debugfs, and not invent higher level concepts.

I really think practical concerns should override this. If we do it that
way then my ntb_test script wouldn't necessarily work reliably and we'd
just be asking for race conditions. (Especially if I moved the memory
window tests earlier.) Anyone else trying to script with ntb_tool would
run into the same problem.

Additionally, the link is up _and_ the hardware is configured/usable
isn't really that high level a concept or anything a user wouldn't
expect already.

My understanding is that ntb_tool is really just a test client to verify
the API and the hardware. I personally would not recommend it for any
real applications. As such, I don't think this philosophical argument
really matches that goal.


>>> If this was never set false anywhere in the patch that added memory 
>>> windows, I wonder if
>> there is a bug.
>>
>> Yup, this looks like an oversight on my part. However, I don't think it
>> resulted in any noticeable bug seeing, at the time, the only way to
>> bring the link back down was to remove the module or the device. It is
>> only strictly necessary now that we have the 'link' file which can
>> control the link.
> 
> Even without a file to control the link, any one side could be unloaded and 
> reloaded.  That also affects the link state on the side that stays loaded.  
> The side that stays loaded still needs to be sane when the link comes back up.

Yup, you're correct. If the other side of link goes down then
tc->link_is_up would be incorrect. So, yes, there may be a corner case
bug there. Though, seeing tc-link_is_up was only previously used to
cancel potentially queued delayed work it's probably pretty minor.

This was copied from ntb_perf which looks like it has the same issue.
I'll make a patch for that in v3.

>>> I think tc->link_is_up should instead be ntb_link_is_up(tc->ntb).
>>
>> I disagree. Bad things will happen if the user waits on the event and
>> then immediately uses the memory windows. It will just be buggy and
>> racy. I can't see a situation where the user would want to wait for the
>> link to come up and not have everything in ntb_tool ready and usable.
> 
> The memory windows can be configured prior to link up.  They can be 
> configured when probing the device instead of waiting for link up.  Doing 
> memory window configuration in probe would simplify the driver, and there 
> would be no race.

I'm not sure this is true, especially considering all possible hardware.
It's certainly not true with the hardware I'm working with and I'd
assume that all the existing NTB clients configured their memory windows
on link up and not in probe for a good reason.


Logan


Re: ktime_get_ts64() splat during resume

2016-06-20 Thread Logan Gunthorpe

Hey Rafael,

This patch appears to be working on my laptop. Thanks.

Logan

On 20/06/16 07:22 PM, Rafael J. Wysocki wrote:

On Tuesday, June 21, 2016 02:05:59 AM Rafael J. Wysocki wrote:

On Monday, June 20, 2016 11:15:18 PM Rafael J. Wysocki wrote:

On Mon, Jun 20, 2016 at 8:29 PM, Linus Torvalds
<torva...@linux-foundation.org> wrote:

On Mon, Jun 20, 2016 at 7:38 AM, Rafael J. Wysocki <r...@rjwysocki.net> wrote:


Overall, we seem to be heading towards the "really weird" territory here.


So the whole commit that Boris bisected down to is weird to me.

Why isn't the temporary text mapping just set up unconditionally in
the temp_level4_pgt?

Why does it have that insane "let's leave the temp_level4_pgt alone
until we actually switch to it, and save away restore_pgd_addr and the
restore_pgd, to then be set up at restore time"?

All the other temporary mappings are set up statically in the
temp_level4_pgt, why not that one?


The text mapping in temp_level4_pgt has to map the image kernel's
physical entry address to the same virtual address that the image
kernel had for it, because the image kernel will switch over to its
own page tables first and it will use its own kernel text mapping from
that point on.  That may not be the same as the text mapping of the
(currently running) restore (or "boot") kernel.

So if we set up the text mapping in temp_level4_pgt upfront, we
basically can't reference the original kernel text (or do any
addressing relative to it) any more after switching over to
temp_level4_pgt.

For some reason I thought that was not doable, but now that I look at
the code it looks like it can be done.  I'll try doing that.


I recalled what the problem was. :-)

In principle, the kernel text mapping in the image kernel may be different
from the kernel text mapping in the restore ("boot") kernel, but the patch
I posted a couple of hours ago actually assumed them to be the same, because
it switched to temp_level4_pgt before jumping to the relocated code.

To get rid of that implicit assumption it has to switch to temp_level4_pgt
from the relocated code itself, but for that to work, the page containing
the relocated code must be executable in the original page tables (it isn't
usually).

So updated patch is appended.

Again, it works for me, but I'm wondering about everybody else.

Thanks,
Rafael

---
From: Rafael J. Wysocki <rafael.j.wyso...@intel.com>
Subject: [PATCH v2] x86/power/64: Fix kernel text mapping corruption during 
image restoration

Logan Gunthorpe reports that hibernation stopped working reliably for
him after commit ab76f7b4ab23 (x86/mm: Set NX on gap between __ex_table
and rodata).

That turns out to be a consequence of a long-standing issue with the
64-bit image restoration code on x86, which is that the temporary
page tables set up by it to avoid page tables corruption when the
last bits of the image kernel's memory contents are copied into
their original page frames re-use the boot kernel's text mapping,
but that mapping may very well get corrupted just like any other
part of the page tables.  Of course, if that happens, the final
jump to the image kernel's entry point will go to nowhere.

The exact reason why commit ab76f7b4ab23 matters here is that it
sometimes causes a PMD of a large page to be split into PTEs
that are allocated dynamically and get corrupted during image
restoration as described above.

To fix that issue note that the code copying the last bits of the
image kernel's memory contents to the page frames occupied by them
previoulsy doesn't use the kernel text mapping, because it runs from
a special page covered by the identity mapping set up for that code
from scratch.  Hence, the kernel text mapping is only needed before
that code starts to run and then it will only be used just for the
final jump to the image kernel's entry point.

Accordingly, the temporary page tables set up in swsusp_arch_resume()
on x86-64 need to contain the kernel text mapping too.  That mapping
is only going to be used for the final jump to the image kernel, so
it only needs to cover the image kernel's entry point, because the
first thing the image kernel does after getting control back is to
switch over to its own original page tables.  Moreover, the virtual
address of the image kernel's entry point in that mapping has to be
the same as the one mapped by the image kernel's page tables.

With that in mind, modify the x86-64's arch_hibernation_header_save()
and arch_hibernation_header_restore() routines to pass the physical
address of the image kernel's entry point (in addition to its virtual
address) to the boot kernel (a small piece of assembly code involved
in passing the entry point's virtual address to the image kernel is
not necessary any more after that, so drop it).  Update RESTORE_MAGIC
too to reflect the image header format change.

Next, in set_up_temporary_mappings(), use the physical and virtual
addresses of the im

Re: [PATCH v3 10/10] ntb_perf: clear link_is_up flag when the link goes down.

2016-06-15 Thread Logan Gunthorpe
Hey,

Actually, I have to retract this patch. After some more thorough testing
I'm finding an issue:

When you remove and re-install the ntb_perf module very quickly,
ntb_perf will occasionally miss the link up event. This is because the
link_cleanup work gets delayed long enough that it gets scheduled after
the link up event gets sent. It then cancels the link work that should
have occurred. Without this patch, it never happens because link_is_up
never returns to false.

I think the correct solution is to just remove the link_cleanup work and
do those actions immediately on receipt of the event. If there's
agreement on this I can re-spin it again.

Thanks,

Logan


On 15/06/16 03:33 PM, Jiang, Dave wrote:
> On Wed, 2016-06-15 at 15:26 -0600, Logan Gunthorpe wrote:
>> When the link goes down, the link_is_up flag did not return to
>> false. This could have caused some subtle corner case bugs
>> when the link goes up and down quickly.
>>
>> Signed-off-by: Logan Gunthorpe <log...@deltatee.com>
> 
> Acked-by: Dave Jiang <dave.ji...@intel.com>
> 
> And all the other ntb_perf patches since there were no additional
> changes. 
> 
>> ---
>>  drivers/ntb/test/ntb_perf.c | 2 ++
>>  1 file changed, 2 insertions(+)
>>
>> diff --git a/drivers/ntb/test/ntb_perf.c
>> b/drivers/ntb/test/ntb_perf.c
>> index f0784e5..ae9d1b2 100644
>> --- a/drivers/ntb/test/ntb_perf.c
>> +++ b/drivers/ntb/test/ntb_perf.c
>> @@ -557,6 +557,8 @@ static void perf_link_cleanup(struct work_struct
>> *work)
>>  
>>  if (!perf->link_is_up)
>>  cancel_delayed_work_sync(>link_work);
>> +
>> +perf->link_is_up = false;
>>  }
>>  
>>  static int perf_setup_mw(struct ntb_dev *ntb, struct perf_ctx *perf)
>> -- 
>> 2.1.4


Re: [PATCH v3 09/10] ntb_test: Add a selftest script for the NTB subsystem

2016-06-15 Thread Logan Gunthorpe


On 15/06/16 03:49 PM, Allen Hubbe wrote:
>> +function link_test()
>> +{
>> +LOC=$1
>> +REM=$2
>> +EXP=0
>> +
>> +echo "Running link tests on: $(basename $LOC) / $(basename $REM)"
>> +
>> +write_file "N" "$LOC/link"
>> +write_file "N" "$LOC/link_event"
> 
> If it fails to bring down the link, won't it just block waiting on link_event 
> and never make it to the next step of the test?
> 
>> +if [[ $(read_file "$REM/link") != "N" ]]; then
>> +echo "Expected remote link to be down in $REM/link" >&2
>> +exit -1
>> +fi
>> +
>> +write_file "Y" "$LOC/link"
>> +write_file "Y" "$LOC/link_event"
>> +
>> +echo "  Passed"
>> +}

Well, the test is really intended to ensure both sides of the link see
changes to the link status. If the driver is somehow buggy and the link
never goes down/up when requested there's little I can do here except
block forever. Unless we want to add a timeout to the link_event file
(which I'd rather not).

You'd have the same issue if, when bringing the link up for the first
time, the link does not come back.

Logan


Re: [PATCH v3 09/10] ntb_test: Add a selftest script for the NTB subsystem

2016-06-15 Thread Logan Gunthorpe


On 15/06/16 04:17 PM, Allen Hubbe wrote:
> This test should fail on Intel RP/TB topology (two cpu sharing one ntb).  The 
> link state is the link state of the secondary side pcie bus connected to the 
> secondary side cpu.  The link must be up in order for the secondary side cpu 
> to discover the ntb device, so the driver does not allow the link to be 
> disabled in such topology.

Ok, I wasn't aware of this.  But looking closer I think I have a better
solution:

ntb_link_disable should return -EINVAL if the hardware can't support
bringing the link down. If I add the error check to my tool_link_write
(which I neglected to do) then writing an "N" to $LOC/link will fail and
the test could be skipped. I'll make a v4 spin.

Logan

> A simple thing to do here might be:
> 
> write_file "N" "$LOC/link"
> sleep 1
> read_file "$REM/link"
> 
> You already have my Ack.  This minor issue can be fixed later if anyone 
> cares.  I don't think it is a big deal, just worth pointing out that the 
> script will hang here instead of report a failure.  If it is worth fixing 
> later, at that point we might also want to change this script to continue 
> with other tests instead of exit on the first failure.
> 


Re: [PATCH v3 10/10] ntb_perf: clear link_is_up flag when the link goes down.

2016-06-15 Thread Logan Gunthorpe
Hey,

On 15/06/16 04:24 PM, Jiang, Dave wrote:
> On Wed, 2016-06-15 at 16:20 -0600, Logan Gunthorpe wrote:
>> Hey,
>>
>> Actually, I have to retract this patch. After some more thorough
>> testing
>> I'm finding an issue:
>>
>> When you remove and re-install the ntb_perf module very quickly,
>> ntb_perf will occasionally miss the link up event. This is because
>> the
>> link_cleanup work gets delayed long enough that it gets scheduled
>> after
>> the link up event gets sent. It then cancels the link work that
>> should
>> have occurred. Without this patch, it never happens because
>> link_is_up
>> never returns to false.
>>
>> I think the correct solution is to just remove the link_cleanup work
>> and
>> do those actions immediately on receipt of the event. If there's
>> agreement on this I can re-spin it again.
> 
> I'm ok with that. This is not an issue with ntb_transport?

Looks like I can get something similar to happen in ntb_transport.
However, it's much rarer and takes significantly more tries to get it to
occur. It does appear to correctly set its link_is_up to false when the
link goes down.

I'm not sure I'm quite clear on the flow in ntb_transport and don't have
time right now to study it so I'll have to let that be someone else's
(fairly minor) issue.

Logan


Re: [PATCH v5 00/10] NTB Selftest Script

2016-06-27 Thread Logan Gunthorpe
Great! Thanks Jon.

Logan

On 26/06/16 05:29 PM, Jon Mason wrote:
> On Mon, Jun 20, 2016 at 01:15:03PM -0600, Logan Gunthorpe wrote:
>> Sorry, I thought this was done but I found one more minor issue with
>> these patches so I'm resubmitting them one last time. Besides this isuse,
>> I think I have acks for all the patches and everything is working as I'd
>> like.
> 
> Applying patch 05/10 to my ntb branch (as it is a bug fix and should
> go into 4.7)
> 
> Applying the rest to my ntb-next branch (which should go into 4.8)
> 
> Note: I'm including patch 09/10 in my tree, even though it is for the
> selftest subsystem.  Given that Shuah Khan acked it, I assume this is
> the desired outcome.
> 
> Thanks,
> Jon
> 
>>
>> Changes since v4:
>>
>> 1) In patch 0006, when setting a translation fails in tool_setup_mw,
>> we now properly clean up the peer dma memory. Before that patch, it
>> wasn't required because if tool_setup_mw failed the module would clean
>> up all mws. After this patch, the clean up never happened, so it would
>> return an error to the user but the DMA memory would still be allocated
>> and the peer_trans file would report that it's ready but the debugfs
>> file would not have been created. In v5, after an error, no DMA memory
>> is allocated and it will still report that it's not ready to the user when
>> reading peer_trans.
>>
>> ---
>>
>> Changes since v3:
>>
>> 1) Check the error returned when setting the link in ntb_tool and pass
>> it back to the user.
>>
>> 2) If an error occurs when setting the link down during a link test in
>> ntb_test, just print unsupported and continue on. (For hardware that
>> does not support setting the link down.)
>>
>> 3) Fix a race condition problem introduced by clearing the link is
>> up flag in ntb_perf. We do this by getting rid of the link cleanup
>> work and doing the few actions in the link event handler.
>> Dave Jiang has already ok'd this approach.
>>
>> ---
>>
>> Changes since v2:
>>
>> 1) As per Allan's suggestion, I've added a patch to postpone the
>> peer memory window setup until the user requests it with a 'peer_trans*'
>> file. This pushes the ntb_tool API closer to the kernel API and allows
>> the link status file to return the raw status of the NTB link.
>>
>> 2) Change the link status file to return the value of ntb_link_is_up
>> instead of the local 'memory window ready' value.
>>
>> 3) Change the link_event file to just block on write. As it was in v2,
>> if multiple users attempted to use the link_event file they could
>> corrupt the state another user had set. Reads to this file are no
>> longer permitted.
>>
>> 4) Updates to the ntb_test script to accommodate the above changes.
>>
>> 5) Added some link tests to the ntb_test script. It will bring the link
>> down and check that the other side agrees.
>>
>> 6) Added a minor bug fix (Patch #10) to ntb_perf. During discussions
>> with Allen it was noticed that the link_is_up flag is never cleared if
>> the link goes away.
>>
>> ---
>>
>> Changes since v1:
>>
>> 1) Add a comment to explain the *15 in the buf size calculation,
>> as per Allen's feedback.
>>
>> 2) Clean up the changes to the pingpong client as there were some
>> sloppy copying mistakes.
>>
>> 3) Rework the 'link' file in ntb_tool as per Allen's suggestions.
>> I've added a 'link_event' file the works essentially how he's asked.
>> Though, I found no need to use a completion as suggested and the flow
>> is maybe slightly simpler than he's suggested. Just write a boolean
>> to the event file then read to wait for the link to be either up or
>> down. There's still some discussion on the best interface and it's
>> not much work to make additional minor functional changes.
>>
>> 4) Update the selftest script to use the new 'link_event' file.
>>
>> 5) Minor change to the way the selftest script lists devices thanks to
>> Allen's observation.
>>
>> ---
>>
>> I've written a ntb_test.sh script that would probably be useful if it
>> were included in the kernel. This series ends with that script and
>> includes some useful interface improvements and fixes to the existing ntb
>> test modules. Please see each individual commit for more information.
>> They are mostly independent.
>>
>> The series is based off of v4.6 plus the patches I've submitted that
>> have been accepted into ntb-next. They've been run through checkpatch
>> with --strict t

Re: [PATCH v3] x86/power/64: Fix kernel text mapping corruption during image restoration

2016-06-27 Thread Logan Gunthorpe

Hey,

This version is working for me as well. Thanks.

Logan

On 27/06/16 08:24 AM, Rafael J. Wysocki wrote:

On Tuesday, June 21, 2016 11:04:41 AM Kees Cook wrote:

On Mon, Jun 20, 2016 at 9:35 PM, Logan Gunthorpe <log...@deltatee.com> wrote:

Hey Rafael,

This patch appears to be working on my laptop. Thanks.


Same for me: resume still works with KASLR in my tests too.


Unfortunately, Boris still sees post-resume memory corruption with the patch
you have tested, but that turns out to be a result of some super-weird
corruption of a pointer on the stack which leads to a store at a wrong address
(and there's no way it can be related to the rest of the patch).

We have verified that it can be avoided by rearranging 
set_up_temporary_text_mapping()
to use fewer local variables and the appended patch contains this modification.

I also went on and changed relocate_restore_code() slightly in a similar 
fashion,
but all of those changes should not affect the behavior (unless there's 
something
insane going on behind the curtains, that is).

Kees, Logan, Boris, please try this one and let me know if it works for you.

Thanks,
Rafael


---
From: Rafael J. Wysocki <rafael.j.wyso...@intel.com>
Subject: [PATCH v2] x86/power/64: Fix kernel text mapping corruption during 
image restoration

Logan Gunthorpe reports that hibernation stopped working reliably for
him after commit ab76f7b4ab23 (x86/mm: Set NX on gap between __ex_table
and rodata).

That turns out to be a consequence of a long-standing issue with the
64-bit image restoration code on x86, which is that the temporary
page tables set up by it to avoid page tables corruption when the
last bits of the image kernel's memory contents are copied into
their original page frames re-use the boot kernel's text mapping,
but that mapping may very well get corrupted just like any other
part of the page tables.  Of course, if that happens, the final
jump to the image kernel's entry point will go to nowhere.

The exact reason why commit ab76f7b4ab23 matters here is that it
sometimes causes a PMD of a large page to be split into PTEs
that are allocated dynamically and get corrupted during image
restoration as described above.

To fix that issue note that the code copying the last bits of the
image kernel's memory contents to the page frames occupied by them
previoulsy doesn't use the kernel text mapping, because it runs from
a special page covered by the identity mapping set up for that code
from scratch.  Hence, the kernel text mapping is only needed before
that code starts to run and then it will only be used just for the
final jump to the image kernel's entry point.

Accordingly, the temporary page tables set up in swsusp_arch_resume()
on x86-64 need to contain the kernel text mapping too.  That mapping
is only going to be used for the final jump to the image kernel, so
it only needs to cover the image kernel's entry point, because the
first thing the image kernel does after getting control back is to
switch over to its own original page tables.  Moreover, the virtual
address of the image kernel's entry point in that mapping has to be
the same as the one mapped by the image kernel's page tables.

With that in mind, modify the x86-64's arch_hibernation_header_save()
and arch_hibernation_header_restore() routines to pass the physical
address of the image kernel's entry point (in addition to its virtual
address) to the boot kernel (a small piece of assembly code involved
in passing the entry point's virtual address to the image kernel is
not necessary any more after that, so drop it).  Update RESTORE_MAGIC
too to reflect the image header format change.

Next, in set_up_temporary_mappings(), use the physical and virtual
addresses of the image kernel's entry point passed in the image
header to set up a minimum kernel text mapping (using memory pages
that won't be overwritten by the image kernel's memory contents) that
will map those addresses to each other as appropriate.

This makes the concern about the possible corruption of the original
boot kernel text mapping go away and if the the minimum kernel text
mapping used for the final jump marks the image kernel's entry point
memory as executable, the jump to it is guaraneed to succeed.

Fixes: ab76f7b4ab23 (x86/mm: Set NX on gap between __ex_table and rodata)
Link: http://marc.info/?l=linux-pm=146372852823760=2
Reported-by: Logan Gunthorpe <log...@deltatee.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wyso...@intel.com>
---
 arch/x86/power/hibernate_64.c |   90 --
 arch/x86/power/hibernate_asm_64.S |   55 ++-
 2 files changed, 102 insertions(+), 43 deletions(-)

Index: linux-pm/arch/x86/power/hibernate_64.c
===
--- linux-pm.orig/arch/x86/power/hibernate_64.c
+++ linux-pm/arch/x86/power/hibernate_64.c
@@ -28,6 +28,7 @@ extern asmlinkage __visible int restore_
  * kernel's te

Re: [PATCH v5] x86/power/64: Fix kernel text mapping corruption during image restoration

2016-06-30 Thread Logan Gunthorpe

Hey,

So, I've done a couple hundred hibernate resume cycles with v5 of the 
patch and so far it looks good. I'll let you know if I hit any bugs in 
more typical usage over the coming week.


Thanks,

Logan

On 30/06/16 10:11 AM, Rafael J. Wysocki wrote:

From: Rafael J. Wysocki <rafael.j.wyso...@intel.com>

Logan Gunthorpe reports that hibernation stopped working reliably for
him after commit ab76f7b4ab23 (x86/mm: Set NX on gap between __ex_table
and rodata).

That turns out to be a consequence of a long-standing issue with the
64-bit image restoration code on x86, which is that the temporary
page tables set up by it to avoid page tables corruption when the
last bits of the image kernel's memory contents are copied into
their original page frames re-use the boot kernel's text mapping,
but that mapping may very well get corrupted just like any other
part of the page tables.  Of course, if that happens, the final
jump to the image kernel's entry point will go to nowhere.

The exact reason why commit ab76f7b4ab23 matters here is that it
sometimes causes a PMD of a large page to be split into PTEs
that are allocated dynamically and get corrupted during image
restoration as described above.

To fix that issue note that the code copying the last bits of the
image kernel's memory contents to the page frames occupied by them
previoulsy doesn't use the kernel text mapping, because it runs from
a special page covered by the identity mapping set up for that code
from scratch.  Hence, the kernel text mapping is only needed before
that code starts to run and then it will only be used just for the
final jump to the image kernel's entry point.

Accordingly, the temporary page tables set up in swsusp_arch_resume()
on x86-64 need to contain the kernel text mapping too.  That mapping
is only going to be used for the final jump to the image kernel, so
it only needs to cover the image kernel's entry point, because the
first thing the image kernel does after getting control back is to
switch over to its own original page tables.  Moreover, the virtual
address of the image kernel's entry point in that mapping has to be
the same as the one mapped by the image kernel's page tables.

With that in mind, modify the x86-64's arch_hibernation_header_save()
and arch_hibernation_header_restore() routines to pass the physical
address of the image kernel's entry point (in addition to its virtual
address) to the boot kernel (a small piece of assembly code involved
in passing the entry point's virtual address to the image kernel is
not necessary any more after that, so drop it).  Update RESTORE_MAGIC
too to reflect the image header format change.

Next, in set_up_temporary_mappings(), use the physical and virtual
addresses of the image kernel's entry point passed in the image
header to set up a minimum kernel text mapping (using memory pages
that won't be overwritten by the image kernel's memory contents) that
will map those addresses to each other as appropriate.

This makes the concern about the possible corruption of the original
boot kernel text mapping go away and if the the minimum kernel text
mapping used for the final jump marks the image kernel's entry point
memory as executable, the jump to it is guaraneed to succeed.

Fixes: ab76f7b4ab23 (x86/mm: Set NX on gap between __ex_table and rodata)
Link: http://marc.info/?l=linux-pm=146372852823760=2
Reported-by: Logan Gunthorpe <log...@deltatee.com>
Tested-by: Kees Cook <keesc...@chromium.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wyso...@intel.com>
---

v4 -> v5: Replace flush_tlb_all() in relocate_restore_code() with
  __flush_tlb_all() (which still should be sufficient) to avoid
  complaints from smp_call_function_many() about disabled interrupts.

v3 -> v4: The new relocate_restore_code() didn't cover the case when
  the page containing the relocated code was mapped via a huge (1G)
  page and it didn't clear the NX bit in that case, which led to the
  page fault reported by Logan at
  http://marc.info/?l=linux-kernel=146725158621626=4.
  Fix that by making relocate_restore_code() handle the huge page
  mapping case too.

I've retained the Tested-by: from Kees as these changes have nothing to do with
whether or not KASLR works with hibernation after this patch.

Logan, please test this one thoroughly.  I'll give it at least one week in
linux-next going forward.

Boris, please test it on the machine where we saw memory corruption with
the previous versions if poss.

---
 arch/x86/power/hibernate_64.c |   97 +-
 arch/x86/power/hibernate_asm_64.S |   55 +
 2 files changed, 109 insertions(+), 43 deletions(-)

Index: linux-pm/arch/x86/power/hibernate_64.c
===
--- linux-pm.orig/arch/x86/power/hibernate_64.c
+++ linux-pm/arch/x86/power/hibernate_64.c
@@ -19,6 +19,7 @@
 #include 
 #include 
 #include 
+#include 

 /* Defined in

Re: [PATCH v3] x86/power/64: Fix kernel text mapping corruption during image restoration

2016-06-29 Thread Logan Gunthorpe

Hey Raf,

Sorry to report that although the patch works the majority of the time, 
I just got a suspicious kernel panic during resume.


It said:

"kernel tried to execute NX protected page - exploit attempt? (uid: 0)"

You can find a photo of the panic here:

http://staff.deltatee.com/~logang/panic.jpg

Logan


On 29/06/16 08:48 AM, Kees Cook wrote:

On Mon, Jun 27, 2016 at 4:33 PM, Logan Gunthorpe <log...@deltatee.com> wrote:

Hey,

This version is working for me as well. Thanks.

Logan

On 27/06/16 08:24 AM, Rafael J. Wysocki wrote:


On Tuesday, June 21, 2016 11:04:41 AM Kees Cook wrote:


On Mon, Jun 20, 2016 at 9:35 PM, Logan Gunthorpe <log...@deltatee.com>
wrote:


Hey Rafael,

This patch appears to be working on my laptop. Thanks.



Same for me: resume still works with KASLR in my tests too.



Unfortunately, Boris still sees post-resume memory corruption with the
patch
you have tested, but that turns out to be a result of some super-weird
corruption of a pointer on the stack which leads to a store at a wrong
address
(and there's no way it can be related to the rest of the patch).

We have verified that it can be avoided by rearranging
set_up_temporary_text_mapping()
to use fewer local variables and the appended patch contains this
modification.

I also went on and changed relocate_restore_code() slightly in a similar
fashion,
but all of those changes should not affect the behavior (unless there's
something
insane going on behind the curtains, that is).

Kees, Logan, Boris, please try this one and let me know if it works for
you.


Tested-by: Kees Cook <keesc...@chromium.org>

I've done a few KASLR boots, and everything continues to look good to
me. Thanks!

-Kees



Thanks,
Rafael


---
From: Rafael J. Wysocki <rafael.j.wyso...@intel.com>
Subject: [PATCH v2] x86/power/64: Fix kernel text mapping corruption
during image restoration

Logan Gunthorpe reports that hibernation stopped working reliably for
him after commit ab76f7b4ab23 (x86/mm: Set NX on gap between __ex_table
and rodata).

That turns out to be a consequence of a long-standing issue with the
64-bit image restoration code on x86, which is that the temporary
page tables set up by it to avoid page tables corruption when the
last bits of the image kernel's memory contents are copied into
their original page frames re-use the boot kernel's text mapping,
but that mapping may very well get corrupted just like any other
part of the page tables.  Of course, if that happens, the final
jump to the image kernel's entry point will go to nowhere.

The exact reason why commit ab76f7b4ab23 matters here is that it
sometimes causes a PMD of a large page to be split into PTEs
that are allocated dynamically and get corrupted during image
restoration as described above.

To fix that issue note that the code copying the last bits of the
image kernel's memory contents to the page frames occupied by them
previoulsy doesn't use the kernel text mapping, because it runs from
a special page covered by the identity mapping set up for that code
from scratch.  Hence, the kernel text mapping is only needed before
that code starts to run and then it will only be used just for the
final jump to the image kernel's entry point.

Accordingly, the temporary page tables set up in swsusp_arch_resume()
on x86-64 need to contain the kernel text mapping too.  That mapping
is only going to be used for the final jump to the image kernel, so
it only needs to cover the image kernel's entry point, because the
first thing the image kernel does after getting control back is to
switch over to its own original page tables.  Moreover, the virtual
address of the image kernel's entry point in that mapping has to be
the same as the one mapped by the image kernel's page tables.

With that in mind, modify the x86-64's arch_hibernation_header_save()
and arch_hibernation_header_restore() routines to pass the physical
address of the image kernel's entry point (in addition to its virtual
address) to the boot kernel (a small piece of assembly code involved
in passing the entry point's virtual address to the image kernel is
not necessary any more after that, so drop it).  Update RESTORE_MAGIC
too to reflect the image header format change.

Next, in set_up_temporary_mappings(), use the physical and virtual
addresses of the image kernel's entry point passed in the image
header to set up a minimum kernel text mapping (using memory pages
that won't be overwritten by the image kernel's memory contents) that
will map those addresses to each other as appropriate.

This makes the concern about the possible corruption of the original
boot kernel text mapping go away and if the the minimum kernel text
mapping used for the final jump marks the image kernel's entry point
memory as executable, the jump to it is guaraneed to succeed.

Fixes: ab76f7b4ab23 (x86/mm: Set NX on gap between __ex_table and rodata)
Link: http://marc.info/?l=linux-pm=146372852823760=2
Reported-by: Lo

Re: [PATCH v3] x86/power/64: Fix kernel text mapping corruption during image restoration

2016-06-29 Thread Logan Gunthorpe



On 29/06/16 08:55 PM, Rafael J. Wysocki wrote:


The only thing that comes to mind at this point is that TLBs should be flushed
after page tables changes, so please apply the appended and let me know
if you see this panic any more with it.



Ok, I'll build a new kernel tomorrow. But keep in mind the panic is 
pretty rare as I've only seen it once so far after a couple dozen or so 
hibernates. So it may be hard to get a concrete yes or no on whether it 
fixes the issue.


I've got a script to run a bunch of hibernates in a row. I usually only 
run it for a handful of iterations, but I'll try running it for much 
longer with this patch and let you know in a couple days.


Logan



Thanks,
Rafael


---
 arch/x86/power/hibernate_64.c |   92 +-
 arch/x86/power/hibernate_asm_64.S |   55 +-
 2 files changed, 104 insertions(+), 43 deletions(-)

Index: linux-pm/arch/x86/power/hibernate_64.c
===
--- linux-pm.orig/arch/x86/power/hibernate_64.c
+++ linux-pm/arch/x86/power/hibernate_64.c
@@ -19,6 +19,7 @@
 #include 
 #include 
 #include 
+#include 

 /* Defined in hibernate_asm_64.S */
 extern asmlinkage __visible int restore_image(void);
@@ -28,6 +29,7 @@ extern asmlinkage __visible int restore_
  * kernel's text (this value is passed in the image header).
  */
 unsigned long restore_jump_address __visible;
+unsigned long jump_address_phys;

 /*
  * Value of the cr3 register from before the hibernation (this value is passed
@@ -37,7 +39,43 @@ unsigned long restore_cr3 __visible;

 pgd_t *temp_level4_pgt __visible;

-void *relocated_restore_code __visible;
+unsigned long relocated_restore_code __visible;
+
+static int set_up_temporary_text_mapping(void)
+{
+   pmd_t *pmd;
+   pud_t *pud;
+
+   /*
+* The new mapping only has to cover the page containing the image
+* kernel's entry point (jump_address_phys), because the switch over to
+* it is carried out by relocated code running from a page allocated
+* specifically for this purpose and covered by the identity mapping, so
+* the temporary kernel text mapping is only needed for the final jump.
+* Moreover, in that mapping the virtual address of the image kernel's
+* entry point must be the same as its virtual address in the image
+* kernel (restore_jump_address), so the image kernel's
+* restore_registers() code doesn't find itself in a different area of
+* the virtual address space after switching over to the original page
+* tables used by the image kernel.
+*/
+   pud = (pud_t *)get_safe_page(GFP_ATOMIC);
+   if (!pud)
+   return -ENOMEM;
+
+   pmd = (pmd_t *)get_safe_page(GFP_ATOMIC);
+   if (!pmd)
+   return -ENOMEM;
+
+   set_pmd(pmd + pmd_index(restore_jump_address),
+   __pmd((jump_address_phys & PMD_MASK) | 
__PAGE_KERNEL_LARGE_EXEC));
+   set_pud(pud + pud_index(restore_jump_address),
+   __pud(__pa(pmd) | _KERNPG_TABLE));
+   set_pgd(temp_level4_pgt + pgd_index(restore_jump_address),
+   __pgd(__pa(pud) | _KERNPG_TABLE));
+
+   return 0;
+}

 static void *alloc_pgt_page(void *context)
 {
@@ -59,9 +97,10 @@ static int set_up_temporary_mappings(voi
if (!temp_level4_pgt)
return -ENOMEM;

-   /* It is safe to reuse the original kernel mapping */
-   set_pgd(temp_level4_pgt + pgd_index(__START_KERNEL_map),
-   init_level4_pgt[pgd_index(__START_KERNEL_map)]);
+   /* Prepare a temporary mapping for the kernel text */
+   result = set_up_temporary_text_mapping();
+   if (result)
+   return result;

/* Set up the direct mapping from scratch */
for (i = 0; i < nr_pfn_mapped; i++) {
@@ -78,19 +117,45 @@ static int set_up_temporary_mappings(voi
return 0;
 }

+static int relocate_restore_code(void)
+{
+   pgd_t *pgd;
+   pmd_t *pmd;
+
+   relocated_restore_code = get_safe_page(GFP_ATOMIC);
+   if (!relocated_restore_code)
+   return -ENOMEM;
+
+   memcpy((void *)relocated_restore_code, _restore_code, PAGE_SIZE);
+
+   /* Make the page containing the relocated code executable */
+   pgd = (pgd_t *)__va(read_cr3()) + pgd_index(relocated_restore_code);
+   pmd = pmd_offset(pud_offset(pgd, relocated_restore_code),
+relocated_restore_code);
+   if (pmd_large(*pmd)) {
+   set_pmd(pmd, __pmd(pmd_val(*pmd) & ~_PAGE_NX));
+   } else {
+   pte_t *pte = pte_offset_kernel(pmd, relocated_restore_code);
+
+   set_pte(pte, __pte(pte_val(*pte) & ~_PAGE_NX));
+   }
+   flush_tlb_all();
+
+   return 0;
+}
+
 int swsusp_arch_resume(void)
 {
int error;

/* We have got enough memory and from now on we cannot recover */
-   if ((error = 

Re: [PATCH v2 0/4] Style fixes: open code obfuscating macros

2017-02-01 Thread Logan Gunthorpe
Cool, Thanks.

Let me know if you need anything from me.

Logan

On 01/02/17 01:08 PM, Jon Mason wrote:
> On Thu, Jan 26, 2017 at 3:00 PM, Logan Gunthorpe <log...@deltatee.com> wrote:
>> Hi,
>>
>> It's been a couple weeks... Any thoughts on this?
> 
> My apologies for not responding sooner.  A large rework of the core
> NTB code was done prior to your patches (for IDT NTB support).
> Unfortunately, after those changes this series does not apply cleanly.
> I understand the changes being made, and will fix-up the patches to
> make it apply cleanly on the updated code base.
> 
> Thanks,
> Jon
> 
>>
>> Thanks,
>>
>> Logan
>>
>> On 10/01/17 05:33 PM, Logan Gunthorpe wrote:
>>> Hi,
>>>
>>> Here's an updated version of the style fixes patchset. The differences
>>> from v1 are:
>>>
>>> 1) Rebased onto Jon Mason's current NTB branch
>>>
>>> 2) Added two more patches to convert the print lines to use the ntb
>>> device and not the pci one. This seems more sane, shortens a bunch
>>> of lines and removes the need for temporaries.
>>>
>>> Note with (2): I've tried my best to ensure that statements that print
>>> before the ntb device is registered still use the PCI device. However,
>>> someone should probably review and test this as I don't have access
>>> to all types of hardware to do that.
>>>
>>> Thanks,
>>>
>>> Logan
>>>
>>>
>>>
>>> Logan Gunthorpe (4):
>>>   ntb_hw_amd: Style fixes: open code macros that just obfuscate code
>>>   ntb_hw_intel: Style fixes: open code macros that just obfuscate code
>>>   ntb_hw_amd: Print kernel messages with the ntb device not the pci one
>>>   ntb_hw_intel: Print kernel messages with the ntb device not the pci
>>> one
>>>
>>>  drivers/ntb/hw/amd/ntb_hw_amd.c |  60 +--
>>>  drivers/ntb/hw/amd/ntb_hw_amd.h |   3 -
>>>  drivers/ntb/hw/intel/ntb_hw_intel.c | 192 
>>> +---
>>>  drivers/ntb/hw/intel/ntb_hw_intel.h |   3 -
>>>  4 files changed, 124 insertions(+), 134 deletions(-)
>>>
>>> --
>>> 2.1.4
>>>
>>
>> --
>> You received this message because you are subscribed to the Google Groups 
>> "linux-ntb" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to linux-ntb+unsubscr...@googlegroups.com.
>> To post to this group, send email to linux-...@googlegroups.com.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/linux-ntb/5ed3be6c-c94e-1ade-e4cd-a8e8293296cd%40deltatee.com.
>> For more options, visit https://groups.google.com/d/optout.


Re: [PATCH 1/1] MicroSemi Switchtec management interface driver

2017-01-31 Thread Logan Gunthorpe


On 31/01/17 10:26 AM, Greg Kroah-Hartman wrote:
> That's one big patch to review, would you want to do that?

Sorry, will do.

> Can you break it up into smaller parts?  At least put the documentation
> separately, right?

Ha, funny. Last time I sent a patch someone asked for the documentation
to be in the same patch. But I can easily split this up.

> And don't dump a .txt file into Documentation/ anymore, people are
> working to move to the newer format.

Fair. I wasn't sure where a good place to put it was. Any suggestions?

> Also, please rebase against Linus's tree at the least, we can't go back
> in time and apply this to the 4.9 kernel tree.

Will do.

> Why a .h file for a single .c file?

I wanted to keep the hardware defining structs and macros in a separate
file for future expansion. This hardware is also capable of some NTB
functions which may find it's way into the kernel in the future.

> Also, why a whole new directory?

We didn't feel it fit in the pci director which was for standard pci
stuff. We're more than open to other suggestions as to where this code
belongs.

Thanks,

Logan



[PATCH 1/1] MicroSemi Switchtec management interface driver

2017-01-31 Thread Logan Gunthorpe
Microsemi's "Switchtec" line of PCI switch devices is already well
supported by the kernel with standard PCI switch drivers. However, the
Switchtec device advertises a special management endpoint with a separate
PCI function address and class code. This endpoint enables some additional
functionality which includes:

 * Packet and Byte Counters
 * Switch Firmware Upgrades
 * Event and Error logs
 * Querying port link status
 * Custom user firmware commands

This patch introduces the switchtec kernel module which provides
PCI driver that exposes a char device. The char device provides
userspace access to this interface through read, write and (optionally)
poll calls. A couple of special IOCTLs are provided to:

* Inform userspace of firmware partition locations
* Pass event counts and allow userspace to wait on events

A short text file is provided which documents the switchtec driver,
outlines the semantics of using the char device and describes the
IOCTLs.

The device also exposes a few read-only sysfs attributes which provide
some device information component names and versions which is provided
by the hardware. These are documented in
Documentation/ABI/testing/sysfs-class-switchtec

A userspace tool and library which utilizes this interface is available
at [1]. This tool takes inspiration (and borrows some code) from
nvme-cli [2]. The tool is largely complete at this time but additional
features may be added in the future.

[1] https://github.com/sbates130272/switchtec-user
[2] https://github.com/linux-nvme/nvme-cli

Signed-off-by: Logan Gunthorpe <log...@deltatee.com>
Signed-off-by: Stephen Bates <stephen.ba...@microsemi.com>
---
 Documentation/ABI/testing/sysfs-class-switchtec |   96 ++
 Documentation/ioctl/ioctl-number.txt|1 +
 Documentation/switchtec.txt |   80 ++
 MAINTAINERS |   11 +
 drivers/pci/Kconfig |1 +
 drivers/pci/Makefile|1 +
 drivers/pci/switch/Kconfig  |   13 +
 drivers/pci/switch/Makefile |1 +
 drivers/pci/switch/switchtec.c  | 1320 +++
 drivers/pci/switch/switchtec.h  |  266 +
 include/uapi/linux/switchtec_ioctl.h|  129 +++
 11 files changed, 1919 insertions(+)
 create mode 100644 Documentation/ABI/testing/sysfs-class-switchtec
 create mode 100644 Documentation/switchtec.txt
 create mode 100644 drivers/pci/switch/Kconfig
 create mode 100644 drivers/pci/switch/Makefile
 create mode 100644 drivers/pci/switch/switchtec.c
 create mode 100644 drivers/pci/switch/switchtec.h
 create mode 100644 include/uapi/linux/switchtec_ioctl.h

diff --git a/Documentation/ABI/testing/sysfs-class-switchtec 
b/Documentation/ABI/testing/sysfs-class-switchtec
new file mode 100644
index 000..4fbc417
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-class-switchtec
@@ -0,0 +1,96 @@
+switchtec - Microsemi Switchtec PCI Switch Management Endpoint
+
+For details on this subsystem look at Documentation/switchtec.txt.
+
+What:  /sys/class/switchtec
+Date:  05-Jan-2017
+KernelVersion: v4.11
+Contact:   Logan Gunthorpe <log...@deltatee.com>
+Description:   The switchtec class subsystem folder.
+   Each registered switchtec driver is represented by a switchtecX
+   subfolder (X being an integer >= 0).
+
+
+What:  /sys/class/switchtec/switchtec[0-9]+/component_id
+Date:  05-Jan-2017
+KernelVersion: v4.11
+Contact:   Logan Gunthorpe <log...@deltatee.com>
+Description:   Component identifier as stored in the hardware (eg. PM4385)
+   (read only)
+Values:arbitrary string.
+
+
+What:  /sys/class/switchtec/switchtec[0-9]+/component_revision
+Date:  05-Jan-2017
+KernelVersion: v4.11
+Contact:   Logan Gunthorpe <log...@deltatee.com>
+Description:   Component revision stored in the hardware (read only)
+Values:integer.
+
+
+What:  /sys/class/switchtec/switchtec[0-9]+/component_vendor
+Date:  05-Jan-2017
+KernelVersion: v4.11
+Contact:   Logan Gunthorpe <log...@deltatee.com>
+Description:   Component vendor as stored in the hardware (eg. MICROSEM)
+   (read only)
+Values:arbitrary string.
+
+
+What:  /sys/class/switchtec/switchtec[0-9]+/device_version
+Date:  05-Jan-2017
+KernelVersion: v4.11
+Contact:   Logan Gunthorpe <log...@deltatee.com>
+Description:   Device version as stored in the hardware (read only)
+Values:integer.
+
+
+What:  /sys/class/switchtec/switchtec[0-9]+/fw_version
+Date:  05-Jan-2017
+KernelVersion: v4.11
+Contact:   Logan Gunthorpe <log...@deltatee.com>
+Description:   Currently running firmware version (read only)
+Values:integer (in hexadecimal).
+
+
+What:  /sys/class/sw

[PATCH 0/1] DRAFT: New Microsemi PCI Switch Management Driver

2017-01-31 Thread Logan Gunthorpe
Hi,

This is a continuation of the RFC we posted lasted month [1] which
proposes a management driver for Microsemi's Switchtec line of PCI
switches. This hardware is still looking to be used in the Open
Compute Platform

To make this entirely clear: the Switchtec products are compliant
with the PCI specifications and are supported today with the standard
in-kernel driver. However, these devices also expose a management endpoint
on a separate PCI function address which can be used to perform some
advanced operations. This is a driver for that function. See the patch
for more information.

Since the RFC, we've made the changes requested by Greg Kroah-Hartman
and Keith Busch, and we've also fleshed out a number of features. We've
added a couple of IOCTLs and sysfs attributes which are documented in
the patch. Significant work has also been done on the userspace tool
which is available under a GPL license at [2]. We've also had testing
done by some of the interested parties.

We hope to see this work included in either 4.11 or 4.12 assuming a
smooth review process.

The patch is based off of the v4.9 release.

Thanks for your review,

Logan

[1] https://www.spinics.net/lists/linux-pci/msg56897.html
[2] https://github.com/sbates130272/switchtec-user

--

Changes since RFC:

* Fixed incorrect use of the drive model as pointed out by Greg
  Kroah-Hartman
* Used devm functions as suggested by Keith Busch
* Added a handful of sysfs attributes to the switchtec class
* Added a handful of IOCTLs to the switchtec device
* A number of miscelaneous bug fixes

--

Logan Gunthorpe (1):
  MicroSemi Switchtec management interface driver

 Documentation/ABI/testing/sysfs-class-switchtec |   96 ++
 Documentation/ioctl/ioctl-number.txt|1 +
 Documentation/switchtec.txt |   80 ++
 MAINTAINERS |   11 +
 drivers/pci/Kconfig |1 +
 drivers/pci/Makefile|1 +
 drivers/pci/switch/Kconfig  |   13 +
 drivers/pci/switch/Makefile |1 +
 drivers/pci/switch/switchtec.c  | 1320 +++
 drivers/pci/switch/switchtec.h  |  266 +
 include/uapi/linux/switchtec_ioctl.h|  129 +++
 11 files changed, 1919 insertions(+)
 create mode 100644 Documentation/ABI/testing/sysfs-class-switchtec
 create mode 100644 Documentation/switchtec.txt
 create mode 100644 drivers/pci/switch/Kconfig
 create mode 100644 drivers/pci/switch/Makefile
 create mode 100644 drivers/pci/switch/switchtec.c
 create mode 100644 drivers/pci/switch/switchtec.h
 create mode 100644 include/uapi/linux/switchtec_ioctl.h

--
2.1.4


Re: [PATCH 1/1] MicroSemi Switchtec management interface driver

2017-01-31 Thread Logan Gunthorpe


On 31/01/17 11:57 AM, Greg Kroah-Hartman wrote:
> Sorry, it was probably me :)

Nope, it was Christoph Hellwig. I don't mind changing it. It's just hard
to know what's expected all the time.

> Why do you need this?  Wherever you put it, it should be built as part
> of the online kernel documentation.  Who is the audience for this
> documentation?

Well usually documenting how to use the device is a good thing. Is it
not? The audience would be people wanting to make use of the userspace
interface. Though in fairness, the vast majority of people should use
our library. I also thought it would be useful to reviewers to more
quickly understand the interface we are creating. If maintainers want us
to leave it out, we can, but that seems like an odd request.

> Do future stuff in the future, no need for that now, right?
> 
> Simple is best.

Ok, I can push it all into the C file for v2.

> I'll leave that up to the PCI maintainer, but just a single .c file in a
> subdir seems odd to me.

Thanks,

Logan


Re: [PATCH 1/1] MicroSemi Switchtec management interface driver

2017-01-31 Thread Logan Gunthorpe


On 31/01/17 10:49 AM, Jonathan Corbet wrote:
> The good news is that your switchtec.txt file is already 99% in the RST
> format, so there is little or nothing to do there.
> 
> The bad news is that we don't quite have a place for it yet.  This is
> really user-space developer documentation, and we don't have a sub-book
> for that.  I expect that to change pretty soon, and I might even toss
> together a bare beginning for 4.11, but that hasn't happened yet.
> 
> If you would like to learn the ropes and make one, that would be more than
> great :)  Alternatively, I'd suggest just leaving the document as-is, and
> I'll put it toward the top of the list of things to move into place once I
> get that book started.

Thanks Jon. I took a look at the documentation and it doesn't look too
difficult. But I don't really feel qualified to make some of the
decisions necessary to create the new book. (ie names, locations, etc.)
And I don't really want that churn in this patchset. So we'll leave it
where it is for now.

I have cleaned up some of the RST format for v2 though.

Logan


Re: [PATCH 1/1] MicroSemi Switchtec management interface driver

2017-02-02 Thread Logan Gunthorpe


On 01/02/17 05:10 AM, Emil Velikov wrote:
> You can keep it roughly as-is if you're ~reasonably certain one won't
> change it in the future.

I've made the change anyway. I think it's better now.

> Some teams frown upon adding new IOCTL(s) where existing ones can be
> made backward/forward compatible.
> I'm not fully aware of the general direction/consensus on the topic,
> so it might be a minority.

Sure, I just don't know what might be needed in the future so it's hard
to add a version or flags ioctl now.

> On the other hand, reading through sysfs for module version in order
> to use IOCTL A or B sounds quite hacky. Do you have an example where
> this is used or pointed out as good approach ?

I don't know of anything doing it that way now. But it sure would be
easy and make a bit of sense. (We'd actually use the module version for
something useful.) Either way, it would really depend on if and how
things change in the future. The point is there are options to expand if
needed.


> Afaict the idea is to not ship/bundle/release userspace until kernel
> parts are in.
> The "do not commit the changes" is implied as [very rarely] distros
> package from "random" git checkouts. Leading to all sorts of fun when
> it is mismatched wrt the kernel parts. Likelihood of doing that here
> is virtually none here, so this is a JFYI inspired by some past
> experiences.

Understood.

> Glad to hear. Then again you already had most of the things nicely done, imho.

Great, thanks.

Logan


[PATCH v2 4/4] switchtec: Add IOCTLs to the Switchtec driver

2017-02-02 Thread Logan Gunthorpe
This patch introduces a couple of special IOCTLs which are provided to:

* Inform userspace of firmware partition locations
* Pass event counts and allow userspace to wait on events
* Translate between PFF numbers used by the switch to
  port numbers.

Signed-off-by: Logan Gunthorpe <log...@deltatee.com>
Signed-off-by: Stephen Bates <stephen.ba...@microsemi.com>
---
 Documentation/ioctl/ioctl-number.txt |   1 +
 Documentation/switchtec.txt  |  27 ++
 MAINTAINERS  |   1 +
 drivers/pci/switch/switchtec.c   | 467 +++
 include/uapi/linux/switchtec_ioctl.h | 132 ++
 5 files changed, 628 insertions(+)
 create mode 100644 include/uapi/linux/switchtec_ioctl.h

diff --git a/Documentation/ioctl/ioctl-number.txt 
b/Documentation/ioctl/ioctl-number.txt
index 81c7f2b..032b33c 100644
--- a/Documentation/ioctl/ioctl-number.txt
+++ b/Documentation/ioctl/ioctl-number.txt
@@ -191,6 +191,7 @@ Code  Seq#(hex) Include FileComments
 'W'00-1F   linux/watchdog.hconflict!
 'W'00-1F   linux/wanrouter.h   conflict!   (pre 3.9)
 'W'00-3F   sound/asound.h  conflict!
+'W'40-5F   drivers/pci/switch/switchtec.c
 'X'all fs/xfs/xfs_fs.h conflict!
and fs/xfs/linux-2.6/xfs_ioctl32.h
and include/linux/falloc.h
diff --git a/Documentation/switchtec.txt b/Documentation/switchtec.txt
index 4bced4c..a0a9c7b 100644
--- a/Documentation/switchtec.txt
+++ b/Documentation/switchtec.txt
@@ -51,3 +51,30 @@ The char device has the following semantics:
 
 * The poll call will also be supported for userspace applications that
   need to do other things while waiting for the command to complete.
+
+The following IOCTLs are also supported by the device:
+
+* SWITCHTEC_IOCTL_FLASH_INFO - Retrieve firmware length and number
+  of partitions in the device.
+
+* SWITCHTEC_IOCTL_FLASH_PART_INFO - Retrieve address and lengeth for
+  any specified partition in flash.
+
+* SWITCHTEC_IOCTL_EVENT_SUMMARY - Read a structure of bitmaps
+  indicating all uncleared events.
+
+* SWITCHTEC_IOCTL_EVENT_CTL - Get the current count, clear and set flags
+  for any event. This ioctl takes in a switchtec_ioctl_event_ctl struct
+  with the event_id, index and flags set (index being the partition or PFF
+  number for non-global events). It returns whether the event has
+  occurred, the number of times and any event specific data. The flags
+  can be used to clear the count or enable and disable actions to
+  happen when the event occurs.
+  By using the SWITCHTEC_IOCTL_EVENT_FLAG_EN_POLL flag,
+  you can set an event to trigger a poll command to return with
+  POLLPRI. In this way, userspace can wait for events to occur.
+
+* SWITCHTEC_IOCTL_PFF_TO_PORT and SWITCHTEC_IOCTL_PORT_TO_PFF convert
+  between PCI Function Framework number (used by the event system)
+  and Switchtec Logic Port ID and Partition number (which is more
+  user friendly).
diff --git a/MAINTAINERS b/MAINTAINERS
index d39b7fd..ea461d1 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -9516,6 +9516,7 @@ S:Maintained
 F: Documentation/switchtec.txt
 F: Documentation/ABI/testing/sysfs-class-switchtec
 F: drivers/pci/switch/switchtec*
+F: include/uapi/linux/switchtec_ioctl.h
 
 PCI DRIVER FOR NVIDIA TEGRA
 M: Thierry Reding <thierry.red...@gmail.com>
diff --git a/drivers/pci/switch/switchtec.c b/drivers/pci/switch/switchtec.c
index 354ddfd..82bfd18 100644
--- a/drivers/pci/switch/switchtec.c
+++ b/drivers/pci/switch/switchtec.c
@@ -13,6 +13,8 @@
  *
  */
 
+#include 
+
 #include 
 #include 
 #include 
@@ -755,6 +757,417 @@ static unsigned int switchtec_dev_poll(struct file *filp, 
poll_table *wait)
return ret;
 }
 
+static int ioctl_flash_info(struct switchtec_dev *stdev,
+   struct switchtec_ioctl_flash_info __user *uinfo)
+{
+   struct switchtec_ioctl_flash_info info = {0};
+   struct flash_info_regs __iomem *fi = stdev->mmio_flash_info;
+
+   info.flash_length = ioread32(>flash_length);
+   info.num_partitions = SWITCHTEC_IOCTL_NUM_PARTITIONS;
+
+   if (copy_to_user(uinfo, , sizeof(info)))
+   return -EFAULT;
+
+   return 0;
+}
+
+static void set_fw_info_part(struct switchtec_ioctl_flash_part_info *info,
+struct partition_info __iomem *pi)
+{
+   info->address = ioread32(>address);
+   info->length = ioread32(>length);
+}
+
+static int ioctl_flash_part_info(struct switchtec_dev *stdev,
+   struct switchtec_ioctl_flash_part_info __user *uinfo)
+{
+   struct switchtec_ioctl_flash_part_info info = {0};
+   struct flash_info_regs __iomem *fi = stdev->mmio_flash_info;
+   u32 active_addr = -1;
+
+   if (copy_from_user(, uinfo, sizeof(info)))
+   return -EFAULT;
+
+   switch (info.flash_partition) {
+   case SWITCHTEC_IOCTL

[PATCH v2 2/4] switchtec: Add user interface documentation

2017-02-02 Thread Logan Gunthorpe
This adds standard documentation for the sysfs switchtec attributes and
a RST formatted text file which documents the char device interface.
Jonathan Corbet has indicated he will move this to a new user-space
developer documentation book once it's created.

Signed-off-by: Logan Gunthorpe <log...@deltatee.com>
Signed-off-by: Stephen Bates <stephen.ba...@microsemi.com>
---
 Documentation/switchtec.txt | 53 +
 MAINTAINERS |  1 +
 2 files changed, 54 insertions(+)
 create mode 100644 Documentation/switchtec.txt

diff --git a/Documentation/switchtec.txt b/Documentation/switchtec.txt
new file mode 100644
index 000..4bced4c
--- /dev/null
+++ b/Documentation/switchtec.txt
@@ -0,0 +1,53 @@
+
+Linux Switchtec Support
+
+
+Microsemi's "Switchtec" line of PCI switch devices is already
+supported by the kernel with standard PCI switch drivers. However, the
+Switchtec device advertises a special management endpoint which
+enables some additional functionality. This includes:
+
+* Packet and Byte Counters
+* Firmware Upgrades
+* Event and Error logs
+* Querying port link status
+* Custom user firmware commands
+
+The switchtec kernel module implements this functionality.
+
+
+Interface
+=
+
+The primary means of communicating with the Switchtec management firmware is
+through the Memory-mapped Remote Procedure Call (MRPC) interface.
+Commands are submitted to the interface with a 4-byte command
+identifier and up to 1KB of command specific data. The firmware will
+respond with a 4 bytes return code and up to 1KB of command specific
+data. The interface only processes a single command at a time.
+
+
+Userspace Interface
+===
+
+The MRPC interface will be exposed to userspace through a simple char
+device: /dev/switchtec#, one for each management endpoint in the system.
+
+The char device has the following semantics:
+
+* A write must consist of at least 4 bytes and no more than 1028 bytes.
+  The first four bytes will be interpreted as the command to run and
+  the remainder will be used as the input data. A write will send the
+  command to the firmware to begin processing.
+
+* Each write must be followed by exactly one read. Any double write will
+  produce an error and any read that doesn't follow a write will
+  produce an error.
+
+* A read will block until the firmware completes the command and return
+  the four bytes of status plus up to 1024 bytes of output data. (The
+  length will be specified by the size parameter of the read call --
+  reading less than 4 bytes will produce an error.
+
+* The poll call will also be supported for userspace applications that
+  need to do other things while waiting for the command to complete.
diff --git a/MAINTAINERS b/MAINTAINERS
index 9c35198..0ab858d 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -9513,6 +9513,7 @@ M:Stephen Bates <stephen.ba...@microsemi.com>
 M: Logan Gunthorpe <log...@deltatee.com>
 L: linux-...@vger.kernel.org
 S: Maintained
+F: Documentation/switchtec.txt
 F: drivers/pci/switch/switchtec*
 
 PCI DRIVER FOR NVIDIA TEGRA
-- 
2.1.4



[PATCH v2 3/4] switchtec: Add sysfs attributes to the Switchtec driver

2017-02-02 Thread Logan Gunthorpe
This patch adds a few read-only sysfs attributes which provide
some device information that is exposed from the devices. Primarily
component and device names and versions. These are documented in
Documentation/ABI/testing/sysfs-class-switchtec.

Signed-off-by: Logan Gunthorpe <log...@deltatee.com>
Signed-off-by: Stephen Bates <stephen.ba...@microsemi.com>
---
 Documentation/ABI/testing/sysfs-class-switchtec |  96 
 MAINTAINERS |   1 +
 drivers/pci/switch/switchtec.c  | 113 
 3 files changed, 210 insertions(+)
 create mode 100644 Documentation/ABI/testing/sysfs-class-switchtec

diff --git a/Documentation/ABI/testing/sysfs-class-switchtec 
b/Documentation/ABI/testing/sysfs-class-switchtec
new file mode 100644
index 000..48cb4c1
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-class-switchtec
@@ -0,0 +1,96 @@
+switchtec - Microsemi Switchtec PCI Switch Management Endpoint
+
+For details on this subsystem look at Documentation/switchtec.txt.
+
+What:  /sys/class/switchtec
+Date:  05-Jan-2017
+KernelVersion: v4.11
+Contact:   Logan Gunthorpe <log...@deltatee.com>
+Description:   The switchtec class subsystem folder.
+   Each registered switchtec driver is represented by a switchtecX
+   subfolder (X being an integer >= 0).
+
+
+What:  /sys/class/switchtec/switchtec[0-9]+/component_id
+Date:  05-Jan-2017
+KernelVersion: v4.11
+Contact:   Logan Gunthorpe <log...@deltatee.com>
+Description:   Component identifier as stored in the hardware (eg. PM8543)
+   (read only)
+Values:arbitrary string.
+
+
+What:  /sys/class/switchtec/switchtec[0-9]+/component_revision
+Date:  05-Jan-2017
+KernelVersion: v4.11
+Contact:   Logan Gunthorpe <log...@deltatee.com>
+Description:   Component revision stored in the hardware (read only)
+Values:integer.
+
+
+What:  /sys/class/switchtec/switchtec[0-9]+/component_vendor
+Date:  05-Jan-2017
+KernelVersion: v4.11
+Contact:   Logan Gunthorpe <log...@deltatee.com>
+Description:   Component vendor as stored in the hardware (eg. MICROSEM)
+   (read only)
+Values:arbitrary string.
+
+
+What:  /sys/class/switchtec/switchtec[0-9]+/device_version
+Date:  05-Jan-2017
+KernelVersion: v4.11
+Contact:   Logan Gunthorpe <log...@deltatee.com>
+Description:   Device version as stored in the hardware (read only)
+Values:integer.
+
+
+What:  /sys/class/switchtec/switchtec[0-9]+/fw_version
+Date:  05-Jan-2017
+KernelVersion: v4.11
+Contact:   Logan Gunthorpe <log...@deltatee.com>
+Description:   Currently running firmware version (read only)
+Values:integer (in hexadecimal).
+
+
+What:  /sys/class/switchtec/switchtec[0-9]+/partition
+Date:  05-Jan-2017
+KernelVersion: v4.11
+Contact:   Logan Gunthorpe <log...@deltatee.com>
+Description:   Partition number for this device in the switch (read only)
+Values:integer.
+
+
+What:  /sys/class/switchtec/switchtec[0-9]+/partition_count
+Date:      05-Jan-2017
+KernelVersion: v4.11
+Contact:   Logan Gunthorpe <log...@deltatee.com>
+Description:   Total number of partitions in the switch (read only)
+Values:integer.
+
+
+What:  /sys/class/switchtec/switchtec[0-9]+/product_id
+Date:      05-Jan-2017
+KernelVersion: v4.11
+Contact:   Logan Gunthorpe <log...@deltatee.com>
+Description:   Product identifier as stored in the hardware (eg. PSX 48XG3)
+   (read only)
+Values:arbitrary string.
+
+
+What:  /sys/class/switchtec/switchtec[0-9]+/product_revision
+Date:  05-Jan-2017
+KernelVersion: v4.11
+Contact:   Logan Gunthorpe <log...@deltatee.com>
+Description:   Product revision stored in the hardware (eg. RevB)
+   (read only)
+Values:arbitrary string.
+
+
+What:  /sys/class/switchtec/switchtec[0-9]+/product_vendor
+Date:  05-Jan-2017
+KernelVersion: v4.11
+Contact:   Logan Gunthorpe <log...@deltatee.com>
+Description:   Product vendor as stored in the hardware (eg. MICROSEM)
+   (read only)
+Values:arbitrary string.
diff --git a/MAINTAINERS b/MAINTAINERS
index 0ab858d..d39b7fd 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -9514,6 +9514,7 @@ M:Logan Gunthorpe <log...@deltatee.com>
 L: linux-...@vger.kernel.org
 S: Maintained
 F: Documentation/switchtec.txt
+F: Documentation/ABI/testing/sysfs-class-switchtec
 F: drivers/pci/switch/switchtec*
 
 PCI DRIVER FOR NVIDIA TEGRA
diff --git a/drivers/pci/switch/switchtec.c b/drivers/pci/switch/switchtec.c
index e4a4294..354ddfd 100644
--- a/drivers/pci/switch/switchtec.c
+++ b/drivers/pci/switch/switchtec.c
@@ -485,6 +485,118 @@ static void mrpc

[PATCH v2 1/4] MicroSemi Switchtec management interface driver

2017-02-02 Thread Logan Gunthorpe
Microsemi's "Switchtec" line of PCI switch devices is already well
supported by the kernel with standard PCI switch drivers. However, the
Switchtec device advertises a special management endpoint with a separate
PCI function address and class code. This endpoint enables some additional
functionality which includes:

 * Packet and Byte Counters
 * Switch Firmware Upgrades
 * Event and Error logs
 * Querying port link status
 * Custom user firmware commands

This patch introduces the switchtec kernel module which provides
PCI driver that exposes a char device. The char device provides
userspace access to this interface through read, write and (optionally)
poll calls.

A userspace tool and library which utilizes this interface is available
at [1]. This tool takes inspiration (and borrows some code) from
nvme-cli [2]. The tool is largely complete at this time but additional
features may be added in the future.

[1] https://github.com/sbates130272/switchtec-user
[2] https://github.com/linux-nvme/nvme-cli

Signed-off-by: Logan Gunthorpe <log...@deltatee.com>
Signed-off-by: Stephen Bates <stephen.ba...@microsemi.com>
---
 MAINTAINERS|8 +
 drivers/pci/Kconfig|1 +
 drivers/pci/Makefile   |1 +
 drivers/pci/switch/Kconfig |   13 +
 drivers/pci/switch/Makefile|1 +
 drivers/pci/switch/switchtec.c | 1028 
 6 files changed, 1052 insertions(+)
 create mode 100644 drivers/pci/switch/Kconfig
 create mode 100644 drivers/pci/switch/Makefile
 create mode 100644 drivers/pci/switch/switchtec.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 5f10c28..9c35198 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -9507,6 +9507,14 @@ S:   Maintained
 F: Documentation/devicetree/bindings/pci/aardvark-pci.txt
 F: drivers/pci/host/pci-aardvark.c
 
+PCI DRIVER FOR MICROSEMI SWITCHTEC
+M: Kurt Schwemmer <kurt.schwem...@microsemi.com>
+M: Stephen Bates <stephen.ba...@microsemi.com>
+M: Logan Gunthorpe <log...@deltatee.com>
+L: linux-...@vger.kernel.org
+S: Maintained
+F: drivers/pci/switch/switchtec*
+
 PCI DRIVER FOR NVIDIA TEGRA
 M: Thierry Reding <thierry.red...@gmail.com>
 L: linux-te...@vger.kernel.org
diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
index 6555eb7..f72e8c5 100644
--- a/drivers/pci/Kconfig
+++ b/drivers/pci/Kconfig
@@ -133,3 +133,4 @@ config PCI_HYPERV
 
 source "drivers/pci/hotplug/Kconfig"
 source "drivers/pci/host/Kconfig"
+source "drivers/pci/switch/Kconfig"
diff --git a/drivers/pci/Makefile b/drivers/pci/Makefile
index 8db5079..15b46dd 100644
--- a/drivers/pci/Makefile
+++ b/drivers/pci/Makefile
@@ -68,3 +68,4 @@ ccflags-$(CONFIG_PCI_DEBUG) := -DDEBUG
 
 # PCI host controller drivers
 obj-y += host/
+obj-y += switch/
diff --git a/drivers/pci/switch/Kconfig b/drivers/pci/switch/Kconfig
new file mode 100644
index 000..4c49648
--- /dev/null
+++ b/drivers/pci/switch/Kconfig
@@ -0,0 +1,13 @@
+menu "PCI switch controller drivers"
+   depends on PCI
+
+config PCI_SW_SWITCHTEC
+   tristate "MicroSemi Switchtec PCIe Switch Management Driver"
+   help
+Enables support for the management interface for the MicroSemi
+Switchtec series of PCIe switches. Supports userspace access
+to submit MRPC commands to the switch via /dev/switchtecX
+devices. See  for more
+information.
+
+endmenu
diff --git a/drivers/pci/switch/Makefile b/drivers/pci/switch/Makefile
new file mode 100644
index 000..37d8cfb
--- /dev/null
+++ b/drivers/pci/switch/Makefile
@@ -0,0 +1 @@
+obj-$(CONFIG_PCI_SW_SWITCHTEC) += switchtec.o
diff --git a/drivers/pci/switch/switchtec.c b/drivers/pci/switch/switchtec.c
new file mode 100644
index 000..e4a4294
--- /dev/null
+++ b/drivers/pci/switch/switchtec.c
@@ -0,0 +1,1028 @@
+/*
+ * Microsemi Switchtec(tm) PCIe Management Driver
+ * Copyright (c) 2017, Microsemi Corporation
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+MODULE_DESCRIPTION("Microsemi Switchtec(tm) PCI-E Management Driver");
+MODULE_VERSION("0.1");
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Microsemi Corporation");
+
+static int max_devices = 16;
+module_param(max_devices, int, 0644);
+MODULE_PARM_DESC(max_devices, "max number of switchtec device instances");
+
+static dev_t switchtec_devt;
+static stru

[PATCH v2 0/4] New Microsemi PCI Switch Management Driver

2017-02-02 Thread Logan Gunthorpe
Changes since v1:

* Rebased onto 4.10-rc6 (cleanly)
* Split the patch into a few more easily digestible patches (as
  suggested by Greg Kroah-Hartman)
* Folded switchtec.c into switchtec.h (per Greg)
* Fixed a bunch of 32bit build warnings caught by the kbuild test robot
* Fixed some issues in the documentation so it has a proper
  reStructredText format (as noted by Jonathan Corbet)
* Fixed padding and sizes in the IOCTL structures as noticed by Emil
  Velikov and used pahole to verify their consistency across 32 and 64
  bit builds
* Reworked one of the IOCTL interfaces to be more future proof (per
  Emil).

Changes since RFC:

* Fixed incorrect use of the drive model as pointed out by Greg
  Kroah-Hartman
* Used devm functions as suggested by Keith Busch
* Added a handful of sysfs attributes to the switchtec class
* Added a handful of IOCTLs to the switchtec device
* A number of miscellaneous bug fixes

--

Hi,

This is a continuation of the RFC we posted lasted month [1] which
proposes a management driver for Microsemi's Switchtec line of PCI
switches. This hardware is still looking to be used in the Open
Compute Platform

To make this entirely clear: the Switchtec products are compliant
with the PCI specifications and are supported today with the standard
in-kernel driver. However, these devices also expose a management endpoint
on a separate PCI function address which can be used to perform some
advanced operations. This is a driver for that function. See the patch
for more information.

Since the RFC, we've made the changes requested by Greg Kroah-Hartman
and Keith Busch, and we've also fleshed out a number of features. We've
added a couple of IOCTLs and sysfs attributes which are documented in
the patch. Significant work has also been done on the userspace tool
which is available under a GPL license at [2]. We've also had testing
done by some of the interested parties.

We hope to see this work included in either 4.11 or 4.12 assuming a
smooth review process.

The patch is based off of the v4.10-rc6 release.

Thanks for your review,

Logan

[1] https://www.spinics.net/lists/linux-pci/msg56897.html
[2] https://github.com/sbates130272/switchtec-user

--

Logan Gunthorpe (4):
  MicroSemi Switchtec management interface driver
  switchtec: Add user interface documentation
  switchtec: Add sysfs attributes to the Switchtec driver
  switchtec: Add IOCTLs to the Switchtec driver

 Documentation/ABI/testing/sysfs-class-switchtec |   96 ++
 Documentation/ioctl/ioctl-number.txt|1 +
 Documentation/switchtec.txt |   80 ++
 MAINTAINERS |   11 +
 drivers/pci/Kconfig |1 +
 drivers/pci/Makefile|1 +
 drivers/pci/switch/Kconfig  |   13 +
 drivers/pci/switch/Makefile |1 +
 drivers/pci/switch/switchtec.c  | 1608 +++
 include/uapi/linux/switchtec_ioctl.h|  132 ++
 10 files changed, 1944 insertions(+)
 create mode 100644 Documentation/ABI/testing/sysfs-class-switchtec
 create mode 100644 Documentation/switchtec.txt
 create mode 100644 drivers/pci/switch/Kconfig
 create mode 100644 drivers/pci/switch/Makefile
 create mode 100644 drivers/pci/switch/switchtec.c
 create mode 100644 include/uapi/linux/switchtec_ioctl.h

--
2.1.4


Re: [PATCH 1/1] MicroSemi Switchtec management interface driver

2017-01-31 Thread Logan Gunthorpe
Hi Emil,

Thanks for the feedback.

On 31/01/17 01:48 PM, Emil Velikov wrote:
>> +struct switchtec_ioctl_fw_info {
>> +   __u32 flash_length;
>> +
>> +   struct {
>> +   __u32 address;
>> +   __u32 length;
>> +   __u32 active;
> Something to keep in mind, not sure how likely it is here:
> If you embed structs (partition in this case), you will not be able to
> extend it [the embedded one] it in the future without the risk of ABI
> breakage.

Based on this feedback, I'll rework the fw_info IOCTL so the user only
requests one partition at a time. This will get rid of the embedded
struct and any concerns about size changes. I was originally trying to
make the ioctl RO to make it simpler but that maybe isn't the case.

>> +   } partition[SWITCHTEC_IOCTL_NUM_PARTITIONS];
>> +};
>> +
> Afaict this and most/all other structs [in this patch] are in
> violation of Prerequisites#2 and/or #3.
> Personally I find pahole quite useful - reference all the UABI structs
> in a dummy userspace app, compile with gcc -g -O0 and observe the
> output across 32 and 64bit builds. Members _must_ be at the same
> offset, otherwise things will be broken with 64bit kernel on a 32bit
> userspace. Something which people will eventually end up trying/using,
> even if you don't plan to support it.

I've made some changes to the padding and sizes to accommodate this.
I'll also do some testing with pahole before v2.


> Please check that Basics (#2 and #5 in particular) are also covered.

> then you need a driver feature flag or revision number somewhere.

We don't have this specifically. A new version ioctl could be added
later when and if changes occur; or the userspace code could easily read
the module version through sysfs and we could increment that with ioctl
changes.

> A couple more generic suggestions, which I may have noticed while
> skimming through:
>  - please ensure that all the input is properly sanitised, before,
> going into the actual implementation
> Afaict having the separate step/separation helps clarity and reduces
> chances of you/others getting it wrong down the line.

The inputs are indeed properly checked in the code. Most of the ioctls
that take inputs are so simple it's hard to separate the two steps. ie.
Verifying that the input is right pretty much gives you the answer you
need to send back to userspace.

>  - user provided storage must not be changed when the ioctl fails.

This should already be true.

> Additionally you want to provide a reference to open-source userspace
> [alongside acknowledgement of the maintainers/co-developers about the
> design] which makes use of said IOCTL(s). But please, do _not_ merge
> userspace code before the kernel work has landed.

I'm having trouble understanding what you're asking here. We provided a
reference to the userspace code in the commit message. Currently the
userspace code is completely useless without the kernel module and it is
entirely independent of other projects. I'm also the only committer so
far on the userspace code. So I assume this only applies if we were
merging changes to other existing code?

> Hope that provided you with sufficient good points wrt IOCTL design.

Thanks, this was very helpful.

Logan


Re: [PATCH v2 0/4] Style fixes: open code obfuscating macros

2017-01-26 Thread Logan Gunthorpe
Hi,

It's been a couple weeks... Any thoughts on this?

Thanks,

Logan

On 10/01/17 05:33 PM, Logan Gunthorpe wrote:
> Hi,
> 
> Here's an updated version of the style fixes patchset. The differences
> from v1 are:
> 
> 1) Rebased onto Jon Mason's current NTB branch
> 
> 2) Added two more patches to convert the print lines to use the ntb
> device and not the pci one. This seems more sane, shortens a bunch
> of lines and removes the need for temporaries.
> 
> Note with (2): I've tried my best to ensure that statements that print
> before the ntb device is registered still use the PCI device. However,
> someone should probably review and test this as I don't have access
> to all types of hardware to do that.
> 
> Thanks,
> 
> Logan
> 
> 
> 
> Logan Gunthorpe (4):
>   ntb_hw_amd: Style fixes: open code macros that just obfuscate code
>   ntb_hw_intel: Style fixes: open code macros that just obfuscate code
>   ntb_hw_amd: Print kernel messages with the ntb device not the pci one
>   ntb_hw_intel: Print kernel messages with the ntb device not the pci
> one
> 
>  drivers/ntb/hw/amd/ntb_hw_amd.c |  60 +--
>  drivers/ntb/hw/amd/ntb_hw_amd.h |   3 -
>  drivers/ntb/hw/intel/ntb_hw_intel.c | 192 
> +---
>  drivers/ntb/hw/intel/ntb_hw_intel.h |   3 -
>  4 files changed, 124 insertions(+), 134 deletions(-)
> 
> --
> 2.1.4
> 


Re: [PATCH v2 0/4] New Microsemi PCI Switch Management Driver

2017-02-17 Thread Logan Gunthorpe
Hi Bjorn,

Can you give us an idea of when you might be able to comment on our
patchset? We've addressed all the outstanding issues and have a couple
of reviewed and tested tags. So we'd like to see this move forward as
soon as possible.

I can do a respin with the tags collected or address any concerns you
may have, just please let us know.

Thanks,

Logan

On 02/02/17 11:05 AM, Logan Gunthorpe wrote:
> Changes since v1:
> 
> * Rebased onto 4.10-rc6 (cleanly)
> * Split the patch into a few more easily digestible patches (as
>   suggested by Greg Kroah-Hartman)
> * Folded switchtec.c into switchtec.h (per Greg)
> * Fixed a bunch of 32bit build warnings caught by the kbuild test robot
> * Fixed some issues in the documentation so it has a proper
>   reStructredText format (as noted by Jonathan Corbet)
> * Fixed padding and sizes in the IOCTL structures as noticed by Emil
>   Velikov and used pahole to verify their consistency across 32 and 64
>   bit builds
> * Reworked one of the IOCTL interfaces to be more future proof (per
>   Emil).
> 
> Changes since RFC:
> 
> * Fixed incorrect use of the drive model as pointed out by Greg
>   Kroah-Hartman
> * Used devm functions as suggested by Keith Busch
> * Added a handful of sysfs attributes to the switchtec class
> * Added a handful of IOCTLs to the switchtec device
> * A number of miscellaneous bug fixes
> 
> --
> 
> Hi,
> 
> This is a continuation of the RFC we posted lasted month [1] which
> proposes a management driver for Microsemi's Switchtec line of PCI
> switches. This hardware is still looking to be used in the Open
> Compute Platform
> 
> To make this entirely clear: the Switchtec products are compliant
> with the PCI specifications and are supported today with the standard
> in-kernel driver. However, these devices also expose a management endpoint
> on a separate PCI function address which can be used to perform some
> advanced operations. This is a driver for that function. See the patch
> for more information.
> 
> Since the RFC, we've made the changes requested by Greg Kroah-Hartman
> and Keith Busch, and we've also fleshed out a number of features. We've
> added a couple of IOCTLs and sysfs attributes which are documented in
> the patch. Significant work has also been done on the userspace tool
> which is available under a GPL license at [2]. We've also had testing
> done by some of the interested parties.
> 
> We hope to see this work included in either 4.11 or 4.12 assuming a
> smooth review process.
> 
> The patch is based off of the v4.10-rc6 release.
> 
> Thanks for your review,
> 
> Logan
> 
> [1] https://www.spinics.net/lists/linux-pci/msg56897.html
> [2] https://github.com/sbates130272/switchtec-user
> 
> --
> 
> Logan Gunthorpe (4):
>   MicroSemi Switchtec management interface driver
>   switchtec: Add user interface documentation
>   switchtec: Add sysfs attributes to the Switchtec driver
>   switchtec: Add IOCTLs to the Switchtec driver
> 
>  Documentation/ABI/testing/sysfs-class-switchtec |   96 ++
>  Documentation/ioctl/ioctl-number.txt|1 +
>  Documentation/switchtec.txt |   80 ++
>  MAINTAINERS |   11 +
>  drivers/pci/Kconfig |1 +
>  drivers/pci/Makefile|1 +
>  drivers/pci/switch/Kconfig  |   13 +
>  drivers/pci/switch/Makefile |1 +
>  drivers/pci/switch/switchtec.c  | 1608 
> +++
>  include/uapi/linux/switchtec_ioctl.h|  132 ++
>  10 files changed, 1944 insertions(+)
>  create mode 100644 Documentation/ABI/testing/sysfs-class-switchtec
>  create mode 100644 Documentation/switchtec.txt
>  create mode 100644 drivers/pci/switch/Kconfig
>  create mode 100644 drivers/pci/switch/Makefile
>  create mode 100644 drivers/pci/switch/switchtec.c
>  create mode 100644 include/uapi/linux/switchtec_ioctl.h
> 
> --
> 2.1.4
> 


Re: [PATCH v2 3/4] switchtec: Add sysfs attributes to the Switchtec driver

2017-02-23 Thread Logan Gunthorpe


On 23/02/17 03:43 PM, Bjorn Helgaas wrote:
> This path seems a little generic.  I don't see other cases where a
> product brand name ("Switchtec") appears at the top level of
> /sys/class/...

Ok, well we are certainly open to suggestions, but there isn't really a
generic version of this device available so I'm not sure how we would
change that. Per device-type classes aren't that uncommon though, a
quick grep shows things like:

platform/chrome/cros_ec_dev.c:40:static struct class cros_class
s390/char/raw3270.h:94:extern struct class *class3270;
net/ethernet/hisilicon/hns/hnae.c:19:static struct class *hnae_class;
mfd/ucb1x00-core.c:490:static struct class ucb1x00_class

> My question is based on "ls Documentation/ABI/testing/sysfs-class*",
> not on any great knowledge of sysfs, and I see Greg has already given
> a Reviewed-by for this, so maybe this is the right approach.
> 
> It does seem like the path could include a clue that this is related
> to PCI.

I mean, we could change it to pci-switchtec or something if you think
that would be better..?? But I'm not sure how else to accommodate this.

> Is there a link to the switch PCI device itself, e.g., to
> /sys/devices/pci*?  Should these attributes simply be put in a
> subdirectory there, e.g., in
> 
>   /sys/devices/pci:00/:00:00.0/stats/...

Well our device shows up here in the tree:

/sys/devices/pci:00/:00:03.0/:03:00.1/switchtec/switchtec0

(Which userspace can get to by following the link at
/sys/class/switchtec/switchtec0) The switch is then always:

/sys/devices/pci:00/:00:03.0

Thanks,

Logan


Re: [PATCH v2 0/4] New Microsemi PCI Switch Management Driver

2017-02-23 Thread Logan Gunthorpe
Thanks Bjorn! I understand your busy and we appreciate your time in this
matter.

I'll prepare a v3 with a collected set of tags shortly. We're more than
happy to clean this up to make your job as easy as possible. We were
just looking for direction in how to move this forward.

Logan

On 23/02/17 03:14 PM, Bjorn Helgaas wrote:
> On Thu, Feb 23, 2017 at 01:36:51PM -0700, Logan Gunthorpe wrote:
>> Hello,
>>
>> We're still waiting on any kind of response from Bjorn. (If you're
>> listening please say something!)
>>
>> Does anyone have any suggestions for dealing with an unresponsive
>> maintainer? Or a way for us to move forward with this quickly and get it
>> merged?
> 
> I try to deal with regressions first and other bug fixes second.
> After that, I look at things that add new functionality.  I try to
> look at the new stuff in roughly chronological order, as you would see
> here:
> 
> https://patchwork.ozlabs.org/project/linux-pci/list/?order=date=1
> 
> If other folks have feedback, as they did on your 12/17, 1/31, and
> even the 2/2 posting, I generally let that get sorted out before I
> look at it.  I apologize that I haven't responded to your queries
> about posting a v3 vs updating v2.
> 
> To answer that question, it's much simpler for me to deal with a
> fresh, clean new series than it is to tweak things in an
> already-posted series, partly because a series with discussion other
> than simple acks and reviewed-bys looks more like work-in-progress.
> 
> Bjorn
> 


Re: [PATCH v2 0/4] New Microsemi PCI Switch Management Driver

2017-02-23 Thread Logan Gunthorpe


On 23/02/17 01:51 PM, Sinan Kaya wrote:
> You'll want to be careful during the merge window (these days) as the
> maintainer is usually busy with code delivery. You can't rush your code in at
> the last minute.

Thanks for the advice, we will continue to wait.

However, I would say we've been very patient. It's been three weeks
since we posted the latest revision, a month since the first version and
almost 3 months since our RFC. I don't think it's too much to expect at
least a response saying that it's in the works or something. That long
with dead silence from the maintainer is a bit much.

Logan



Re: [PATCH v2 0/4] New Microsemi PCI Switch Management Driver

2017-02-23 Thread Logan Gunthorpe
Hello,

We're still waiting on any kind of response from Bjorn. (If you're
listening please say something!)

Does anyone have any suggestions for dealing with an unresponsive
maintainer? Or a way for us to move forward with this quickly and get it
merged?

ie. Can anyone else pick this up through another route? In the end, it's
just a fairly basic driver and doesn't touch any core PCI functionality
and we've had a fair amount of review from other kernel contributors,
all of which we've addressed.

Thanks,

Logan




On 17/02/17 01:36 PM, Logan Gunthorpe wrote:
> Hi Bjorn,
> 
> Can you give us an idea of when you might be able to comment on our
> patchset? We've addressed all the outstanding issues and have a couple
> of reviewed and tested tags. So we'd like to see this move forward as
> soon as possible.
> 
> I can do a respin with the tags collected or address any concerns you
> may have, just please let us know.
> 
> Thanks,
> 
> Logan
> 
> On 02/02/17 11:05 AM, Logan Gunthorpe wrote:
>> Changes since v1:
>>
>> * Rebased onto 4.10-rc6 (cleanly)
>> * Split the patch into a few more easily digestible patches (as
>>   suggested by Greg Kroah-Hartman)
>> * Folded switchtec.c into switchtec.h (per Greg)
>> * Fixed a bunch of 32bit build warnings caught by the kbuild test robot
>> * Fixed some issues in the documentation so it has a proper
>>   reStructredText format (as noted by Jonathan Corbet)
>> * Fixed padding and sizes in the IOCTL structures as noticed by Emil
>>   Velikov and used pahole to verify their consistency across 32 and 64
>>   bit builds
>> * Reworked one of the IOCTL interfaces to be more future proof (per
>>   Emil).
>>
>> Changes since RFC:
>>
>> * Fixed incorrect use of the drive model as pointed out by Greg
>>   Kroah-Hartman
>> * Used devm functions as suggested by Keith Busch
>> * Added a handful of sysfs attributes to the switchtec class
>> * Added a handful of IOCTLs to the switchtec device
>> * A number of miscellaneous bug fixes
>>
>> --
>>
>> Hi,
>>
>> This is a continuation of the RFC we posted lasted month [1] which
>> proposes a management driver for Microsemi's Switchtec line of PCI
>> switches. This hardware is still looking to be used in the Open
>> Compute Platform
>>
>> To make this entirely clear: the Switchtec products are compliant
>> with the PCI specifications and are supported today with the standard
>> in-kernel driver. However, these devices also expose a management endpoint
>> on a separate PCI function address which can be used to perform some
>> advanced operations. This is a driver for that function. See the patch
>> for more information.
>>
>> Since the RFC, we've made the changes requested by Greg Kroah-Hartman
>> and Keith Busch, and we've also fleshed out a number of features. We've
>> added a couple of IOCTLs and sysfs attributes which are documented in
>> the patch. Significant work has also been done on the userspace tool
>> which is available under a GPL license at [2]. We've also had testing
>> done by some of the interested parties.
>>
>> We hope to see this work included in either 4.11 or 4.12 assuming a
>> smooth review process.
>>
>> The patch is based off of the v4.10-rc6 release.
>>
>> Thanks for your review,
>>
>> Logan
>>
>> [1] https://www.spinics.net/lists/linux-pci/msg56897.html
>> [2] https://github.com/sbates130272/switchtec-user
>>
>> --
>>
>> Logan Gunthorpe (4):
>>   MicroSemi Switchtec management interface driver
>>   switchtec: Add user interface documentation
>>   switchtec: Add sysfs attributes to the Switchtec driver
>>   switchtec: Add IOCTLs to the Switchtec driver
>>
>>  Documentation/ABI/testing/sysfs-class-switchtec |   96 ++
>>  Documentation/ioctl/ioctl-number.txt|1 +
>>  Documentation/switchtec.txt |   80 ++
>>  MAINTAINERS |   11 +
>>  drivers/pci/Kconfig |1 +
>>  drivers/pci/Makefile|1 +
>>  drivers/pci/switch/Kconfig  |   13 +
>>  drivers/pci/switch/Makefile |1 +
>>  drivers/pci/switch/switchtec.c  | 1608 
>> +++
>>  include/uapi/linux/switchtec_ioctl.h|  132 ++
>>  10 files changed, 1944 insertions(+)
>>  create mode 100644 Documentation/ABI/testing/sysfs-class-switchtec
>>  create mode 100644 Documentation/switchtec.txt
>>  create mode 100644 drivers/pci/switch/Kconfig
>>  create mode 100644 drivers/pci/switch/Makefile
>>  create mode 100644 drivers/pci/switch/switchtec.c
>>  create mode 100644 include/uapi/linux/switchtec_ioctl.h
>>
>> --
>> 2.1.4
>>


[PATCH v3 4/4] switchtec: Add IOCTLs to the Switchtec driver

2017-02-23 Thread Logan Gunthorpe
This patch introduces a couple of special IOCTLs which are provided to:

* Inform userspace of firmware partition locations
* Pass event counts and allow userspace to wait on events
* Translate between PFF numbers used by the switch to
  port numbers.

Signed-off-by: Logan Gunthorpe <log...@deltatee.com>
Signed-off-by: Stephen Bates <stephen.ba...@microsemi.com>
Tested-by: Krishna Dhulipala <krish...@fb.com>
Reviewed-by: Wei Zhang <wzh...@fb.com>
Reviewed-by: Jens Axboe <ax...@fb.com>
---
 Documentation/ioctl/ioctl-number.txt |   1 +
 Documentation/switchtec.txt  |  27 ++
 MAINTAINERS  |   1 +
 drivers/pci/switch/switchtec.c   | 467 +++
 include/uapi/linux/switchtec_ioctl.h | 132 ++
 5 files changed, 628 insertions(+)
 create mode 100644 include/uapi/linux/switchtec_ioctl.h

diff --git a/Documentation/ioctl/ioctl-number.txt 
b/Documentation/ioctl/ioctl-number.txt
index 81c7f2b..032b33c 100644
--- a/Documentation/ioctl/ioctl-number.txt
+++ b/Documentation/ioctl/ioctl-number.txt
@@ -191,6 +191,7 @@ Code  Seq#(hex) Include FileComments
 'W'00-1F   linux/watchdog.hconflict!
 'W'00-1F   linux/wanrouter.h   conflict!   (pre 3.9)
 'W'00-3F   sound/asound.h  conflict!
+'W'40-5F   drivers/pci/switch/switchtec.c
 'X'all fs/xfs/xfs_fs.h conflict!
and fs/xfs/linux-2.6/xfs_ioctl32.h
and include/linux/falloc.h
diff --git a/Documentation/switchtec.txt b/Documentation/switchtec.txt
index 4bced4c..a0a9c7b 100644
--- a/Documentation/switchtec.txt
+++ b/Documentation/switchtec.txt
@@ -51,3 +51,30 @@ The char device has the following semantics:
 
 * The poll call will also be supported for userspace applications that
   need to do other things while waiting for the command to complete.
+
+The following IOCTLs are also supported by the device:
+
+* SWITCHTEC_IOCTL_FLASH_INFO - Retrieve firmware length and number
+  of partitions in the device.
+
+* SWITCHTEC_IOCTL_FLASH_PART_INFO - Retrieve address and lengeth for
+  any specified partition in flash.
+
+* SWITCHTEC_IOCTL_EVENT_SUMMARY - Read a structure of bitmaps
+  indicating all uncleared events.
+
+* SWITCHTEC_IOCTL_EVENT_CTL - Get the current count, clear and set flags
+  for any event. This ioctl takes in a switchtec_ioctl_event_ctl struct
+  with the event_id, index and flags set (index being the partition or PFF
+  number for non-global events). It returns whether the event has
+  occurred, the number of times and any event specific data. The flags
+  can be used to clear the count or enable and disable actions to
+  happen when the event occurs.
+  By using the SWITCHTEC_IOCTL_EVENT_FLAG_EN_POLL flag,
+  you can set an event to trigger a poll command to return with
+  POLLPRI. In this way, userspace can wait for events to occur.
+
+* SWITCHTEC_IOCTL_PFF_TO_PORT and SWITCHTEC_IOCTL_PORT_TO_PFF convert
+  between PCI Function Framework number (used by the event system)
+  and Switchtec Logic Port ID and Partition number (which is more
+  user friendly).
diff --git a/MAINTAINERS b/MAINTAINERS
index 6fed938..c1a9a30 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -9515,6 +9515,7 @@ S:Maintained
 F: Documentation/switchtec.txt
 F: Documentation/ABI/testing/sysfs-class-switchtec
 F: drivers/pci/switch/switchtec*
+F: include/uapi/linux/switchtec_ioctl.h
 
 PCI DRIVER FOR NVIDIA TEGRA
 M: Thierry Reding <thierry.red...@gmail.com>
diff --git a/drivers/pci/switch/switchtec.c b/drivers/pci/switch/switchtec.c
index c09d335..04d35f4 100644
--- a/drivers/pci/switch/switchtec.c
+++ b/drivers/pci/switch/switchtec.c
@@ -13,6 +13,8 @@
  *
  */
 
+#include 
+
 #include 
 #include 
 #include 
@@ -755,6 +757,417 @@ static unsigned int switchtec_dev_poll(struct file *filp, 
poll_table *wait)
return ret;
 }
 
+static int ioctl_flash_info(struct switchtec_dev *stdev,
+   struct switchtec_ioctl_flash_info __user *uinfo)
+{
+   struct switchtec_ioctl_flash_info info = {0};
+   struct flash_info_regs __iomem *fi = stdev->mmio_flash_info;
+
+   info.flash_length = ioread32(>flash_length);
+   info.num_partitions = SWITCHTEC_IOCTL_NUM_PARTITIONS;
+
+   if (copy_to_user(uinfo, , sizeof(info)))
+   return -EFAULT;
+
+   return 0;
+}
+
+static void set_fw_info_part(struct switchtec_ioctl_flash_part_info *info,
+struct partition_info __iomem *pi)
+{
+   info->address = ioread32(>address);
+   info->length = ioread32(>length);
+}
+
+static int ioctl_flash_part_info(struct switchtec_dev *stdev,
+   struct switchtec_ioctl_flash_part_info __user *uinfo)
+{
+   struct switchtec_ioctl_flash_part_info info = {0};
+   struct flash_info_regs __iomem *fi = stdev->mmio_flash_info;
+   u32 active_addr = -1;
+
+   if (copy

[PATCH v3 0/4] New Microsemi PCI Switch Management Driver

2017-02-23 Thread Logan Gunthorpe
Changes since v2:

* Collected reviewed and tested tags
* Very minor fix to the error path in the create function

Changes since v1:

* Rebased onto 4.10-rc6 (cleanly)
* Split the patch into a few more easily digestible patches (as
  suggested by Greg Kroah-Hartman)
* Folded switchtec.c into switchtec.h (per Greg)
* Fixed a bunch of 32bit build warnings caught by the kbuild test robot
* Fixed some issues in the documentation so it has a proper
  reStructredText format (as noted by Jonathan Corbet)
* Fixed padding and sizes in the IOCTL structures as noticed by Emil
  Velikov and used pahole to verify their consistency across 32 and 64
  bit builds
* Reworked one of the IOCTL interfaces to be more future proof (per
  Emil).

Changes since RFC:

* Fixed incorrect use of the drive model as pointed out by Greg
  Kroah-Hartman
* Used devm functions as suggested by Keith Busch
* Added a handful of sysfs attributes to the switchtec class
* Added a handful of IOCTLs to the switchtec device
* A number of miscellaneous bug fixes

--

Hi,

This is a continuation of the RFC we posted lasted month [1] which
proposes a management driver for Microsemi's Switchtec line of PCI
switches. This hardware is still looking to be used in the Open
Compute Platform

To make this entirely clear: the Switchtec products are compliant
with the PCI specifications and are supported today with the standard
in-kernel driver. However, these devices also expose a management endpoint
on a separate PCI function address which can be used to perform some
advanced operations. This is a driver for that function. See the patch
for more information.

Since the RFC, we've made the changes requested by Greg Kroah-Hartman
and Keith Busch, and we've also fleshed out a number of features. We've
added a couple of IOCTLs and sysfs attributes which are documented in
the patch. Significant work has also been done on the userspace tool
which is available under a GPL license at [2]. We've also had testing
done by some of the interested parties.

We hope to see this work included in either 4.11 or 4.12 assuming a
smooth review process.

The patch is based off of the v4.10-rc6 release.

Thanks for your review,

Logan

[1] https://www.spinics.net/lists/linux-pci/msg56897.html
[2] https://github.com/sbates130272/switchtec-user

--

Logan Gunthorpe (4):
  MicroSemi Switchtec management interface driver
  switchtec: Add user interface documentation
  switchtec: Add sysfs attributes to the Switchtec driver
  switchtec: Add IOCTLs to the Switchtec driver

 Documentation/ABI/testing/sysfs-class-switchtec |   96 ++
 Documentation/ioctl/ioctl-number.txt|1 +
 Documentation/switchtec.txt |   80 ++
 MAINTAINERS |   11 +
 drivers/pci/Kconfig |1 +
 drivers/pci/Makefile|1 +
 drivers/pci/switch/Kconfig  |   13 +
 drivers/pci/switch/Makefile |1 +
 drivers/pci/switch/switchtec.c  | 1611 +++
 include/uapi/linux/switchtec_ioctl.h|  132 ++
 10 files changed, 1947 insertions(+)
 create mode 100644 Documentation/ABI/testing/sysfs-class-switchtec
 create mode 100644 Documentation/switchtec.txt
 create mode 100644 drivers/pci/switch/Kconfig
 create mode 100644 drivers/pci/switch/Makefile
 create mode 100644 drivers/pci/switch/switchtec.c
 create mode 100644 include/uapi/linux/switchtec_ioctl.h

--
2.1.4


[PATCH v3 3/4] switchtec: Add sysfs attributes to the Switchtec driver

2017-02-23 Thread Logan Gunthorpe
This patch adds a few read-only sysfs attributes which provide
some device information that is exposed from the devices. Primarily
component and device names and versions. These are documented in
Documentation/ABI/testing/sysfs-class-switchtec.

Signed-off-by: Logan Gunthorpe <log...@deltatee.com>
Signed-off-by: Stephen Bates <stephen.ba...@microsemi.com>
Tested-by: Krishna Dhulipala <krish...@fb.com>
Reviewed-by: Wei Zhang <wzh...@fb.com>
Reviewed-by: Jens Axboe <ax...@fb.com>
Reviewed-by: Greg Kroah-Hartman <gre...@linuxfoundation.org>
---
 Documentation/ABI/testing/sysfs-class-switchtec |  96 
 MAINTAINERS |   1 +
 drivers/pci/switch/switchtec.c  | 113 
 3 files changed, 210 insertions(+)
 create mode 100644 Documentation/ABI/testing/sysfs-class-switchtec

diff --git a/Documentation/ABI/testing/sysfs-class-switchtec 
b/Documentation/ABI/testing/sysfs-class-switchtec
new file mode 100644
index 000..48cb4c1
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-class-switchtec
@@ -0,0 +1,96 @@
+switchtec - Microsemi Switchtec PCI Switch Management Endpoint
+
+For details on this subsystem look at Documentation/switchtec.txt.
+
+What:  /sys/class/switchtec
+Date:  05-Jan-2017
+KernelVersion: v4.11
+Contact:   Logan Gunthorpe <log...@deltatee.com>
+Description:   The switchtec class subsystem folder.
+   Each registered switchtec driver is represented by a switchtecX
+   subfolder (X being an integer >= 0).
+
+
+What:  /sys/class/switchtec/switchtec[0-9]+/component_id
+Date:  05-Jan-2017
+KernelVersion: v4.11
+Contact:   Logan Gunthorpe <log...@deltatee.com>
+Description:   Component identifier as stored in the hardware (eg. PM8543)
+   (read only)
+Values:arbitrary string.
+
+
+What:  /sys/class/switchtec/switchtec[0-9]+/component_revision
+Date:  05-Jan-2017
+KernelVersion: v4.11
+Contact:   Logan Gunthorpe <log...@deltatee.com>
+Description:   Component revision stored in the hardware (read only)
+Values:integer.
+
+
+What:  /sys/class/switchtec/switchtec[0-9]+/component_vendor
+Date:      05-Jan-2017
+KernelVersion: v4.11
+Contact:   Logan Gunthorpe <log...@deltatee.com>
+Description:   Component vendor as stored in the hardware (eg. MICROSEM)
+   (read only)
+Values:arbitrary string.
+
+
+What:  /sys/class/switchtec/switchtec[0-9]+/device_version
+Date:      05-Jan-2017
+KernelVersion: v4.11
+Contact:   Logan Gunthorpe <log...@deltatee.com>
+Description:   Device version as stored in the hardware (read only)
+Values:integer.
+
+
+What:  /sys/class/switchtec/switchtec[0-9]+/fw_version
+Date:  05-Jan-2017
+KernelVersion: v4.11
+Contact:   Logan Gunthorpe <log...@deltatee.com>
+Description:   Currently running firmware version (read only)
+Values:integer (in hexadecimal).
+
+
+What:  /sys/class/switchtec/switchtec[0-9]+/partition
+Date:  05-Jan-2017
+KernelVersion: v4.11
+Contact:   Logan Gunthorpe <log...@deltatee.com>
+Description:   Partition number for this device in the switch (read only)
+Values:integer.
+
+
+What:  /sys/class/switchtec/switchtec[0-9]+/partition_count
+Date:  05-Jan-2017
+KernelVersion: v4.11
+Contact:   Logan Gunthorpe <log...@deltatee.com>
+Description:   Total number of partitions in the switch (read only)
+Values:integer.
+
+
+What:  /sys/class/switchtec/switchtec[0-9]+/product_id
+Date:  05-Jan-2017
+KernelVersion: v4.11
+Contact:   Logan Gunthorpe <log...@deltatee.com>
+Description:   Product identifier as stored in the hardware (eg. PSX 48XG3)
+   (read only)
+Values:arbitrary string.
+
+
+What:  /sys/class/switchtec/switchtec[0-9]+/product_revision
+Date:  05-Jan-2017
+KernelVersion: v4.11
+Contact:   Logan Gunthorpe <log...@deltatee.com>
+Description:   Product revision stored in the hardware (eg. RevB)
+   (read only)
+Values:arbitrary string.
+
+
+What:  /sys/class/switchtec/switchtec[0-9]+/product_vendor
+Date:  05-Jan-2017
+KernelVersion: v4.11
+Contact:   Logan Gunthorpe <log...@deltatee.com>
+Description:   Product vendor as stored in the hardware (eg. MICROSEM)
+   (read only)
+Values:arbitrary string.
diff --git a/MAINTAINERS b/MAINTAINERS
index aa702b0..6fed938 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -9513,6 +9513,7 @@ M:Logan Gunthorpe <log...@deltatee.com>
 L: linux-...@vger.kernel.org
 S: Maintained
 F: Documentation/switchtec.txt
+F: Documentation/ABI/testing/sysfs-class-switchtec
 F: drivers/pci/switch/switchtec*
 
 PCI DRIVER FOR NVIDIA TEGRA
diff --gi

[PATCH v3 1/4] MicroSemi Switchtec management interface driver

2017-02-23 Thread Logan Gunthorpe
Microsemi's "Switchtec" line of PCI switch devices is already well
supported by the kernel with standard PCI switch drivers. However, the
Switchtec device advertises a special management endpoint with a separate
PCI function address and class code. This endpoint enables some additional
functionality which includes:

 * Packet and Byte Counters
 * Switch Firmware Upgrades
 * Event and Error logs
 * Querying port link status
 * Custom user firmware commands

This patch introduces the switchtec kernel module which provides
PCI driver that exposes a char device. The char device provides
userspace access to this interface through read, write and (optionally)
poll calls.

A userspace tool and library which utilizes this interface is available
at [1]. This tool takes inspiration (and borrows some code) from
nvme-cli [2]. The tool is largely complete at this time but additional
features may be added in the future.

[1] https://github.com/sbates130272/switchtec-user
[2] https://github.com/linux-nvme/nvme-cli

Signed-off-by: Logan Gunthorpe <log...@deltatee.com>
Signed-off-by: Stephen Bates <stephen.ba...@microsemi.com>
Tested-by: Krishna Dhulipala <krish...@fb.com>
Reviewed-by: Wei Zhang <wzh...@fb.com>
Reviewed-by: Jens Axboe <ax...@fb.com>
Reviewed-by: Greg Kroah-Hartman <gre...@linuxfoundation.org>
---
 MAINTAINERS|8 +
 drivers/pci/Kconfig|1 +
 drivers/pci/Makefile   |1 +
 drivers/pci/switch/Kconfig |   13 +
 drivers/pci/switch/Makefile|1 +
 drivers/pci/switch/switchtec.c | 1031 
 6 files changed, 1055 insertions(+)
 create mode 100644 drivers/pci/switch/Kconfig
 create mode 100644 drivers/pci/switch/Makefile
 create mode 100644 drivers/pci/switch/switchtec.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 527d137..a57686f 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -9506,6 +9506,14 @@ S:   Maintained
 F: Documentation/devicetree/bindings/pci/aardvark-pci.txt
 F: drivers/pci/host/pci-aardvark.c
 
+PCI DRIVER FOR MICROSEMI SWITCHTEC
+M: Kurt Schwemmer <kurt.schwem...@microsemi.com>
+M: Stephen Bates <stephen.ba...@microsemi.com>
+M: Logan Gunthorpe <log...@deltatee.com>
+L: linux-...@vger.kernel.org
+S: Maintained
+F: drivers/pci/switch/switchtec*
+
 PCI DRIVER FOR NVIDIA TEGRA
 M: Thierry Reding <thierry.red...@gmail.com>
 L: linux-te...@vger.kernel.org
diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
index 6555eb7..f72e8c5 100644
--- a/drivers/pci/Kconfig
+++ b/drivers/pci/Kconfig
@@ -133,3 +133,4 @@ config PCI_HYPERV
 
 source "drivers/pci/hotplug/Kconfig"
 source "drivers/pci/host/Kconfig"
+source "drivers/pci/switch/Kconfig"
diff --git a/drivers/pci/Makefile b/drivers/pci/Makefile
index 8db5079..15b46dd 100644
--- a/drivers/pci/Makefile
+++ b/drivers/pci/Makefile
@@ -68,3 +68,4 @@ ccflags-$(CONFIG_PCI_DEBUG) := -DDEBUG
 
 # PCI host controller drivers
 obj-y += host/
+obj-y += switch/
diff --git a/drivers/pci/switch/Kconfig b/drivers/pci/switch/Kconfig
new file mode 100644
index 000..4c49648
--- /dev/null
+++ b/drivers/pci/switch/Kconfig
@@ -0,0 +1,13 @@
+menu "PCI switch controller drivers"
+   depends on PCI
+
+config PCI_SW_SWITCHTEC
+   tristate "MicroSemi Switchtec PCIe Switch Management Driver"
+   help
+Enables support for the management interface for the MicroSemi
+Switchtec series of PCIe switches. Supports userspace access
+to submit MRPC commands to the switch via /dev/switchtecX
+devices. See  for more
+information.
+
+endmenu
diff --git a/drivers/pci/switch/Makefile b/drivers/pci/switch/Makefile
new file mode 100644
index 000..37d8cfb
--- /dev/null
+++ b/drivers/pci/switch/Makefile
@@ -0,0 +1 @@
+obj-$(CONFIG_PCI_SW_SWITCHTEC) += switchtec.o
diff --git a/drivers/pci/switch/switchtec.c b/drivers/pci/switch/switchtec.c
new file mode 100644
index 000..f6bcff1
--- /dev/null
+++ b/drivers/pci/switch/switchtec.c
@@ -0,0 +1,1031 @@
+/*
+ * Microsemi Switchtec(tm) PCIe Management Driver
+ * Copyright (c) 2017, Microsemi Corporation
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+MODULE_DESCRIPTION("Microsemi Switchtec(tm) PCI-E Management Driver");
+MODULE_VERSION("0.1");
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Microsemi Cor

[PATCH v3 2/4] switchtec: Add user interface documentation

2017-02-23 Thread Logan Gunthorpe
This adds standard documentation for the sysfs switchtec attributes and
a RST formatted text file which documents the char device interface.
Jonathan Corbet has indicated he will move this to a new user-space
developer documentation book once it's created.

Signed-off-by: Logan Gunthorpe <log...@deltatee.com>
Signed-off-by: Stephen Bates <stephen.ba...@microsemi.com>
Tested-by: Krishna Dhulipala <krish...@fb.com>
Reviewed-by: Wei Zhang <wzh...@fb.com>
Reviewed-by: Jens Axboe <ax...@fb.com>
---
 Documentation/switchtec.txt | 53 +
 MAINTAINERS |  1 +
 2 files changed, 54 insertions(+)
 create mode 100644 Documentation/switchtec.txt

diff --git a/Documentation/switchtec.txt b/Documentation/switchtec.txt
new file mode 100644
index 000..4bced4c
--- /dev/null
+++ b/Documentation/switchtec.txt
@@ -0,0 +1,53 @@
+
+Linux Switchtec Support
+
+
+Microsemi's "Switchtec" line of PCI switch devices is already
+supported by the kernel with standard PCI switch drivers. However, the
+Switchtec device advertises a special management endpoint which
+enables some additional functionality. This includes:
+
+* Packet and Byte Counters
+* Firmware Upgrades
+* Event and Error logs
+* Querying port link status
+* Custom user firmware commands
+
+The switchtec kernel module implements this functionality.
+
+
+Interface
+=
+
+The primary means of communicating with the Switchtec management firmware is
+through the Memory-mapped Remote Procedure Call (MRPC) interface.
+Commands are submitted to the interface with a 4-byte command
+identifier and up to 1KB of command specific data. The firmware will
+respond with a 4 bytes return code and up to 1KB of command specific
+data. The interface only processes a single command at a time.
+
+
+Userspace Interface
+===
+
+The MRPC interface will be exposed to userspace through a simple char
+device: /dev/switchtec#, one for each management endpoint in the system.
+
+The char device has the following semantics:
+
+* A write must consist of at least 4 bytes and no more than 1028 bytes.
+  The first four bytes will be interpreted as the command to run and
+  the remainder will be used as the input data. A write will send the
+  command to the firmware to begin processing.
+
+* Each write must be followed by exactly one read. Any double write will
+  produce an error and any read that doesn't follow a write will
+  produce an error.
+
+* A read will block until the firmware completes the command and return
+  the four bytes of status plus up to 1024 bytes of output data. (The
+  length will be specified by the size parameter of the read call --
+  reading less than 4 bytes will produce an error.
+
+* The poll call will also be supported for userspace applications that
+  need to do other things while waiting for the command to complete.
diff --git a/MAINTAINERS b/MAINTAINERS
index a57686f..aa702b0 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -9512,6 +9512,7 @@ M:Stephen Bates <stephen.ba...@microsemi.com>
 M: Logan Gunthorpe <log...@deltatee.com>
 L: linux-...@vger.kernel.org
 S: Maintained
+F: Documentation/switchtec.txt
 F: drivers/pci/switch/switchtec*
 
 PCI DRIVER FOR NVIDIA TEGRA
-- 
2.1.4



Re: [PATCH v2 1/4] MicroSemi Switchtec management interface driver

2017-02-24 Thread Logan Gunthorpe
Hey Bjorn,

Thanks for the thorough review. It definitely helped a lot to make the
code as good as it can be.

I've made all of the changes you suggested. I'm just going to do a bit
more testing and then post a v4. See my responses to all of your
feedback bellow.

Logan

On 23/02/17 05:35 PM, Bjorn Helgaas wrote:
> Would it make any sense to integrate this with the perf tool?  It
> looks like switchtec-user measures bandwidth and latency, which sounds
> sort of perf-ish.

That would be cool but lets file that under potential future work. This
driver also enables more than bandwidth and latency so it's still
required for us.

>> +MODULE_DESCRIPTION("Microsemi Switchtec(tm) PCI-E Management Driver");
> 
> Nit: s/PCI-E/PCIe/
> 

Fixed.

>> +MODULE_VERSION("0.1");
>> +MODULE_LICENSE("GPL");
>> +MODULE_AUTHOR("Microsemi Corporation");
>> +
>> +static int max_devices = 16;
> 
> This static initialization is slightly misleading because
> switchtec_init() immediately sets max_devices to at least 256.

Oops copied that from dax without thinking. I've just removed the max()
call in the init function.


>> +atomic_t event_cnt;
> 
> Why is this atomic_t?  You only do an atomic_set() (in stdev_create())
> and atomic_reads() -- no writes other than the initial one.  So I'm
> not sure what value atomic_t brings.

If you looked at a following patch in the series you'd see that there's
an atomic_inc in the ISR. The splitting of the series wasn't as precise
as it maybe should have been and thus this became a bit confusing.

>> +memcpy_fromio(stuser->data, >mmio_mrpc->output_data,
>> +  sizeof(stuser->data));
> 
> Apparently you always get 1K of data back, no matter what?

Yes, sort of unfortunately. Because these transactions can occur before
the user actually does the read, we don't know how much data the user
wants. This may be a performance improvement in the future (ie. if the
user reads before the MRPC transaction comes back, then only read the
requested amount). But we will leave it this way for now.


>> +if (!stdev_is_alive(stdev))
>> +return -ENXIO;
> 
> What exactly does this protect against?  Is it OK if stdev_is_alive()
> becomes false immediately after we check?

Yup, you're right: that's racy. Turns out cleanup is hard and I've
learned a lot even in the short time since I wrote this code. I've
gotten rid of stdev_is_alive and moved the pci cleanup code into the
device's release function so they won't occur until the last user closes
the cdev. I think that should work better but please let me know if you
see an issue with this.

>> +
>> +if (size < sizeof(stuser->cmd) ||
>> +size > sizeof(stuser->cmd) + SWITCHTEC_MRPC_PAYLOAD_SIZE)
> 
> I think I would use "sizeof(stuser->data)" instead of
> SWITCHTEC_MRPC_PAYLOAD_SIZE so it matches what mrpc_complete_cmd()
> does.  Same below in switchtec_dev_read().

Ok.

> mrpc_mutex doesn't protect the stuser things, does it?  If not, you
> could do this without the gotos.  And I think it's a little easier to
> be confident there are no buffer overruns:

I was using the mutex to protect stuser->state as well. But I've made
your changes and just adjusted stuser->state to be atomic and I think
this should be just as good.

>> +static int switchtec_init_msix_isr(struct switchtec_dev *stdev)
>> +{
>> +struct pci_dev *pdev = stdev->pdev;
>> +int rc, i, msix_count, node;
>> +
>> +node = dev_to_node(>dev);
> 
> Unused?

Yup, I've removed it.

>> +for (i = 0; i < ARRAY_SIZE(stdev->msix); ++i)
>> +stdev->msix[i].entry = i;
>> +
>> +msix_count = pci_enable_msix_range(pdev, stdev->msix, 1,
>> +   ARRAY_SIZE(stdev->msix));
>> +if (msix_count < 0)
>> +return msix_count;
> 
> Maybe this could be simplified by using pci_alloc_irq_vectors()?

Yup! I wasn't aware of that function. It's a _huge_ improvement. Thanks.
Still I'd really appreciate a quick re-review after I post v4 just to
make sure I got it correct.

>> +stdev->event_irq = ioread32(>mmio_part_cfg->vep_vector_number);
>> +if (stdev->event_irq < 0 || stdev->event_irq >= msix_count) {
>> +rc = -EFAULT;
>> +goto err_msix_request;
>> +}
> 
> Not sure what this is doing, but you immediately overwrite
> stdev->event_irq below.  If you're using it as just a temporary above
> for doing some validation, I would just use a local variable to avoid
> the connection with stdev->event_irq.

Yes, I made this temporary.

>> +stdev->event_irq = stdev->msix[stdev->event_irq].vector;
> 
> Oh, I guess you allocate several MSI-X vectors, but you only actually
> use one of them?  Why is that?  I was confused about why you allocate
> several vectors, but there's only one devm_request_irq() call below.

The event_irq is configurable in hardware and won't necessarily be the
first irq available. (It should always be in the first four.) As I
understand it, we need to allocate all of 

[PATCH v4 4/4] switchtec: Add IOCTLs to the Switchtec driver

2017-02-24 Thread Logan Gunthorpe
This patch introduces a couple of special IOCTLs which are provided to:

* Inform userspace of firmware partition locations
* Pass event counts and allow userspace to wait on events
* Translate between PFF numbers used by the switch to
  port numbers.

Signed-off-by: Logan Gunthorpe <log...@deltatee.com>
Signed-off-by: Stephen Bates <stephen.ba...@microsemi.com>
Tested-by: Krishna Dhulipala <krish...@fb.com>
Reviewed-by: Wei Zhang <wzh...@fb.com>
Reviewed-by: Jens Axboe <ax...@fb.com>
---
 Documentation/ioctl/ioctl-number.txt |   1 +
 Documentation/switchtec.txt  |  27 ++
 MAINTAINERS  |   1 +
 drivers/pci/switch/switchtec.c   | 467 +++
 include/uapi/linux/switchtec_ioctl.h | 132 ++
 5 files changed, 628 insertions(+)
 create mode 100644 include/uapi/linux/switchtec_ioctl.h

diff --git a/Documentation/ioctl/ioctl-number.txt 
b/Documentation/ioctl/ioctl-number.txt
index 81c7f2b..032b33c 100644
--- a/Documentation/ioctl/ioctl-number.txt
+++ b/Documentation/ioctl/ioctl-number.txt
@@ -191,6 +191,7 @@ Code  Seq#(hex) Include FileComments
 'W'00-1F   linux/watchdog.hconflict!
 'W'00-1F   linux/wanrouter.h   conflict!   (pre 3.9)
 'W'00-3F   sound/asound.h  conflict!
+'W'40-5F   drivers/pci/switch/switchtec.c
 'X'all fs/xfs/xfs_fs.h conflict!
and fs/xfs/linux-2.6/xfs_ioctl32.h
and include/linux/falloc.h
diff --git a/Documentation/switchtec.txt b/Documentation/switchtec.txt
index 4bced4c..a0a9c7b 100644
--- a/Documentation/switchtec.txt
+++ b/Documentation/switchtec.txt
@@ -51,3 +51,30 @@ The char device has the following semantics:
 
 * The poll call will also be supported for userspace applications that
   need to do other things while waiting for the command to complete.
+
+The following IOCTLs are also supported by the device:
+
+* SWITCHTEC_IOCTL_FLASH_INFO - Retrieve firmware length and number
+  of partitions in the device.
+
+* SWITCHTEC_IOCTL_FLASH_PART_INFO - Retrieve address and lengeth for
+  any specified partition in flash.
+
+* SWITCHTEC_IOCTL_EVENT_SUMMARY - Read a structure of bitmaps
+  indicating all uncleared events.
+
+* SWITCHTEC_IOCTL_EVENT_CTL - Get the current count, clear and set flags
+  for any event. This ioctl takes in a switchtec_ioctl_event_ctl struct
+  with the event_id, index and flags set (index being the partition or PFF
+  number for non-global events). It returns whether the event has
+  occurred, the number of times and any event specific data. The flags
+  can be used to clear the count or enable and disable actions to
+  happen when the event occurs.
+  By using the SWITCHTEC_IOCTL_EVENT_FLAG_EN_POLL flag,
+  you can set an event to trigger a poll command to return with
+  POLLPRI. In this way, userspace can wait for events to occur.
+
+* SWITCHTEC_IOCTL_PFF_TO_PORT and SWITCHTEC_IOCTL_PORT_TO_PFF convert
+  between PCI Function Framework number (used by the event system)
+  and Switchtec Logic Port ID and Partition number (which is more
+  user friendly).
diff --git a/MAINTAINERS b/MAINTAINERS
index 6fed938..c1a9a30 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -9515,6 +9515,7 @@ S:Maintained
 F: Documentation/switchtec.txt
 F: Documentation/ABI/testing/sysfs-class-switchtec
 F: drivers/pci/switch/switchtec*
+F: include/uapi/linux/switchtec_ioctl.h
 
 PCI DRIVER FOR NVIDIA TEGRA
 M: Thierry Reding <thierry.red...@gmail.com>
diff --git a/drivers/pci/switch/switchtec.c b/drivers/pci/switch/switchtec.c
index 2e3a45b..18286d3 100644
--- a/drivers/pci/switch/switchtec.c
+++ b/drivers/pci/switch/switchtec.c
@@ -13,6 +13,8 @@
  *
  */
 
+#include 
+
 #include 
 #include 
 #include 
@@ -717,6 +719,417 @@ static unsigned int switchtec_dev_poll(struct file *filp, 
poll_table *wait)
return ret;
 }
 
+static int ioctl_flash_info(struct switchtec_dev *stdev,
+   struct switchtec_ioctl_flash_info __user *uinfo)
+{
+   struct switchtec_ioctl_flash_info info = {0};
+   struct flash_info_regs __iomem *fi = stdev->mmio_flash_info;
+
+   info.flash_length = ioread32(>flash_length);
+   info.num_partitions = SWITCHTEC_IOCTL_NUM_PARTITIONS;
+
+   if (copy_to_user(uinfo, , sizeof(info)))
+   return -EFAULT;
+
+   return 0;
+}
+
+static void set_fw_info_part(struct switchtec_ioctl_flash_part_info *info,
+struct partition_info __iomem *pi)
+{
+   info->address = ioread32(>address);
+   info->length = ioread32(>length);
+}
+
+static int ioctl_flash_part_info(struct switchtec_dev *stdev,
+   struct switchtec_ioctl_flash_part_info __user *uinfo)
+{
+   struct switchtec_ioctl_flash_part_info info = {0};
+   struct flash_info_regs __iomem *fi = stdev->mmio_flash_info;
+   u32 active_addr = -1;
+
+   if (copy

[PATCH v4 0/4] New Microsemi PCI Switch Management Driver

2017-02-24 Thread Logan Gunthorpe
Changes since v3:

* Removed stdev_is_alive and moved pci deinit functions
  into the device release function which only occurs after all
  cdev instances are closed. (per Bjorn)
* Reduced locking in read/write functions (per Bjorn)
* Use pci_alloc_irq_vectors to vastly improve ISR initialization (Bjorn)
* A few other minor nits and cleanup as noticed by Bjorn

Changes since v2:

* Collected reviewed and tested tags
* Very minor fix to the error path in the create function

Changes since v1:

* Rebased onto 4.10-rc6 (cleanly)
* Split the patch into a few more easily digestible patches (as
  suggested by Greg Kroah-Hartman)
* Folded switchtec.c into switchtec.h (per Greg)
* Fixed a bunch of 32bit build warnings caught by the kbuild test robot
* Fixed some issues in the documentation so it has a proper
  reStructredText format (as noted by Jonathan Corbet)
* Fixed padding and sizes in the IOCTL structures as noticed by Emil
  Velikov and used pahole to verify their consistency across 32 and 64
  bit builds
* Reworked one of the IOCTL interfaces to be more future proof (per
  Emil).

Changes since RFC:

* Fixed incorrect use of the drive model as pointed out by Greg
  Kroah-Hartman
* Used devm functions as suggested by Keith Busch
* Added a handful of sysfs attributes to the switchtec class
* Added a handful of IOCTLs to the switchtec device
* A number of miscellaneous bug fixes

--

Hi,

This is a continuation of the RFC we posted lasted month [1] which
proposes a management driver for Microsemi's Switchtec line of PCI
switches. This hardware is still looking to be used in the Open
Compute Platform

To make this entirely clear: the Switchtec products are compliant
with the PCI specifications and are supported today with the standard
in-kernel driver. However, these devices also expose a management endpoint
on a separate PCI function address which can be used to perform some
advanced operations. This is a driver for that function. See the patch
for more information.

Since the RFC, we've made the changes requested by Greg Kroah-Hartman
and Keith Busch, and we've also fleshed out a number of features. We've
added a couple of IOCTLs and sysfs attributes which are documented in
the patch. Significant work has also been done on the userspace tool
which is available under a GPL license at [2]. We've also had testing
done by some of the interested parties.

We hope to see this work included in either 4.11 or 4.12 assuming a
smooth review process.

The patch is based off of the v4.10-rc6 release.

Thanks for your review,

Logan

[1] https://www.spinics.net/lists/linux-pci/msg56897.html
[2] https://github.com/sbates130272/switchtec-user

Logan Gunthorpe (4):
  MicroSemi Switchtec management interface driver
  switchtec: Add user interface documentation
  switchtec: Add sysfs attributes to the Switchtec driver
  switchtec: Add IOCTLs to the Switchtec driver

 Documentation/ABI/testing/sysfs-class-switchtec |   96 ++
 Documentation/ioctl/ioctl-number.txt|1 +
 Documentation/switchtec.txt |   80 ++
 MAINTAINERS |   11 +
 drivers/pci/Kconfig |1 +
 drivers/pci/Makefile|1 +
 drivers/pci/switch/Kconfig  |   13 +
 drivers/pci/switch/Makefile |1 +
 drivers/pci/switch/switchtec.c  | 1503 +++
 include/uapi/linux/switchtec_ioctl.h|  132 ++
 10 files changed, 1839 insertions(+)
 create mode 100644 Documentation/ABI/testing/sysfs-class-switchtec
 create mode 100644 Documentation/switchtec.txt
 create mode 100644 drivers/pci/switch/Kconfig
 create mode 100644 drivers/pci/switch/Makefile
 create mode 100644 drivers/pci/switch/switchtec.c
 create mode 100644 include/uapi/linux/switchtec_ioctl.h

--
2.1.4


[PATCH v4 3/4] switchtec: Add sysfs attributes to the Switchtec driver

2017-02-24 Thread Logan Gunthorpe
This patch adds a few read-only sysfs attributes which provide
some device information that is exposed from the devices. Primarily
component and device names and versions. These are documented in
Documentation/ABI/testing/sysfs-class-switchtec.

Signed-off-by: Logan Gunthorpe <log...@deltatee.com>
Signed-off-by: Stephen Bates <stephen.ba...@microsemi.com>
Tested-by: Krishna Dhulipala <krish...@fb.com>
Reviewed-by: Wei Zhang <wzh...@fb.com>
Reviewed-by: Jens Axboe <ax...@fb.com>
Reviewed-by: Greg Kroah-Hartman <gre...@linuxfoundation.org>
---
 Documentation/ABI/testing/sysfs-class-switchtec |  96 
 MAINTAINERS |   1 +
 drivers/pci/switch/switchtec.c  | 113 
 3 files changed, 210 insertions(+)
 create mode 100644 Documentation/ABI/testing/sysfs-class-switchtec

diff --git a/Documentation/ABI/testing/sysfs-class-switchtec 
b/Documentation/ABI/testing/sysfs-class-switchtec
new file mode 100644
index 000..48cb4c1
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-class-switchtec
@@ -0,0 +1,96 @@
+switchtec - Microsemi Switchtec PCI Switch Management Endpoint
+
+For details on this subsystem look at Documentation/switchtec.txt.
+
+What:  /sys/class/switchtec
+Date:  05-Jan-2017
+KernelVersion: v4.11
+Contact:   Logan Gunthorpe <log...@deltatee.com>
+Description:   The switchtec class subsystem folder.
+   Each registered switchtec driver is represented by a switchtecX
+   subfolder (X being an integer >= 0).
+
+
+What:  /sys/class/switchtec/switchtec[0-9]+/component_id
+Date:  05-Jan-2017
+KernelVersion: v4.11
+Contact:   Logan Gunthorpe <log...@deltatee.com>
+Description:   Component identifier as stored in the hardware (eg. PM8543)
+   (read only)
+Values:arbitrary string.
+
+
+What:  /sys/class/switchtec/switchtec[0-9]+/component_revision
+Date:  05-Jan-2017
+KernelVersion: v4.11
+Contact:   Logan Gunthorpe <log...@deltatee.com>
+Description:   Component revision stored in the hardware (read only)
+Values:integer.
+
+
+What:  /sys/class/switchtec/switchtec[0-9]+/component_vendor
+Date:      05-Jan-2017
+KernelVersion: v4.11
+Contact:   Logan Gunthorpe <log...@deltatee.com>
+Description:   Component vendor as stored in the hardware (eg. MICROSEM)
+   (read only)
+Values:arbitrary string.
+
+
+What:  /sys/class/switchtec/switchtec[0-9]+/device_version
+Date:      05-Jan-2017
+KernelVersion: v4.11
+Contact:   Logan Gunthorpe <log...@deltatee.com>
+Description:   Device version as stored in the hardware (read only)
+Values:integer.
+
+
+What:  /sys/class/switchtec/switchtec[0-9]+/fw_version
+Date:  05-Jan-2017
+KernelVersion: v4.11
+Contact:   Logan Gunthorpe <log...@deltatee.com>
+Description:   Currently running firmware version (read only)
+Values:integer (in hexadecimal).
+
+
+What:  /sys/class/switchtec/switchtec[0-9]+/partition
+Date:  05-Jan-2017
+KernelVersion: v4.11
+Contact:   Logan Gunthorpe <log...@deltatee.com>
+Description:   Partition number for this device in the switch (read only)
+Values:integer.
+
+
+What:  /sys/class/switchtec/switchtec[0-9]+/partition_count
+Date:  05-Jan-2017
+KernelVersion: v4.11
+Contact:   Logan Gunthorpe <log...@deltatee.com>
+Description:   Total number of partitions in the switch (read only)
+Values:integer.
+
+
+What:  /sys/class/switchtec/switchtec[0-9]+/product_id
+Date:  05-Jan-2017
+KernelVersion: v4.11
+Contact:   Logan Gunthorpe <log...@deltatee.com>
+Description:   Product identifier as stored in the hardware (eg. PSX 48XG3)
+   (read only)
+Values:arbitrary string.
+
+
+What:  /sys/class/switchtec/switchtec[0-9]+/product_revision
+Date:  05-Jan-2017
+KernelVersion: v4.11
+Contact:   Logan Gunthorpe <log...@deltatee.com>
+Description:   Product revision stored in the hardware (eg. RevB)
+   (read only)
+Values:arbitrary string.
+
+
+What:  /sys/class/switchtec/switchtec[0-9]+/product_vendor
+Date:  05-Jan-2017
+KernelVersion: v4.11
+Contact:   Logan Gunthorpe <log...@deltatee.com>
+Description:   Product vendor as stored in the hardware (eg. MICROSEM)
+   (read only)
+Values:arbitrary string.
diff --git a/MAINTAINERS b/MAINTAINERS
index aa702b0..6fed938 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -9513,6 +9513,7 @@ M:Logan Gunthorpe <log...@deltatee.com>
 L: linux-...@vger.kernel.org
 S: Maintained
 F: Documentation/switchtec.txt
+F: Documentation/ABI/testing/sysfs-class-switchtec
 F: drivers/pci/switch/switchtec*
 
 PCI DRIVER FOR NVIDIA TEGRA
diff --gi

[PATCH v4 1/4] MicroSemi Switchtec management interface driver

2017-02-24 Thread Logan Gunthorpe
Microsemi's "Switchtec" line of PCI switch devices is already well
supported by the kernel with standard PCI switch drivers. However, the
Switchtec device advertises a special management endpoint with a separate
PCI function address and class code. This endpoint enables some additional
functionality which includes:

 * Packet and Byte Counters
 * Switch Firmware Upgrades
 * Event and Error logs
 * Querying port link status
 * Custom user firmware commands

This patch introduces the switchtec kernel module which provides
PCI driver that exposes a char device. The char device provides
userspace access to this interface through read, write and (optionally)
poll calls.

A userspace tool and library which utilizes this interface is available
at [1]. This tool takes inspiration (and borrows some code) from
nvme-cli [2]. The tool is largely complete at this time but additional
features may be added in the future.

[1] https://github.com/sbates130272/switchtec-user
[2] https://github.com/linux-nvme/nvme-cli

Signed-off-by: Logan Gunthorpe <log...@deltatee.com>
Signed-off-by: Stephen Bates <stephen.ba...@microsemi.com>
Tested-by: Krishna Dhulipala <krish...@fb.com>
Reviewed-by: Wei Zhang <wzh...@fb.com>
Reviewed-by: Jens Axboe <ax...@fb.com>
Reviewed-by: Greg Kroah-Hartman <gre...@linuxfoundation.org>
---
 MAINTAINERS|   8 +
 drivers/pci/Kconfig|   1 +
 drivers/pci/Makefile   |   1 +
 drivers/pci/switch/Kconfig |  13 +
 drivers/pci/switch/Makefile|   1 +
 drivers/pci/switch/switchtec.c | 923 +
 6 files changed, 947 insertions(+)
 create mode 100644 drivers/pci/switch/Kconfig
 create mode 100644 drivers/pci/switch/Makefile
 create mode 100644 drivers/pci/switch/switchtec.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 527d137..a57686f 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -9506,6 +9506,14 @@ S:   Maintained
 F: Documentation/devicetree/bindings/pci/aardvark-pci.txt
 F: drivers/pci/host/pci-aardvark.c
 
+PCI DRIVER FOR MICROSEMI SWITCHTEC
+M: Kurt Schwemmer <kurt.schwem...@microsemi.com>
+M: Stephen Bates <stephen.ba...@microsemi.com>
+M: Logan Gunthorpe <log...@deltatee.com>
+L: linux-...@vger.kernel.org
+S: Maintained
+F: drivers/pci/switch/switchtec*
+
 PCI DRIVER FOR NVIDIA TEGRA
 M: Thierry Reding <thierry.red...@gmail.com>
 L: linux-te...@vger.kernel.org
diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
index 6555eb7..f72e8c5 100644
--- a/drivers/pci/Kconfig
+++ b/drivers/pci/Kconfig
@@ -133,3 +133,4 @@ config PCI_HYPERV
 
 source "drivers/pci/hotplug/Kconfig"
 source "drivers/pci/host/Kconfig"
+source "drivers/pci/switch/Kconfig"
diff --git a/drivers/pci/Makefile b/drivers/pci/Makefile
index 8db5079..15b46dd 100644
--- a/drivers/pci/Makefile
+++ b/drivers/pci/Makefile
@@ -68,3 +68,4 @@ ccflags-$(CONFIG_PCI_DEBUG) := -DDEBUG
 
 # PCI host controller drivers
 obj-y += host/
+obj-y += switch/
diff --git a/drivers/pci/switch/Kconfig b/drivers/pci/switch/Kconfig
new file mode 100644
index 000..4c49648
--- /dev/null
+++ b/drivers/pci/switch/Kconfig
@@ -0,0 +1,13 @@
+menu "PCI switch controller drivers"
+   depends on PCI
+
+config PCI_SW_SWITCHTEC
+   tristate "MicroSemi Switchtec PCIe Switch Management Driver"
+   help
+Enables support for the management interface for the MicroSemi
+Switchtec series of PCIe switches. Supports userspace access
+to submit MRPC commands to the switch via /dev/switchtecX
+devices. See  for more
+information.
+
+endmenu
diff --git a/drivers/pci/switch/Makefile b/drivers/pci/switch/Makefile
new file mode 100644
index 000..37d8cfb
--- /dev/null
+++ b/drivers/pci/switch/Makefile
@@ -0,0 +1 @@
+obj-$(CONFIG_PCI_SW_SWITCHTEC) += switchtec.o
diff --git a/drivers/pci/switch/switchtec.c b/drivers/pci/switch/switchtec.c
new file mode 100644
index 000..f5813e0
--- /dev/null
+++ b/drivers/pci/switch/switchtec.c
@@ -0,0 +1,923 @@
+/*
+ * Microsemi Switchtec(tm) PCIe Management Driver
+ * Copyright (c) 2017, Microsemi Corporation
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+MODULE_DESCRIPTION("Microsemi Switchtec(tm) PCIe Management Driver");
+MODULE_VERSION("0.1");
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Microsemi Corporation"

[PATCH v4 2/4] switchtec: Add user interface documentation

2017-02-24 Thread Logan Gunthorpe
This adds standard documentation for the sysfs switchtec attributes and
a RST formatted text file which documents the char device interface.
Jonathan Corbet has indicated he will move this to a new user-space
developer documentation book once it's created.

Signed-off-by: Logan Gunthorpe <log...@deltatee.com>
Signed-off-by: Stephen Bates <stephen.ba...@microsemi.com>
Tested-by: Krishna Dhulipala <krish...@fb.com>
Reviewed-by: Wei Zhang <wzh...@fb.com>
Reviewed-by: Jens Axboe <ax...@fb.com>
---
 Documentation/switchtec.txt | 53 +
 MAINTAINERS |  1 +
 2 files changed, 54 insertions(+)
 create mode 100644 Documentation/switchtec.txt

diff --git a/Documentation/switchtec.txt b/Documentation/switchtec.txt
new file mode 100644
index 000..4bced4c
--- /dev/null
+++ b/Documentation/switchtec.txt
@@ -0,0 +1,53 @@
+
+Linux Switchtec Support
+
+
+Microsemi's "Switchtec" line of PCI switch devices is already
+supported by the kernel with standard PCI switch drivers. However, the
+Switchtec device advertises a special management endpoint which
+enables some additional functionality. This includes:
+
+* Packet and Byte Counters
+* Firmware Upgrades
+* Event and Error logs
+* Querying port link status
+* Custom user firmware commands
+
+The switchtec kernel module implements this functionality.
+
+
+Interface
+=
+
+The primary means of communicating with the Switchtec management firmware is
+through the Memory-mapped Remote Procedure Call (MRPC) interface.
+Commands are submitted to the interface with a 4-byte command
+identifier and up to 1KB of command specific data. The firmware will
+respond with a 4 bytes return code and up to 1KB of command specific
+data. The interface only processes a single command at a time.
+
+
+Userspace Interface
+===
+
+The MRPC interface will be exposed to userspace through a simple char
+device: /dev/switchtec#, one for each management endpoint in the system.
+
+The char device has the following semantics:
+
+* A write must consist of at least 4 bytes and no more than 1028 bytes.
+  The first four bytes will be interpreted as the command to run and
+  the remainder will be used as the input data. A write will send the
+  command to the firmware to begin processing.
+
+* Each write must be followed by exactly one read. Any double write will
+  produce an error and any read that doesn't follow a write will
+  produce an error.
+
+* A read will block until the firmware completes the command and return
+  the four bytes of status plus up to 1024 bytes of output data. (The
+  length will be specified by the size parameter of the read call --
+  reading less than 4 bytes will produce an error.
+
+* The poll call will also be supported for userspace applications that
+  need to do other things while waiting for the command to complete.
diff --git a/MAINTAINERS b/MAINTAINERS
index a57686f..aa702b0 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -9512,6 +9512,7 @@ M:Stephen Bates <stephen.ba...@microsemi.com>
 M: Logan Gunthorpe <log...@deltatee.com>
 L: linux-...@vger.kernel.org
 S: Maintained
+F: Documentation/switchtec.txt
 F: drivers/pci/switch/switchtec*
 
 PCI DRIVER FOR NVIDIA TEGRA
-- 
2.1.4



Re: [PATCH] switchtec: cleanup cdev init

2017-02-18 Thread Logan Gunthorpe
Hi,

Please don't apply this patch and instead apply the switchtec driver as
we submitted in v2. As per the discussion in [1], not using the cdev's
kobj parent results in incorrect reference counting and a possible use
of the cdev after its containing structure is freed.

I've also done a quick review around the kernel and found the pattern in
question to be quite prevalent. It's used in, at least, these drivers:

drivers/dax/dax.c:708
drivers/input/evdev.c:1419
drivers/input/joydev.c:908
drivers/input/mousedev.c:904
drivers/gpio/gpiolib.c:1039
drivers/char/tpm/tpm-chip.c:190
drivers/platform/chrome/cros_ec_dev.c:411
drivers/infiniband/hw/hfi1/device.c:72
drivers/infiniband/core/uverbs_main.c:1168
drivers/infiniband/core/user_mad.c:1186
drivers/infiniband/core/user_mad.c:1205
drivers/iio/industrialio-core.c:1721
drivers/media/cec/cec-core.c:140
drivers/media/media-devnode.c:258

Dan Williams has proposed a helper API in [2] that could make this code
appear significantly less suspect which I think would be a good idea.
However, for now, I feel the switchtec code should also follow this
pattern (ie. the way it was submitted in v2) and can be changed or
cleaned up when/if the above numerous examples are fixed up.

Thanks,

Logan

[1] https://lkml.org/lkml/2017/2/10/589
[2] https://lkml.org/lkml/2017/2/13/700




On 10/02/17 10:57 AM, Logan Gunthorpe wrote:
> Per a suggestion from Greg Kroah-Hartman [1], don't set the cdev's
> kobject parent. This allows us to use device_register instead of init
> and add.
> 
> [1] https://lkml.org/lkml/2017/2/10/370
> ---
>  drivers/pci/switch/switchtec.c | 15 +++
>  1 file changed, 7 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/pci/switch/switchtec.c b/drivers/pci/switch/switchtec.c
> index 82bfd18..014eaec 100644
> --- a/drivers/pci/switch/switchtec.c
> +++ b/drivers/pci/switch/switchtec.c
> @@ -1222,24 +1222,23 @@ static struct switchtec_dev *stdev_create(struct 
> pci_dev *pdev)
>   return ERR_PTR(minor);
>  
>   dev = >dev;
> - device_initialize(dev);
>   dev->devt = MKDEV(MAJOR(switchtec_devt), minor);
> - dev->class = switchtec_class;
> - dev->parent = >dev;
> - dev->groups = switchtec_device_groups;
> - dev->release = stdev_release;
> - dev_set_name(dev, "switchtec%d", minor);
>  
>   cdev = >cdev;
>   cdev_init(cdev, _fops);
>   cdev->owner = THIS_MODULE;
> - cdev->kobj.parent = >kobj;
>  
>   rc = cdev_add(>cdev, dev->devt, 1);
>   if (rc)
>   goto err_cdev;
>  
> - rc = device_add(dev);
> + dev->class = switchtec_class;
> + dev->parent = >dev;
> + dev->groups = switchtec_device_groups;
> + dev->release = stdev_release;
> + dev_set_name(dev, "switchtec%d", minor);
> +
> + rc = device_register(dev);
>   if (rc) {
>   cdev_del(>cdev);
>   put_device(dev);
> 


Re: [PATCH] switchtec: cleanup cdev init

2017-02-19 Thread Logan Gunthorpe


On 19/02/17 02:43 PM, Dan Williams wrote:
> Is this race present for all file operations? I've only seen it with
> mmap() and late faults. So if these other drivers do not support mmap
> it's not clear they can trigger the failure.

I don't see how it's related to mmap only. I think there's a number of
variations on this but the race I see happens if you try to take a
reference to the device in the open/close handlers of a char device file.

If a driver puts the last remaining reference in the release handler,
then the device structure will be freed, and on returning from the
release handler, the char_dev code will try to put the last reference to
the cdev structure which may have already been free'd. There needs to be
a way for the cdev structure to take a reference on the device structure
so it doesn't get freed and the kobj.parent trick seems to accomplish
this fairly well.

Really, in any situation where there's a cdev and a device in the same
structure, the life cycles of the two become linked but their reference
counts are not and that is the problem here.

I feel like a number of developers have made the same realization
independently so this would probably be a good thing to clean up.

See these commits for examples spanning around 5 years:

4a215aa Input: fix use-after-free introduced with dynamic minor changes
0d5b7da iio: Prevent race between IIO chardev opening and IIO device
ba0ef85 tpm: Fix initialization of the cdev

Also, seems both my brother and I have independently looked into this
exact issue:

https://www.mail-archive.com/linux-rdma@vger.kernel.org/msg25692.html

So, I think it would be a good idea to clean it up with Dan's proposed
API cleanup. If I have some time tomorrow, I may throw together a patch
set with that patch and the changes to the instances around the kernel.

Logan


Re: [PATCH] device-dax: don't set kobj parent during cdev init

2017-02-11 Thread Logan Gunthorpe

On 11/02/17 11:27 AM, Dan Williams wrote:
> Why, when the lifetime of the cdev is already correct?

Well, it's only correct if you use the kobj parent trick which Greg is
arguing against. As someone reviewing/copying code that trick is
unclear, undocumented and it looks rather odd messing with internal
kobjects. Taking the explicit reference would be very clear, very
standard and only net one additional line.

> See commit ba09c01d2fa8 "dax: convert to the cdev api". I used to take
> explicit references like you suggest, but cdev made it cleaner.

I agree that, on the whole, that patch makes things a good deal cleaner.
I'm not so sure that this one small aspect is an improvement.

In any case, it's up to you. If you'd like I can certainly submit a v2
patch that adds the get/put.

Thanks,

Logan


Re: [PATCH] device-dax: don't set kobj parent during cdev init

2017-02-13 Thread Logan Gunthorpe
Hey,

I like the interface. It's just Greg that needs to comment on whether
using the kobj.parent for this purpose is actually sane. That was his
argument from the beginning.

Logan

On 13/02/17 01:47 PM, Dan Williams wrote:
> On Sat, Feb 11, 2017 at 9:42 PM, Logan Gunthorpe <log...@deltatee.com> wrote:
>> On 11/02/17 11:58 AM, Dan Williams wrote:
>>> Also when using an embedded cdev how would you recommend avoiding this 
>>> problem?
>>
>> I don't know. Hopefully, Greg has a good idea. But it sounds like a
>> general problem that a lot of cdev's actually suffer from without
>> realizing. Perhaps we need a more general solution. Some way for the
>> cdev to reference its containing structure in a way that it's designed
>> for such that anyone writing a driver will do the right thing without
>> needing to dive into the kobjects.
>>
> 
> How about something like the below? I.e. hide the details with a new
> helper api so that all driver writers need to worry about is the
> parent device and cdev_del(). This is similar to the device_add_disk()
> and del_gendisk() pairing that we have for block-device drivers.
> 
> diff --git a/fs/char_dev.c b/fs/char_dev.c
> index 3bc97002c86f..4c5d51cdd353 100644
> --- a/fs/char_dev.c
> +++ b/fs/char_dev.c
> @@ -471,6 +471,12 @@ int cdev_add(struct cdev *p, dev_t dev, unsigned count)
> return 0;
>  }
> 
> +int device_add_cdev(struct device *dev, struct cdev *p)
> +{
> +   p->kobj.parent = >kobj;
> +   return cdev_add(p, dev->devt, 1);
> +}
> +
>  static void cdev_unmap(dev_t dev, unsigned count)
>  {
> kobj_unmap(cdev_map, dev, count);
> diff --git a/include/linux/cdev.h b/include/linux/cdev.h
> index f8763615a5f2..043168df1d8f 100644
> --- a/include/linux/cdev.h
> +++ b/include/linux/cdev.h
> @@ -25,6 +25,7 @@ struct cdev *cdev_alloc(void);
>  void cdev_put(struct cdev *p);
> 
>  int cdev_add(struct cdev *, dev_t, unsigned);
> +int device_add_cdev(struct device *dev, struct cdev *);
> 
>  void cdev_del(struct cdev *);
> 


Re: [PATCH] device-dax: don't set kobj parent during cdev init

2017-02-11 Thread Logan Gunthorpe
On 11/02/17 11:58 AM, Dan Williams wrote:
> Also when using an embedded cdev how would you recommend avoiding this 
> problem?

I don't know. Hopefully, Greg has a good idea. But it sounds like a
general problem that a lot of cdev's actually suffer from without
realizing. Perhaps we need a more general solution. Some way for the
cdev to reference its containing structure in a way that it's designed
for such that anyone writing a driver will do the right thing without
needing to dive into the kobjects.

Logan


Re: [PATCH v2 1/4] MicroSemi Switchtec management interface driver

2017-02-10 Thread Logan Gunthorpe
Hey Greg,

Thanks so much for the review.

On 10/02/17 07:51 AM, Greg Kroah-Hartman wrote:
> On Thu, Feb 02, 2017 at 11:06:00AM -0700, Logan Gunthorpe wrote:
>> +cdev = >cdev;
>> +cdev_init(cdev, _fops);
>> +cdev->owner = THIS_MODULE;
>> +cdev->kobj.parent = >kobj;
> 
> Minor nit, the kobject in a cdev is unlike any other kobject you have
> ever seen, don't mess with it, it's not doing anything like you think it
> is doing.  So no need to set the parent field.

Ok, that makes sense. I'll do a v3 shortly.

I copied this from drivers/dax/dax.c so when I have a spare moment I'll
submit a patch to remove it from there as well.

Just to make sure I get this right without extra churn: does this look
correct?


cdev = >cdev;
cdev_init(cdev, _fops);
cdev->owner = THIS_MODULE;

rc = cdev_add(>cdev, dev->devt, 1);
if (rc)
goto err_cdev;

dev = >dev;
dev->devt = MKDEV(MAJOR(switchtec_devt), minor);
dev->class = switchtec_class;
dev->parent = >dev;
dev->groups = switchtec_device_groups;
dev->release = stdev_release;
dev_set_name(dev, "switchtec%d", minor);

rc = device_register(dev);
if (rc) {
cdev_del(>cdev);
put_device(dev);
return ERR_PTR(rc);
}


Thanks,

Logan


[PATCH] device-dax: don't set kobj parent during cdev init

2017-02-10 Thread Logan Gunthorpe
I copied this code and per feedback from Greg Kroah-Hartman [1] the
cdev's kobject's parent should not be set to the related device.
This should have minor consequences but isn't doing what anyone
expects it to.

This patch then fixes device-dax so it doesn't make the same mistake.

[1] https://lkml.org/lkml/2017/2/10/370

Signed-off-by: Logan Gunthorpe <log...@deltatee.com>
---
 drivers/dax/dax.c | 6 +-
 1 file changed, 1 insertion(+), 5 deletions(-)

diff --git a/drivers/dax/dax.c b/drivers/dax/dax.c
index ed758b7..24e53b7 100644
--- a/drivers/dax/dax.c
+++ b/drivers/dax/dax.c
@@ -699,13 +699,9 @@ struct dax_dev *devm_create_dax_dev(struct dax_region 
*dax_region,
goto err_inode;
}
 
-   /* device_initialize() so cdev can reference kobj parent */
-   device_initialize(dev);
-
cdev = _dev->cdev;
cdev_init(cdev, _fops);
cdev->owner = parent->driver->owner;
-   cdev->kobj.parent = >kobj;
rc = cdev_add(_dev->cdev, dev_t, 1);
if (rc)
goto err_cdev;
@@ -722,7 +718,7 @@ struct dax_dev *devm_create_dax_dev(struct dax_region 
*dax_region,
dev->groups = dax_attribute_groups;
dev->release = dax_dev_release;
dev_set_name(dev, "dax%d.%d", dax_region->id, dax_dev->id);
-   rc = device_add(dev);
+   rc = device_register(dev);
if (rc) {
put_device(dev);
return ERR_PTR(rc);
-- 
2.1.4



Re: [PATCH] device-dax: don't set kobj parent during cdev init

2017-02-10 Thread Logan Gunthorpe
Hey,

Also on the subject of very minor fixes: I noticed drivers/dax is not in
the maintainers file. I just assumed the nvdimm list should have been
included with those from get_maintainers.

Thanks,

Logan

On 10/02/17 12:19 PM, Logan Gunthorpe wrote:
> I copied this code and per feedback from Greg Kroah-Hartman [1] the
> cdev's kobject's parent should not be set to the related device.
> This should have minor consequences but isn't doing what anyone
> expects it to.
> 
> This patch then fixes device-dax so it doesn't make the same mistake.
> 
> [1] https://lkml.org/lkml/2017/2/10/370
> 
> Signed-off-by: Logan Gunthorpe <log...@deltatee.com>
> ---
>  drivers/dax/dax.c | 6 +-
>  1 file changed, 1 insertion(+), 5 deletions(-)
> 
> diff --git a/drivers/dax/dax.c b/drivers/dax/dax.c
> index ed758b7..24e53b7 100644
> --- a/drivers/dax/dax.c
> +++ b/drivers/dax/dax.c
> @@ -699,13 +699,9 @@ struct dax_dev *devm_create_dax_dev(struct dax_region 
> *dax_region,
>   goto err_inode;
>   }
>  
> - /* device_initialize() so cdev can reference kobj parent */
> - device_initialize(dev);
> -
>   cdev = _dev->cdev;
>   cdev_init(cdev, _fops);
>   cdev->owner = parent->driver->owner;
> - cdev->kobj.parent = >kobj;
>   rc = cdev_add(_dev->cdev, dev_t, 1);
>   if (rc)
>   goto err_cdev;
> @@ -722,7 +718,7 @@ struct dax_dev *devm_create_dax_dev(struct dax_region 
> *dax_region,
>   dev->groups = dax_attribute_groups;
>   dev->release = dax_dev_release;
>   dev_set_name(dev, "dax%d.%d", dax_region->id, dax_dev->id);
> - rc = device_add(dev);
> + rc = device_register(dev);
>   if (rc) {
>   put_device(dev);
>   return ERR_PTR(rc);
> 


[PATCH] switchtec: cleanup cdev init

2017-02-10 Thread Logan Gunthorpe
Per a suggestion from Greg Kroah-Hartman [1], don't set the cdev's
kobject parent. This allows us to use device_register instead of init
and add.

[1] https://lkml.org/lkml/2017/2/10/370
---
 drivers/pci/switch/switchtec.c | 15 +++
 1 file changed, 7 insertions(+), 8 deletions(-)

diff --git a/drivers/pci/switch/switchtec.c b/drivers/pci/switch/switchtec.c
index 82bfd18..014eaec 100644
--- a/drivers/pci/switch/switchtec.c
+++ b/drivers/pci/switch/switchtec.c
@@ -1222,24 +1222,23 @@ static struct switchtec_dev *stdev_create(struct 
pci_dev *pdev)
return ERR_PTR(minor);
 
dev = >dev;
-   device_initialize(dev);
dev->devt = MKDEV(MAJOR(switchtec_devt), minor);
-   dev->class = switchtec_class;
-   dev->parent = >dev;
-   dev->groups = switchtec_device_groups;
-   dev->release = stdev_release;
-   dev_set_name(dev, "switchtec%d", minor);
 
cdev = >cdev;
cdev_init(cdev, _fops);
cdev->owner = THIS_MODULE;
-   cdev->kobj.parent = >kobj;
 
rc = cdev_add(>cdev, dev->devt, 1);
if (rc)
goto err_cdev;
 
-   rc = device_add(dev);
+   dev->class = switchtec_class;
+   dev->parent = >dev;
+   dev->groups = switchtec_device_groups;
+   dev->release = stdev_release;
+   dev_set_name(dev, "switchtec%d", minor);
+
+   rc = device_register(dev);
if (rc) {
cdev_del(>cdev);
put_device(dev);
-- 
2.1.4



Re: [PATCH v2 1/4] MicroSemi Switchtec management interface driver

2017-02-10 Thread Logan Gunthorpe


On 10/02/17 10:09 AM, Greg Kroah-Hartman wrote:
> Sure, or just wait for these to be applied to the PCI tree and then send
> a follow-on patch.  It's up to Bjorn really, as to what he wants.

Ok, I sent a working follow-on patch to this thread.

@Bjorn: I'm happy to send the patches however you like. Just let me
know. If at all possible, we'd really like to see this in for the 4.11
merge window.

Thanks,

Logan


Re: [PATCH v2 1/4] MicroSemi Switchtec management interface driver

2017-02-10 Thread Logan Gunthorpe


On 10/02/17 09:55 AM, Greg Kroah-Hartman wrote:
> Yes, but try it yourself to verify it really is correct :)

Yes, of course... turns out it wasn't. I accidentally refereed to dev
before I assigned it. I had mainly just wanted your feedback to ensure
that switching to device_register made sense.

> And it can just be an add-on patch, no need to respin a whole new
> version for just that simple change, it doesn't hurt anything as-is,
> it's just "not needed".

Ok... How should I do that exactly? Attempt to reply to the thread with
another patch?

Thanks,

Logan


Re: [PATCH] device-dax: don't set kobj parent during cdev init

2017-02-11 Thread Logan Gunthorpe

On 11/02/17 01:56 AM, Dan Williams wrote:

When the device is unregistered it invalidates all existing mappings,
but the driver may continue to service vm fault requests until the
final put of the cdev. Until that time the fault handler needs to be
able to check dax_dev->alive. Since the final cdev put is handled by
the vfs I use the cdev's kobject to keep the struct dax_dev instance
alive.


I'm just taking a wild stab at this, but would it not make sense to just 
take a reference to the dax_dev device in dax_open and put it back it in 
dax_release? (Or perhaps, in the open/close of the vm_ops.) That way the 
structure won't be free'd until there are no users and alive will always 
be accessible.


It would also be a bit more clear as to what's going on because you are 
actually making a reference in filp->private_data.


Logan


[PATCH v5 3/4] switchtec: Add sysfs attributes to the Switchtec driver

2017-02-25 Thread Logan Gunthorpe
This patch adds a few read-only sysfs attributes which provide
some device information that is exposed from the devices. Primarily
component and device names and versions. These are documented in
Documentation/ABI/testing/sysfs-class-switchtec.

Signed-off-by: Logan Gunthorpe <log...@deltatee.com>
Signed-off-by: Stephen Bates <stephen.ba...@microsemi.com>
Tested-by: Krishna Dhulipala <krish...@fb.com>
Reviewed-by: Wei Zhang <wzh...@fb.com>
Reviewed-by: Jens Axboe <ax...@fb.com>
Reviewed-by: Greg Kroah-Hartman <gre...@linuxfoundation.org>
---
 Documentation/ABI/testing/sysfs-class-switchtec |  96 
 MAINTAINERS |   1 +
 drivers/pci/switch/switchtec.c  | 113 
 3 files changed, 210 insertions(+)
 create mode 100644 Documentation/ABI/testing/sysfs-class-switchtec

diff --git a/Documentation/ABI/testing/sysfs-class-switchtec 
b/Documentation/ABI/testing/sysfs-class-switchtec
new file mode 100644
index 000..48cb4c1
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-class-switchtec
@@ -0,0 +1,96 @@
+switchtec - Microsemi Switchtec PCI Switch Management Endpoint
+
+For details on this subsystem look at Documentation/switchtec.txt.
+
+What:  /sys/class/switchtec
+Date:  05-Jan-2017
+KernelVersion: v4.11
+Contact:   Logan Gunthorpe <log...@deltatee.com>
+Description:   The switchtec class subsystem folder.
+   Each registered switchtec driver is represented by a switchtecX
+   subfolder (X being an integer >= 0).
+
+
+What:  /sys/class/switchtec/switchtec[0-9]+/component_id
+Date:  05-Jan-2017
+KernelVersion: v4.11
+Contact:   Logan Gunthorpe <log...@deltatee.com>
+Description:   Component identifier as stored in the hardware (eg. PM8543)
+   (read only)
+Values:arbitrary string.
+
+
+What:  /sys/class/switchtec/switchtec[0-9]+/component_revision
+Date:  05-Jan-2017
+KernelVersion: v4.11
+Contact:   Logan Gunthorpe <log...@deltatee.com>
+Description:   Component revision stored in the hardware (read only)
+Values:integer.
+
+
+What:  /sys/class/switchtec/switchtec[0-9]+/component_vendor
+Date:      05-Jan-2017
+KernelVersion: v4.11
+Contact:   Logan Gunthorpe <log...@deltatee.com>
+Description:   Component vendor as stored in the hardware (eg. MICROSEM)
+   (read only)
+Values:arbitrary string.
+
+
+What:  /sys/class/switchtec/switchtec[0-9]+/device_version
+Date:      05-Jan-2017
+KernelVersion: v4.11
+Contact:   Logan Gunthorpe <log...@deltatee.com>
+Description:   Device version as stored in the hardware (read only)
+Values:integer.
+
+
+What:  /sys/class/switchtec/switchtec[0-9]+/fw_version
+Date:  05-Jan-2017
+KernelVersion: v4.11
+Contact:   Logan Gunthorpe <log...@deltatee.com>
+Description:   Currently running firmware version (read only)
+Values:integer (in hexadecimal).
+
+
+What:  /sys/class/switchtec/switchtec[0-9]+/partition
+Date:  05-Jan-2017
+KernelVersion: v4.11
+Contact:   Logan Gunthorpe <log...@deltatee.com>
+Description:   Partition number for this device in the switch (read only)
+Values:integer.
+
+
+What:  /sys/class/switchtec/switchtec[0-9]+/partition_count
+Date:  05-Jan-2017
+KernelVersion: v4.11
+Contact:   Logan Gunthorpe <log...@deltatee.com>
+Description:   Total number of partitions in the switch (read only)
+Values:integer.
+
+
+What:  /sys/class/switchtec/switchtec[0-9]+/product_id
+Date:  05-Jan-2017
+KernelVersion: v4.11
+Contact:   Logan Gunthorpe <log...@deltatee.com>
+Description:   Product identifier as stored in the hardware (eg. PSX 48XG3)
+   (read only)
+Values:arbitrary string.
+
+
+What:  /sys/class/switchtec/switchtec[0-9]+/product_revision
+Date:  05-Jan-2017
+KernelVersion: v4.11
+Contact:   Logan Gunthorpe <log...@deltatee.com>
+Description:   Product revision stored in the hardware (eg. RevB)
+   (read only)
+Values:arbitrary string.
+
+
+What:  /sys/class/switchtec/switchtec[0-9]+/product_vendor
+Date:  05-Jan-2017
+KernelVersion: v4.11
+Contact:   Logan Gunthorpe <log...@deltatee.com>
+Description:   Product vendor as stored in the hardware (eg. MICROSEM)
+   (read only)
+Values:arbitrary string.
diff --git a/MAINTAINERS b/MAINTAINERS
index aa702b0..6fed938 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -9513,6 +9513,7 @@ M:Logan Gunthorpe <log...@deltatee.com>
 L: linux-...@vger.kernel.org
 S: Maintained
 F: Documentation/switchtec.txt
+F: Documentation/ABI/testing/sysfs-class-switchtec
 F: drivers/pci/switch/switchtec*
 
 PCI DRIVER FOR NVIDIA TEGRA
diff --gi

[PATCH v5 0/4] New Microsemi PCI Switch Management Driver

2017-02-25 Thread Logan Gunthorpe
Changes since v4:

* Turns out pushing the pci release code into the device release
  function didn't work as I would have liked. If you try to unbind the
  device with an instance open, then you hit a kernel bug at
  drivers/pci/msi.c:371. (This didn't occur in v3.)

  To solve this, we've moved the pci release code back into the
  unregister function and reintroduced an alive flag. This time,
  however, the alive flag is protected by mrpc_mutex and we're very
  careful about what happens to devices still in use (they should
  all be released through the timeout path and an ENODEV error
  returned to userspace; while new commands are blocked with the
  same error).

Changes since v3:

* Removed stdev_is_alive and moved pci deinit functions
  into the device release function which only occurs after all
  cdev instances are closed. (per Bjorn)
* Reduced locking in read/write functions (per Bjorn)
* Use pci_alloc_irq_vectors to vastly improve ISR initialization (Bjorn)
* A few other minor nits and cleanup as noticed by Bjorn

Changes since v2:

* Collected reviewed and tested tags
* Very minor fix to the error path in the create function

Changes since v1:

* Rebased onto 4.10-rc6 (cleanly)
* Split the patch into a few more easily digestible patches (as
  suggested by Greg Kroah-Hartman)
* Folded switchtec.c into switchtec.h (per Greg)
* Fixed a bunch of 32bit build warnings caught by the kbuild test robot
* Fixed some issues in the documentation so it has a proper
  reStructredText format (as noted by Jonathan Corbet)
* Fixed padding and sizes in the IOCTL structures as noticed by Emil
  Velikov and used pahole to verify their consistency across 32 and 64
  bit builds
* Reworked one of the IOCTL interfaces to be more future proof (per
  Emil).

Changes since RFC:

* Fixed incorrect use of the drive model as pointed out by Greg
  Kroah-Hartman
* Used devm functions as suggested by Keith Busch
* Added a handful of sysfs attributes to the switchtec class
* Added a handful of IOCTLs to the switchtec device
* A number of miscellaneous bug fixes

--

Hi,

This is a continuation of the RFC we posted lasted month [1] which
proposes a management driver for Microsemi's Switchtec line of PCI
switches. This hardware is still looking to be used in the Open
Compute Platform

To make this entirely clear: the Switchtec products are compliant
with the PCI specifications and are supported today with the standard
in-kernel driver. However, these devices also expose a management endpoint
on a separate PCI function address which can be used to perform some
advanced operations. This is a driver for that function. See the patch
for more information.

Since the RFC, we've made the changes requested by Greg Kroah-Hartman
and Keith Busch, and we've also fleshed out a number of features. We've
added a couple of IOCTLs and sysfs attributes which are documented in
the patch. Significant work has also been done on the userspace tool
which is available under a GPL license at [2]. We've also had testing
done by some of the interested parties.

We hope to see this work included in either 4.11 or 4.12 assuming a
smooth review process.

The patch is based off of the v4.10-rc6 release.

Thanks for your review,

Logan

[1] https://www.spinics.net/lists/linux-pci/msg56897.html
[2] https://github.com/sbates130272/switchtec-user

Logan Gunthorpe (4):
  MicroSemi Switchtec management interface driver
  switchtec: Add user interface documentation
  switchtec: Add sysfs attributes to the Switchtec driver
  switchtec: Add IOCTLs to the Switchtec driver

 Documentation/ABI/testing/sysfs-class-switchtec |   96 ++
 Documentation/ioctl/ioctl-number.txt|1 +
 Documentation/switchtec.txt |   80 ++
 MAINTAINERS |   11 +
 drivers/pci/Kconfig |1 +
 drivers/pci/Makefile|1 +
 drivers/pci/switch/Kconfig  |   13 +
 drivers/pci/switch/Makefile |1 +
 drivers/pci/switch/switchtec.c  | 1535 +++
 include/uapi/linux/switchtec_ioctl.h|  132 ++
 10 files changed, 1871 insertions(+)
 create mode 100644 Documentation/ABI/testing/sysfs-class-switchtec
 create mode 100644 Documentation/switchtec.txt
 create mode 100644 drivers/pci/switch/Kconfig
 create mode 100644 drivers/pci/switch/Makefile
 create mode 100644 drivers/pci/switch/switchtec.c
 create mode 100644 include/uapi/linux/switchtec_ioctl.h

--
2.1.4


[PATCH v5 1/4] MicroSemi Switchtec management interface driver

2017-02-25 Thread Logan Gunthorpe
Microsemi's "Switchtec" line of PCI switch devices is already well
supported by the kernel with standard PCI switch drivers. However, the
Switchtec device advertises a special management endpoint with a separate
PCI function address and class code. This endpoint enables some additional
functionality which includes:

 * Packet and Byte Counters
 * Switch Firmware Upgrades
 * Event and Error logs
 * Querying port link status
 * Custom user firmware commands

This patch introduces the switchtec kernel module which provides
PCI driver that exposes a char device. The char device provides
userspace access to this interface through read, write and (optionally)
poll calls.

A userspace tool and library which utilizes this interface is available
at [1]. This tool takes inspiration (and borrows some code) from
nvme-cli [2]. The tool is largely complete at this time but additional
features may be added in the future.

[1] https://github.com/sbates130272/switchtec-user
[2] https://github.com/linux-nvme/nvme-cli

Signed-off-by: Logan Gunthorpe <log...@deltatee.com>
Signed-off-by: Stephen Bates <stephen.ba...@microsemi.com>
Tested-by: Krishna Dhulipala <krish...@fb.com>
Reviewed-by: Wei Zhang <wzh...@fb.com>
Reviewed-by: Jens Axboe <ax...@fb.com>
Reviewed-by: Greg Kroah-Hartman <gre...@linuxfoundation.org>
---
 MAINTAINERS|   8 +
 drivers/pci/Kconfig|   1 +
 drivers/pci/Makefile   |   1 +
 drivers/pci/switch/Kconfig |  13 +
 drivers/pci/switch/Makefile|   1 +
 drivers/pci/switch/switchtec.c | 955 +
 6 files changed, 979 insertions(+)
 create mode 100644 drivers/pci/switch/Kconfig
 create mode 100644 drivers/pci/switch/Makefile
 create mode 100644 drivers/pci/switch/switchtec.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 527d137..a57686f 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -9506,6 +9506,14 @@ S:   Maintained
 F: Documentation/devicetree/bindings/pci/aardvark-pci.txt
 F: drivers/pci/host/pci-aardvark.c
 
+PCI DRIVER FOR MICROSEMI SWITCHTEC
+M: Kurt Schwemmer <kurt.schwem...@microsemi.com>
+M: Stephen Bates <stephen.ba...@microsemi.com>
+M: Logan Gunthorpe <log...@deltatee.com>
+L: linux-...@vger.kernel.org
+S: Maintained
+F: drivers/pci/switch/switchtec*
+
 PCI DRIVER FOR NVIDIA TEGRA
 M: Thierry Reding <thierry.red...@gmail.com>
 L: linux-te...@vger.kernel.org
diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
index 6555eb7..f72e8c5 100644
--- a/drivers/pci/Kconfig
+++ b/drivers/pci/Kconfig
@@ -133,3 +133,4 @@ config PCI_HYPERV
 
 source "drivers/pci/hotplug/Kconfig"
 source "drivers/pci/host/Kconfig"
+source "drivers/pci/switch/Kconfig"
diff --git a/drivers/pci/Makefile b/drivers/pci/Makefile
index 8db5079..15b46dd 100644
--- a/drivers/pci/Makefile
+++ b/drivers/pci/Makefile
@@ -68,3 +68,4 @@ ccflags-$(CONFIG_PCI_DEBUG) := -DDEBUG
 
 # PCI host controller drivers
 obj-y += host/
+obj-y += switch/
diff --git a/drivers/pci/switch/Kconfig b/drivers/pci/switch/Kconfig
new file mode 100644
index 000..4c49648
--- /dev/null
+++ b/drivers/pci/switch/Kconfig
@@ -0,0 +1,13 @@
+menu "PCI switch controller drivers"
+   depends on PCI
+
+config PCI_SW_SWITCHTEC
+   tristate "MicroSemi Switchtec PCIe Switch Management Driver"
+   help
+Enables support for the management interface for the MicroSemi
+Switchtec series of PCIe switches. Supports userspace access
+to submit MRPC commands to the switch via /dev/switchtecX
+devices. See  for more
+information.
+
+endmenu
diff --git a/drivers/pci/switch/Makefile b/drivers/pci/switch/Makefile
new file mode 100644
index 000..37d8cfb
--- /dev/null
+++ b/drivers/pci/switch/Makefile
@@ -0,0 +1 @@
+obj-$(CONFIG_PCI_SW_SWITCHTEC) += switchtec.o
diff --git a/drivers/pci/switch/switchtec.c b/drivers/pci/switch/switchtec.c
new file mode 100644
index 000..4aaf89b
--- /dev/null
+++ b/drivers/pci/switch/switchtec.c
@@ -0,0 +1,955 @@
+/*
+ * Microsemi Switchtec(tm) PCIe Management Driver
+ * Copyright (c) 2017, Microsemi Corporation
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+MODULE_DESCRIPTION("Microsemi Switchtec(tm) PCIe Management Driver");
+MODULE_VERSION("0.1");
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Microsemi Corporation"

[PATCH v5 4/4] switchtec: Add IOCTLs to the Switchtec driver

2017-02-25 Thread Logan Gunthorpe
This patch introduces a couple of special IOCTLs which are provided to:

* Inform userspace of firmware partition locations
* Pass event counts and allow userspace to wait on events
* Translate between PFF numbers used by the switch to
  port numbers.

Signed-off-by: Logan Gunthorpe <log...@deltatee.com>
Signed-off-by: Stephen Bates <stephen.ba...@microsemi.com>
Tested-by: Krishna Dhulipala <krish...@fb.com>
Reviewed-by: Wei Zhang <wzh...@fb.com>
Reviewed-by: Jens Axboe <ax...@fb.com>
---
 Documentation/ioctl/ioctl-number.txt |   1 +
 Documentation/switchtec.txt  |  27 ++
 MAINTAINERS  |   1 +
 drivers/pci/switch/switchtec.c   | 467 +++
 include/uapi/linux/switchtec_ioctl.h | 132 ++
 5 files changed, 628 insertions(+)
 create mode 100644 include/uapi/linux/switchtec_ioctl.h

diff --git a/Documentation/ioctl/ioctl-number.txt 
b/Documentation/ioctl/ioctl-number.txt
index 81c7f2b..032b33c 100644
--- a/Documentation/ioctl/ioctl-number.txt
+++ b/Documentation/ioctl/ioctl-number.txt
@@ -191,6 +191,7 @@ Code  Seq#(hex) Include FileComments
 'W'00-1F   linux/watchdog.hconflict!
 'W'00-1F   linux/wanrouter.h   conflict!   (pre 3.9)
 'W'00-3F   sound/asound.h  conflict!
+'W'40-5F   drivers/pci/switch/switchtec.c
 'X'all fs/xfs/xfs_fs.h conflict!
and fs/xfs/linux-2.6/xfs_ioctl32.h
and include/linux/falloc.h
diff --git a/Documentation/switchtec.txt b/Documentation/switchtec.txt
index 4bced4c..a0a9c7b 100644
--- a/Documentation/switchtec.txt
+++ b/Documentation/switchtec.txt
@@ -51,3 +51,30 @@ The char device has the following semantics:
 
 * The poll call will also be supported for userspace applications that
   need to do other things while waiting for the command to complete.
+
+The following IOCTLs are also supported by the device:
+
+* SWITCHTEC_IOCTL_FLASH_INFO - Retrieve firmware length and number
+  of partitions in the device.
+
+* SWITCHTEC_IOCTL_FLASH_PART_INFO - Retrieve address and lengeth for
+  any specified partition in flash.
+
+* SWITCHTEC_IOCTL_EVENT_SUMMARY - Read a structure of bitmaps
+  indicating all uncleared events.
+
+* SWITCHTEC_IOCTL_EVENT_CTL - Get the current count, clear and set flags
+  for any event. This ioctl takes in a switchtec_ioctl_event_ctl struct
+  with the event_id, index and flags set (index being the partition or PFF
+  number for non-global events). It returns whether the event has
+  occurred, the number of times and any event specific data. The flags
+  can be used to clear the count or enable and disable actions to
+  happen when the event occurs.
+  By using the SWITCHTEC_IOCTL_EVENT_FLAG_EN_POLL flag,
+  you can set an event to trigger a poll command to return with
+  POLLPRI. In this way, userspace can wait for events to occur.
+
+* SWITCHTEC_IOCTL_PFF_TO_PORT and SWITCHTEC_IOCTL_PORT_TO_PFF convert
+  between PCI Function Framework number (used by the event system)
+  and Switchtec Logic Port ID and Partition number (which is more
+  user friendly).
diff --git a/MAINTAINERS b/MAINTAINERS
index 6fed938..c1a9a30 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -9515,6 +9515,7 @@ S:Maintained
 F: Documentation/switchtec.txt
 F: Documentation/ABI/testing/sysfs-class-switchtec
 F: drivers/pci/switch/switchtec*
+F: include/uapi/linux/switchtec_ioctl.h
 
 PCI DRIVER FOR NVIDIA TEGRA
 M: Thierry Reding <thierry.red...@gmail.com>
diff --git a/drivers/pci/switch/switchtec.c b/drivers/pci/switch/switchtec.c
index 2e47d79..94384e7 100644
--- a/drivers/pci/switch/switchtec.c
+++ b/drivers/pci/switch/switchtec.c
@@ -13,6 +13,8 @@
  *
  */
 
+#include 
+
 #include 
 #include 
 #include 
@@ -740,6 +742,417 @@ static unsigned int switchtec_dev_poll(struct file *filp, 
poll_table *wait)
return ret;
 }
 
+static int ioctl_flash_info(struct switchtec_dev *stdev,
+   struct switchtec_ioctl_flash_info __user *uinfo)
+{
+   struct switchtec_ioctl_flash_info info = {0};
+   struct flash_info_regs __iomem *fi = stdev->mmio_flash_info;
+
+   info.flash_length = ioread32(>flash_length);
+   info.num_partitions = SWITCHTEC_IOCTL_NUM_PARTITIONS;
+
+   if (copy_to_user(uinfo, , sizeof(info)))
+   return -EFAULT;
+
+   return 0;
+}
+
+static void set_fw_info_part(struct switchtec_ioctl_flash_part_info *info,
+struct partition_info __iomem *pi)
+{
+   info->address = ioread32(>address);
+   info->length = ioread32(>length);
+}
+
+static int ioctl_flash_part_info(struct switchtec_dev *stdev,
+   struct switchtec_ioctl_flash_part_info __user *uinfo)
+{
+   struct switchtec_ioctl_flash_part_info info = {0};
+   struct flash_info_regs __iomem *fi = stdev->mmio_flash_info;
+   u32 active_addr = -1;
+
+   if (copy

[PATCH v5 2/4] switchtec: Add user interface documentation

2017-02-25 Thread Logan Gunthorpe
This adds standard documentation for the sysfs switchtec attributes and
a RST formatted text file which documents the char device interface.
Jonathan Corbet has indicated he will move this to a new user-space
developer documentation book once it's created.

Signed-off-by: Logan Gunthorpe <log...@deltatee.com>
Signed-off-by: Stephen Bates <stephen.ba...@microsemi.com>
Tested-by: Krishna Dhulipala <krish...@fb.com>
Reviewed-by: Wei Zhang <wzh...@fb.com>
Reviewed-by: Jens Axboe <ax...@fb.com>
---
 Documentation/switchtec.txt | 53 +
 MAINTAINERS |  1 +
 2 files changed, 54 insertions(+)
 create mode 100644 Documentation/switchtec.txt

diff --git a/Documentation/switchtec.txt b/Documentation/switchtec.txt
new file mode 100644
index 000..4bced4c
--- /dev/null
+++ b/Documentation/switchtec.txt
@@ -0,0 +1,53 @@
+
+Linux Switchtec Support
+
+
+Microsemi's "Switchtec" line of PCI switch devices is already
+supported by the kernel with standard PCI switch drivers. However, the
+Switchtec device advertises a special management endpoint which
+enables some additional functionality. This includes:
+
+* Packet and Byte Counters
+* Firmware Upgrades
+* Event and Error logs
+* Querying port link status
+* Custom user firmware commands
+
+The switchtec kernel module implements this functionality.
+
+
+Interface
+=
+
+The primary means of communicating with the Switchtec management firmware is
+through the Memory-mapped Remote Procedure Call (MRPC) interface.
+Commands are submitted to the interface with a 4-byte command
+identifier and up to 1KB of command specific data. The firmware will
+respond with a 4 bytes return code and up to 1KB of command specific
+data. The interface only processes a single command at a time.
+
+
+Userspace Interface
+===
+
+The MRPC interface will be exposed to userspace through a simple char
+device: /dev/switchtec#, one for each management endpoint in the system.
+
+The char device has the following semantics:
+
+* A write must consist of at least 4 bytes and no more than 1028 bytes.
+  The first four bytes will be interpreted as the command to run and
+  the remainder will be used as the input data. A write will send the
+  command to the firmware to begin processing.
+
+* Each write must be followed by exactly one read. Any double write will
+  produce an error and any read that doesn't follow a write will
+  produce an error.
+
+* A read will block until the firmware completes the command and return
+  the four bytes of status plus up to 1024 bytes of output data. (The
+  length will be specified by the size parameter of the read call --
+  reading less than 4 bytes will produce an error.
+
+* The poll call will also be supported for userspace applications that
+  need to do other things while waiting for the command to complete.
diff --git a/MAINTAINERS b/MAINTAINERS
index a57686f..aa702b0 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -9512,6 +9512,7 @@ M:Stephen Bates <stephen.ba...@microsemi.com>
 M: Logan Gunthorpe <log...@deltatee.com>
 L: linux-...@vger.kernel.org
 S: Maintained
+F: Documentation/switchtec.txt
 F: drivers/pci/switch/switchtec*
 
 PCI DRIVER FOR NVIDIA TEGRA
-- 
2.1.4



Re: [PATCH v5 0/4] New Microsemi PCI Switch Management Driver

2017-02-28 Thread Logan Gunthorpe

> This driver doesn't have anything to do with the PCI core, other than
> using the pci_register_driver() interface (just like all other drivers
> for PCI-connected devices), so drivers/pci doesn't really feel like
> the right place for it.  Putting it in drivers/pci leads to the sort
> of confusion you mentioned above ("To make this entirely clear ...").
> 
> Would drivers/perf or drivers/misc/switchtec/ be possible places for
> it?

I made a similar argument when we made the decision of where to put the
code. In the end, the device _is_ a PCI Switch and someone going through
menuconfig or the source tree would probably look there first.

As for drivers/perf, our device does a fair bit more than performance
counters so it doesn't seem like it really fits in there. drivers/misc
just seems like a dumping ground which we'd prefer not to contribute to.
We also considered drivers/char (seeing it exposes a char device), but
that also seems like a dumping ground with stuff that belongs and other
stuff that just ended up stuck between the cracks.

If you still feel strongly about this we can move it into misc, but I
think from an organizational perspective pci/switch makes a bit more sense.

In any case, I also wish we could have had this discussion 3 months ago
when we posted the RFC and not when I have people pushing to get this
merged.

Thanks,

Logan


Re: [PATCH v5 0/4] New Microsemi PCI Switch Management Driver

2017-03-01 Thread Logan Gunthorpe


On 01/03/17 05:32 PM, Bjorn Helgaas wrote:
> It's not a perfect fit in drivers/pci because it's not bus
> infrastructure and I don't want to be the default maintainer of it,
> but I agree there's not really an alternative that's clearly better,
> so let's leave it where it is for now.

Sounds good, thanks.

It would be nice if it could stay where it is for organization but go
through other maintainers trees (though I'm not sure who). I understand
why you wouldn't want to take on the maintenance of it. Hopefully,
myself and the other Microsemi developers will be able to do the bulk of
the maintenance work for the driver.

Thanks,

Logan


Re: [PATCH v5 0/4] New Microsemi PCI Switch Management Driver

2017-03-01 Thread Logan Gunthorpe


On 01/03/17 04:58 PM, Jason Gunthorpe wrote:
> On Wed, Mar 01, 2017 at 03:49:04PM -0700, Logan Gunthorpe wrote:
> 
>> Seems to me like an elegant solution would be to implement a 'cdev_kill'
>> function which could kill all the processes using a cdev. Thus, during
>> an unbind, a driver could call it and be sure that there are no users
>> left and it can safely allow the devres unwind to continue. Then no
>> difficult and racy 'alive' flags would be necessary and it would be much
>> easier on drivers.
> 
> That could help, but this would mean cdev would have to insert a shim
> to grab locks around the various file ops.

Hmm, I was hoping something more along the lines of actually killing the
processes instead of just shimming away fops.

> AFAIK TPM is correct and has been robustly tested now. We have a 'vtpm'
> driver that agressively uses hot-unplug.

Ah, thanks for the explanation of how that works. I didn't notice the
semaphore usage.

Switchtec is a bit more tricky because a) there's no upper level driver
to handle things and b) userspace may be inside a wait_for_completion
(via read or poll) that needs to be completed. If a so called
'cdev_kill' could actually just kill these processes it would be a bit
easier.

Currently, in the Switchtec code, there's a timeout if the interrupt
doesn't fire (which occurs if the pci device has been torn down) and the
code will check an alive bit (under mutex protection) and error out if
it's not alive.

Because of how poll works, I don't see how I can just hold a semaphore
inside every fops call like the tpm code does.

Logan


Re: [PATCH v5 0/4] New Microsemi PCI Switch Management Driver

2017-03-01 Thread Logan Gunthorpe


On 01/03/17 05:23 PM, Logan Gunthorpe wrote:
> Because of how poll works, I don't see how I can just hold a semaphore
> inside every fops call like the tpm code does.

On 01/03/17 04:58 PM, Jason Gunthorpe wrote:
> Both infiniband and TPM have the 'detach' flavour of unplug where the
> user is not forced to close their open cdevs. For simpler cases you
> can just wait for all cdevs to close with a simple rwsem, much like
> sysfs does already during device_del.

Though, after reading your email again, a hard fence on close sounds
like a good idea. Tomorrow I'll post a v6 that uses that approach.


Thanks,

Logan


Re: [PATCH v5 0/4] New Microsemi PCI Switch Management Driver

2017-03-01 Thread Logan Gunthorpe


On 01/03/17 03:26 PM, Keith Busch wrote:
> I think this is from using the managed device resource API to request the
> irq actions. The scope of the resource used to be tied to the pci_dev's
> dev, but now it's the new switchec class dev, which has a different
> lifetime while open references exist, so it's not releasing the irq's.

The scope of the IRQ was originally tied to the pci_dev. Then in v4 I
tied it to the switchtec device in order to try and keep using the pci
device after unbind. This didn't work, so I switched it back to using
the pci_dev. (This seems to be the way most drivers work anyway.)


> One thing about the BUG_ON that is confusing me is how it's getting
> to free_msi_irq's BUG in v4 or v5. I don't see any part releasing the
> allocated ones. Maybe the devres API is harder to use than having the
> driver manage all the resources...

free_msi_irqs seems to be called via pci_disable_device in pcim_release
which devres will call during release of the PCI device and before all
the references to the pci_dev are freed (I tried adding an extra
get_device which gets put in the child devices release -- this didn't work):

 [ 1079.845616] Call Trace:
 [ 1079.845652]  ? pcim_release+0x35/0x96
 [ 1079.845691]  ? release_nodes+0x15b/0x17c
 [ 1079.845730]  ? device_release_driver_internal+0x12d/0x1cb
 [ 1079.845771]  ? unbind_store+0x59/0x89
 [ 1079.845809]  ? kernfs_fop_write+0xe7/0x129
 [ 1079.845847]  ? __vfs_write+0x1c/0xa2
 [ 1079.845885]  ? kmem_cache_alloc+0xc5/0x131
 [ 1079.845923]  ? fput+0xd/0x7d
 [ 1079.845958]  ? filp_close+0x5a/0x61
 [ 1079.845993]  ? vfs_write+0xa2/0xe4
 [ 1079.846028]  ? SyS_write+0x48/0x73
 [ 1079.846066]  ? entry_SYSCALL_64_fastpath+0x13/0x94

v5 is correct because it registers the irqs against the pci_dev (with
devm_request_irq) and thus they get freed in time as part of the devres
unwind.

Logan




Re: [PATCH v5 0/4] New Microsemi PCI Switch Management Driver

2017-03-01 Thread Logan Gunthorpe


On 01/03/17 02:41 PM, Bjorn Helgaas wrote:
> I don't think this is indicating a bug in the PCI core (although I do
> think a BUG_ON() here is an excessive response).  I think it's an
> indication that the driver didn't disconnect its ISR.  Without more
> details of the failure it's hard to tell if the BUG_ON is a symptom of
> a problem in the driver or what.

Yes, my assumption was that when you force an unbind on the PCI core,
it's designed to stop using the PCI device right away even if there are
users using it. Thus it becomes the drivers responsibility to handle
this situation.

> An "alive" flag feels racy, and I can't tell if it's really the best
> way to deal with this, or if it's just avoiding the issue.  There must
> be other drivers with the same cleanup issue -- do they handle it the
> same way?

I haven't done a comprehensive search, but it's very common for people
to use (and this is what I've adopted again in v5):

devm_request_irq(>dev, ...)

In this way, the IRQs are released with the pci_dev (or often platform)
and thus the BUG_ON never hits. However, it means any user space program
waiting on an IRQ (like via a cdev call) will hang unless handled with
other means. Exactly what those means are seems driver specific and not
always obvious. I wouldn't be surprised if a lot of drivers get this
aspect wrong.

A couple examples I've looked at:

1) drivers/dax/dax.c uses an alive flag without any mutexes, atomics or
anything. So I don't know if it's racy or perhaps correct for other reasons.

2) drivers/char/hw_random has a drop_current_rng that looks like it
could easily be racy with the get_current_rng in the userspace flow.

3) A couple of drivers drivers/char/tpm doesn't seem to have any
protection at all and appears like they would continue to use io
operations even after the they may get unmapped because the char device
persists.

So I'm not sure where you'd find a driver that does it correctly and in
a simpler way..

Another thing: based on comments in [1], a lot of people don't seem to
realize that cdev instances can persist long after cdev_del so it's
probably very common for drivers to get this wrong.

Logan

[1] https://lists.01.org/pipermail/linux-nvdimm/2017-February/009001.html



>>   To solve this, we've moved the pci release code back into the
>>   unregister function and reintroduced an alive flag. This time,
>>   however, the alive flag is protected by mrpc_mutex and we're very
>>   careful about what happens to devices still in use (they should
>>   all be released through the timeout path and an ENODEV error
>>   returned to userspace; while new commands are blocked with the
>>   same error).


Re: [PATCH v5 0/4] New Microsemi PCI Switch Management Driver

2017-03-01 Thread Logan Gunthorpe
Hey,

Seems to me like an elegant solution would be to implement a 'cdev_kill'
function which could kill all the processes using a cdev. Thus, during
an unbind, a driver could call it and be sure that there are no users
left and it can safely allow the devres unwind to continue. Then no
difficult and racy 'alive' flags would be necessary and it would be much
easier on drivers.

However, I don't think any such thing exists at the moment and it's not
likely to be done in the near term. I'm reasonably confident in the
correctness of v5 of my driver (especially when compared to other
drivers) and unless someone can describe how it's wrong or a better
solution I'd rather see it merged as is. If and when a better approach
arrives I'd happily patch it to improve the situation.

Logan

On 01/03/17 03:24 PM, Logan Gunthorpe wrote:
> 
> 
> On 01/03/17 02:41 PM, Bjorn Helgaas wrote:
>> I don't think this is indicating a bug in the PCI core (although I do
>> think a BUG_ON() here is an excessive response).  I think it's an
>> indication that the driver didn't disconnect its ISR.  Without more
>> details of the failure it's hard to tell if the BUG_ON is a symptom of
>> a problem in the driver or what.
> 
> Yes, my assumption was that when you force an unbind on the PCI core,
> it's designed to stop using the PCI device right away even if there are
> users using it. Thus it becomes the drivers responsibility to handle
> this situation.
> 
>> An "alive" flag feels racy, and I can't tell if it's really the best
>> way to deal with this, or if it's just avoiding the issue.  There must
>> be other drivers with the same cleanup issue -- do they handle it the
>> same way?
> 
> I haven't done a comprehensive search, but it's very common for people
> to use (and this is what I've adopted again in v5):
> 
> devm_request_irq(>dev, ...)
> 
> In this way, the IRQs are released with the pci_dev (or often platform)
> and thus the BUG_ON never hits. However, it means any user space program
> waiting on an IRQ (like via a cdev call) will hang unless handled with
> other means. Exactly what those means are seems driver specific and not
> always obvious. I wouldn't be surprised if a lot of drivers get this
> aspect wrong.
> 
> A couple examples I've looked at:
> 
> 1) drivers/dax/dax.c uses an alive flag without any mutexes, atomics or
> anything. So I don't know if it's racy or perhaps correct for other reasons.
> 
> 2) drivers/char/hw_random has a drop_current_rng that looks like it
> could easily be racy with the get_current_rng in the userspace flow.
> 
> 3) A couple of drivers drivers/char/tpm doesn't seem to have any
> protection at all and appears like they would continue to use io
> operations even after the they may get unmapped because the char device
> persists.
> 
> So I'm not sure where you'd find a driver that does it correctly and in
> a simpler way..
> 
> Another thing: based on comments in [1], a lot of people don't seem to
> realize that cdev instances can persist long after cdev_del so it's
> probably very common for drivers to get this wrong.
> 
> Logan
> 
> [1] https://lists.01.org/pipermail/linux-nvdimm/2017-February/009001.html




Re: [PATCH v5 0/4] New Microsemi PCI Switch Management Driver

2017-03-01 Thread Logan Gunthorpe


On 01/03/17 03:59 PM, Keith Busch wrote:
> Okay, I see. Was mistakenliy looking at v4. The v5 looks right.

Great! Thanks for reviewing that for me.

Logan



Re: [PATCH 2/3] iopmem : Add a block device driver for PCIe attached IO memory.

2016-10-28 Thread Logan Gunthorpe
Hi Christoph,

Thanks so much for the detailed review of the code! Even though by the
sounds of things we will be moving to device dax and most of this is
moot. Still, it's great to get some feedback and learn a few things.

I've given some responses below.

On 28/10/16 12:45 AM, Christoph Hellwig wrote:
>> + * This driver is heavily based on drivers/block/pmem.c.
>> + * Copyright (c) 2014, Intel Corporation.
>> + * Copyright (C) 2007 Nick Piggin
>> + * Copyright (C) 2007 Novell Inc.
> 
> Is there anything left of it actually?  I didn't spot anything
> obvious.  Nevermind that we don't have a file with that name anymore :)

Yes, actually there's still a lot of similarities with the current
pmem.c. Though, yes, the path was on oversight. Some of this code is
getting pretty old (it started from an out-of-tree version of pmem.c)
and we've tried our best to track as many of the changes to the pmem.c
as possible. This proved to be difficult. Note: this is now the nvdimm
pmem and not the dax pmem (drivers/nvdimm/pmem.c)

>> +  /*
>> +   * We can only access the iopmem device with full 32-bit word
>> +   * accesses which cannot be gaurantee'd by the regular memcpy
>> +   */
> 
> Odd comment formatting. 

Oops. I'm surprised check_patch didn't pick up on that.

> 
>> +static void memcpy_from_iopmem(void *dst, const void *src, size_t sz)
>> +{
>> +u64 *wdst = dst;
>> +const u64 *wsrc = src;
>> +u64 tmp;
>> +
>> +while (sz >= sizeof(*wdst)) {
>> +*wdst++ = *wsrc++;
>> +sz -= sizeof(*wdst);
>> +}
>> +
>> +if (!sz)
>> +return;
>> +
>> +tmp = *wsrc;
>> +memcpy(wdst, , sz);
>> +}
> 
> And then we dod a memcpy here anyway.  And no volatile whatsover, so
> the compiler could do anything to it.  I defintively feel a bit uneasy
> about having this in the driver as well.  Can we define the exact
> semantics for this and define it by the system, possibly in an arch
> specific way?

Yeah, you're right. We should have reviewed this function a bit more.
Anyway, I'd be interested in learning a better approach to forcing a
copy from a mapped BAR with larger widths.


>> +static void iopmem_do_bvec(struct iopmem_device *iopmem, struct page *page,
>> +   unsigned int len, unsigned int off, bool is_write,
>> +   sector_t sector)
>> +{
>> +phys_addr_t iopmem_off = sector * 512;
>> +void *iopmem_addr = iopmem->virt_addr + iopmem_off;
>> +
>> +if (!is_write) {
>> +read_iopmem(page, off, iopmem_addr, len);
>> +flush_dcache_page(page);
>> +} else {
>> +flush_dcache_page(page);
>> +write_iopmem(iopmem_addr, page, off, len);
>> +}
> 
> How about moving the  address and offset calculation as well as the
> cache flushing into read_iopmem/write_iopmem and removing this function?

Could do. This was copied from the existing pmem.c and once the bad_pmem
stuff was stripped out this function became relatively simple.


> 
>> +static blk_qc_t iopmem_make_request(struct request_queue *q, struct bio 
>> *bio)
>> +{
>> +struct iopmem_device *iopmem = q->queuedata;
>> +struct bio_vec bvec;
>> +struct bvec_iter iter;
>> +
>> +bio_for_each_segment(bvec, bio, iter) {
>> +iopmem_do_bvec(iopmem, bvec.bv_page, bvec.bv_len,
>> +bvec.bv_offset, op_is_write(bio_op(bio)),
>> +iter.bi_sector);
> 
> op_is_write just checks the data direction.  I'd feel much more
> comfortable with a switch on the op, e.g.

That makes sense. This was also copied from pmem.c, so this same change
may make sense there too.


>> +static long iopmem_direct_access(struct block_device *bdev, sector_t sector,
>> +   void **kaddr, pfn_t *pfn, long size)
>> +{
>> +struct iopmem_device *iopmem = bdev->bd_queue->queuedata;
>> +resource_size_t offset = sector * 512;
>> +
>> +if (!iopmem)
>> +return -ENODEV;
> 
> I don't think this can ever happen, can it?

Yes, I think now that's the case. This is probably a holdover from a
previous version.

> Just use ida_simple_get/ida_simple_remove instead to take care
> of the locking and preloading, and get rid of these two functions.

Thanks, noted. That would be much better. I never found a simple example
of that when I was looking, though I expected there should have been.

> 
>> +static int iopmem_attach_disk(struct iopmem_device *iopmem)
>> +{
>> +struct gendisk *disk;
>> +int nid = dev_to_node(iopmem->dev);
>> +struct request_queue *q = iopmem->queue;
>> +
>> +blk_queue_write_cache(q, true, true);
> 
> You don't handle flush commands or the fua bit in make_request, so
> this setting seems wrong.

Yup, ok. I'm afraid this is a case of copying without complete
comprehension.

> 
>> +int err = 0;
>> +int nid = dev_to_node(>dev);
>> +
>> +if (pci_enable_device_mem(pdev) < 0) {
> 
> propagate the actual error code, please.

Hmm, 

Re: Enabling peer to peer device transactions for PCIe devices

2016-11-25 Thread Logan Gunthorpe


On 25/11/16 06:06 AM, Christian König wrote:
> Well Serguei send me a couple of documents about QPI when we started to
> discuss this internally as well and that's exactly one of the cases I
> had in mind when writing this.
> 
> If I understood it correctly for such systems P2P is technical possible,
> but not necessary a good idea. Usually it is faster to just use a
> bouncing buffer when the peers are a bit "father" apart.
> 
> That this problem is solved on newer hardware is good, but doesn't helps
> us at all if we at want to support at least systems from the last five
> years or so.

Well we have been testing with Sandy Bridge, I think the problem was
supposed to be fixed in Ivy but we never tested it so I can't say what
the performance turned out to be. Ivy is nearly 5 years old. I expect
this is something that will be improved more and more with subsequent
generations.

A white list may end up being rather complicated if it has to cover
different CPU generations and system architectures. I feel this is a
decision user space could easily make.

Logan


Re: Enabling peer to peer device transactions for PCIe devices

2016-11-23 Thread Logan Gunthorpe


On 23/11/16 02:55 PM, Jason Gunthorpe wrote:
>>> Only ODP hardware allows changing the DMA address on the fly, and it
>>> works at the page table level. We do not need special handling for
>>> RDMA.
>>
>> I am aware of ODP but, noted by others, it doesn't provide a general
>> solution to the points above.
> 
> How do you mean?

I was only saying it wasn't general in that it wouldn't work for IB
hardware that doesn't support ODP or other hardware  that doesn't do
similar things (like an NVMe drive).

It makes sense for hardware that supports ODP to allow MRs to not pin
the underlying memory and provide for migrations that the hardware can
follow. But most DMA engines will require the memory to be pinned and
any complex allocators (GPU or otherwise) should respect that. And that
seems like it should be the default way most of this works -- and I
think it wouldn't actually take too much effort to make it all work now
as is. (Our iopmem work is actually quite small and simple.)

>> It's also worth noting that #4 makes use of ZONE_DEVICE (#2) so they are
>> really the same option. iopmem is really just one way to get BAR
>> addresses to user-space while inside the kernel it's ZONE_DEVICE.
> 
> Seems fine for RDMA?

Yeah, we've had RDMA and O_DIRECT transfers to PCIe backed ZONE_DEVICE
memory working for some time. I'd say it's a good fit. The main question
we've had is how to expose PCIe bars to userspace to be used as MRs and
such.


Logan


Re: Enabling peer to peer device transactions for PCIe devices

2016-11-23 Thread Logan Gunthorpe
Hey,

On 22/11/16 11:59 AM, Serguei Sagalovitch wrote:
> -  How well we will be able to handle case when we need to "move"/"evict"
>memory/data to the new location so CPU pointer should point to the
> new physical location/address
> (and may be not in PCI device memory at all)?

IMO any memory that has been registered for a P2P transaction should be
locked from being evicted. So if there's a get_user_pages call it needs
to be pinned until the put_page. The main issue being with the RDMA
case: handling an eviction when a chunk of memory has been registered as
an MR would be very tricky. The MR may be relied upon by another host
and the kernel would have to inform user-space the MR was invalid then
user-space would have to tell the remote application. This seems like a
lot of burden to place on applications and may be subject to timing
issues. Either that or all RDMA applications need to be written with the
assumption that their target memory could go away at any time.

More generally, if you tell one PCI device to do a DMA transfer to
another PCI device's BAR space, and the target memory gets evicted then
DMA transaction needs to be aborted which means every driver doing the
transfer would need special support for this. If the memory can be
relied on to not be evicted than existing drivers should work unmodified
(ie O_DIRECT to/from an NVMe card would just work).

I feel the better approach is to pin memory subject to P2P transactions
as is typically done with DMA transfers to main memory.

Logan



Re: Enabling peer to peer device transactions for PCIe devices

2016-11-23 Thread Logan Gunthorpe


On 23/11/16 01:33 PM, Jason Gunthorpe wrote:
> On Wed, Nov 23, 2016 at 02:58:38PM -0500, Serguei Sagalovitch wrote:
> 
>>We do not want to have "highly" dynamic translation due to
>>performance cost.  We need to support "overcommit" but would
>>like to minimize impact.  To support RDMA MRs for GPU/VRAM/PCIe
>>device memory (which is must) we need either globally force
>>pinning for the scope of "get_user_pages() / "put_pages" or have
>>special handling for RDMA MRs and similar cases.
> 
> As I said, there is no possible special handling. Standard IB hardware
> does not support changing the DMA address once a MR is created. Forget
> about doing that.

Yeah, that's essentially the point I was trying to make. Not to mention
all the other unrelated hardware that can't DMA to an address that might
disappear mid-transfer.

> Only ODP hardware allows changing the DMA address on the fly, and it
> works at the page table level. We do not need special handling for
> RDMA.

I am aware of ODP but, noted by others, it doesn't provide a general
solution to the points above.

> Like I said, this is the direction the industry seems to be moving in,
> so any solution here should focus on VMAs/page tables as the way to link
> the peer-peer devices.

Yes, this was the appeal to us of using ZONE_DEVICE.

> To me this means at least items #1 and #3 should be removed from
> Alexander's list.

It's also worth noting that #4 makes use of ZONE_DEVICE (#2) so they are
really the same option. iopmem is really just one way to get BAR
addresses to user-space while inside the kernel it's ZONE_DEVICE.

Logan


Re: Enabling peer to peer device transactions for PCIe devices

2016-11-28 Thread Logan Gunthorpe


On 28/11/16 09:57 AM, Jason Gunthorpe wrote:
>> On PeerDirect, we have some kind of a middle-ground solution for pinning
>> GPU memory. We create a non-ODP MR pointing to VRAM but rely on
>> user-space and the GPU not to migrate it. If they do, the MR gets
>> destroyed immediately.
> 
> That sounds horrible. How can that possibly work? What if the MR is
> being used when the GPU decides to migrate? I would not support that
> upstream without a lot more explanation..

Yup, this was our experience when playing around with PeerDirect. There
was nothing we could do if the GPU decided to invalidate the P2P
mapping. It just meant the application would fail or need complicated
logic to detect this and redo just about everything. And given that it
was a reasonably rare occurrence during development it probably means
not a lot of applications will be developed to handle it and most would
end up being randomly broken in environments with memory pressure.

Logan



Re: Enabling peer to peer device transactions for PCIe devices

2016-11-28 Thread Logan Gunthorpe

On 28/11/16 12:35 PM, Serguei Sagalovitch wrote:
> As soon as PeerDirect mapping is called then GPU must not "move" the
> such memory.  It is by PeerDirect design. It is similar how it is works
> with system memory and RDMA MR: when "get_user_pages" is called then the
> memory is pinned.

We haven't touch this in a long time and perhaps it changed, but there
definitely was a call back in the PeerDirect API to allow the GPU to
invalidate the mapping. That's what we don't want.

Logan



Re: Enabling peer to peer device transactions for PCIe devices

2016-11-24 Thread Logan Gunthorpe
Hey,

On 24/11/16 02:45 AM, Christian König wrote:
> E.g. it can happen that PCI device A exports it's BAR using ZONE_DEVICE.
> Not PCI device B (a SATA device) can directly read/write to it because
> it is on the same bus segment, but PCI device C (a network card for
> example) can't because it is on a different bus segment and the bridge
> can't handle P2P transactions.

Yeah, that could be an issue but in our experience we have yet to see
it. We've tested with two separate PCI buses on different CPUs connected
through QPI links and it works fine. (It is rather slow but I understand
Intel has improved the bottleneck in newer CPUs than the ones we tested.)

It may just be older hardware that has this issue. I expect that as long
as a failed transfer can be handled gracefully by the initiator I don't
see a need to predetermine whether a device can see another devices memory.


Logan


Re: Enabling peer to peer device transactions for PCIe devices

2016-11-24 Thread Logan Gunthorpe


On 24/11/16 09:42 AM, Jason Gunthorpe wrote:
> There are three cases to worry about:
>  - Coherent long lived page table mirroring (RDMA ODP MR)
>  - Non-coherent long lived page table mirroring (RDMA MR)
>  - Short lived DMA mapping (everything else)
> 
> Like you say below we have to handle short lived in the usual way, and
> that covers basically every device except IB MRs, including the
> command queue on a NVMe drive.

Yes, this makes sense to me. Though I thought regular IB MRs with
regular memory currently pinned the pages (despite being long lived)
that's why we can run up against the "max locked memory" limit. It
doesn't seem so terrible if GPU memory had a similar restriction until
ODP like solutions get implemented.

>> Yeah, we've had RDMA and O_DIRECT transfers to PCIe backed ZONE_DEVICE
>> memory working for some time. I'd say it's a good fit. The main question
>> we've had is how to expose PCIe bars to userspace to be used as MRs and
>> such.

> Is there any progress on that?

Well, I guess there's some consensus building to do. The existing
options are:

* Device DAX: which could work but the problem I see with it is that it
only allows one application to do these transfers. Or there would have
to be some user-space coordination to figure which application gets what
memeroy.

* Regular DAX in the FS doesn't work at this time because the FS can
move the file you think your transfer to out from under you. Though I
understand there's been some work with XFS to solve that issue.

Though, we've been considering that the backed memory would be
non-volatile which adds some of this complexity. If the memory were
volatile the kernel would just need to do some relatively straight
forward allocation to user-space when asked. For example, with NVMe, the
kernel could give chunks of the CMB buffer to userspace via an mmap call
to /dev/nvmeX. Though I think there's been some push back against things
like that as well.

> I still don't quite get what iopmem was about.. I thought the
> objection to uncachable ZONE_DEVICE & DAX made sense, so running DAX
> over iopmem and still ending up with uncacheable mmaps still seems
> like a non-starter to me...

The latest incarnation of iopmem simply created a block device backed by
ZONE_DEVICE memory on a PCIe BAR. We then put a DAX FS on it and
user-space could mmap the files and send them to other devices to do P2P
transfers.

I don't think there was a hard objection to uncachable ZONE_DEVICE and
DAX. We did try our experimental hardware with cached ZONE_DEVICE and it
did work but the performance was beyond unusable (which may be a
hardware issue). In the end I feel the driver would have to decide the
most appropriate caching for the hardware and I don't understand why WC
or UC wouldn't work with ZONE_DEVICE.

Logan


Re: Enabling peer to peer device transactions for PCIe devices

2016-12-06 Thread Logan Gunthorpe
Hey,

On 06/12/16 09:38 AM, Jason Gunthorpe wrote:
>>> I'm not opposed to mapping /dev/nvmeX.  However, the lookup is trivial
>>> to accomplish in sysfs through /sys/dev/char to find the sysfs path of the
>>> device-dax instance under the nvme device, or if you already have the nvme
>>> sysfs path the dax instance(s) will appear under the "dax" sub-directory.
>>
>> Personally I think mapping the dax resource in the sysfs tree is a nice
>> way to do this and a bit more intuitive than mapping a /dev/nvmeX.
> 
> It is still not at all clear to me what userpsace is supposed to do
> with this on nvme.. How is the CMB usable from userspace?

The flow is pretty simple. For example to write to NVMe from an RDMA device:

1) Obtain a chunk of the CMB to use as a buffer(either by mmaping
/dev/nvmx, the device dax char device or through a block layer interface
(which sounds like a good suggestion from Christoph, but I'm not really
sure how it would look).

2) Create an MR with the buffer and use an RDMA function to fill it with
data from a remote host. This will cause the RDMA hardware to write
directly to the memory in the NVMe card.

3) Using O_DIRECT, write the buffer to a file on the NVMe filesystem.
When the address reaches hardware the NVMe will recognize it as local
memory and copy it directly there.

Thus we are able to transfer data to any file on an NVMe device without
going through system memory. This has benefits on systems with lots of
activity in system memory but step 3 is likely to be slowish due to the
need to pin/unpin the memory for every transaction.

Logan



[RFC 0/1] New PCI Switch Management Driver

2016-12-17 Thread Logan Gunthorpe
Hi,

[Appologies: this is a resend for some people. Due to a configuration
error the original email was rejected by the mailing lists. I hope
this one makes it!]

We're looking to get some initial feedback on a new driver for
a line of PCIe switches produced and produced and sold by Microsemi.
The goal is to get the process moving to get this code included in
upstream hopefully for 4.11. Facebook is currently gearing up to
use this hardware in its Open Compute Platform and is pushing to
have this driver in the upstream kernel.

The following patch briefly describes the hardware and provides
the first draft of driver code. Currently, the driver works and
has been tested but is not feature complete. Thus, we are not looking
to get it merged immediately. However we would like some early review,
specifically on the interfaces and core concepts so that we don't
do a lot of work down a path the community would reject. Barring any
objections to this RFC, we will flesh out all the features
and provide a completed patch for inclusion in the coming weeks.

Work on a userspace tool, that utilizes this driver, is also being
done at [1]. The tool is currently also a bit of a skeleton and
will be fleshed out assuming there are no serious objections to our
userspace interface. In the end, the tool will be released with a
GPL license.

The patch is based off of the v4.9 release.

Thanks for your review,

Logan

[1] https://github.com/sbates130272/switchtec-user

Logan Gunthorpe (1):
  MicroSemi Switchtec management interface driver

 Documentation/switchtec.txt|  54 +++
 MAINTAINERS|   9 +
 drivers/pci/Kconfig|   1 +
 drivers/pci/Makefile   |   1 +
 drivers/pci/switch/Kconfig |  13 +
 drivers/pci/switch/Makefile|   1 +
 drivers/pci/switch/switchtec.c | 824 +
 drivers/pci/switch/switchtec.h | 119 ++
 8 files changed, 1022 insertions(+)
 create mode 100644 Documentation/switchtec.txt
 create mode 100644 drivers/pci/switch/Kconfig
 create mode 100644 drivers/pci/switch/Makefile
 create mode 100644 drivers/pci/switch/switchtec.c
 create mode 100644 drivers/pci/switch/switchtec.h

--
2.1.4


[RFC 1/1] MicroSemi Switchtec management interface driver

2016-12-17 Thread Logan Gunthorpe
Microsemi's "Switchtec" line of PCI switch devices is already
supported by the kernel with standard PCI switch drivers. However, the
Switchtec device advertises a special management endpoint which
enables some additional functionality. This includes:

 * Packet and Byte Counters
 * Firmware Upgrades
 * Event and Error logs
 * Querying port link status
 * Custom user firmware commands

This patch introduces the switchtec kernel module which provides
pci driver that exposes a char device. The char device provides
userspace access to this interface through read, write and (optionally)
poll calls. Currently no ioctls have been implemented but a couple
may be added in a later revision.

A short text file is provided which documents the switchtec driver
and outlines the semantics of using the char device.

A WIP userspace tool which utilizes this interface is available
at [1]. This tool takes
inspiration (and borrows some code) from nvme-cli [2].

[1] https://github.com/sbates130272/switchtec-user
[2] https://github.com/linux-nvme/nvme-cli

Signed-off-by: Logan Gunthorpe <log...@deltatee.com>
Signed-off-by: Stephen Bates <stephen.ba...@microsemi.com>
---
 Documentation/switchtec.txt|  54 +++
 MAINTAINERS|   9 +
 drivers/pci/Kconfig|   1 +
 drivers/pci/Makefile   |   1 +
 drivers/pci/switch/Kconfig |  13 +
 drivers/pci/switch/Makefile|   1 +
 drivers/pci/switch/switchtec.c | 824 +
 drivers/pci/switch/switchtec.h | 119 ++
 8 files changed, 1022 insertions(+)
 create mode 100644 Documentation/switchtec.txt
 create mode 100644 drivers/pci/switch/Kconfig
 create mode 100644 drivers/pci/switch/Makefile
 create mode 100644 drivers/pci/switch/switchtec.c
 create mode 100644 drivers/pci/switch/switchtec.h

diff --git a/Documentation/switchtec.txt b/Documentation/switchtec.txt
new file mode 100644
index 000..04657ce
--- /dev/null
+++ b/Documentation/switchtec.txt
@@ -0,0 +1,54 @@
+
+Linux Switchtec Support
+
+
+Microsemi's "Switchtec" line of PCI switch devices is already
+supported by the kernel with standard PCI switch drivers. However, the
+Switchtec device advertises a special management endpoint which
+enables some additional functionality. This includes:
+
+ * Packet and Byte Counters
+ * Firmware Upgrades
+ * Event and Error logs
+ * Querying port link status
+ * Custom user firmware commands
+
+The switchtec kernel module implements this functionality.
+
+
+
+Interface
+=
+
+The primary means of communicating with the Switchtec management firmware is
+through the Memory-mapped Remote Procedure Call (MRPC) interface.
+Commands are submitted to the interface with a 4-byte command
+identifier and up to 1KB of command specific data. The firmware will
+respond with a 4 bytes return code and up to 1KB of command specific
+data. The interface only processes a single command at a time.
+
+
+Userspace Interface
+===
+
+The MRPC interface will be exposed to userspace through a simple char
+device: /dev/switchtec#, one for each management endpoint in the system.
+
+The char device has the following semantics:
+
+ * A write must consist of at least 4 bytes and no more than 1028 bytes.
+   The first four bytes will be interpreted as the command to run and
+   the remainder will be used as the input data. A write will send the
+   command to the firmware to begin processing.
+
+ * Each write must be followed by exactly one read. Any double write will
+   produce an error and any read that doesn't follow a write will
+   produce an error.
+
+ * A read will block until the firmware completes the command and return
+   the four bytes of status plus up to 1024 bytes of output data. (The
+   length will be specified by the size parameter of the read call --
+   reading less than 4 bytes will produce an error.
+
+ * The poll call will also be supported for userspace applications that
+   need to do other things while waiting for the command to complete.
diff --git a/MAINTAINERS b/MAINTAINERS
index 63cefa6..1e21505 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -9288,6 +9288,15 @@ S:   Maintained
 F: Documentation/devicetree/bindings/pci/aardvark-pci.txt
 F: drivers/pci/host/pci-aardvark.c
 
+PCI DRIVER FOR MICROSEMI SWITCHTEC
+M: Kurt Schwemmer <kurt.schwem...@microsemi.com>
+M: Stephen Bates <stephen.ba...@microsemi.com>
+M: Logan Gunthorpe <log...@deltatee.com>
+L: linux-...@vger.kernel.org
+S: Maintained
+F: Documentation/switchtec.txt
+F: drivers/pci/switch/switchtec*
+
 PCI DRIVER FOR NVIDIA TEGRA
 M: Thierry Reding <thierry.red...@gmail.com>
 L: linux-te...@vger.kernel.org
diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
index 6555eb7..f72e8c5 100644
--- a/drivers/pci/Kconfig
+++ b/drivers/pci/Kconfig
@@ -133,3 +133,4 @@ config PCI_HYPERV
 
 source "drivers/pci/hotplug

[PATCH 2/2] ntb_hw_intel: Style fixes: open code macros that just obfuscate code

2017-01-10 Thread Logan Gunthorpe
As per a comments in [1] by Greg Kroah-Hartman, the ndev_* macros should
be cleaned up. This makes it more clear what's actually going on when
reading the code.

[1] http://www.spinics.net/lists/linux-pci/msg56904.html

Signed-off-by: Logan Gunthorpe <log...@deltatee.com>
---
 drivers/ntb/hw/intel/ntb_hw_intel.c | 145 ++--
 drivers/ntb/hw/intel/ntb_hw_intel.h |   3 -
 2 files changed, 73 insertions(+), 75 deletions(-)

diff --git a/drivers/ntb/hw/intel/ntb_hw_intel.c 
b/drivers/ntb/hw/intel/ntb_hw_intel.c
index 7310a26..3c8ef1d 100644
--- a/drivers/ntb/hw/intel/ntb_hw_intel.c
+++ b/drivers/ntb/hw/intel/ntb_hw_intel.c
@@ -254,12 +254,12 @@ static inline int ndev_db_addr(struct intel_ntb_dev *ndev,
 
if (db_addr) {
*db_addr = reg_addr + reg;
-   dev_dbg(ndev_dev(ndev), "Peer db addr %llx\n", *db_addr);
+   dev_dbg(>ntb.pdev->dev, "Peer db addr %llx\n", *db_addr);
}
 
if (db_size) {
*db_size = ndev->reg->db_size;
-   dev_dbg(ndev_dev(ndev), "Peer db size %llx\n", *db_size);
+   dev_dbg(>ntb.pdev->dev, "Peer db size %llx\n", *db_size);
}
 
return 0;
@@ -352,7 +352,8 @@ static inline int ndev_spad_addr(struct intel_ntb_dev 
*ndev, int idx,
 
if (spad_addr) {
*spad_addr = reg_addr + reg + (idx << 2);
-   dev_dbg(ndev_dev(ndev), "Peer spad addr %llx\n", *spad_addr);
+   dev_dbg(>ntb.pdev->dev, "Peer spad addr %llx\n",
+   *spad_addr);
}
 
return 0;
@@ -390,7 +391,7 @@ static irqreturn_t ndev_interrupt(struct intel_ntb_dev 
*ndev, int vec)
 
vec_mask = ndev_vec_mask(ndev, vec);
 
-   dev_dbg(ndev_dev(ndev), "vec %d vec_mask %llx\n", vec, vec_mask);
+   dev_dbg(>ntb.pdev->dev, "vec %d vec_mask %llx\n", vec, vec_mask);
 
ndev->last_ts = jiffies;
 
@@ -416,7 +417,7 @@ static irqreturn_t ndev_irq_isr(int irq, void *dev)
 {
struct intel_ntb_dev *ndev = dev;
 
-   return ndev_interrupt(ndev, irq - ndev_pdev(ndev)->irq);
+   return ndev_interrupt(ndev, irq - ndev->ntb.pdev->irq);
 }
 
 static int ndev_init_isr(struct intel_ntb_dev *ndev,
@@ -426,7 +427,7 @@ static int ndev_init_isr(struct intel_ntb_dev *ndev,
struct pci_dev *pdev;
int rc, i, msix_count, node;
 
-   pdev = ndev_pdev(ndev);
+   pdev = ndev->ntb.pdev;
 
node = dev_to_node(>dev);
 
@@ -465,7 +466,7 @@ static int ndev_init_isr(struct intel_ntb_dev *ndev,
goto err_msix_request;
}
 
-   dev_dbg(ndev_dev(ndev), "Using msix interrupts\n");
+   dev_dbg(>dev, "Using msix interrupts\n");
ndev->db_vec_count = msix_count;
ndev->db_vec_shift = msix_shift;
return 0;
@@ -493,7 +494,7 @@ static int ndev_init_isr(struct intel_ntb_dev *ndev,
if (rc)
goto err_msi_request;
 
-   dev_dbg(ndev_dev(ndev), "Using msi interrupts\n");
+   dev_dbg(>dev, "Using msi interrupts\n");
ndev->db_vec_count = 1;
ndev->db_vec_shift = total_shift;
return 0;
@@ -511,7 +512,7 @@ static int ndev_init_isr(struct intel_ntb_dev *ndev,
if (rc)
goto err_intx_request;
 
-   dev_dbg(ndev_dev(ndev), "Using intx interrupts\n");
+   dev_dbg(>dev, "Using intx interrupts\n");
ndev->db_vec_count = 1;
ndev->db_vec_shift = total_shift;
return 0;
@@ -525,7 +526,7 @@ static void ndev_deinit_isr(struct intel_ntb_dev *ndev)
struct pci_dev *pdev;
int i;
 
-   pdev = ndev_pdev(ndev);
+   pdev = ndev->ntb.pdev;
 
/* Mask all doorbell interrupts */
ndev->db_mask = ndev->db_valid_mask;
@@ -559,7 +560,7 @@ static ssize_t ndev_debugfs_read(struct file *filp, char 
__user *ubuf,
union { u64 v64; u32 v32; u16 v16; u8 v8; } u;
 
ndev = filp->private_data;
-   pdev = ndev_pdev(ndev);
+   pdev = ndev->ntb.pdev;
mmio = ndev->self_mmio;
 
buf_size = min(count, 0x800ul);
@@ -820,7 +821,8 @@ static void ndev_init_debugfs(struct intel_ntb_dev *ndev)
ndev->debugfs_info = NULL;
} else {
ndev->debugfs_dir =
-   debugfs_create_dir(ndev_name(ndev), debugfs_dir);
+   debugfs_create_dir(pci_name(ndev->ntb.pdev),
+  debugfs_dir);
if (!ndev->debugfs_dir)
ndev->debugfs_info = NULL;
else
@@ -1007,13 +1009,13 @@ static int intel_ntb_link_enable(struct ntb_dev *ntb,
if (ndev->ntb.topo == NTB_TOPO_SEC)
return -EINVAL;
 
-  

[PATCH 1/2] ntb_hw_amd: Style fixes: open code macros that just obfuscate code

2017-01-10 Thread Logan Gunthorpe
As per a comments in [1] by Greg Kroah-Hartman, the ndev_* macros should
be cleaned up. This makes it more clear what's actually going on when
reading the code.

[1] http://www.spinics.net/lists/linux-pci/msg56904.html

Signed-off-by: Logan Gunthorpe <log...@deltatee.com>
---
 drivers/ntb/hw/amd/ntb_hw_amd.c | 59 ++---
 drivers/ntb/hw/amd/ntb_hw_amd.h |  3 ---
 2 files changed, 31 insertions(+), 31 deletions(-)

diff --git a/drivers/ntb/hw/amd/ntb_hw_amd.c b/drivers/ntb/hw/amd/ntb_hw_amd.c
index 6ccba0d..85a9a4f 100644
--- a/drivers/ntb/hw/amd/ntb_hw_amd.c
+++ b/drivers/ntb/hw/amd/ntb_hw_amd.c
@@ -98,10 +98,10 @@ static int amd_ntb_mw_get_range(struct ntb_dev *ntb, int 
idx,
return bar;
 
if (base)
-   *base = pci_resource_start(ndev->ntb.pdev, bar);
+   *base = pci_resource_start(ntb->pdev, bar);
 
if (size)
-   *size = pci_resource_len(ndev->ntb.pdev, bar);
+   *size = pci_resource_len(ntb->pdev, bar);
 
if (align)
*align = SZ_4K;
@@ -126,7 +126,7 @@ static int amd_ntb_mw_set_trans(struct ntb_dev *ntb, int 
idx,
if (bar < 0)
return bar;
 
-   mw_size = pci_resource_len(ndev->ntb.pdev, bar);
+   mw_size = pci_resource_len(ntb->pdev, bar);
 
/* make sure the range fits in the usable mw size */
if (size > mw_size)
@@ -135,7 +135,7 @@ static int amd_ntb_mw_set_trans(struct ntb_dev *ntb, int 
idx,
mmio = ndev->self_mmio;
peer_mmio = ndev->peer_mmio;
 
-   base_addr = pci_resource_start(ndev->ntb.pdev, bar);
+   base_addr = pci_resource_start(ntb->pdev, bar);
 
if (bar != 1) {
xlat_reg = AMD_BAR23XLAT_OFFSET + ((bar - 2) << 3);
@@ -226,7 +226,7 @@ static int amd_ntb_link_is_up(struct ntb_dev *ntb,
if (width)
*width = NTB_LNK_STA_WIDTH(ndev->lnk_sta);
 
-   dev_dbg(ndev_dev(ndev), "link is up.\n");
+   dev_dbg(>pdev->dev, "link is up.\n");
 
ret = 1;
} else {
@@ -235,7 +235,7 @@ static int amd_ntb_link_is_up(struct ntb_dev *ntb,
if (width)
*width = NTB_WIDTH_NONE;
 
-   dev_dbg(ndev_dev(ndev), "link is down.\n");
+   dev_dbg(>pdev->dev, "link is down.\n");
}
 
return ret;
@@ -255,7 +255,7 @@ static int amd_ntb_link_enable(struct ntb_dev *ntb,
 
if (ndev->ntb.topo == NTB_TOPO_SEC)
return -EINVAL;
-   dev_dbg(ndev_dev(ndev), "Enabling Link.\n");
+   dev_dbg(>pdev->dev, "Enabling Link.\n");
 
ntb_ctl = readl(mmio + AMD_CNTL_OFFSET);
ntb_ctl |= (PMM_REG_CTL | SMM_REG_CTL);
@@ -276,7 +276,7 @@ static int amd_ntb_link_disable(struct ntb_dev *ntb)
 
if (ndev->ntb.topo == NTB_TOPO_SEC)
return -EINVAL;
-   dev_dbg(ndev_dev(ndev), "Enabling Link.\n");
+   dev_dbg(>pdev->dev, "Enabling Link.\n");
 
ntb_ctl = readl(mmio + AMD_CNTL_OFFSET);
ntb_ctl &= ~(PMM_REG_CTL | SMM_REG_CTL);
@@ -467,18 +467,19 @@ static void amd_ack_smu(struct amd_ntb_dev *ndev, u32 bit)
 static void amd_handle_event(struct amd_ntb_dev *ndev, int vec)
 {
void __iomem *mmio = ndev->self_mmio;
+   struct device *dev = >ntb.pdev->dev;
u32 status;
 
status = readl(mmio + AMD_INTSTAT_OFFSET);
if (!(status & AMD_EVENT_INTMASK))
return;
 
-   dev_dbg(ndev_dev(ndev), "status = 0x%x and vec = %d\n", status, vec);
+   dev_dbg(dev, "status = 0x%x and vec = %d\n", status, vec);
 
status &= AMD_EVENT_INTMASK;
switch (status) {
case AMD_PEER_FLUSH_EVENT:
-   dev_info(ndev_dev(ndev), "Flush is done.\n");
+   dev_info(dev, "Flush is done.\n");
break;
case AMD_PEER_RESET_EVENT:
amd_ack_smu(ndev, AMD_PEER_RESET_EVENT);
@@ -502,7 +503,7 @@ static void amd_handle_event(struct amd_ntb_dev *ndev, int 
vec)
status = readl(mmio + AMD_PMESTAT_OFFSET);
/* check if this is WAKEUP event */
if (status & 0x1)
-   dev_info(ndev_dev(ndev), "Wakeup is done.\n");
+   dev_info(dev, "Wakeup is done.\n");
 
amd_ack_smu(ndev, AMD_PEER_D0_EVENT);
 
@@ -511,14 +512,14 @@ static void amd_handle_event(struct amd_ntb_dev *ndev, 
int vec)
  AMD_LINK_HB_TIMEOUT);
break;
default:
-   dev_info(ndev_dev(ndev), "event status = 0x%x.\n", status);
+   dev_info(dev, "event status = 0x%x.\n", status);
   

  1   2   3   4   5   6   7   8   9   10   >