[PATCH] Added another USB product ID for ELAN touchscreen quirks.
I've had the same issue as described in commit c68929f75dfcb6354918862b91b5778585de1fa5 Except my touchscreen's ID is ID 04f3:0125 Elan Microelectronics Corp. Signed-off-by: Logan Gunthorpe log...@deltatee.com --- drivers/usb/core/quirks.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/drivers/usb/core/quirks.c b/drivers/usb/core/quirks.c index 41e510a..d85abfe 100644 --- a/drivers/usb/core/quirks.c +++ b/drivers/usb/core/quirks.c @@ -106,6 +106,9 @@ static const struct usb_device_id usb_quirk_list[] = { { USB_DEVICE(0x04f3, 0x010c), .driver_info = USB_QUIRK_DEVICE_QUALIFIER }, + { USB_DEVICE(0x04f3, 0x0125), .driver_info = + USB_QUIRK_DEVICE_QUALIFIER }, + { USB_DEVICE(0x04f3, 0x016f), .driver_info = USB_QUIRK_DEVICE_QUALIFIER }, -- 2.1.4 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 10/15] block, dax: fix lifetime of in-kernel dax mappings
Hi Dan, We've uncovered another issue during testing with these patches. We get a kernel panic sometimes just while using a DAX filesystem. I've traced the issue back to this patch. (There's a stack trace at the end of this email.) On 22/09/15 10:42 PM, Dan Williams wrote: +static void dax_unmap_bh(const struct buffer_head *bh, void __pmem *addr) +{ + struct block_device *bdev = bh->b_bdev; + struct request_queue *q = bdev->bd_queue; + + if (IS_ERR(addr)) + return; + blk_dax_put(q); } @@ -127,9 +159,8 @@ static ssize_t dax_io(struct inode *inode, struct iov_iter *iter, if (pos == bh_max) { bh->b_size = PAGE_ALIGN(end - pos); bh->b_state = 0; - retval = get_block(inode, block, bh, - iov_iter_rw(iter) == WRITE); - if (retval) + rc = get_block(inode, block, bh, rw == WRITE); + if (rc) break; if (!buffer_size_valid(bh)) bh->b_size = 1 << blkbits; @@ -178,8 +213,9 @@ static ssize_t dax_io(struct inode *inode, struct iov_iter *iter, if (need_wmb) wmb_pmem(); + dax_unmap_bh(bh, kmap); - return (pos == start) ? retval : pos - start; + return (pos == start) ? rc : pos - start; } The problem is if get_block fails and returns an error code, it will still call dax_unmap_bh which tries to dereference bh->b_bdev. However, seeing get_block failed, that pointer is NULL. Maybe a null check in dax_unmap_bh would be sufficient? Thanks, Logan -- [ 35.391790] BUG: unable to handle kernel NULL pointer dereference at 00a0 [ 35.393306] IP: [] dax_unmap_bh+0x41/0x70 [ 35.394253] PGD 7c7ed067 PUD 7c01f067 PMD 0 [ 35.395020] Oops: [#1] SMP [ 35.395597] Modules linked in: mtramonb(O) [ 35.396320] CPU: 0 PID: 1501 Comm: dd Tainted: G O4.3.0-rc2+ #51 [ 35.397500] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014 [ 35.399728] task: 88007bc6d600 ti: 88007925 task.ti: 88007925 [ 35.401006] RIP: 0010:[] [] dax_unmap_bh+0x41/0x70 [ 35.402402] RSP: 0018:880079253ab8 EFLAGS: 00010246 [ 35.403321] RAX: RBX: RCX: 0012 [ 35.404550] RDX: 0007 RSI: 0246 RDI: 8181f630 [ 35.405783] RBP: 880079253ac8 R08: 000a R09: fffe [ 35.407011] R10: 88007fc1ad90 R11: 6abc R12: fffb [ 35.408237] R13: 0038e000 R14: 0038e000 R15: 0038e000 [ 35.409466] FS: 7f0e81476700() GS:88007fc0() knlGS: [ 35.410830] CS: 0010 DS: ES: CR0: 80050033 [ 35.411796] CR2: 00a0 CR3: 7926f000 CR4: 06f0 [ 35.412990] Stack: [ 35.413340] ffe4 880079253cf0 880079253be0 811d2259 [ 35.414670] 0246 81203120 880079253e68 [ 35.415981] 00017c394800 0038e000 0001 fffb [ 35.417298] Call Trace: [ 35.417727] [] dax_do_io+0x199/0x700 [ 35.418615] [] ? _ext4_get_block+0x200/0x200 [ 35.419819] [] ? jbd2_journal_stop+0x60/0x390 [ 35.420886] [] ext4_ind_direct_IO+0x8d/0x410 [ 35.421908] [] ext4_direct_IO+0x2da/0x540 [ 35.422859] [] ? ext4_dirty_inode+0x5c/0x70 [ 35.423841] [] generic_file_direct_write+0xaa/0x170 [ 35.424931] [] __generic_file_write_iter+0xc2/0x1f0 [ 35.426027] [] ext4_file_write_iter+0x13b/0x420 [ 35.427066] [] ? pick_next_entity+0xb2/0x190 [ 35.428061] [] __vfs_write+0xa7/0xf0 [ 35.428940] [] vfs_write+0xa9/0x190 [ 35.429810] [] ? vfs_read+0x114/0x130 [ 35.430706] [] SyS_write+0x46/0xa0 [ 35.431561] [] entry_SYSCALL_64_fastpath+0x12/0x71 [ 35.432637] Code: fe 48 c7 c7 4c 63 7e 81 e8 11 ec f5 ff 48 8b 5b 30 48 c7 c7 52 63 7e 81 31 c0 48 89 de e8 fc eb f5 ff 31 c0 48 c7 c7 30 f6 81 81 <48> 8b 9b a0 00 00 00 e8 e7 eb f5 ff 49 81 fc 00 f0 ff ff 77 08 [ 35.437029] RIP [] dax_unmap_bh+0x41/0x70 [ 35.438005] RSP [ 35.438608] CR2: 00a0 [ 35.439194] ---[ end trace 323695b29b46dd96 ]--- -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 14/15] mm, dax, pmem: introduce {get|put}_dev_pagemap() for dax-gup
Hi Dan, We've been doing some experimenting and testing with this patchset. Specifically, we are trying to use you're ZONE_DEVICE work to enable peer to peer PCIe transfers. This is actually working pretty well (though we're still testing and working through some things). However, we've found a couple of issues: On Wed, Sep 23, 2015 at 12:42:27AM -0400, Dan Williams wrote: > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h > index 3d6baa7d4534..20097e7b679a 100644 > --- a/include/linux/mm_types.h > +++ b/include/linux/mm_types.h > @@ -49,12 +49,16 @@ struct page { >* updated asynchronously */ > union { > struct address_space *mapping; /* If low bit clear, points to > - * inode address_space, or NULL. > + * inode address_space, unless > + * the page is in ZONE_DEVICE > + * then it points to its parent > + * dev_pagemap, otherwise NULL. >* If page mapped as anonymous >* memory, low bit is set, and >* it points to anon_vma object: >* see PAGE_MAPPING_ANON below. >*/ > + struct dev_pagemap *pgmap; > void *s_mem;/* slab first object */ > }; When you add to this union and overide the mapping value, we see bugs in calls to set_page_dirty when it tries to dereference mapping. I believe a change to page_mapping is required such as the patch that's at the end of this email. > diff --git a/mm/gup.c b/mm/gup.c > index a798293fc648..1064e9a489a4 100644 > --- a/mm/gup.c > +++ b/mm/gup.c > @@ -98,7 +98,16 @@ retry: > } > > page = vm_normal_page(vma, address, pte); > - if (unlikely(!page)) { > + if (!page && pte_devmap(pte) && (flags & FOLL_GET)) { > + /* > + * Only return device mapping pages in the FOLL_GET case since > + * they are only valid while holding the pgmap reference. > + */ > + if (get_dev_pagemap(pte_pfn(pte), NULL)) > + page = pte_page(pte); > + else > + goto no_page; > + } else if (unlikely(!page)) { I've found that if a driver creates a ZONE_DEVICE mapping but doesn't create the pagemap (using devm_register_pagemap) then the get_user_pages code will go into an infinite loop. I'm not really sure if this as an issue or not but it seems a bit undesirable for a buggy driver to be able to cause this. My thoughts are that either devm_register_pagemap needs to be done by devm_memremap_pages so a driver cannot use one without the other, or the GUP code needs to return EFAULT if no pagemap was registered so it doesn't loop forever. Thanks! Logan diff --git a/mm/util.c b/mm/util.c index 68ff8a5..19af683 100644 --- a/mm/util.c +++ b/mm/util.c @@ -368,6 +368,9 @@ struct address_space *page_mapping(struct page *page) return swap_address_space(entry); } + if (unlikely(is_zone_device_page(page))) + return NULL; + mapping = (unsigned long)page->mapping; if (mapping & PAGE_MAPPING_FLAGS) return NULL; -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 14/15] mm, dax, pmem: introduce {get|put}_dev_pagemap() for dax-gup
Hi Dan, Good to know you've already addressed the struct page issue. We'll watch out for an updated patchset to try. On 02/10/15 03:53 PM, Dan Williams wrote: Hmm, I didn't have peer-to-peer PCI-E in mind for this mechanism, but the test report is welcome nonetheless. The definition of dma_addr_t is the device view of host memory, not necessarily the device view of a peer device's memory range, so I expect you'll run into issues with IOMMUs and other parts of the kernel that assume this definition. Yeah, we've actually been doing this with a number of more "hacky" techniques for some time. ZONE_DEVICE just provides us with a much cleaner way to set this up that doesn't require patching around get_user_pages in various places in the kernel. We've never had any issues with the IOMMU getting in the way (at least on Intel x86). My understanding always was that the IOMMU sits between a PCI card and main memory; it doesn't get in the way of peer-to-peer transfers. Though admittedly, I don't have a complete understanding of how the IOMMU works in the kernel. I'm just speaking from experimental experience. We've never actually tried this on other architectures. Thanks, Logan -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 14/15] mm, dax, pmem: introduce {get|put}_dev_pagemap() for dax-gup
On 02/10/15 03:53 PM, Dan Williams wrote: Yes, this location for dev_pagemap will not work. I've since moved it to a union with the lru list_head since ZONE_DEVICE pages memory should always have an elevated page count and never land on a slab allocator lru. Oh, also, I was actually hoping to make use of the lru list_head in the future with ZONE_DEVICE memory. One thought I had was once we have a PCIe device with a BAR space, we'd then need to have a way of allocating these buffers when user space needs them. The simple way I was thinking was to just use the lru list head to store lists of used and unused pages -- though there are probably other solutions to this that don't require using struct pages. Thanks, Logan -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/3] ntb_transport: Check the number of spads the hardware supports
Hi Jon, Thanks for the feedback. I'll send an updated patch in a moment. On 04/06/16 09:40 AM, Jon Mason wrote: > Nit, please add spaces around '*' (per checkpatch) I'll change this, but I did run it through checkpatch and it did not warn about this. > Please explicitly point out that this is being modified in the commit > message. I don't see them being used, so probably not a big deal > (unless Dave Jiang has something queued that will use it). Done. I feel like he can always add them back in when he adds the functionality. This way, when he does, MAX_SPAD will be updated and the check will still be correct. > Move this check above the dev_to_node assignment above. Done. Logan
Re: PROBLEM: Resume form hibernate broken by setting NX on gap
Hey Rafael, Awesome, this patch fixes the problem! Nice work. Thanks, Logan On 12/06/16 08:31 AM, Rafael J. Wysocki wrote: On Saturday, June 11, 2016 10:48:08 PM Logan Gunthorpe wrote: Hey, Hi, On 11/06/16 07:05 PM, Rafael J. Wysocki wrote: 1) Commit ab76f7b4ab only extends the NX bit between __ex_table and rodata; which, by my understanding, shouldn't be used by anything. And __ex_table and rodata are fixed by the kernel's binary so both symbols should be the same in both the image kernel and the boot kernel given that both are running from the same binary. Well, what if the kernel is relocated? Ah, I'm sure I don't fully grasp the implications of that but I would assume that if the image kernel were located somewhere else it would still be far away from the boot kernel's ex_table/rodata boundary. Probably. 2) When ab76f7b4ab is reverted, hibernation seems to work 100%. Though, when it's in place, it only works some of the time. Given that commit is only extending the NX region a bit, if there is some random mismatch, why does it never reach rodata? In other words, why is rodata a magic line that seems to work all the time -- why doesn't this random mismatch ever extend into the rodata region? rodata isn't _that_ far away from the end of ex_table. That's a very good question. :-) Yeah, I guess if we knew the answer we'd understand what was going on and have a fix. Right. Appended is one more patch to try. It actually fixes a theoretical problem, so I'll need to add comments to it (as it is far from obvious IMO) and a changelog etc and post it as a proper submission. So the concern is that the page copying done in the loop in core_restore_code() may corrupt the kernel text part of the temporary memory mapping used at that time, because that's the original kernel text mapping from the boot kernel. That doesn't matter for the loop itself (as that code runs from a "safe" page guaranteed not to be overwritten), but it (quite theoretically) may matter for the final jump to the image kernel's restore_registers(). [I realized that it might be a problem only after I had started to think about the problem you reported.] As a bonus, the patch also eliminates the possible concern about the executability of the memory mapped via the kernel text mapping in the boot kernel, so IMO it's worth to give it a shot. I've tested it lightly on one machine, but I guess it would just crash right away if there were any problems in it. Thanks, Rafael --- arch/x86/power/hibernate_64.c | 46 ++ arch/x86/power/hibernate_asm_64.S | 28 +-- 2 files changed, 58 insertions(+), 16 deletions(-) Index: linux-pm/arch/x86/power/hibernate_64.c === --- linux-pm.orig/arch/x86/power/hibernate_64.c +++ linux-pm/arch/x86/power/hibernate_64.c @@ -27,7 +27,8 @@ extern asmlinkage __visible int restore_ * Address to jump to in the last phase of restore in order to get to the image * kernel's text (this value is passed in the image header). */ -unsigned long restore_jump_address __visible; +void *restore_jump_address __visible; +unsigned long jump_address_phys; /* * Value of the cr3 register from before the hibernation (this value is passed @@ -37,6 +38,9 @@ unsigned long restore_cr3 __visible; pgd_t *temp_level4_pgt __visible; +void *restore_pgd_addr __visible; +pgd_t restore_pgd __visible; + void *relocated_restore_code __visible; static void *alloc_pgt_page(void *context) @@ -44,6 +48,33 @@ static void *alloc_pgt_page(void *contex return (void *)get_safe_page(GFP_ATOMIC); } +static int prepare_temporary_text_mapping(void) +{ + unsigned long vaddr = (unsigned long)restore_jump_address; + unsigned long paddr = jump_address_phys & PMD_MASK; + pmd_t *pmd; + pud_t *pud; + + pud = (pud_t *)get_safe_page(GFP_ATOMIC); + if (!pud) + return -ENOMEM; + + restore_pgd = __pgd(__pa(pud) | _KERNPG_TABLE); + + pud += pud_index(vaddr); + pmd = (pmd_t *)get_safe_page(GFP_ATOMIC); + if (!pmd) + return -ENOMEM; + + set_pud(pud, __pud(__pa(pmd) | _KERNPG_TABLE)); + + pmd += pmd_index(vaddr); + set_pmd(pmd, __pmd(paddr | __PAGE_KERNEL_LARGE_EXEC)); + + restore_pgd_addr = temp_level4_pgt + pgd_index(vaddr); + return 0; +} + static int set_up_temporary_mappings(void) { struct x86_mapping_info info = { @@ -63,6 +94,10 @@ static int set_up_temporary_mappings(voi set_pgd(temp_level4_pgt + pgd_index(__START_KERNEL_map), init_level4_pgt[pgd_index(__START_KERNEL_map)]); + result = prepare_temporary_text_mapping(); + if (result) + return result; + /* Set up the direct mapping from scratch */ for (i = 0; i < nr_pfn_mapped; i++) { mstart = pf
Re: PROBLEM: Resume form hibernate broken by setting NX on gap
Hey, On 10/06/16 12:09 PM, Kees Cook wrote: >> restore_code: 880157c3b000 >> jump_addr: 81446be0 >> >> >> diff --git a/arch/x86/power/hibernate_64.c b/arch/x86/power/hibernate_64.c >> index 009947d..6efedb7 100644 >> --- a/arch/x86/power/hibernate_64.c >> +++ b/arch/x86/power/hibernate_64.c >> @@ -92,6 +92,9 @@ int swsusp_arch_resume(void) >> memcpy(relocated_restore_code, _restore_code, >>_registers - _restore_code); >> >> + pr_info("restore_code: %p\n", relocated_restore_code); >> + pr_info("jump_addr: %lx\n", restore_jump_address); >> + > > Also interesting would be the "relocated_restore_code" address, as > well as a dump of /sys/kernel/debug/kernel_page_tables (from > CONFIG_X86_PTDUMP). Is that not what I printed? If not, can you give me a better hint as to what you're looking for so I can spin another kernel? I'll also provide the kernel_page_tables once I do that. > I'm baffled by the problem, but the best I can understand is the the > relocated_restore_code range isn't executable (which should be visible > from finding it in /sys/kernel/debug/kernel_page_tables), but I don't > see how to solve that since my original patch didn't work. Yeah this is definitely a baffling problem. Thanks, Logan
Re: PROBLEM: Resume form hibernate broken by setting NX on gap
On 10/06/16 04:29 PM, Rafael J. Wysocki wrote: > OK, I have a theory, but I need a bit of help. > > This may be a dumb question, but I don't quite remember the answer readily. > > Given a physical address, how do I get the corresponding virtual one under > the kernel identity mapping? Is that not just 'phys_to_virt(x)'?? Logan
Re: PROBLEM: Resume form hibernate broken by setting NX on gap
Hey Rafael, Thank for looking into this. I tried the patch below applied to v4.6 and I still got a lockup on resume. Additionally there was a kernel warning at arch/x86/mm/pageattr.c:1414 change_page_attr_set_clr+0x2bb/0x440 and another one right afterwards at kernel/smp.c:416 smp_call_function_many+0xb5/0x250. Regardless, Ill try the other patch you sent shortly. Logan On 11/06/16 05:48 AM, Rafael J. Wysocki wrote: On Saturday, June 11, 2016 03:47:59 AM Rafael J. Wysocki wrote: On Saturday, June 11, 2016 02:13:45 AM Rafael J. Wysocki wrote: On Saturday, June 11, 2016 12:33:31 AM Rafael J. Wysocki wrote: On Friday, June 10, 2016 04:28:01 PM Logan Gunthorpe wrote: On 10/06/16 04:29 PM, Rafael J. Wysocki wrote: OK, I have a theory, but I need a bit of help. This may be a dumb question, but I don't quite remember the answer readily. Given a physical address, how do I get the corresponding virtual one under the kernel identity mapping? Is that not just 'phys_to_virt(x)'?? Ah that. Thanks! So first, appended is a cleanup of the 64-bit hibernate/restore code that hopefully will make it a bit clearer what happens there. In particular, it follows from it quite clearly that the restore will only work if the virtual address of the trampoline (_registers) in the image kernel happens to match the virtual address of the same chunk of memory in the boot kernel (and after it has switched to the temporary page tables). That appears to be the case most of the time, or hibernation would randomly break for people all over, but I'm not actually sure if that's guaranteed to happen. If not, that very well may explain unexpected failures in there. If it is not guaranteed to happen, the solution would be to pass the physical address of that memory from the image kernel to the boot kernel instead of its virtual address, OK, the appended patch does the trick for me. Logan, can you please try it? It shouldn't break things, but I'm wondering if the problem you're having after commit ab76f7b4ab is still reproducible with it. No, it won't help, and I think I know what's going on. The temporary page tables set up by set_up_temporary_mappings() re-use the original boot kernel's text mapping, so if the trampoline code address falls into a non-executable page in that mapping, the boot kernel can't pass control to the image kernel. If that theory is correct, the patch below should help, but IMO it should be applied regardless to eliminate this particular failure mode. I'll prepare another patch to pass the physical address on top of this one. Thanks, Rafael --- From: Rafael J. Wysocki <rafael.j.wyso...@intel.com> Subject: [PATCH] x86 / hibernate: Ensure that 64-bit trampoline code is executable During resume from hibernation, the boot kernel has to jump to the image kernel's trampoline code to pass control to it. On x86_64 the address of that code is passed from the image kernel to the boot kernel in the image header. That address is interpreted with respect to the original boot kernel's kernel text memory mapping, so if the page containing it is not executable in that mapping, the jump to it will crash the kernel, so prevent that from happening. While at it, clean up the way in which the trampoline code address is saved by the image kernel (assembly code is involved in that unnecessarily etc). Signed-off-by: Rafael J. Wysocki <rafael.j.wyso...@intel.com> --- arch/x86/power/hibernate_64.c | 15 --- arch/x86/power/hibernate_asm_64.S |3 --- 2 files changed, 12 insertions(+), 6 deletions(-) Index: linux-pm/arch/x86/power/hibernate_64.c === --- linux-pm.orig/arch/x86/power/hibernate_64.c +++ linux-pm/arch/x86/power/hibernate_64.c @@ -12,6 +12,7 @@ #include #include +#include #include #include #include @@ -27,7 +28,7 @@ extern asmlinkage __visible int restore_ * Address to jump to in the last phase of restore in order to get to the image * kernel's text (this value is passed in the image header). */ -unsigned long restore_jump_address __visible; +void *restore_jump_address __visible; /* * Value of the cr3 register from before the hibernation (this value is passed @@ -91,6 +92,14 @@ int swsusp_arch_resume(void) return -ENOMEM; memcpy(relocated_restore_code, _restore_code, _registers - _restore_code); + /* +* The temporary memory mapping set up by set_up_temporary_mappings() +* above re-uses the original kernel text mapping. If the page with +* restore_jump_address is not executable in that mapping, it won't be +* possible to pass control to the image kernel, so prevent that from +* happening. +*/ + set_memory_x((unsigned long)restore_jump_address, 1); restore_image(); return 0; @@ -108,7 +117,7 @@ int pfn_is_nosave(unsigned long pfn)
Re: [PATCH 5/8] ntb_tool: BUG: Ensure the buffer size is large enough to return all spads
Ok, I'll add a comment. Logan On 10/06/16 08:35 PM, Allen Hubbe wrote: On Fri, Jun 10, 2016 at 6:54 PM, Logan Gunthorpe <log...@deltatee.com> wrote: On hardware with 32 scratchpad registers the spad field in ntb tool could chop off the end. The maximum buffer size is increased from 256 to 15 times the number or scratchpads. Signed-off-by: Logan Gunthorpe <log...@deltatee.com> It could be marginally better if there was an explanation to accompany the magic number 15, but it's not a big deal. One might guess it has something to do with the expected length of the formatted string. Acked-by: Allen Hubbe <allen.hu...@emc.com> --- drivers/ntb/test/ntb_tool.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/drivers/ntb/test/ntb_tool.c b/drivers/ntb/test/ntb_tool.c index 4c01057..954e1d5 100644 --- a/drivers/ntb/test/ntb_tool.c +++ b/drivers/ntb/test/ntb_tool.c @@ -368,7 +368,9 @@ static ssize_t tool_spadfn_read(struct tool_ctx *tc, char __user *ubuf, if (!spad_read_fn) return -EINVAL; - buf_size = min_t(size_t, size, 0x100); + spad_count = ntb_spad_count(tc->ntb); + + buf_size = min_t(size_t, size, spad_count * 15); buf = kmalloc(buf_size, GFP_KERNEL); if (!buf) @@ -376,7 +378,6 @@ static ssize_t tool_spadfn_read(struct tool_ctx *tc, char __user *ubuf, pos = 0; - spad_count = ntb_spad_count(tc->ntb); for (i = 0; i < spad_count; ++i) { pos += scnprintf(buf + pos, buf_size - pos, "%d\t%#x\n", i, spad_read_fn(tc->ntb, i)); -- 2.1.4 -- You received this message because you are subscribed to the Google Groups "linux-ntb" group. To unsubscribe from this group and stop receiving emails from it, send an email to linux-ntb+unsubscr...@googlegroups.com. To post to this group, send email to linux-...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/linux-ntb/d9488f2c946644c2b1258a78929d3543747283ec.1465598632.git.logang%40deltatee.com. For more options, visit https://groups.google.com/d/optout.
Re: [PATCH 6/8] ntb_tool: Add link status file to debugfs
Hey Allen, Thanks for the feedback it's a bit more complicated but I don't object to that. I'll work something up on Monday. I was trying to avoid adding link controls, but if we do, would you say the module should still enable the link when it's installed? Or would we have the user explicitly have to enable the link before using it? Thanks, Logan On 10/06/16 08:27 PM, Allen Hubbe wrote: On Fri, Jun 10, 2016 at 6:54 PM, Logan Gunthorpe <log...@deltatee.com> wrote: In order to more successfully script with ntb_tool it's useful to have a link file to check the link status so that the script doesn't use the other files until the link is up. This commit adds a 'link' file to the debugfs directory which reads 0 or 1 depending on the link status. For scripting convenience, writing will block until the link is up (discarding anything that was written). Signed-off-by: Logan Gunthorpe <log...@deltatee.com> --- drivers/ntb/test/ntb_tool.c | 45 + 1 file changed, 45 insertions(+) diff --git a/drivers/ntb/test/ntb_tool.c b/drivers/ntb/test/ntb_tool.c index 954e1d5..116352e 100644 --- a/drivers/ntb/test/ntb_tool.c +++ b/drivers/ntb/test/ntb_tool.c @@ -59,6 +59,12 @@ * * Eg: check if clearing the doorbell mask generates an interrupt. * + * # Check the link status + * root@self# cat $DBG_DIR/link + * + * # Block until the link is up + * root@self# echo > $DBG_DIR/link I think a file to get and set the link status is a good idea, but the way it is done as proposed here is not in a similar style to other ntb_tool operations. Other operations simply read a register and format the value, or scan a value and write a register. Similarly, I think the link status could be done in the same way: use the read file operation to get the current status with ntb_link_is_up(), and use the file write operation to enable or disable the link with ntb_link_enable() and ntb_link_disable(). Waiting for link status is an interesting concept, too. Really, one might be interested in a change in link status, whether up or down. What about a link event file that supports write to arm the event, and read to block for the event. Consider an implementation based on . It would be used in combination with the link status file, above, as follows. 1: Write 1 to the event file. This arms the event. - The event will be disarmed by the next tool_link_event(). 2: The application may read the link status file if it is interested in waiting for a particular event. 3. The application may wait for an event by reading the event file - The application will wait as long as the event is still armed. - If the event was disarmed before waiting, the application will not block. 4. The application should read the link status again. In any case, I think it would be more expected and natural to block while reading a file versus writing it. + * * # Set the doorbell mask * root@self# echo 's 1' > $DBG_DIR/mask * @@ -127,6 +133,7 @@ struct tool_ctx { struct work_struct link_cleanup; bool link_is_up; struct delayed_work link_work; + wait_queue_head_t link_wq; int mw_count; struct tool_mw mws[MAX_MWS]; }; @@ -237,6 +244,7 @@ static void tool_link_work(struct work_struct *work) "Error setting up memory windows: %d\n", rc); tc->link_is_up = true; + wake_up(>link_wq); } static void tool_link_cleanup(struct work_struct *work) @@ -573,6 +581,39 @@ static TOOL_FOPS_RDWR(tool_peer_spad_fops, tool_peer_spad_read, tool_peer_spad_write); +static ssize_t tool_link_read(struct file *filep, char __user *ubuf, + size_t size, loff_t *offp) +{ + struct tool_ctx *tc = filep->private_data; + char *buf; + ssize_t pos, rc; + + buf = kmalloc(64, GFP_KERNEL); + if (!buf) + return -ENOMEM; + + pos = scnprintf(buf, 64, "%d\n", tc->link_is_up); + rc = simple_read_from_buffer(ubuf, size, offp, buf, pos); + + kfree(buf); + + return rc; +} + +static ssize_t tool_link_write(struct file *filep, const char __user *ubuf, + size_t size, loff_t *offp) +{ + struct tool_ctx *tc = filep->private_data; + + if (wait_event_interruptible(tc->link_wq, tc->link_is_up)) + return -ERESTART; + + return size; +} + +static TOOL_FOPS_RDWR(tool_link_fops, + tool_link_read, + tool_link_write); static ssize_t tool_mw_read(struct file *filep, char __user *ubuf, size_t size, loff_t *offp) @@ -708,6 +749,9 @@ static void tool_setup_dbgfs(struct tool_ctx *tc) debugfs_create_file("peer_spad", S_IRUSR | S_IWUSR, tc->dbgfs, tc, _peer_spad_
Re: [PATCH 7/8] ntb_pingpong: Add a debugfs file to get the ping count
The pp_debugfs_dir is already initialized by the module init function. If it doesn't exist here, I think we should just return instead of trying again. It's also worth noting, though it is probably no harm, the code here does not check debugfs_initialized(). +static int __init tool_init(void) This should be pp_init() not tool_init(). Yup, this is just sloppy copying on my part. I copied from two different places. I'll fix these oversights. Thanks, Logan
Re: PROBLEM: Resume form hibernate broken by setting NX on gap
Hey Rafael, I tried this patch as well and there was no change. I have a couple tentative observations to make though. None of this is 100% clear to me so please correct me if I'm wrong anywhere: 1) Commit ab76f7b4ab only extends the NX bit between __ex_table and rodata; which, by my understanding, shouldn't be used by anything. And __ex_table and rodata are fixed by the kernel's binary so both symbols should be the same in both the image kernel and the boot kernel given that both are running from the same binary. 2) When ab76f7b4ab is reverted, hibernation seems to work 100%. Though, when it's in place, it only works some of the time. Given that commit is only extending the NX region a bit, if there is some random mismatch, why does it never reach rodata? In other words, why is rodata a magic line that seems to work all the time -- why doesn't this random mismatch ever extend into the rodata region? rodata isn't _that_ far away from the end of ex_table. Anyway, thanks again for looking into this. Logan On 10/06/16 07:47 PM, Rafael J. Wysocki wrote: On Saturday, June 11, 2016 02:13:45 AM Rafael J. Wysocki wrote: On Saturday, June 11, 2016 12:33:31 AM Rafael J. Wysocki wrote: On Friday, June 10, 2016 04:28:01 PM Logan Gunthorpe wrote: On 10/06/16 04:29 PM, Rafael J. Wysocki wrote: OK, I have a theory, but I need a bit of help. This may be a dumb question, but I don't quite remember the answer readily. Given a physical address, how do I get the corresponding virtual one under the kernel identity mapping? Is that not just 'phys_to_virt(x)'?? Ah that. Thanks! So first, appended is a cleanup of the 64-bit hibernate/restore code that hopefully will make it a bit clearer what happens there. In particular, it follows from it quite clearly that the restore will only work if the virtual address of the trampoline (_registers) in the image kernel happens to match the virtual address of the same chunk of memory in the boot kernel (and after it has switched to the temporary page tables). That appears to be the case most of the time, or hibernation would randomly break for people all over, but I'm not actually sure if that's guaranteed to happen. If not, that very well may explain unexpected failures in there. If it is not guaranteed to happen, the solution would be to pass the physical address of that memory from the image kernel to the boot kernel instead of its virtual address, OK, the appended patch does the trick for me. Logan, can you please try it? It shouldn't break things, but I'm wondering if the problem you're having after commit ab76f7b4ab is still reproducible with it. Thanks, Rafael --- From: Rafael J. Wysocki <rafael.j.wyso...@intel.com> Subject: [PATCH] x86 / hibernate: Pass physical address of the trampoline During resume from hibernation, the boot kernel has to jump to the image kernel's trampoline code to pass control to it. On x86_64 the address of that code is passed from the image kernel to the boot kernel in the image header. Currently, the way in which that address is saved by the image kernel is not entirely straightforward (assembly code is involved in that unnecessarily etc). Moreover, the current code passes the virtual address of the trampoline with the assumption that the image and boot kernels will always use the same virtual addresses for the chunk of physical address space occupied by the trampoline code. In case that assumption is not valid, pass the physical address of the trampoline code from the image kernel to the boot kernel and update RESTORE_MAGIC to avoid situations in which the boot kernel may use a different trampoline address passing convention from the one used by the image kernel. Also drop the assembly code chunk updating restore_jump_address in swsusp_arch_suspend() as it is not necessary any more. Signed-off-by: Rafael J. Wysocki <rafael.j.wyso...@intel.com> --- arch/x86/power/hibernate_64.c |8 arch/x86/power/hibernate_asm_64.S |3 --- 2 files changed, 4 insertions(+), 7 deletions(-) Index: linux-pm/arch/x86/power/hibernate_64.c === --- linux-pm.orig/arch/x86/power/hibernate_64.c +++ linux-pm/arch/x86/power/hibernate_64.c @@ -27,7 +27,7 @@ extern asmlinkage __visible int restore_ * Address to jump to in the last phase of restore in order to get to the image * kernel's text (this value is passed in the image header). */ -unsigned long restore_jump_address __visible; +void *restore_jump_address __visible; /* * Value of the cr3 register from before the hibernation (this value is passed @@ -113,7 +113,7 @@ struct restore_data_record { unsigned long magic; }; -#define RESTORE_MAGIC 0x0123456789ABCDEFUL +#define RESTORE_MAGIC 0x0123456789ABCDF0UL /** *arch_hibernation_header_save - populate the architecture specific part @@ -126,7 +126,7 @@ int arch_hibernation_head
Re: PROBLEM: Resume form hibernate broken by setting NX on gap
Hey, On 11/06/16 07:05 PM, Rafael J. Wysocki wrote: 1) Commit ab76f7b4ab only extends the NX bit between __ex_table and rodata; which, by my understanding, shouldn't be used by anything. And __ex_table and rodata are fixed by the kernel's binary so both symbols should be the same in both the image kernel and the boot kernel given that both are running from the same binary. Well, what if the kernel is relocated? Ah, I'm sure I don't fully grasp the implications of that but I would assume that if the image kernel were located somewhere else it would still be far away from the boot kernel's ex_table/rodata boundary. 2) When ab76f7b4ab is reverted, hibernation seems to work 100%. Though, when it's in place, it only works some of the time. Given that commit is only extending the NX region a bit, if there is some random mismatch, why does it never reach rodata? In other words, why is rodata a magic line that seems to work all the time -- why doesn't this random mismatch ever extend into the rodata region? rodata isn't _that_ far away from the end of ex_table. That's a very good question. :-) Yeah, I guess if we knew the answer we'd understand what was going on and have a fix. Can you please check if the patch below makes any difference? I'm afraid it's no different. The kernel still freezes on resume. Though, no warnings with this one. Thanks, Logan
Re: [PATCH 6/8] ntb_tool: Add link status file to debugfs
On 14/06/16 09:45 AM, Allen Hubbe wrote: > > Feel free to disregard my suggestion above. I hope my comment has not cost > you too much time. > > The way you have written it already, and used it in the self-test script is > much more concise. > >>> + * root@self# echo > $DBG_DIR/link > > Acked-by: allen.hu...@emc.com > > > > Eventually, I think it would be useful to let ntb_tool enable and disable the > link. In that case, it might also be useful in a test script to wait for > link down, not just link up. > > What about this: > > # Wait for the link to be up or down > root@self# echo 1 > $DBG_DIR/link > root@self# echo 0 > $DBG_DIR/link > > It need not be a part of this patch, but eventually: > > # Enable or disable the link > root@self# echo 1 > $DBG_DIR/link_ctrl > root@self# echo 0 > $DBG_DIR/link_ctrl > > # Reading the link_ctrl file can also give the link status > root@self# cat $DBG_DIR/link_ctrl > > Finally, I wonder if the file called "link" in this patch should be called > "link_wait" or similar, so its purpose is obviously not for enabling and > disabling the link. > Actually I've already implemented something similar to your original suggestion. I'll be submitting a v2 of this set shortly. Logan
Re: [PATCH 8/8] ntb_test: Add a selftest script for the NTB subsystem
On 14/06/16 09:47 AM, Allen Hubbe wrote: > From: Logan Gunthorpe >> This script automates testing doorbells, scratchpads and memory windows >> for an NTB device. It can be run locally, with the NTB looped >> back to the same host or use SSH to remotely control the second host. >> >> In the single host case, the script just needs to be passed two >> arguments: a PCI ID for each side of the link. In the two host case >> the -r option must be used to specify the remote hostname (which must >> be SSH accessible and should probably have ssh-keys exchanged). >> >> A sample run looks like this: >> >> $ sudo ./ntb_test.sh :03:00.1 :83:00.1 -p 29 >> Starting ntb_tool tests... >> Running db tests on: :03:00.1 / :83:00.1 >> Passed >> Running db tests on: :83:00.1 / :03:00.1 >> Passed >> Running spad tests on: :03:00.1 / :83:00.1 >> Passed >> Running spad tests on: :83:00.1 / :03:00.1 >> Passed >> Running mw0 tests on: :03:00.1 / :83:00.1 >> Passed >> Running mw0 tests on: :83:00.1 / :03:00.1 >> Passed >> Running mw1 tests on: :03:00.1 / :83:00.1 >> Passed >> Running mw1 tests on: :83:00.1 / :03:00.1 >> Passed >> >> Starting ntb_pingpong tests... >> Running ping pong tests on: :03:00.1 / :83:00.1 >> Passed >> >> Starting ntb_perf tests... >> Running local perf test without DMA >> 0: copied 536870912 bytes in 238205 usecs, 2253 MBytes/s >> Passed >> Running remote perf test without DMA >> 0: copied 536870912 bytes in 238205 usecs, 2253 MBytes/s >> Passed >> >> Signed-off-by: Logan Gunthorpe <log...@deltatee.com> >> --- >> MAINTAINERS | 1 + >> tools/testing/selftests/ntb/ntb_test.sh | 386 >> >> 2 files changed, 387 insertions(+) >> create mode 100755 tools/testing/selftests/ntb/ntb_test.sh >> >> diff --git a/MAINTAINERS b/MAINTAINERS >> index 9c567a4..f178e7e 100644 >> --- a/MAINTAINERS >> +++ b/MAINTAINERS >> @@ -7846,6 +7846,7 @@ F: drivers/ntb/ >> F: drivers/net/ntb_netdev.c >> F: include/linux/ntb.h >> F: include/linux/ntb_transport.h >> +F: tools/testing/selftests/ntb/ >> >> NTB INTEL DRIVER >> M: Jon Mason <jdma...@kudzu.us> >> diff --git a/tools/testing/selftests/ntb/ntb_test.sh >> b/tools/testing/selftests/ntb/ntb_test.sh >> new file mode 100755 >> index 000..e4a89e9 >> --- /dev/null >> +++ b/tools/testing/selftests/ntb/ntb_test.sh >> @@ -0,0 +1,386 @@ >> +#!/bin/bash >> +# Copyright (c) 2016 Microsemi. All Rights Reserved. >> +# >> +# This program is free software; you can redistribute it and/or >> +# modify it under the terms of the GNU General Public License as >> +# published by the Free Software Foundation; either version 2 of >> +# the License, or (at your option) any later version. >> +# >> +# This program is distributed in the hope that it would be useful, >> +# but WITHOUT ANY WARRANTY; without even the implied warranty of >> +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the >> +# GNU General Public License for more details. >> +# >> +# Author: Logan Gunthorpe <log...@deltatee.com> >> + >> +REMOTE_HOST= >> +LIST_DEVS=FALSE >> + >> +DEBUGFS=${DEBUGFS-/sys/kernel/debug} >> + >> +PERF_RUN_ORDER=32 >> +MAX_MW_SIZE=0 >> +RUN_DMA_TESTS= >> +DONT_CLEANUP= >> + >> +function show_help() >> +{ >> +echo "Usage: $0 [OPTIONS] LOCAL_DEV REMOTE_DEV" >> +echo "Run tests on a pair of NTB endpoints." >> +echo >> +echo "If the NTB device loops back to the same host then," >> +echo "just specifying the two PCI ids on the command line is" >> +echo "sufficient. Otherwise, if the NTB link spans two hosts" >> +echo "use the -r option to specify the hostname for the remote" >> +echo "device. SSH will then be used to test the remote side." >> +echo "An SSH key between the root users of the host would then" >> +echo "be highly recommended." >> +echo >> +echo "Options:" >> +echo " -C don't cleanup ntb modules on exit" >> +echo " -d run dma tests" >> +echo " -h show this help message" >> +echo " -l
Re: [PATCH 8/8] ntb_test: Add a selftest script for the NTB subsystem
On 14/06/16 08:16 AM, Shuah Khan wrote: > On 06/14/2016 08:06 AM, Jon Mason wrote: >> On Fri, Jun 10, 2016 at 6:54 PM, Logan Gunthorpe <log...@deltatee.com> wrote: >>> This script automates testing doorbells, scratchpads and memory windows >>> for an NTB device. It can be run locally, with the NTB looped >>> back to the same host or use SSH to remotely control the second host. >>> >>> In the single host case, the script just needs to be passed two >>> arguments: a PCI ID for each side of the link. In the two host case >>> the -r option must be used to specify the remote hostname (which must >>> be SSH accessible and should probably have ssh-keys exchanged). >> >> I appreciate the work that you are putting in here, but test shell >> scripts are not accepted into the kernel source. Yeah, I wasn't aware of this rule. I previously did some work on a very similar shell script for NVM fabrics. Though that hasn't made it to upstream yet. > I don't see any reason for this script to be not part of kernel selftests. > I think it will be a good addition. We probably don't want to include it in > the auto run of the selftest suite. > Jon! I you would like to take this script through your ntb tree, here is > my ack for the script for kselftest part. > > Acked-by: Shuah Khan <shua...@osg.samsung.com> > Thanks Shauh! >> I think a better place for this to be shared would be on the github >> account wiki, https://github.com/jonmason/ntb/wiki >> In fact, I'd really like for someone to add some pages there on using >> the ntb tools and testing. If you are willing, I'd be most >> appreciative. I can probably make some time for this later in the week. Logan
Re: [PATCH v2 6/8] ntb_tool: Add link status and files to debugfs
On 14/06/16 01:33 PM, Allen Hubbe wrote: >> diff --git a/drivers/ntb/test/ntb_tool.c b/drivers/ntb/test/ntb_tool.c >> index cba31fd..9bebd0d 100644 >> --- a/drivers/ntb/test/ntb_tool.c >> +++ b/drivers/ntb/test/ntb_tool.c >> @@ -59,6 +59,13 @@ >> * >> * Eg: check if clearing the doorbell mask generates an interrupt. >> * >> + * # Check the link status >> + * root@self# cat $DBG_DIR/link >> + * >> + * # Block until the link is up >> + * root@self# echo Y > $DBG_DIR/link_event >> + * root@self# cat $DBG_DIR/link_event >> + * >> * # Set the doorbell mask >> * root@self# echo 's 1' > $DBG_DIR/mask >> * >> @@ -126,7 +133,9 @@ struct tool_ctx { >> struct dentry *dbgfs; >> struct work_struct link_cleanup; >> bool link_is_up; > > Really, link_is_up means "memory windows are configured." This comes from > your earlier patch that introduced memory windows to ntb_tool. Yes, this is technically true. However, I don't think the distinction is necessary. The user only really cares whether everything is up and usable -- not whether the link is just physically up or not. >> +bool link_event; >> struct delayed_work link_work; >> +wait_queue_head_t link_wq; >> int mw_count; >> struct tool_mw mws[MAX_MWS]; >> }; >> @@ -237,6 +246,7 @@ static void tool_link_work(struct work_struct *work) >> "Error setting up memory windows: %d\n", rc); >> >> tc->link_is_up = true; > > In other words, "memory windows are configured" = true. Technically, yes. >> +wake_up(>link_wq); >> } >> >> static void tool_link_cleanup(struct work_struct *work) >> @@ -246,6 +256,9 @@ static void tool_link_cleanup(struct work_struct *work) >> >> if (!tc->link_is_up) >> cancel_delayed_work_sync(>link_work); >> + >> +tc->link_is_up = false; > > If this was never set false anywhere in the patch that added memory windows, > I wonder if there is a bug. Yup, this looks like an oversight on my part. However, I don't think it resulted in any noticeable bug seeing, at the time, the only way to bring the link back down was to remove the module or the device. It is only strictly necessary now that we have the 'link' file which can control the link. >> +wake_up(>link_wq); >> } >> >> static void tool_link_event(void *ctx) >> @@ -578,6 +591,95 @@ static TOOL_FOPS_RDWR(tool_peer_spad_fops, >>tool_peer_spad_read, >>tool_peer_spad_write); >> >> +static ssize_t tool_link_read(struct file *filep, char __user *ubuf, >> + size_t size, loff_t *offp) >> +{ >> +struct tool_ctx *tc = filep->private_data; >> +char buf[3]; >> + >> +buf[0] = tc->link_is_up ? 'Y' : 'N'; > > I think tc->link_is_up should instead be ntb_link_is_up(tc->ntb). I disagree. Bad things will happen if the user waits on the event and then immediately uses the memory windows. It will just be buggy and racy. I can't see a situation where the user would want to wait for the link to come up and not have everything in ntb_tool ready and usable. >> +buf[1] = '\n'; >> +buf[2] = '\0'; >> + >> +return simple_read_from_buffer(ubuf, size, offp, buf, 2); >> +} >> + >> +static ssize_t tool_link_write(struct file *filep, const char __user *ubuf, >> + size_t size, loff_t *offp) >> +{ >> +struct tool_ctx *tc = filep->private_data; >> +char buf[32]; >> +size_t buf_size; >> +bool val; >> +int rc; >> + >> +buf_size = min(size, (sizeof(buf) - 1)); >> +if (copy_from_user(buf, ubuf, buf_size)) >> +return -EFAULT; >> + >> +buf[buf_size] = '\0'; >> + >> +rc = strtobool(buf, ); >> +if (rc) >> +return rc; >> + >> +if (val) >> +ntb_link_enable(tc->ntb, NTB_SPEED_AUTO, NTB_WIDTH_AUTO); >> +else >> +ntb_link_disable(tc->ntb); >> + >> +return size; >> +} >> + >> +static TOOL_FOPS_RDWR(tool_link_fops, >> + tool_link_read, >> + tool_link_write); >> + >> +static ssize_t tool_link_event_read(struct file *filep, char __user *ubuf, >> +size_t size, loff_t *offp) >> +{ >> +struct tool_ctx *tc = filep->private_data; >> +char buf[3]; >> + >> +if (wait_event_interruptible(tc->link_wq, >> + tc->link_is_up == tc->link_event)) > > I think tc->link_is_up should instead be ntb_link_is_up(tc->ntb). See above. >> +return -ERESTART; >> + >> +buf[0] = tc->link_is_up ? 'Y' : 'N'; >> +buf[1] = '\n'; >> +buf[2] = '\0'; >> + >> +return simple_read_from_buffer(ubuf, size, offp, buf, 2); >> +} >> + >> +static ssize_t tool_link_event_write(struct file *filep, >> + const char __user *ubuf, >> + size_t size, loff_t *offp) >> +{ >> +struct tool_ctx *tc = filep->private_data; >> +char buf[32]; >> +size_t buf_size; >> +
Re: [PATCH v2 6/8] ntb_tool: Add link status and files to debugfs
On 15/06/16 10:02 AM, Allen Hubbe wrote: >> My understanding is that ntb_tool is really just a test client to verify >> the API and the hardware. I personally would not recommend it for any >> real applications. As such, I don't think this philosophical argument >> really matches that goal. > > The purpose is to "verify the API and the hardware", not to support "real > applications." > > The link status reported by the tool should be the link status reported by > "the API and the hardware," and not something else that might be convenient > for "my ntb_test script" or "anyone else trying to script with ntb_tool." > The primary purpose of ntb_tool is api-level debugging of hardware and > drivers, not scripting. > > The problem with races in ntb_tool is due to auto-configuration of memory > windows in ntb_tool. Instead of having ntb_tool setup the memory windows > automatically, maybe instead it should provide a file to control the memory > windows via debugfs. Reading the file can format what is returned by > ntb_mw_get_range(), and writing the file can allocate a buffer and call > ntb_mw_set_trans(), or ntb_mw_clear_trans() and free the buffer. Then, the > test script can wait for link up, then setup the memory windows, and then > finally proceed with the rest of the tests, and there would be no race. > There would be no confusion about what "link up" means, and ntb_tool would > more closely resemble the ntb.h api for memory windows. Ok, well this is a good deal more complicated than it is now but I can live with it. I'll work something up shortly. > That's interesting about the hardware. Maybe the driver for that particular > hardware should make sure that any translation register programming happens > before reporting link up to its clients. Otherwise, ntb_transport will be > broken on that hardware. The ntb_transport driver configures memory windows > the first time the link comes up, and only ever again if a different memory > window size is negotiated (unlikely). > > There are two reasons for doing the configuration after link up in > ntb_transport. First, it avoids consuming memory resources if the link never > comes up. Second, ntb_transport negotiates with its peer how much of the > memory window will actually be used. The ntb_perf tool is similar. Hmm, yes I didn't notice that in ntb_transport. We'll have to make some considerations for that in our driver. Logan
Re: [PATCH v2 6/8] ntb_tool: Add link status and files to debugfs
On 14/06/16 03:46 PM, Allen Hubbe wrote: > The ntb_tool is intended to be a simple low level access to the ntb.h api. > As much as possible, I think ntb_tool should directly expose the ntb.h api > through debugfs, and not invent higher level concepts. I really think practical concerns should override this. If we do it that way then my ntb_test script wouldn't necessarily work reliably and we'd just be asking for race conditions. (Especially if I moved the memory window tests earlier.) Anyone else trying to script with ntb_tool would run into the same problem. Additionally, the link is up _and_ the hardware is configured/usable isn't really that high level a concept or anything a user wouldn't expect already. My understanding is that ntb_tool is really just a test client to verify the API and the hardware. I personally would not recommend it for any real applications. As such, I don't think this philosophical argument really matches that goal. >>> If this was never set false anywhere in the patch that added memory >>> windows, I wonder if >> there is a bug. >> >> Yup, this looks like an oversight on my part. However, I don't think it >> resulted in any noticeable bug seeing, at the time, the only way to >> bring the link back down was to remove the module or the device. It is >> only strictly necessary now that we have the 'link' file which can >> control the link. > > Even without a file to control the link, any one side could be unloaded and > reloaded. That also affects the link state on the side that stays loaded. > The side that stays loaded still needs to be sane when the link comes back up. Yup, you're correct. If the other side of link goes down then tc->link_is_up would be incorrect. So, yes, there may be a corner case bug there. Though, seeing tc-link_is_up was only previously used to cancel potentially queued delayed work it's probably pretty minor. This was copied from ntb_perf which looks like it has the same issue. I'll make a patch for that in v3. >>> I think tc->link_is_up should instead be ntb_link_is_up(tc->ntb). >> >> I disagree. Bad things will happen if the user waits on the event and >> then immediately uses the memory windows. It will just be buggy and >> racy. I can't see a situation where the user would want to wait for the >> link to come up and not have everything in ntb_tool ready and usable. > > The memory windows can be configured prior to link up. They can be > configured when probing the device instead of waiting for link up. Doing > memory window configuration in probe would simplify the driver, and there > would be no race. I'm not sure this is true, especially considering all possible hardware. It's certainly not true with the hardware I'm working with and I'd assume that all the existing NTB clients configured their memory windows on link up and not in probe for a good reason. Logan
Re: ktime_get_ts64() splat during resume
Hey Rafael, This patch appears to be working on my laptop. Thanks. Logan On 20/06/16 07:22 PM, Rafael J. Wysocki wrote: On Tuesday, June 21, 2016 02:05:59 AM Rafael J. Wysocki wrote: On Monday, June 20, 2016 11:15:18 PM Rafael J. Wysocki wrote: On Mon, Jun 20, 2016 at 8:29 PM, Linus Torvalds <torva...@linux-foundation.org> wrote: On Mon, Jun 20, 2016 at 7:38 AM, Rafael J. Wysocki <r...@rjwysocki.net> wrote: Overall, we seem to be heading towards the "really weird" territory here. So the whole commit that Boris bisected down to is weird to me. Why isn't the temporary text mapping just set up unconditionally in the temp_level4_pgt? Why does it have that insane "let's leave the temp_level4_pgt alone until we actually switch to it, and save away restore_pgd_addr and the restore_pgd, to then be set up at restore time"? All the other temporary mappings are set up statically in the temp_level4_pgt, why not that one? The text mapping in temp_level4_pgt has to map the image kernel's physical entry address to the same virtual address that the image kernel had for it, because the image kernel will switch over to its own page tables first and it will use its own kernel text mapping from that point on. That may not be the same as the text mapping of the (currently running) restore (or "boot") kernel. So if we set up the text mapping in temp_level4_pgt upfront, we basically can't reference the original kernel text (or do any addressing relative to it) any more after switching over to temp_level4_pgt. For some reason I thought that was not doable, but now that I look at the code it looks like it can be done. I'll try doing that. I recalled what the problem was. :-) In principle, the kernel text mapping in the image kernel may be different from the kernel text mapping in the restore ("boot") kernel, but the patch I posted a couple of hours ago actually assumed them to be the same, because it switched to temp_level4_pgt before jumping to the relocated code. To get rid of that implicit assumption it has to switch to temp_level4_pgt from the relocated code itself, but for that to work, the page containing the relocated code must be executable in the original page tables (it isn't usually). So updated patch is appended. Again, it works for me, but I'm wondering about everybody else. Thanks, Rafael --- From: Rafael J. Wysocki <rafael.j.wyso...@intel.com> Subject: [PATCH v2] x86/power/64: Fix kernel text mapping corruption during image restoration Logan Gunthorpe reports that hibernation stopped working reliably for him after commit ab76f7b4ab23 (x86/mm: Set NX on gap between __ex_table and rodata). That turns out to be a consequence of a long-standing issue with the 64-bit image restoration code on x86, which is that the temporary page tables set up by it to avoid page tables corruption when the last bits of the image kernel's memory contents are copied into their original page frames re-use the boot kernel's text mapping, but that mapping may very well get corrupted just like any other part of the page tables. Of course, if that happens, the final jump to the image kernel's entry point will go to nowhere. The exact reason why commit ab76f7b4ab23 matters here is that it sometimes causes a PMD of a large page to be split into PTEs that are allocated dynamically and get corrupted during image restoration as described above. To fix that issue note that the code copying the last bits of the image kernel's memory contents to the page frames occupied by them previoulsy doesn't use the kernel text mapping, because it runs from a special page covered by the identity mapping set up for that code from scratch. Hence, the kernel text mapping is only needed before that code starts to run and then it will only be used just for the final jump to the image kernel's entry point. Accordingly, the temporary page tables set up in swsusp_arch_resume() on x86-64 need to contain the kernel text mapping too. That mapping is only going to be used for the final jump to the image kernel, so it only needs to cover the image kernel's entry point, because the first thing the image kernel does after getting control back is to switch over to its own original page tables. Moreover, the virtual address of the image kernel's entry point in that mapping has to be the same as the one mapped by the image kernel's page tables. With that in mind, modify the x86-64's arch_hibernation_header_save() and arch_hibernation_header_restore() routines to pass the physical address of the image kernel's entry point (in addition to its virtual address) to the boot kernel (a small piece of assembly code involved in passing the entry point's virtual address to the image kernel is not necessary any more after that, so drop it). Update RESTORE_MAGIC too to reflect the image header format change. Next, in set_up_temporary_mappings(), use the physical and virtual addresses of the im
Re: [PATCH v3 10/10] ntb_perf: clear link_is_up flag when the link goes down.
Hey, Actually, I have to retract this patch. After some more thorough testing I'm finding an issue: When you remove and re-install the ntb_perf module very quickly, ntb_perf will occasionally miss the link up event. This is because the link_cleanup work gets delayed long enough that it gets scheduled after the link up event gets sent. It then cancels the link work that should have occurred. Without this patch, it never happens because link_is_up never returns to false. I think the correct solution is to just remove the link_cleanup work and do those actions immediately on receipt of the event. If there's agreement on this I can re-spin it again. Thanks, Logan On 15/06/16 03:33 PM, Jiang, Dave wrote: > On Wed, 2016-06-15 at 15:26 -0600, Logan Gunthorpe wrote: >> When the link goes down, the link_is_up flag did not return to >> false. This could have caused some subtle corner case bugs >> when the link goes up and down quickly. >> >> Signed-off-by: Logan Gunthorpe <log...@deltatee.com> > > Acked-by: Dave Jiang <dave.ji...@intel.com> > > And all the other ntb_perf patches since there were no additional > changes. > >> --- >> drivers/ntb/test/ntb_perf.c | 2 ++ >> 1 file changed, 2 insertions(+) >> >> diff --git a/drivers/ntb/test/ntb_perf.c >> b/drivers/ntb/test/ntb_perf.c >> index f0784e5..ae9d1b2 100644 >> --- a/drivers/ntb/test/ntb_perf.c >> +++ b/drivers/ntb/test/ntb_perf.c >> @@ -557,6 +557,8 @@ static void perf_link_cleanup(struct work_struct >> *work) >> >> if (!perf->link_is_up) >> cancel_delayed_work_sync(>link_work); >> + >> +perf->link_is_up = false; >> } >> >> static int perf_setup_mw(struct ntb_dev *ntb, struct perf_ctx *perf) >> -- >> 2.1.4
Re: [PATCH v3 09/10] ntb_test: Add a selftest script for the NTB subsystem
On 15/06/16 03:49 PM, Allen Hubbe wrote: >> +function link_test() >> +{ >> +LOC=$1 >> +REM=$2 >> +EXP=0 >> + >> +echo "Running link tests on: $(basename $LOC) / $(basename $REM)" >> + >> +write_file "N" "$LOC/link" >> +write_file "N" "$LOC/link_event" > > If it fails to bring down the link, won't it just block waiting on link_event > and never make it to the next step of the test? > >> +if [[ $(read_file "$REM/link") != "N" ]]; then >> +echo "Expected remote link to be down in $REM/link" >&2 >> +exit -1 >> +fi >> + >> +write_file "Y" "$LOC/link" >> +write_file "Y" "$LOC/link_event" >> + >> +echo " Passed" >> +} Well, the test is really intended to ensure both sides of the link see changes to the link status. If the driver is somehow buggy and the link never goes down/up when requested there's little I can do here except block forever. Unless we want to add a timeout to the link_event file (which I'd rather not). You'd have the same issue if, when bringing the link up for the first time, the link does not come back. Logan
Re: [PATCH v3 09/10] ntb_test: Add a selftest script for the NTB subsystem
On 15/06/16 04:17 PM, Allen Hubbe wrote: > This test should fail on Intel RP/TB topology (two cpu sharing one ntb). The > link state is the link state of the secondary side pcie bus connected to the > secondary side cpu. The link must be up in order for the secondary side cpu > to discover the ntb device, so the driver does not allow the link to be > disabled in such topology. Ok, I wasn't aware of this. But looking closer I think I have a better solution: ntb_link_disable should return -EINVAL if the hardware can't support bringing the link down. If I add the error check to my tool_link_write (which I neglected to do) then writing an "N" to $LOC/link will fail and the test could be skipped. I'll make a v4 spin. Logan > A simple thing to do here might be: > > write_file "N" "$LOC/link" > sleep 1 > read_file "$REM/link" > > You already have my Ack. This minor issue can be fixed later if anyone > cares. I don't think it is a big deal, just worth pointing out that the > script will hang here instead of report a failure. If it is worth fixing > later, at that point we might also want to change this script to continue > with other tests instead of exit on the first failure. >
Re: [PATCH v3 10/10] ntb_perf: clear link_is_up flag when the link goes down.
Hey, On 15/06/16 04:24 PM, Jiang, Dave wrote: > On Wed, 2016-06-15 at 16:20 -0600, Logan Gunthorpe wrote: >> Hey, >> >> Actually, I have to retract this patch. After some more thorough >> testing >> I'm finding an issue: >> >> When you remove and re-install the ntb_perf module very quickly, >> ntb_perf will occasionally miss the link up event. This is because >> the >> link_cleanup work gets delayed long enough that it gets scheduled >> after >> the link up event gets sent. It then cancels the link work that >> should >> have occurred. Without this patch, it never happens because >> link_is_up >> never returns to false. >> >> I think the correct solution is to just remove the link_cleanup work >> and >> do those actions immediately on receipt of the event. If there's >> agreement on this I can re-spin it again. > > I'm ok with that. This is not an issue with ntb_transport? Looks like I can get something similar to happen in ntb_transport. However, it's much rarer and takes significantly more tries to get it to occur. It does appear to correctly set its link_is_up to false when the link goes down. I'm not sure I'm quite clear on the flow in ntb_transport and don't have time right now to study it so I'll have to let that be someone else's (fairly minor) issue. Logan
Re: [PATCH v5 00/10] NTB Selftest Script
Great! Thanks Jon. Logan On 26/06/16 05:29 PM, Jon Mason wrote: > On Mon, Jun 20, 2016 at 01:15:03PM -0600, Logan Gunthorpe wrote: >> Sorry, I thought this was done but I found one more minor issue with >> these patches so I'm resubmitting them one last time. Besides this isuse, >> I think I have acks for all the patches and everything is working as I'd >> like. > > Applying patch 05/10 to my ntb branch (as it is a bug fix and should > go into 4.7) > > Applying the rest to my ntb-next branch (which should go into 4.8) > > Note: I'm including patch 09/10 in my tree, even though it is for the > selftest subsystem. Given that Shuah Khan acked it, I assume this is > the desired outcome. > > Thanks, > Jon > >> >> Changes since v4: >> >> 1) In patch 0006, when setting a translation fails in tool_setup_mw, >> we now properly clean up the peer dma memory. Before that patch, it >> wasn't required because if tool_setup_mw failed the module would clean >> up all mws. After this patch, the clean up never happened, so it would >> return an error to the user but the DMA memory would still be allocated >> and the peer_trans file would report that it's ready but the debugfs >> file would not have been created. In v5, after an error, no DMA memory >> is allocated and it will still report that it's not ready to the user when >> reading peer_trans. >> >> --- >> >> Changes since v3: >> >> 1) Check the error returned when setting the link in ntb_tool and pass >> it back to the user. >> >> 2) If an error occurs when setting the link down during a link test in >> ntb_test, just print unsupported and continue on. (For hardware that >> does not support setting the link down.) >> >> 3) Fix a race condition problem introduced by clearing the link is >> up flag in ntb_perf. We do this by getting rid of the link cleanup >> work and doing the few actions in the link event handler. >> Dave Jiang has already ok'd this approach. >> >> --- >> >> Changes since v2: >> >> 1) As per Allan's suggestion, I've added a patch to postpone the >> peer memory window setup until the user requests it with a 'peer_trans*' >> file. This pushes the ntb_tool API closer to the kernel API and allows >> the link status file to return the raw status of the NTB link. >> >> 2) Change the link status file to return the value of ntb_link_is_up >> instead of the local 'memory window ready' value. >> >> 3) Change the link_event file to just block on write. As it was in v2, >> if multiple users attempted to use the link_event file they could >> corrupt the state another user had set. Reads to this file are no >> longer permitted. >> >> 4) Updates to the ntb_test script to accommodate the above changes. >> >> 5) Added some link tests to the ntb_test script. It will bring the link >> down and check that the other side agrees. >> >> 6) Added a minor bug fix (Patch #10) to ntb_perf. During discussions >> with Allen it was noticed that the link_is_up flag is never cleared if >> the link goes away. >> >> --- >> >> Changes since v1: >> >> 1) Add a comment to explain the *15 in the buf size calculation, >> as per Allen's feedback. >> >> 2) Clean up the changes to the pingpong client as there were some >> sloppy copying mistakes. >> >> 3) Rework the 'link' file in ntb_tool as per Allen's suggestions. >> I've added a 'link_event' file the works essentially how he's asked. >> Though, I found no need to use a completion as suggested and the flow >> is maybe slightly simpler than he's suggested. Just write a boolean >> to the event file then read to wait for the link to be either up or >> down. There's still some discussion on the best interface and it's >> not much work to make additional minor functional changes. >> >> 4) Update the selftest script to use the new 'link_event' file. >> >> 5) Minor change to the way the selftest script lists devices thanks to >> Allen's observation. >> >> --- >> >> I've written a ntb_test.sh script that would probably be useful if it >> were included in the kernel. This series ends with that script and >> includes some useful interface improvements and fixes to the existing ntb >> test modules. Please see each individual commit for more information. >> They are mostly independent. >> >> The series is based off of v4.6 plus the patches I've submitted that >> have been accepted into ntb-next. They've been run through checkpatch >> with --strict t
Re: [PATCH v3] x86/power/64: Fix kernel text mapping corruption during image restoration
Hey, This version is working for me as well. Thanks. Logan On 27/06/16 08:24 AM, Rafael J. Wysocki wrote: On Tuesday, June 21, 2016 11:04:41 AM Kees Cook wrote: On Mon, Jun 20, 2016 at 9:35 PM, Logan Gunthorpe <log...@deltatee.com> wrote: Hey Rafael, This patch appears to be working on my laptop. Thanks. Same for me: resume still works with KASLR in my tests too. Unfortunately, Boris still sees post-resume memory corruption with the patch you have tested, but that turns out to be a result of some super-weird corruption of a pointer on the stack which leads to a store at a wrong address (and there's no way it can be related to the rest of the patch). We have verified that it can be avoided by rearranging set_up_temporary_text_mapping() to use fewer local variables and the appended patch contains this modification. I also went on and changed relocate_restore_code() slightly in a similar fashion, but all of those changes should not affect the behavior (unless there's something insane going on behind the curtains, that is). Kees, Logan, Boris, please try this one and let me know if it works for you. Thanks, Rafael --- From: Rafael J. Wysocki <rafael.j.wyso...@intel.com> Subject: [PATCH v2] x86/power/64: Fix kernel text mapping corruption during image restoration Logan Gunthorpe reports that hibernation stopped working reliably for him after commit ab76f7b4ab23 (x86/mm: Set NX on gap between __ex_table and rodata). That turns out to be a consequence of a long-standing issue with the 64-bit image restoration code on x86, which is that the temporary page tables set up by it to avoid page tables corruption when the last bits of the image kernel's memory contents are copied into their original page frames re-use the boot kernel's text mapping, but that mapping may very well get corrupted just like any other part of the page tables. Of course, if that happens, the final jump to the image kernel's entry point will go to nowhere. The exact reason why commit ab76f7b4ab23 matters here is that it sometimes causes a PMD of a large page to be split into PTEs that are allocated dynamically and get corrupted during image restoration as described above. To fix that issue note that the code copying the last bits of the image kernel's memory contents to the page frames occupied by them previoulsy doesn't use the kernel text mapping, because it runs from a special page covered by the identity mapping set up for that code from scratch. Hence, the kernel text mapping is only needed before that code starts to run and then it will only be used just for the final jump to the image kernel's entry point. Accordingly, the temporary page tables set up in swsusp_arch_resume() on x86-64 need to contain the kernel text mapping too. That mapping is only going to be used for the final jump to the image kernel, so it only needs to cover the image kernel's entry point, because the first thing the image kernel does after getting control back is to switch over to its own original page tables. Moreover, the virtual address of the image kernel's entry point in that mapping has to be the same as the one mapped by the image kernel's page tables. With that in mind, modify the x86-64's arch_hibernation_header_save() and arch_hibernation_header_restore() routines to pass the physical address of the image kernel's entry point (in addition to its virtual address) to the boot kernel (a small piece of assembly code involved in passing the entry point's virtual address to the image kernel is not necessary any more after that, so drop it). Update RESTORE_MAGIC too to reflect the image header format change. Next, in set_up_temporary_mappings(), use the physical and virtual addresses of the image kernel's entry point passed in the image header to set up a minimum kernel text mapping (using memory pages that won't be overwritten by the image kernel's memory contents) that will map those addresses to each other as appropriate. This makes the concern about the possible corruption of the original boot kernel text mapping go away and if the the minimum kernel text mapping used for the final jump marks the image kernel's entry point memory as executable, the jump to it is guaraneed to succeed. Fixes: ab76f7b4ab23 (x86/mm: Set NX on gap between __ex_table and rodata) Link: http://marc.info/?l=linux-pm=146372852823760=2 Reported-by: Logan Gunthorpe <log...@deltatee.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wyso...@intel.com> --- arch/x86/power/hibernate_64.c | 90 -- arch/x86/power/hibernate_asm_64.S | 55 ++- 2 files changed, 102 insertions(+), 43 deletions(-) Index: linux-pm/arch/x86/power/hibernate_64.c === --- linux-pm.orig/arch/x86/power/hibernate_64.c +++ linux-pm/arch/x86/power/hibernate_64.c @@ -28,6 +28,7 @@ extern asmlinkage __visible int restore_ * kernel's te
Re: [PATCH v5] x86/power/64: Fix kernel text mapping corruption during image restoration
Hey, So, I've done a couple hundred hibernate resume cycles with v5 of the patch and so far it looks good. I'll let you know if I hit any bugs in more typical usage over the coming week. Thanks, Logan On 30/06/16 10:11 AM, Rafael J. Wysocki wrote: From: Rafael J. Wysocki <rafael.j.wyso...@intel.com> Logan Gunthorpe reports that hibernation stopped working reliably for him after commit ab76f7b4ab23 (x86/mm: Set NX on gap between __ex_table and rodata). That turns out to be a consequence of a long-standing issue with the 64-bit image restoration code on x86, which is that the temporary page tables set up by it to avoid page tables corruption when the last bits of the image kernel's memory contents are copied into their original page frames re-use the boot kernel's text mapping, but that mapping may very well get corrupted just like any other part of the page tables. Of course, if that happens, the final jump to the image kernel's entry point will go to nowhere. The exact reason why commit ab76f7b4ab23 matters here is that it sometimes causes a PMD of a large page to be split into PTEs that are allocated dynamically and get corrupted during image restoration as described above. To fix that issue note that the code copying the last bits of the image kernel's memory contents to the page frames occupied by them previoulsy doesn't use the kernel text mapping, because it runs from a special page covered by the identity mapping set up for that code from scratch. Hence, the kernel text mapping is only needed before that code starts to run and then it will only be used just for the final jump to the image kernel's entry point. Accordingly, the temporary page tables set up in swsusp_arch_resume() on x86-64 need to contain the kernel text mapping too. That mapping is only going to be used for the final jump to the image kernel, so it only needs to cover the image kernel's entry point, because the first thing the image kernel does after getting control back is to switch over to its own original page tables. Moreover, the virtual address of the image kernel's entry point in that mapping has to be the same as the one mapped by the image kernel's page tables. With that in mind, modify the x86-64's arch_hibernation_header_save() and arch_hibernation_header_restore() routines to pass the physical address of the image kernel's entry point (in addition to its virtual address) to the boot kernel (a small piece of assembly code involved in passing the entry point's virtual address to the image kernel is not necessary any more after that, so drop it). Update RESTORE_MAGIC too to reflect the image header format change. Next, in set_up_temporary_mappings(), use the physical and virtual addresses of the image kernel's entry point passed in the image header to set up a minimum kernel text mapping (using memory pages that won't be overwritten by the image kernel's memory contents) that will map those addresses to each other as appropriate. This makes the concern about the possible corruption of the original boot kernel text mapping go away and if the the minimum kernel text mapping used for the final jump marks the image kernel's entry point memory as executable, the jump to it is guaraneed to succeed. Fixes: ab76f7b4ab23 (x86/mm: Set NX on gap between __ex_table and rodata) Link: http://marc.info/?l=linux-pm=146372852823760=2 Reported-by: Logan Gunthorpe <log...@deltatee.com> Tested-by: Kees Cook <keesc...@chromium.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wyso...@intel.com> --- v4 -> v5: Replace flush_tlb_all() in relocate_restore_code() with __flush_tlb_all() (which still should be sufficient) to avoid complaints from smp_call_function_many() about disabled interrupts. v3 -> v4: The new relocate_restore_code() didn't cover the case when the page containing the relocated code was mapped via a huge (1G) page and it didn't clear the NX bit in that case, which led to the page fault reported by Logan at http://marc.info/?l=linux-kernel=146725158621626=4. Fix that by making relocate_restore_code() handle the huge page mapping case too. I've retained the Tested-by: from Kees as these changes have nothing to do with whether or not KASLR works with hibernation after this patch. Logan, please test this one thoroughly. I'll give it at least one week in linux-next going forward. Boris, please test it on the machine where we saw memory corruption with the previous versions if poss. --- arch/x86/power/hibernate_64.c | 97 +- arch/x86/power/hibernate_asm_64.S | 55 + 2 files changed, 109 insertions(+), 43 deletions(-) Index: linux-pm/arch/x86/power/hibernate_64.c === --- linux-pm.orig/arch/x86/power/hibernate_64.c +++ linux-pm/arch/x86/power/hibernate_64.c @@ -19,6 +19,7 @@ #include #include #include +#include /* Defined in
Re: [PATCH v3] x86/power/64: Fix kernel text mapping corruption during image restoration
Hey Raf, Sorry to report that although the patch works the majority of the time, I just got a suspicious kernel panic during resume. It said: "kernel tried to execute NX protected page - exploit attempt? (uid: 0)" You can find a photo of the panic here: http://staff.deltatee.com/~logang/panic.jpg Logan On 29/06/16 08:48 AM, Kees Cook wrote: On Mon, Jun 27, 2016 at 4:33 PM, Logan Gunthorpe <log...@deltatee.com> wrote: Hey, This version is working for me as well. Thanks. Logan On 27/06/16 08:24 AM, Rafael J. Wysocki wrote: On Tuesday, June 21, 2016 11:04:41 AM Kees Cook wrote: On Mon, Jun 20, 2016 at 9:35 PM, Logan Gunthorpe <log...@deltatee.com> wrote: Hey Rafael, This patch appears to be working on my laptop. Thanks. Same for me: resume still works with KASLR in my tests too. Unfortunately, Boris still sees post-resume memory corruption with the patch you have tested, but that turns out to be a result of some super-weird corruption of a pointer on the stack which leads to a store at a wrong address (and there's no way it can be related to the rest of the patch). We have verified that it can be avoided by rearranging set_up_temporary_text_mapping() to use fewer local variables and the appended patch contains this modification. I also went on and changed relocate_restore_code() slightly in a similar fashion, but all of those changes should not affect the behavior (unless there's something insane going on behind the curtains, that is). Kees, Logan, Boris, please try this one and let me know if it works for you. Tested-by: Kees Cook <keesc...@chromium.org> I've done a few KASLR boots, and everything continues to look good to me. Thanks! -Kees Thanks, Rafael --- From: Rafael J. Wysocki <rafael.j.wyso...@intel.com> Subject: [PATCH v2] x86/power/64: Fix kernel text mapping corruption during image restoration Logan Gunthorpe reports that hibernation stopped working reliably for him after commit ab76f7b4ab23 (x86/mm: Set NX on gap between __ex_table and rodata). That turns out to be a consequence of a long-standing issue with the 64-bit image restoration code on x86, which is that the temporary page tables set up by it to avoid page tables corruption when the last bits of the image kernel's memory contents are copied into their original page frames re-use the boot kernel's text mapping, but that mapping may very well get corrupted just like any other part of the page tables. Of course, if that happens, the final jump to the image kernel's entry point will go to nowhere. The exact reason why commit ab76f7b4ab23 matters here is that it sometimes causes a PMD of a large page to be split into PTEs that are allocated dynamically and get corrupted during image restoration as described above. To fix that issue note that the code copying the last bits of the image kernel's memory contents to the page frames occupied by them previoulsy doesn't use the kernel text mapping, because it runs from a special page covered by the identity mapping set up for that code from scratch. Hence, the kernel text mapping is only needed before that code starts to run and then it will only be used just for the final jump to the image kernel's entry point. Accordingly, the temporary page tables set up in swsusp_arch_resume() on x86-64 need to contain the kernel text mapping too. That mapping is only going to be used for the final jump to the image kernel, so it only needs to cover the image kernel's entry point, because the first thing the image kernel does after getting control back is to switch over to its own original page tables. Moreover, the virtual address of the image kernel's entry point in that mapping has to be the same as the one mapped by the image kernel's page tables. With that in mind, modify the x86-64's arch_hibernation_header_save() and arch_hibernation_header_restore() routines to pass the physical address of the image kernel's entry point (in addition to its virtual address) to the boot kernel (a small piece of assembly code involved in passing the entry point's virtual address to the image kernel is not necessary any more after that, so drop it). Update RESTORE_MAGIC too to reflect the image header format change. Next, in set_up_temporary_mappings(), use the physical and virtual addresses of the image kernel's entry point passed in the image header to set up a minimum kernel text mapping (using memory pages that won't be overwritten by the image kernel's memory contents) that will map those addresses to each other as appropriate. This makes the concern about the possible corruption of the original boot kernel text mapping go away and if the the minimum kernel text mapping used for the final jump marks the image kernel's entry point memory as executable, the jump to it is guaraneed to succeed. Fixes: ab76f7b4ab23 (x86/mm: Set NX on gap between __ex_table and rodata) Link: http://marc.info/?l=linux-pm=146372852823760=2 Reported-by: Lo
Re: [PATCH v3] x86/power/64: Fix kernel text mapping corruption during image restoration
On 29/06/16 08:55 PM, Rafael J. Wysocki wrote: The only thing that comes to mind at this point is that TLBs should be flushed after page tables changes, so please apply the appended and let me know if you see this panic any more with it. Ok, I'll build a new kernel tomorrow. But keep in mind the panic is pretty rare as I've only seen it once so far after a couple dozen or so hibernates. So it may be hard to get a concrete yes or no on whether it fixes the issue. I've got a script to run a bunch of hibernates in a row. I usually only run it for a handful of iterations, but I'll try running it for much longer with this patch and let you know in a couple days. Logan Thanks, Rafael --- arch/x86/power/hibernate_64.c | 92 +- arch/x86/power/hibernate_asm_64.S | 55 +- 2 files changed, 104 insertions(+), 43 deletions(-) Index: linux-pm/arch/x86/power/hibernate_64.c === --- linux-pm.orig/arch/x86/power/hibernate_64.c +++ linux-pm/arch/x86/power/hibernate_64.c @@ -19,6 +19,7 @@ #include #include #include +#include /* Defined in hibernate_asm_64.S */ extern asmlinkage __visible int restore_image(void); @@ -28,6 +29,7 @@ extern asmlinkage __visible int restore_ * kernel's text (this value is passed in the image header). */ unsigned long restore_jump_address __visible; +unsigned long jump_address_phys; /* * Value of the cr3 register from before the hibernation (this value is passed @@ -37,7 +39,43 @@ unsigned long restore_cr3 __visible; pgd_t *temp_level4_pgt __visible; -void *relocated_restore_code __visible; +unsigned long relocated_restore_code __visible; + +static int set_up_temporary_text_mapping(void) +{ + pmd_t *pmd; + pud_t *pud; + + /* +* The new mapping only has to cover the page containing the image +* kernel's entry point (jump_address_phys), because the switch over to +* it is carried out by relocated code running from a page allocated +* specifically for this purpose and covered by the identity mapping, so +* the temporary kernel text mapping is only needed for the final jump. +* Moreover, in that mapping the virtual address of the image kernel's +* entry point must be the same as its virtual address in the image +* kernel (restore_jump_address), so the image kernel's +* restore_registers() code doesn't find itself in a different area of +* the virtual address space after switching over to the original page +* tables used by the image kernel. +*/ + pud = (pud_t *)get_safe_page(GFP_ATOMIC); + if (!pud) + return -ENOMEM; + + pmd = (pmd_t *)get_safe_page(GFP_ATOMIC); + if (!pmd) + return -ENOMEM; + + set_pmd(pmd + pmd_index(restore_jump_address), + __pmd((jump_address_phys & PMD_MASK) | __PAGE_KERNEL_LARGE_EXEC)); + set_pud(pud + pud_index(restore_jump_address), + __pud(__pa(pmd) | _KERNPG_TABLE)); + set_pgd(temp_level4_pgt + pgd_index(restore_jump_address), + __pgd(__pa(pud) | _KERNPG_TABLE)); + + return 0; +} static void *alloc_pgt_page(void *context) { @@ -59,9 +97,10 @@ static int set_up_temporary_mappings(voi if (!temp_level4_pgt) return -ENOMEM; - /* It is safe to reuse the original kernel mapping */ - set_pgd(temp_level4_pgt + pgd_index(__START_KERNEL_map), - init_level4_pgt[pgd_index(__START_KERNEL_map)]); + /* Prepare a temporary mapping for the kernel text */ + result = set_up_temporary_text_mapping(); + if (result) + return result; /* Set up the direct mapping from scratch */ for (i = 0; i < nr_pfn_mapped; i++) { @@ -78,19 +117,45 @@ static int set_up_temporary_mappings(voi return 0; } +static int relocate_restore_code(void) +{ + pgd_t *pgd; + pmd_t *pmd; + + relocated_restore_code = get_safe_page(GFP_ATOMIC); + if (!relocated_restore_code) + return -ENOMEM; + + memcpy((void *)relocated_restore_code, _restore_code, PAGE_SIZE); + + /* Make the page containing the relocated code executable */ + pgd = (pgd_t *)__va(read_cr3()) + pgd_index(relocated_restore_code); + pmd = pmd_offset(pud_offset(pgd, relocated_restore_code), +relocated_restore_code); + if (pmd_large(*pmd)) { + set_pmd(pmd, __pmd(pmd_val(*pmd) & ~_PAGE_NX)); + } else { + pte_t *pte = pte_offset_kernel(pmd, relocated_restore_code); + + set_pte(pte, __pte(pte_val(*pte) & ~_PAGE_NX)); + } + flush_tlb_all(); + + return 0; +} + int swsusp_arch_resume(void) { int error; /* We have got enough memory and from now on we cannot recover */ - if ((error =
Re: [PATCH v2 0/4] Style fixes: open code obfuscating macros
Cool, Thanks. Let me know if you need anything from me. Logan On 01/02/17 01:08 PM, Jon Mason wrote: > On Thu, Jan 26, 2017 at 3:00 PM, Logan Gunthorpe <log...@deltatee.com> wrote: >> Hi, >> >> It's been a couple weeks... Any thoughts on this? > > My apologies for not responding sooner. A large rework of the core > NTB code was done prior to your patches (for IDT NTB support). > Unfortunately, after those changes this series does not apply cleanly. > I understand the changes being made, and will fix-up the patches to > make it apply cleanly on the updated code base. > > Thanks, > Jon > >> >> Thanks, >> >> Logan >> >> On 10/01/17 05:33 PM, Logan Gunthorpe wrote: >>> Hi, >>> >>> Here's an updated version of the style fixes patchset. The differences >>> from v1 are: >>> >>> 1) Rebased onto Jon Mason's current NTB branch >>> >>> 2) Added two more patches to convert the print lines to use the ntb >>> device and not the pci one. This seems more sane, shortens a bunch >>> of lines and removes the need for temporaries. >>> >>> Note with (2): I've tried my best to ensure that statements that print >>> before the ntb device is registered still use the PCI device. However, >>> someone should probably review and test this as I don't have access >>> to all types of hardware to do that. >>> >>> Thanks, >>> >>> Logan >>> >>> >>> >>> Logan Gunthorpe (4): >>> ntb_hw_amd: Style fixes: open code macros that just obfuscate code >>> ntb_hw_intel: Style fixes: open code macros that just obfuscate code >>> ntb_hw_amd: Print kernel messages with the ntb device not the pci one >>> ntb_hw_intel: Print kernel messages with the ntb device not the pci >>> one >>> >>> drivers/ntb/hw/amd/ntb_hw_amd.c | 60 +-- >>> drivers/ntb/hw/amd/ntb_hw_amd.h | 3 - >>> drivers/ntb/hw/intel/ntb_hw_intel.c | 192 >>> +--- >>> drivers/ntb/hw/intel/ntb_hw_intel.h | 3 - >>> 4 files changed, 124 insertions(+), 134 deletions(-) >>> >>> -- >>> 2.1.4 >>> >> >> -- >> You received this message because you are subscribed to the Google Groups >> "linux-ntb" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to linux-ntb+unsubscr...@googlegroups.com. >> To post to this group, send email to linux-...@googlegroups.com. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/linux-ntb/5ed3be6c-c94e-1ade-e4cd-a8e8293296cd%40deltatee.com. >> For more options, visit https://groups.google.com/d/optout.
Re: [PATCH 1/1] MicroSemi Switchtec management interface driver
On 31/01/17 10:26 AM, Greg Kroah-Hartman wrote: > That's one big patch to review, would you want to do that? Sorry, will do. > Can you break it up into smaller parts? At least put the documentation > separately, right? Ha, funny. Last time I sent a patch someone asked for the documentation to be in the same patch. But I can easily split this up. > And don't dump a .txt file into Documentation/ anymore, people are > working to move to the newer format. Fair. I wasn't sure where a good place to put it was. Any suggestions? > Also, please rebase against Linus's tree at the least, we can't go back > in time and apply this to the 4.9 kernel tree. Will do. > Why a .h file for a single .c file? I wanted to keep the hardware defining structs and macros in a separate file for future expansion. This hardware is also capable of some NTB functions which may find it's way into the kernel in the future. > Also, why a whole new directory? We didn't feel it fit in the pci director which was for standard pci stuff. We're more than open to other suggestions as to where this code belongs. Thanks, Logan
[PATCH 1/1] MicroSemi Switchtec management interface driver
Microsemi's "Switchtec" line of PCI switch devices is already well supported by the kernel with standard PCI switch drivers. However, the Switchtec device advertises a special management endpoint with a separate PCI function address and class code. This endpoint enables some additional functionality which includes: * Packet and Byte Counters * Switch Firmware Upgrades * Event and Error logs * Querying port link status * Custom user firmware commands This patch introduces the switchtec kernel module which provides PCI driver that exposes a char device. The char device provides userspace access to this interface through read, write and (optionally) poll calls. A couple of special IOCTLs are provided to: * Inform userspace of firmware partition locations * Pass event counts and allow userspace to wait on events A short text file is provided which documents the switchtec driver, outlines the semantics of using the char device and describes the IOCTLs. The device also exposes a few read-only sysfs attributes which provide some device information component names and versions which is provided by the hardware. These are documented in Documentation/ABI/testing/sysfs-class-switchtec A userspace tool and library which utilizes this interface is available at [1]. This tool takes inspiration (and borrows some code) from nvme-cli [2]. The tool is largely complete at this time but additional features may be added in the future. [1] https://github.com/sbates130272/switchtec-user [2] https://github.com/linux-nvme/nvme-cli Signed-off-by: Logan Gunthorpe <log...@deltatee.com> Signed-off-by: Stephen Bates <stephen.ba...@microsemi.com> --- Documentation/ABI/testing/sysfs-class-switchtec | 96 ++ Documentation/ioctl/ioctl-number.txt|1 + Documentation/switchtec.txt | 80 ++ MAINTAINERS | 11 + drivers/pci/Kconfig |1 + drivers/pci/Makefile|1 + drivers/pci/switch/Kconfig | 13 + drivers/pci/switch/Makefile |1 + drivers/pci/switch/switchtec.c | 1320 +++ drivers/pci/switch/switchtec.h | 266 + include/uapi/linux/switchtec_ioctl.h| 129 +++ 11 files changed, 1919 insertions(+) create mode 100644 Documentation/ABI/testing/sysfs-class-switchtec create mode 100644 Documentation/switchtec.txt create mode 100644 drivers/pci/switch/Kconfig create mode 100644 drivers/pci/switch/Makefile create mode 100644 drivers/pci/switch/switchtec.c create mode 100644 drivers/pci/switch/switchtec.h create mode 100644 include/uapi/linux/switchtec_ioctl.h diff --git a/Documentation/ABI/testing/sysfs-class-switchtec b/Documentation/ABI/testing/sysfs-class-switchtec new file mode 100644 index 000..4fbc417 --- /dev/null +++ b/Documentation/ABI/testing/sysfs-class-switchtec @@ -0,0 +1,96 @@ +switchtec - Microsemi Switchtec PCI Switch Management Endpoint + +For details on this subsystem look at Documentation/switchtec.txt. + +What: /sys/class/switchtec +Date: 05-Jan-2017 +KernelVersion: v4.11 +Contact: Logan Gunthorpe <log...@deltatee.com> +Description: The switchtec class subsystem folder. + Each registered switchtec driver is represented by a switchtecX + subfolder (X being an integer >= 0). + + +What: /sys/class/switchtec/switchtec[0-9]+/component_id +Date: 05-Jan-2017 +KernelVersion: v4.11 +Contact: Logan Gunthorpe <log...@deltatee.com> +Description: Component identifier as stored in the hardware (eg. PM4385) + (read only) +Values:arbitrary string. + + +What: /sys/class/switchtec/switchtec[0-9]+/component_revision +Date: 05-Jan-2017 +KernelVersion: v4.11 +Contact: Logan Gunthorpe <log...@deltatee.com> +Description: Component revision stored in the hardware (read only) +Values:integer. + + +What: /sys/class/switchtec/switchtec[0-9]+/component_vendor +Date: 05-Jan-2017 +KernelVersion: v4.11 +Contact: Logan Gunthorpe <log...@deltatee.com> +Description: Component vendor as stored in the hardware (eg. MICROSEM) + (read only) +Values:arbitrary string. + + +What: /sys/class/switchtec/switchtec[0-9]+/device_version +Date: 05-Jan-2017 +KernelVersion: v4.11 +Contact: Logan Gunthorpe <log...@deltatee.com> +Description: Device version as stored in the hardware (read only) +Values:integer. + + +What: /sys/class/switchtec/switchtec[0-9]+/fw_version +Date: 05-Jan-2017 +KernelVersion: v4.11 +Contact: Logan Gunthorpe <log...@deltatee.com> +Description: Currently running firmware version (read only) +Values:integer (in hexadecimal). + + +What: /sys/class/sw
[PATCH 0/1] DRAFT: New Microsemi PCI Switch Management Driver
Hi, This is a continuation of the RFC we posted lasted month [1] which proposes a management driver for Microsemi's Switchtec line of PCI switches. This hardware is still looking to be used in the Open Compute Platform To make this entirely clear: the Switchtec products are compliant with the PCI specifications and are supported today with the standard in-kernel driver. However, these devices also expose a management endpoint on a separate PCI function address which can be used to perform some advanced operations. This is a driver for that function. See the patch for more information. Since the RFC, we've made the changes requested by Greg Kroah-Hartman and Keith Busch, and we've also fleshed out a number of features. We've added a couple of IOCTLs and sysfs attributes which are documented in the patch. Significant work has also been done on the userspace tool which is available under a GPL license at [2]. We've also had testing done by some of the interested parties. We hope to see this work included in either 4.11 or 4.12 assuming a smooth review process. The patch is based off of the v4.9 release. Thanks for your review, Logan [1] https://www.spinics.net/lists/linux-pci/msg56897.html [2] https://github.com/sbates130272/switchtec-user -- Changes since RFC: * Fixed incorrect use of the drive model as pointed out by Greg Kroah-Hartman * Used devm functions as suggested by Keith Busch * Added a handful of sysfs attributes to the switchtec class * Added a handful of IOCTLs to the switchtec device * A number of miscelaneous bug fixes -- Logan Gunthorpe (1): MicroSemi Switchtec management interface driver Documentation/ABI/testing/sysfs-class-switchtec | 96 ++ Documentation/ioctl/ioctl-number.txt|1 + Documentation/switchtec.txt | 80 ++ MAINTAINERS | 11 + drivers/pci/Kconfig |1 + drivers/pci/Makefile|1 + drivers/pci/switch/Kconfig | 13 + drivers/pci/switch/Makefile |1 + drivers/pci/switch/switchtec.c | 1320 +++ drivers/pci/switch/switchtec.h | 266 + include/uapi/linux/switchtec_ioctl.h| 129 +++ 11 files changed, 1919 insertions(+) create mode 100644 Documentation/ABI/testing/sysfs-class-switchtec create mode 100644 Documentation/switchtec.txt create mode 100644 drivers/pci/switch/Kconfig create mode 100644 drivers/pci/switch/Makefile create mode 100644 drivers/pci/switch/switchtec.c create mode 100644 drivers/pci/switch/switchtec.h create mode 100644 include/uapi/linux/switchtec_ioctl.h -- 2.1.4
Re: [PATCH 1/1] MicroSemi Switchtec management interface driver
On 31/01/17 11:57 AM, Greg Kroah-Hartman wrote: > Sorry, it was probably me :) Nope, it was Christoph Hellwig. I don't mind changing it. It's just hard to know what's expected all the time. > Why do you need this? Wherever you put it, it should be built as part > of the online kernel documentation. Who is the audience for this > documentation? Well usually documenting how to use the device is a good thing. Is it not? The audience would be people wanting to make use of the userspace interface. Though in fairness, the vast majority of people should use our library. I also thought it would be useful to reviewers to more quickly understand the interface we are creating. If maintainers want us to leave it out, we can, but that seems like an odd request. > Do future stuff in the future, no need for that now, right? > > Simple is best. Ok, I can push it all into the C file for v2. > I'll leave that up to the PCI maintainer, but just a single .c file in a > subdir seems odd to me. Thanks, Logan
Re: [PATCH 1/1] MicroSemi Switchtec management interface driver
On 31/01/17 10:49 AM, Jonathan Corbet wrote: > The good news is that your switchtec.txt file is already 99% in the RST > format, so there is little or nothing to do there. > > The bad news is that we don't quite have a place for it yet. This is > really user-space developer documentation, and we don't have a sub-book > for that. I expect that to change pretty soon, and I might even toss > together a bare beginning for 4.11, but that hasn't happened yet. > > If you would like to learn the ropes and make one, that would be more than > great :) Alternatively, I'd suggest just leaving the document as-is, and > I'll put it toward the top of the list of things to move into place once I > get that book started. Thanks Jon. I took a look at the documentation and it doesn't look too difficult. But I don't really feel qualified to make some of the decisions necessary to create the new book. (ie names, locations, etc.) And I don't really want that churn in this patchset. So we'll leave it where it is for now. I have cleaned up some of the RST format for v2 though. Logan
Re: [PATCH 1/1] MicroSemi Switchtec management interface driver
On 01/02/17 05:10 AM, Emil Velikov wrote: > You can keep it roughly as-is if you're ~reasonably certain one won't > change it in the future. I've made the change anyway. I think it's better now. > Some teams frown upon adding new IOCTL(s) where existing ones can be > made backward/forward compatible. > I'm not fully aware of the general direction/consensus on the topic, > so it might be a minority. Sure, I just don't know what might be needed in the future so it's hard to add a version or flags ioctl now. > On the other hand, reading through sysfs for module version in order > to use IOCTL A or B sounds quite hacky. Do you have an example where > this is used or pointed out as good approach ? I don't know of anything doing it that way now. But it sure would be easy and make a bit of sense. (We'd actually use the module version for something useful.) Either way, it would really depend on if and how things change in the future. The point is there are options to expand if needed. > Afaict the idea is to not ship/bundle/release userspace until kernel > parts are in. > The "do not commit the changes" is implied as [very rarely] distros > package from "random" git checkouts. Leading to all sorts of fun when > it is mismatched wrt the kernel parts. Likelihood of doing that here > is virtually none here, so this is a JFYI inspired by some past > experiences. Understood. > Glad to hear. Then again you already had most of the things nicely done, imho. Great, thanks. Logan
[PATCH v2 4/4] switchtec: Add IOCTLs to the Switchtec driver
This patch introduces a couple of special IOCTLs which are provided to: * Inform userspace of firmware partition locations * Pass event counts and allow userspace to wait on events * Translate between PFF numbers used by the switch to port numbers. Signed-off-by: Logan Gunthorpe <log...@deltatee.com> Signed-off-by: Stephen Bates <stephen.ba...@microsemi.com> --- Documentation/ioctl/ioctl-number.txt | 1 + Documentation/switchtec.txt | 27 ++ MAINTAINERS | 1 + drivers/pci/switch/switchtec.c | 467 +++ include/uapi/linux/switchtec_ioctl.h | 132 ++ 5 files changed, 628 insertions(+) create mode 100644 include/uapi/linux/switchtec_ioctl.h diff --git a/Documentation/ioctl/ioctl-number.txt b/Documentation/ioctl/ioctl-number.txt index 81c7f2b..032b33c 100644 --- a/Documentation/ioctl/ioctl-number.txt +++ b/Documentation/ioctl/ioctl-number.txt @@ -191,6 +191,7 @@ Code Seq#(hex) Include FileComments 'W'00-1F linux/watchdog.hconflict! 'W'00-1F linux/wanrouter.h conflict! (pre 3.9) 'W'00-3F sound/asound.h conflict! +'W'40-5F drivers/pci/switch/switchtec.c 'X'all fs/xfs/xfs_fs.h conflict! and fs/xfs/linux-2.6/xfs_ioctl32.h and include/linux/falloc.h diff --git a/Documentation/switchtec.txt b/Documentation/switchtec.txt index 4bced4c..a0a9c7b 100644 --- a/Documentation/switchtec.txt +++ b/Documentation/switchtec.txt @@ -51,3 +51,30 @@ The char device has the following semantics: * The poll call will also be supported for userspace applications that need to do other things while waiting for the command to complete. + +The following IOCTLs are also supported by the device: + +* SWITCHTEC_IOCTL_FLASH_INFO - Retrieve firmware length and number + of partitions in the device. + +* SWITCHTEC_IOCTL_FLASH_PART_INFO - Retrieve address and lengeth for + any specified partition in flash. + +* SWITCHTEC_IOCTL_EVENT_SUMMARY - Read a structure of bitmaps + indicating all uncleared events. + +* SWITCHTEC_IOCTL_EVENT_CTL - Get the current count, clear and set flags + for any event. This ioctl takes in a switchtec_ioctl_event_ctl struct + with the event_id, index and flags set (index being the partition or PFF + number for non-global events). It returns whether the event has + occurred, the number of times and any event specific data. The flags + can be used to clear the count or enable and disable actions to + happen when the event occurs. + By using the SWITCHTEC_IOCTL_EVENT_FLAG_EN_POLL flag, + you can set an event to trigger a poll command to return with + POLLPRI. In this way, userspace can wait for events to occur. + +* SWITCHTEC_IOCTL_PFF_TO_PORT and SWITCHTEC_IOCTL_PORT_TO_PFF convert + between PCI Function Framework number (used by the event system) + and Switchtec Logic Port ID and Partition number (which is more + user friendly). diff --git a/MAINTAINERS b/MAINTAINERS index d39b7fd..ea461d1 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -9516,6 +9516,7 @@ S:Maintained F: Documentation/switchtec.txt F: Documentation/ABI/testing/sysfs-class-switchtec F: drivers/pci/switch/switchtec* +F: include/uapi/linux/switchtec_ioctl.h PCI DRIVER FOR NVIDIA TEGRA M: Thierry Reding <thierry.red...@gmail.com> diff --git a/drivers/pci/switch/switchtec.c b/drivers/pci/switch/switchtec.c index 354ddfd..82bfd18 100644 --- a/drivers/pci/switch/switchtec.c +++ b/drivers/pci/switch/switchtec.c @@ -13,6 +13,8 @@ * */ +#include + #include #include #include @@ -755,6 +757,417 @@ static unsigned int switchtec_dev_poll(struct file *filp, poll_table *wait) return ret; } +static int ioctl_flash_info(struct switchtec_dev *stdev, + struct switchtec_ioctl_flash_info __user *uinfo) +{ + struct switchtec_ioctl_flash_info info = {0}; + struct flash_info_regs __iomem *fi = stdev->mmio_flash_info; + + info.flash_length = ioread32(>flash_length); + info.num_partitions = SWITCHTEC_IOCTL_NUM_PARTITIONS; + + if (copy_to_user(uinfo, , sizeof(info))) + return -EFAULT; + + return 0; +} + +static void set_fw_info_part(struct switchtec_ioctl_flash_part_info *info, +struct partition_info __iomem *pi) +{ + info->address = ioread32(>address); + info->length = ioread32(>length); +} + +static int ioctl_flash_part_info(struct switchtec_dev *stdev, + struct switchtec_ioctl_flash_part_info __user *uinfo) +{ + struct switchtec_ioctl_flash_part_info info = {0}; + struct flash_info_regs __iomem *fi = stdev->mmio_flash_info; + u32 active_addr = -1; + + if (copy_from_user(, uinfo, sizeof(info))) + return -EFAULT; + + switch (info.flash_partition) { + case SWITCHTEC_IOCTL
[PATCH v2 2/4] switchtec: Add user interface documentation
This adds standard documentation for the sysfs switchtec attributes and a RST formatted text file which documents the char device interface. Jonathan Corbet has indicated he will move this to a new user-space developer documentation book once it's created. Signed-off-by: Logan Gunthorpe <log...@deltatee.com> Signed-off-by: Stephen Bates <stephen.ba...@microsemi.com> --- Documentation/switchtec.txt | 53 + MAINTAINERS | 1 + 2 files changed, 54 insertions(+) create mode 100644 Documentation/switchtec.txt diff --git a/Documentation/switchtec.txt b/Documentation/switchtec.txt new file mode 100644 index 000..4bced4c --- /dev/null +++ b/Documentation/switchtec.txt @@ -0,0 +1,53 @@ + +Linux Switchtec Support + + +Microsemi's "Switchtec" line of PCI switch devices is already +supported by the kernel with standard PCI switch drivers. However, the +Switchtec device advertises a special management endpoint which +enables some additional functionality. This includes: + +* Packet and Byte Counters +* Firmware Upgrades +* Event and Error logs +* Querying port link status +* Custom user firmware commands + +The switchtec kernel module implements this functionality. + + +Interface += + +The primary means of communicating with the Switchtec management firmware is +through the Memory-mapped Remote Procedure Call (MRPC) interface. +Commands are submitted to the interface with a 4-byte command +identifier and up to 1KB of command specific data. The firmware will +respond with a 4 bytes return code and up to 1KB of command specific +data. The interface only processes a single command at a time. + + +Userspace Interface +=== + +The MRPC interface will be exposed to userspace through a simple char +device: /dev/switchtec#, one for each management endpoint in the system. + +The char device has the following semantics: + +* A write must consist of at least 4 bytes and no more than 1028 bytes. + The first four bytes will be interpreted as the command to run and + the remainder will be used as the input data. A write will send the + command to the firmware to begin processing. + +* Each write must be followed by exactly one read. Any double write will + produce an error and any read that doesn't follow a write will + produce an error. + +* A read will block until the firmware completes the command and return + the four bytes of status plus up to 1024 bytes of output data. (The + length will be specified by the size parameter of the read call -- + reading less than 4 bytes will produce an error. + +* The poll call will also be supported for userspace applications that + need to do other things while waiting for the command to complete. diff --git a/MAINTAINERS b/MAINTAINERS index 9c35198..0ab858d 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -9513,6 +9513,7 @@ M:Stephen Bates <stephen.ba...@microsemi.com> M: Logan Gunthorpe <log...@deltatee.com> L: linux-...@vger.kernel.org S: Maintained +F: Documentation/switchtec.txt F: drivers/pci/switch/switchtec* PCI DRIVER FOR NVIDIA TEGRA -- 2.1.4
[PATCH v2 3/4] switchtec: Add sysfs attributes to the Switchtec driver
This patch adds a few read-only sysfs attributes which provide some device information that is exposed from the devices. Primarily component and device names and versions. These are documented in Documentation/ABI/testing/sysfs-class-switchtec. Signed-off-by: Logan Gunthorpe <log...@deltatee.com> Signed-off-by: Stephen Bates <stephen.ba...@microsemi.com> --- Documentation/ABI/testing/sysfs-class-switchtec | 96 MAINTAINERS | 1 + drivers/pci/switch/switchtec.c | 113 3 files changed, 210 insertions(+) create mode 100644 Documentation/ABI/testing/sysfs-class-switchtec diff --git a/Documentation/ABI/testing/sysfs-class-switchtec b/Documentation/ABI/testing/sysfs-class-switchtec new file mode 100644 index 000..48cb4c1 --- /dev/null +++ b/Documentation/ABI/testing/sysfs-class-switchtec @@ -0,0 +1,96 @@ +switchtec - Microsemi Switchtec PCI Switch Management Endpoint + +For details on this subsystem look at Documentation/switchtec.txt. + +What: /sys/class/switchtec +Date: 05-Jan-2017 +KernelVersion: v4.11 +Contact: Logan Gunthorpe <log...@deltatee.com> +Description: The switchtec class subsystem folder. + Each registered switchtec driver is represented by a switchtecX + subfolder (X being an integer >= 0). + + +What: /sys/class/switchtec/switchtec[0-9]+/component_id +Date: 05-Jan-2017 +KernelVersion: v4.11 +Contact: Logan Gunthorpe <log...@deltatee.com> +Description: Component identifier as stored in the hardware (eg. PM8543) + (read only) +Values:arbitrary string. + + +What: /sys/class/switchtec/switchtec[0-9]+/component_revision +Date: 05-Jan-2017 +KernelVersion: v4.11 +Contact: Logan Gunthorpe <log...@deltatee.com> +Description: Component revision stored in the hardware (read only) +Values:integer. + + +What: /sys/class/switchtec/switchtec[0-9]+/component_vendor +Date: 05-Jan-2017 +KernelVersion: v4.11 +Contact: Logan Gunthorpe <log...@deltatee.com> +Description: Component vendor as stored in the hardware (eg. MICROSEM) + (read only) +Values:arbitrary string. + + +What: /sys/class/switchtec/switchtec[0-9]+/device_version +Date: 05-Jan-2017 +KernelVersion: v4.11 +Contact: Logan Gunthorpe <log...@deltatee.com> +Description: Device version as stored in the hardware (read only) +Values:integer. + + +What: /sys/class/switchtec/switchtec[0-9]+/fw_version +Date: 05-Jan-2017 +KernelVersion: v4.11 +Contact: Logan Gunthorpe <log...@deltatee.com> +Description: Currently running firmware version (read only) +Values:integer (in hexadecimal). + + +What: /sys/class/switchtec/switchtec[0-9]+/partition +Date: 05-Jan-2017 +KernelVersion: v4.11 +Contact: Logan Gunthorpe <log...@deltatee.com> +Description: Partition number for this device in the switch (read only) +Values:integer. + + +What: /sys/class/switchtec/switchtec[0-9]+/partition_count +Date: 05-Jan-2017 +KernelVersion: v4.11 +Contact: Logan Gunthorpe <log...@deltatee.com> +Description: Total number of partitions in the switch (read only) +Values:integer. + + +What: /sys/class/switchtec/switchtec[0-9]+/product_id +Date: 05-Jan-2017 +KernelVersion: v4.11 +Contact: Logan Gunthorpe <log...@deltatee.com> +Description: Product identifier as stored in the hardware (eg. PSX 48XG3) + (read only) +Values:arbitrary string. + + +What: /sys/class/switchtec/switchtec[0-9]+/product_revision +Date: 05-Jan-2017 +KernelVersion: v4.11 +Contact: Logan Gunthorpe <log...@deltatee.com> +Description: Product revision stored in the hardware (eg. RevB) + (read only) +Values:arbitrary string. + + +What: /sys/class/switchtec/switchtec[0-9]+/product_vendor +Date: 05-Jan-2017 +KernelVersion: v4.11 +Contact: Logan Gunthorpe <log...@deltatee.com> +Description: Product vendor as stored in the hardware (eg. MICROSEM) + (read only) +Values:arbitrary string. diff --git a/MAINTAINERS b/MAINTAINERS index 0ab858d..d39b7fd 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -9514,6 +9514,7 @@ M:Logan Gunthorpe <log...@deltatee.com> L: linux-...@vger.kernel.org S: Maintained F: Documentation/switchtec.txt +F: Documentation/ABI/testing/sysfs-class-switchtec F: drivers/pci/switch/switchtec* PCI DRIVER FOR NVIDIA TEGRA diff --git a/drivers/pci/switch/switchtec.c b/drivers/pci/switch/switchtec.c index e4a4294..354ddfd 100644 --- a/drivers/pci/switch/switchtec.c +++ b/drivers/pci/switch/switchtec.c @@ -485,6 +485,118 @@ static void mrpc
[PATCH v2 1/4] MicroSemi Switchtec management interface driver
Microsemi's "Switchtec" line of PCI switch devices is already well supported by the kernel with standard PCI switch drivers. However, the Switchtec device advertises a special management endpoint with a separate PCI function address and class code. This endpoint enables some additional functionality which includes: * Packet and Byte Counters * Switch Firmware Upgrades * Event and Error logs * Querying port link status * Custom user firmware commands This patch introduces the switchtec kernel module which provides PCI driver that exposes a char device. The char device provides userspace access to this interface through read, write and (optionally) poll calls. A userspace tool and library which utilizes this interface is available at [1]. This tool takes inspiration (and borrows some code) from nvme-cli [2]. The tool is largely complete at this time but additional features may be added in the future. [1] https://github.com/sbates130272/switchtec-user [2] https://github.com/linux-nvme/nvme-cli Signed-off-by: Logan Gunthorpe <log...@deltatee.com> Signed-off-by: Stephen Bates <stephen.ba...@microsemi.com> --- MAINTAINERS|8 + drivers/pci/Kconfig|1 + drivers/pci/Makefile |1 + drivers/pci/switch/Kconfig | 13 + drivers/pci/switch/Makefile|1 + drivers/pci/switch/switchtec.c | 1028 6 files changed, 1052 insertions(+) create mode 100644 drivers/pci/switch/Kconfig create mode 100644 drivers/pci/switch/Makefile create mode 100644 drivers/pci/switch/switchtec.c diff --git a/MAINTAINERS b/MAINTAINERS index 5f10c28..9c35198 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -9507,6 +9507,14 @@ S: Maintained F: Documentation/devicetree/bindings/pci/aardvark-pci.txt F: drivers/pci/host/pci-aardvark.c +PCI DRIVER FOR MICROSEMI SWITCHTEC +M: Kurt Schwemmer <kurt.schwem...@microsemi.com> +M: Stephen Bates <stephen.ba...@microsemi.com> +M: Logan Gunthorpe <log...@deltatee.com> +L: linux-...@vger.kernel.org +S: Maintained +F: drivers/pci/switch/switchtec* + PCI DRIVER FOR NVIDIA TEGRA M: Thierry Reding <thierry.red...@gmail.com> L: linux-te...@vger.kernel.org diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig index 6555eb7..f72e8c5 100644 --- a/drivers/pci/Kconfig +++ b/drivers/pci/Kconfig @@ -133,3 +133,4 @@ config PCI_HYPERV source "drivers/pci/hotplug/Kconfig" source "drivers/pci/host/Kconfig" +source "drivers/pci/switch/Kconfig" diff --git a/drivers/pci/Makefile b/drivers/pci/Makefile index 8db5079..15b46dd 100644 --- a/drivers/pci/Makefile +++ b/drivers/pci/Makefile @@ -68,3 +68,4 @@ ccflags-$(CONFIG_PCI_DEBUG) := -DDEBUG # PCI host controller drivers obj-y += host/ +obj-y += switch/ diff --git a/drivers/pci/switch/Kconfig b/drivers/pci/switch/Kconfig new file mode 100644 index 000..4c49648 --- /dev/null +++ b/drivers/pci/switch/Kconfig @@ -0,0 +1,13 @@ +menu "PCI switch controller drivers" + depends on PCI + +config PCI_SW_SWITCHTEC + tristate "MicroSemi Switchtec PCIe Switch Management Driver" + help +Enables support for the management interface for the MicroSemi +Switchtec series of PCIe switches. Supports userspace access +to submit MRPC commands to the switch via /dev/switchtecX +devices. See for more +information. + +endmenu diff --git a/drivers/pci/switch/Makefile b/drivers/pci/switch/Makefile new file mode 100644 index 000..37d8cfb --- /dev/null +++ b/drivers/pci/switch/Makefile @@ -0,0 +1 @@ +obj-$(CONFIG_PCI_SW_SWITCHTEC) += switchtec.o diff --git a/drivers/pci/switch/switchtec.c b/drivers/pci/switch/switchtec.c new file mode 100644 index 000..e4a4294 --- /dev/null +++ b/drivers/pci/switch/switchtec.c @@ -0,0 +1,1028 @@ +/* + * Microsemi Switchtec(tm) PCIe Management Driver + * Copyright (c) 2017, Microsemi Corporation + * + * This program is free software; you can redistribute it and/or modify it + * under the terms and conditions of the GNU General Public License, + * version 2, as published by the Free Software Foundation. + * + * This program is distributed in the hope it will be useful, but WITHOUT + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or + * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for + * more details. + * + */ + +#include +#include +#include +#include +#include +#include +#include +#include + +MODULE_DESCRIPTION("Microsemi Switchtec(tm) PCI-E Management Driver"); +MODULE_VERSION("0.1"); +MODULE_LICENSE("GPL"); +MODULE_AUTHOR("Microsemi Corporation"); + +static int max_devices = 16; +module_param(max_devices, int, 0644); +MODULE_PARM_DESC(max_devices, "max number of switchtec device instances"); + +static dev_t switchtec_devt; +static stru
[PATCH v2 0/4] New Microsemi PCI Switch Management Driver
Changes since v1: * Rebased onto 4.10-rc6 (cleanly) * Split the patch into a few more easily digestible patches (as suggested by Greg Kroah-Hartman) * Folded switchtec.c into switchtec.h (per Greg) * Fixed a bunch of 32bit build warnings caught by the kbuild test robot * Fixed some issues in the documentation so it has a proper reStructredText format (as noted by Jonathan Corbet) * Fixed padding and sizes in the IOCTL structures as noticed by Emil Velikov and used pahole to verify their consistency across 32 and 64 bit builds * Reworked one of the IOCTL interfaces to be more future proof (per Emil). Changes since RFC: * Fixed incorrect use of the drive model as pointed out by Greg Kroah-Hartman * Used devm functions as suggested by Keith Busch * Added a handful of sysfs attributes to the switchtec class * Added a handful of IOCTLs to the switchtec device * A number of miscellaneous bug fixes -- Hi, This is a continuation of the RFC we posted lasted month [1] which proposes a management driver for Microsemi's Switchtec line of PCI switches. This hardware is still looking to be used in the Open Compute Platform To make this entirely clear: the Switchtec products are compliant with the PCI specifications and are supported today with the standard in-kernel driver. However, these devices also expose a management endpoint on a separate PCI function address which can be used to perform some advanced operations. This is a driver for that function. See the patch for more information. Since the RFC, we've made the changes requested by Greg Kroah-Hartman and Keith Busch, and we've also fleshed out a number of features. We've added a couple of IOCTLs and sysfs attributes which are documented in the patch. Significant work has also been done on the userspace tool which is available under a GPL license at [2]. We've also had testing done by some of the interested parties. We hope to see this work included in either 4.11 or 4.12 assuming a smooth review process. The patch is based off of the v4.10-rc6 release. Thanks for your review, Logan [1] https://www.spinics.net/lists/linux-pci/msg56897.html [2] https://github.com/sbates130272/switchtec-user -- Logan Gunthorpe (4): MicroSemi Switchtec management interface driver switchtec: Add user interface documentation switchtec: Add sysfs attributes to the Switchtec driver switchtec: Add IOCTLs to the Switchtec driver Documentation/ABI/testing/sysfs-class-switchtec | 96 ++ Documentation/ioctl/ioctl-number.txt|1 + Documentation/switchtec.txt | 80 ++ MAINTAINERS | 11 + drivers/pci/Kconfig |1 + drivers/pci/Makefile|1 + drivers/pci/switch/Kconfig | 13 + drivers/pci/switch/Makefile |1 + drivers/pci/switch/switchtec.c | 1608 +++ include/uapi/linux/switchtec_ioctl.h| 132 ++ 10 files changed, 1944 insertions(+) create mode 100644 Documentation/ABI/testing/sysfs-class-switchtec create mode 100644 Documentation/switchtec.txt create mode 100644 drivers/pci/switch/Kconfig create mode 100644 drivers/pci/switch/Makefile create mode 100644 drivers/pci/switch/switchtec.c create mode 100644 include/uapi/linux/switchtec_ioctl.h -- 2.1.4
Re: [PATCH 1/1] MicroSemi Switchtec management interface driver
Hi Emil, Thanks for the feedback. On 31/01/17 01:48 PM, Emil Velikov wrote: >> +struct switchtec_ioctl_fw_info { >> + __u32 flash_length; >> + >> + struct { >> + __u32 address; >> + __u32 length; >> + __u32 active; > Something to keep in mind, not sure how likely it is here: > If you embed structs (partition in this case), you will not be able to > extend it [the embedded one] it in the future without the risk of ABI > breakage. Based on this feedback, I'll rework the fw_info IOCTL so the user only requests one partition at a time. This will get rid of the embedded struct and any concerns about size changes. I was originally trying to make the ioctl RO to make it simpler but that maybe isn't the case. >> + } partition[SWITCHTEC_IOCTL_NUM_PARTITIONS]; >> +}; >> + > Afaict this and most/all other structs [in this patch] are in > violation of Prerequisites#2 and/or #3. > Personally I find pahole quite useful - reference all the UABI structs > in a dummy userspace app, compile with gcc -g -O0 and observe the > output across 32 and 64bit builds. Members _must_ be at the same > offset, otherwise things will be broken with 64bit kernel on a 32bit > userspace. Something which people will eventually end up trying/using, > even if you don't plan to support it. I've made some changes to the padding and sizes to accommodate this. I'll also do some testing with pahole before v2. > Please check that Basics (#2 and #5 in particular) are also covered. > then you need a driver feature flag or revision number somewhere. We don't have this specifically. A new version ioctl could be added later when and if changes occur; or the userspace code could easily read the module version through sysfs and we could increment that with ioctl changes. > A couple more generic suggestions, which I may have noticed while > skimming through: > - please ensure that all the input is properly sanitised, before, > going into the actual implementation > Afaict having the separate step/separation helps clarity and reduces > chances of you/others getting it wrong down the line. The inputs are indeed properly checked in the code. Most of the ioctls that take inputs are so simple it's hard to separate the two steps. ie. Verifying that the input is right pretty much gives you the answer you need to send back to userspace. > - user provided storage must not be changed when the ioctl fails. This should already be true. > Additionally you want to provide a reference to open-source userspace > [alongside acknowledgement of the maintainers/co-developers about the > design] which makes use of said IOCTL(s). But please, do _not_ merge > userspace code before the kernel work has landed. I'm having trouble understanding what you're asking here. We provided a reference to the userspace code in the commit message. Currently the userspace code is completely useless without the kernel module and it is entirely independent of other projects. I'm also the only committer so far on the userspace code. So I assume this only applies if we were merging changes to other existing code? > Hope that provided you with sufficient good points wrt IOCTL design. Thanks, this was very helpful. Logan
Re: [PATCH v2 0/4] Style fixes: open code obfuscating macros
Hi, It's been a couple weeks... Any thoughts on this? Thanks, Logan On 10/01/17 05:33 PM, Logan Gunthorpe wrote: > Hi, > > Here's an updated version of the style fixes patchset. The differences > from v1 are: > > 1) Rebased onto Jon Mason's current NTB branch > > 2) Added two more patches to convert the print lines to use the ntb > device and not the pci one. This seems more sane, shortens a bunch > of lines and removes the need for temporaries. > > Note with (2): I've tried my best to ensure that statements that print > before the ntb device is registered still use the PCI device. However, > someone should probably review and test this as I don't have access > to all types of hardware to do that. > > Thanks, > > Logan > > > > Logan Gunthorpe (4): > ntb_hw_amd: Style fixes: open code macros that just obfuscate code > ntb_hw_intel: Style fixes: open code macros that just obfuscate code > ntb_hw_amd: Print kernel messages with the ntb device not the pci one > ntb_hw_intel: Print kernel messages with the ntb device not the pci > one > > drivers/ntb/hw/amd/ntb_hw_amd.c | 60 +-- > drivers/ntb/hw/amd/ntb_hw_amd.h | 3 - > drivers/ntb/hw/intel/ntb_hw_intel.c | 192 > +--- > drivers/ntb/hw/intel/ntb_hw_intel.h | 3 - > 4 files changed, 124 insertions(+), 134 deletions(-) > > -- > 2.1.4 >
Re: [PATCH v2 0/4] New Microsemi PCI Switch Management Driver
Hi Bjorn, Can you give us an idea of when you might be able to comment on our patchset? We've addressed all the outstanding issues and have a couple of reviewed and tested tags. So we'd like to see this move forward as soon as possible. I can do a respin with the tags collected or address any concerns you may have, just please let us know. Thanks, Logan On 02/02/17 11:05 AM, Logan Gunthorpe wrote: > Changes since v1: > > * Rebased onto 4.10-rc6 (cleanly) > * Split the patch into a few more easily digestible patches (as > suggested by Greg Kroah-Hartman) > * Folded switchtec.c into switchtec.h (per Greg) > * Fixed a bunch of 32bit build warnings caught by the kbuild test robot > * Fixed some issues in the documentation so it has a proper > reStructredText format (as noted by Jonathan Corbet) > * Fixed padding and sizes in the IOCTL structures as noticed by Emil > Velikov and used pahole to verify their consistency across 32 and 64 > bit builds > * Reworked one of the IOCTL interfaces to be more future proof (per > Emil). > > Changes since RFC: > > * Fixed incorrect use of the drive model as pointed out by Greg > Kroah-Hartman > * Used devm functions as suggested by Keith Busch > * Added a handful of sysfs attributes to the switchtec class > * Added a handful of IOCTLs to the switchtec device > * A number of miscellaneous bug fixes > > -- > > Hi, > > This is a continuation of the RFC we posted lasted month [1] which > proposes a management driver for Microsemi's Switchtec line of PCI > switches. This hardware is still looking to be used in the Open > Compute Platform > > To make this entirely clear: the Switchtec products are compliant > with the PCI specifications and are supported today with the standard > in-kernel driver. However, these devices also expose a management endpoint > on a separate PCI function address which can be used to perform some > advanced operations. This is a driver for that function. See the patch > for more information. > > Since the RFC, we've made the changes requested by Greg Kroah-Hartman > and Keith Busch, and we've also fleshed out a number of features. We've > added a couple of IOCTLs and sysfs attributes which are documented in > the patch. Significant work has also been done on the userspace tool > which is available under a GPL license at [2]. We've also had testing > done by some of the interested parties. > > We hope to see this work included in either 4.11 or 4.12 assuming a > smooth review process. > > The patch is based off of the v4.10-rc6 release. > > Thanks for your review, > > Logan > > [1] https://www.spinics.net/lists/linux-pci/msg56897.html > [2] https://github.com/sbates130272/switchtec-user > > -- > > Logan Gunthorpe (4): > MicroSemi Switchtec management interface driver > switchtec: Add user interface documentation > switchtec: Add sysfs attributes to the Switchtec driver > switchtec: Add IOCTLs to the Switchtec driver > > Documentation/ABI/testing/sysfs-class-switchtec | 96 ++ > Documentation/ioctl/ioctl-number.txt|1 + > Documentation/switchtec.txt | 80 ++ > MAINTAINERS | 11 + > drivers/pci/Kconfig |1 + > drivers/pci/Makefile|1 + > drivers/pci/switch/Kconfig | 13 + > drivers/pci/switch/Makefile |1 + > drivers/pci/switch/switchtec.c | 1608 > +++ > include/uapi/linux/switchtec_ioctl.h| 132 ++ > 10 files changed, 1944 insertions(+) > create mode 100644 Documentation/ABI/testing/sysfs-class-switchtec > create mode 100644 Documentation/switchtec.txt > create mode 100644 drivers/pci/switch/Kconfig > create mode 100644 drivers/pci/switch/Makefile > create mode 100644 drivers/pci/switch/switchtec.c > create mode 100644 include/uapi/linux/switchtec_ioctl.h > > -- > 2.1.4 >
Re: [PATCH v2 3/4] switchtec: Add sysfs attributes to the Switchtec driver
On 23/02/17 03:43 PM, Bjorn Helgaas wrote: > This path seems a little generic. I don't see other cases where a > product brand name ("Switchtec") appears at the top level of > /sys/class/... Ok, well we are certainly open to suggestions, but there isn't really a generic version of this device available so I'm not sure how we would change that. Per device-type classes aren't that uncommon though, a quick grep shows things like: platform/chrome/cros_ec_dev.c:40:static struct class cros_class s390/char/raw3270.h:94:extern struct class *class3270; net/ethernet/hisilicon/hns/hnae.c:19:static struct class *hnae_class; mfd/ucb1x00-core.c:490:static struct class ucb1x00_class > My question is based on "ls Documentation/ABI/testing/sysfs-class*", > not on any great knowledge of sysfs, and I see Greg has already given > a Reviewed-by for this, so maybe this is the right approach. > > It does seem like the path could include a clue that this is related > to PCI. I mean, we could change it to pci-switchtec or something if you think that would be better..?? But I'm not sure how else to accommodate this. > Is there a link to the switch PCI device itself, e.g., to > /sys/devices/pci*? Should these attributes simply be put in a > subdirectory there, e.g., in > > /sys/devices/pci:00/:00:00.0/stats/... Well our device shows up here in the tree: /sys/devices/pci:00/:00:03.0/:03:00.1/switchtec/switchtec0 (Which userspace can get to by following the link at /sys/class/switchtec/switchtec0) The switch is then always: /sys/devices/pci:00/:00:03.0 Thanks, Logan
Re: [PATCH v2 0/4] New Microsemi PCI Switch Management Driver
Thanks Bjorn! I understand your busy and we appreciate your time in this matter. I'll prepare a v3 with a collected set of tags shortly. We're more than happy to clean this up to make your job as easy as possible. We were just looking for direction in how to move this forward. Logan On 23/02/17 03:14 PM, Bjorn Helgaas wrote: > On Thu, Feb 23, 2017 at 01:36:51PM -0700, Logan Gunthorpe wrote: >> Hello, >> >> We're still waiting on any kind of response from Bjorn. (If you're >> listening please say something!) >> >> Does anyone have any suggestions for dealing with an unresponsive >> maintainer? Or a way for us to move forward with this quickly and get it >> merged? > > I try to deal with regressions first and other bug fixes second. > After that, I look at things that add new functionality. I try to > look at the new stuff in roughly chronological order, as you would see > here: > > https://patchwork.ozlabs.org/project/linux-pci/list/?order=date=1 > > If other folks have feedback, as they did on your 12/17, 1/31, and > even the 2/2 posting, I generally let that get sorted out before I > look at it. I apologize that I haven't responded to your queries > about posting a v3 vs updating v2. > > To answer that question, it's much simpler for me to deal with a > fresh, clean new series than it is to tweak things in an > already-posted series, partly because a series with discussion other > than simple acks and reviewed-bys looks more like work-in-progress. > > Bjorn >
Re: [PATCH v2 0/4] New Microsemi PCI Switch Management Driver
On 23/02/17 01:51 PM, Sinan Kaya wrote: > You'll want to be careful during the merge window (these days) as the > maintainer is usually busy with code delivery. You can't rush your code in at > the last minute. Thanks for the advice, we will continue to wait. However, I would say we've been very patient. It's been three weeks since we posted the latest revision, a month since the first version and almost 3 months since our RFC. I don't think it's too much to expect at least a response saying that it's in the works or something. That long with dead silence from the maintainer is a bit much. Logan
Re: [PATCH v2 0/4] New Microsemi PCI Switch Management Driver
Hello, We're still waiting on any kind of response from Bjorn. (If you're listening please say something!) Does anyone have any suggestions for dealing with an unresponsive maintainer? Or a way for us to move forward with this quickly and get it merged? ie. Can anyone else pick this up through another route? In the end, it's just a fairly basic driver and doesn't touch any core PCI functionality and we've had a fair amount of review from other kernel contributors, all of which we've addressed. Thanks, Logan On 17/02/17 01:36 PM, Logan Gunthorpe wrote: > Hi Bjorn, > > Can you give us an idea of when you might be able to comment on our > patchset? We've addressed all the outstanding issues and have a couple > of reviewed and tested tags. So we'd like to see this move forward as > soon as possible. > > I can do a respin with the tags collected or address any concerns you > may have, just please let us know. > > Thanks, > > Logan > > On 02/02/17 11:05 AM, Logan Gunthorpe wrote: >> Changes since v1: >> >> * Rebased onto 4.10-rc6 (cleanly) >> * Split the patch into a few more easily digestible patches (as >> suggested by Greg Kroah-Hartman) >> * Folded switchtec.c into switchtec.h (per Greg) >> * Fixed a bunch of 32bit build warnings caught by the kbuild test robot >> * Fixed some issues in the documentation so it has a proper >> reStructredText format (as noted by Jonathan Corbet) >> * Fixed padding and sizes in the IOCTL structures as noticed by Emil >> Velikov and used pahole to verify their consistency across 32 and 64 >> bit builds >> * Reworked one of the IOCTL interfaces to be more future proof (per >> Emil). >> >> Changes since RFC: >> >> * Fixed incorrect use of the drive model as pointed out by Greg >> Kroah-Hartman >> * Used devm functions as suggested by Keith Busch >> * Added a handful of sysfs attributes to the switchtec class >> * Added a handful of IOCTLs to the switchtec device >> * A number of miscellaneous bug fixes >> >> -- >> >> Hi, >> >> This is a continuation of the RFC we posted lasted month [1] which >> proposes a management driver for Microsemi's Switchtec line of PCI >> switches. This hardware is still looking to be used in the Open >> Compute Platform >> >> To make this entirely clear: the Switchtec products are compliant >> with the PCI specifications and are supported today with the standard >> in-kernel driver. However, these devices also expose a management endpoint >> on a separate PCI function address which can be used to perform some >> advanced operations. This is a driver for that function. See the patch >> for more information. >> >> Since the RFC, we've made the changes requested by Greg Kroah-Hartman >> and Keith Busch, and we've also fleshed out a number of features. We've >> added a couple of IOCTLs and sysfs attributes which are documented in >> the patch. Significant work has also been done on the userspace tool >> which is available under a GPL license at [2]. We've also had testing >> done by some of the interested parties. >> >> We hope to see this work included in either 4.11 or 4.12 assuming a >> smooth review process. >> >> The patch is based off of the v4.10-rc6 release. >> >> Thanks for your review, >> >> Logan >> >> [1] https://www.spinics.net/lists/linux-pci/msg56897.html >> [2] https://github.com/sbates130272/switchtec-user >> >> -- >> >> Logan Gunthorpe (4): >> MicroSemi Switchtec management interface driver >> switchtec: Add user interface documentation >> switchtec: Add sysfs attributes to the Switchtec driver >> switchtec: Add IOCTLs to the Switchtec driver >> >> Documentation/ABI/testing/sysfs-class-switchtec | 96 ++ >> Documentation/ioctl/ioctl-number.txt|1 + >> Documentation/switchtec.txt | 80 ++ >> MAINTAINERS | 11 + >> drivers/pci/Kconfig |1 + >> drivers/pci/Makefile|1 + >> drivers/pci/switch/Kconfig | 13 + >> drivers/pci/switch/Makefile |1 + >> drivers/pci/switch/switchtec.c | 1608 >> +++ >> include/uapi/linux/switchtec_ioctl.h| 132 ++ >> 10 files changed, 1944 insertions(+) >> create mode 100644 Documentation/ABI/testing/sysfs-class-switchtec >> create mode 100644 Documentation/switchtec.txt >> create mode 100644 drivers/pci/switch/Kconfig >> create mode 100644 drivers/pci/switch/Makefile >> create mode 100644 drivers/pci/switch/switchtec.c >> create mode 100644 include/uapi/linux/switchtec_ioctl.h >> >> -- >> 2.1.4 >>
[PATCH v3 4/4] switchtec: Add IOCTLs to the Switchtec driver
This patch introduces a couple of special IOCTLs which are provided to: * Inform userspace of firmware partition locations * Pass event counts and allow userspace to wait on events * Translate between PFF numbers used by the switch to port numbers. Signed-off-by: Logan Gunthorpe <log...@deltatee.com> Signed-off-by: Stephen Bates <stephen.ba...@microsemi.com> Tested-by: Krishna Dhulipala <krish...@fb.com> Reviewed-by: Wei Zhang <wzh...@fb.com> Reviewed-by: Jens Axboe <ax...@fb.com> --- Documentation/ioctl/ioctl-number.txt | 1 + Documentation/switchtec.txt | 27 ++ MAINTAINERS | 1 + drivers/pci/switch/switchtec.c | 467 +++ include/uapi/linux/switchtec_ioctl.h | 132 ++ 5 files changed, 628 insertions(+) create mode 100644 include/uapi/linux/switchtec_ioctl.h diff --git a/Documentation/ioctl/ioctl-number.txt b/Documentation/ioctl/ioctl-number.txt index 81c7f2b..032b33c 100644 --- a/Documentation/ioctl/ioctl-number.txt +++ b/Documentation/ioctl/ioctl-number.txt @@ -191,6 +191,7 @@ Code Seq#(hex) Include FileComments 'W'00-1F linux/watchdog.hconflict! 'W'00-1F linux/wanrouter.h conflict! (pre 3.9) 'W'00-3F sound/asound.h conflict! +'W'40-5F drivers/pci/switch/switchtec.c 'X'all fs/xfs/xfs_fs.h conflict! and fs/xfs/linux-2.6/xfs_ioctl32.h and include/linux/falloc.h diff --git a/Documentation/switchtec.txt b/Documentation/switchtec.txt index 4bced4c..a0a9c7b 100644 --- a/Documentation/switchtec.txt +++ b/Documentation/switchtec.txt @@ -51,3 +51,30 @@ The char device has the following semantics: * The poll call will also be supported for userspace applications that need to do other things while waiting for the command to complete. + +The following IOCTLs are also supported by the device: + +* SWITCHTEC_IOCTL_FLASH_INFO - Retrieve firmware length and number + of partitions in the device. + +* SWITCHTEC_IOCTL_FLASH_PART_INFO - Retrieve address and lengeth for + any specified partition in flash. + +* SWITCHTEC_IOCTL_EVENT_SUMMARY - Read a structure of bitmaps + indicating all uncleared events. + +* SWITCHTEC_IOCTL_EVENT_CTL - Get the current count, clear and set flags + for any event. This ioctl takes in a switchtec_ioctl_event_ctl struct + with the event_id, index and flags set (index being the partition or PFF + number for non-global events). It returns whether the event has + occurred, the number of times and any event specific data. The flags + can be used to clear the count or enable and disable actions to + happen when the event occurs. + By using the SWITCHTEC_IOCTL_EVENT_FLAG_EN_POLL flag, + you can set an event to trigger a poll command to return with + POLLPRI. In this way, userspace can wait for events to occur. + +* SWITCHTEC_IOCTL_PFF_TO_PORT and SWITCHTEC_IOCTL_PORT_TO_PFF convert + between PCI Function Framework number (used by the event system) + and Switchtec Logic Port ID and Partition number (which is more + user friendly). diff --git a/MAINTAINERS b/MAINTAINERS index 6fed938..c1a9a30 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -9515,6 +9515,7 @@ S:Maintained F: Documentation/switchtec.txt F: Documentation/ABI/testing/sysfs-class-switchtec F: drivers/pci/switch/switchtec* +F: include/uapi/linux/switchtec_ioctl.h PCI DRIVER FOR NVIDIA TEGRA M: Thierry Reding <thierry.red...@gmail.com> diff --git a/drivers/pci/switch/switchtec.c b/drivers/pci/switch/switchtec.c index c09d335..04d35f4 100644 --- a/drivers/pci/switch/switchtec.c +++ b/drivers/pci/switch/switchtec.c @@ -13,6 +13,8 @@ * */ +#include + #include #include #include @@ -755,6 +757,417 @@ static unsigned int switchtec_dev_poll(struct file *filp, poll_table *wait) return ret; } +static int ioctl_flash_info(struct switchtec_dev *stdev, + struct switchtec_ioctl_flash_info __user *uinfo) +{ + struct switchtec_ioctl_flash_info info = {0}; + struct flash_info_regs __iomem *fi = stdev->mmio_flash_info; + + info.flash_length = ioread32(>flash_length); + info.num_partitions = SWITCHTEC_IOCTL_NUM_PARTITIONS; + + if (copy_to_user(uinfo, , sizeof(info))) + return -EFAULT; + + return 0; +} + +static void set_fw_info_part(struct switchtec_ioctl_flash_part_info *info, +struct partition_info __iomem *pi) +{ + info->address = ioread32(>address); + info->length = ioread32(>length); +} + +static int ioctl_flash_part_info(struct switchtec_dev *stdev, + struct switchtec_ioctl_flash_part_info __user *uinfo) +{ + struct switchtec_ioctl_flash_part_info info = {0}; + struct flash_info_regs __iomem *fi = stdev->mmio_flash_info; + u32 active_addr = -1; + + if (copy
[PATCH v3 0/4] New Microsemi PCI Switch Management Driver
Changes since v2: * Collected reviewed and tested tags * Very minor fix to the error path in the create function Changes since v1: * Rebased onto 4.10-rc6 (cleanly) * Split the patch into a few more easily digestible patches (as suggested by Greg Kroah-Hartman) * Folded switchtec.c into switchtec.h (per Greg) * Fixed a bunch of 32bit build warnings caught by the kbuild test robot * Fixed some issues in the documentation so it has a proper reStructredText format (as noted by Jonathan Corbet) * Fixed padding and sizes in the IOCTL structures as noticed by Emil Velikov and used pahole to verify their consistency across 32 and 64 bit builds * Reworked one of the IOCTL interfaces to be more future proof (per Emil). Changes since RFC: * Fixed incorrect use of the drive model as pointed out by Greg Kroah-Hartman * Used devm functions as suggested by Keith Busch * Added a handful of sysfs attributes to the switchtec class * Added a handful of IOCTLs to the switchtec device * A number of miscellaneous bug fixes -- Hi, This is a continuation of the RFC we posted lasted month [1] which proposes a management driver for Microsemi's Switchtec line of PCI switches. This hardware is still looking to be used in the Open Compute Platform To make this entirely clear: the Switchtec products are compliant with the PCI specifications and are supported today with the standard in-kernel driver. However, these devices also expose a management endpoint on a separate PCI function address which can be used to perform some advanced operations. This is a driver for that function. See the patch for more information. Since the RFC, we've made the changes requested by Greg Kroah-Hartman and Keith Busch, and we've also fleshed out a number of features. We've added a couple of IOCTLs and sysfs attributes which are documented in the patch. Significant work has also been done on the userspace tool which is available under a GPL license at [2]. We've also had testing done by some of the interested parties. We hope to see this work included in either 4.11 or 4.12 assuming a smooth review process. The patch is based off of the v4.10-rc6 release. Thanks for your review, Logan [1] https://www.spinics.net/lists/linux-pci/msg56897.html [2] https://github.com/sbates130272/switchtec-user -- Logan Gunthorpe (4): MicroSemi Switchtec management interface driver switchtec: Add user interface documentation switchtec: Add sysfs attributes to the Switchtec driver switchtec: Add IOCTLs to the Switchtec driver Documentation/ABI/testing/sysfs-class-switchtec | 96 ++ Documentation/ioctl/ioctl-number.txt|1 + Documentation/switchtec.txt | 80 ++ MAINTAINERS | 11 + drivers/pci/Kconfig |1 + drivers/pci/Makefile|1 + drivers/pci/switch/Kconfig | 13 + drivers/pci/switch/Makefile |1 + drivers/pci/switch/switchtec.c | 1611 +++ include/uapi/linux/switchtec_ioctl.h| 132 ++ 10 files changed, 1947 insertions(+) create mode 100644 Documentation/ABI/testing/sysfs-class-switchtec create mode 100644 Documentation/switchtec.txt create mode 100644 drivers/pci/switch/Kconfig create mode 100644 drivers/pci/switch/Makefile create mode 100644 drivers/pci/switch/switchtec.c create mode 100644 include/uapi/linux/switchtec_ioctl.h -- 2.1.4
[PATCH v3 3/4] switchtec: Add sysfs attributes to the Switchtec driver
This patch adds a few read-only sysfs attributes which provide some device information that is exposed from the devices. Primarily component and device names and versions. These are documented in Documentation/ABI/testing/sysfs-class-switchtec. Signed-off-by: Logan Gunthorpe <log...@deltatee.com> Signed-off-by: Stephen Bates <stephen.ba...@microsemi.com> Tested-by: Krishna Dhulipala <krish...@fb.com> Reviewed-by: Wei Zhang <wzh...@fb.com> Reviewed-by: Jens Axboe <ax...@fb.com> Reviewed-by: Greg Kroah-Hartman <gre...@linuxfoundation.org> --- Documentation/ABI/testing/sysfs-class-switchtec | 96 MAINTAINERS | 1 + drivers/pci/switch/switchtec.c | 113 3 files changed, 210 insertions(+) create mode 100644 Documentation/ABI/testing/sysfs-class-switchtec diff --git a/Documentation/ABI/testing/sysfs-class-switchtec b/Documentation/ABI/testing/sysfs-class-switchtec new file mode 100644 index 000..48cb4c1 --- /dev/null +++ b/Documentation/ABI/testing/sysfs-class-switchtec @@ -0,0 +1,96 @@ +switchtec - Microsemi Switchtec PCI Switch Management Endpoint + +For details on this subsystem look at Documentation/switchtec.txt. + +What: /sys/class/switchtec +Date: 05-Jan-2017 +KernelVersion: v4.11 +Contact: Logan Gunthorpe <log...@deltatee.com> +Description: The switchtec class subsystem folder. + Each registered switchtec driver is represented by a switchtecX + subfolder (X being an integer >= 0). + + +What: /sys/class/switchtec/switchtec[0-9]+/component_id +Date: 05-Jan-2017 +KernelVersion: v4.11 +Contact: Logan Gunthorpe <log...@deltatee.com> +Description: Component identifier as stored in the hardware (eg. PM8543) + (read only) +Values:arbitrary string. + + +What: /sys/class/switchtec/switchtec[0-9]+/component_revision +Date: 05-Jan-2017 +KernelVersion: v4.11 +Contact: Logan Gunthorpe <log...@deltatee.com> +Description: Component revision stored in the hardware (read only) +Values:integer. + + +What: /sys/class/switchtec/switchtec[0-9]+/component_vendor +Date: 05-Jan-2017 +KernelVersion: v4.11 +Contact: Logan Gunthorpe <log...@deltatee.com> +Description: Component vendor as stored in the hardware (eg. MICROSEM) + (read only) +Values:arbitrary string. + + +What: /sys/class/switchtec/switchtec[0-9]+/device_version +Date: 05-Jan-2017 +KernelVersion: v4.11 +Contact: Logan Gunthorpe <log...@deltatee.com> +Description: Device version as stored in the hardware (read only) +Values:integer. + + +What: /sys/class/switchtec/switchtec[0-9]+/fw_version +Date: 05-Jan-2017 +KernelVersion: v4.11 +Contact: Logan Gunthorpe <log...@deltatee.com> +Description: Currently running firmware version (read only) +Values:integer (in hexadecimal). + + +What: /sys/class/switchtec/switchtec[0-9]+/partition +Date: 05-Jan-2017 +KernelVersion: v4.11 +Contact: Logan Gunthorpe <log...@deltatee.com> +Description: Partition number for this device in the switch (read only) +Values:integer. + + +What: /sys/class/switchtec/switchtec[0-9]+/partition_count +Date: 05-Jan-2017 +KernelVersion: v4.11 +Contact: Logan Gunthorpe <log...@deltatee.com> +Description: Total number of partitions in the switch (read only) +Values:integer. + + +What: /sys/class/switchtec/switchtec[0-9]+/product_id +Date: 05-Jan-2017 +KernelVersion: v4.11 +Contact: Logan Gunthorpe <log...@deltatee.com> +Description: Product identifier as stored in the hardware (eg. PSX 48XG3) + (read only) +Values:arbitrary string. + + +What: /sys/class/switchtec/switchtec[0-9]+/product_revision +Date: 05-Jan-2017 +KernelVersion: v4.11 +Contact: Logan Gunthorpe <log...@deltatee.com> +Description: Product revision stored in the hardware (eg. RevB) + (read only) +Values:arbitrary string. + + +What: /sys/class/switchtec/switchtec[0-9]+/product_vendor +Date: 05-Jan-2017 +KernelVersion: v4.11 +Contact: Logan Gunthorpe <log...@deltatee.com> +Description: Product vendor as stored in the hardware (eg. MICROSEM) + (read only) +Values:arbitrary string. diff --git a/MAINTAINERS b/MAINTAINERS index aa702b0..6fed938 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -9513,6 +9513,7 @@ M:Logan Gunthorpe <log...@deltatee.com> L: linux-...@vger.kernel.org S: Maintained F: Documentation/switchtec.txt +F: Documentation/ABI/testing/sysfs-class-switchtec F: drivers/pci/switch/switchtec* PCI DRIVER FOR NVIDIA TEGRA diff --gi
[PATCH v3 1/4] MicroSemi Switchtec management interface driver
Microsemi's "Switchtec" line of PCI switch devices is already well supported by the kernel with standard PCI switch drivers. However, the Switchtec device advertises a special management endpoint with a separate PCI function address and class code. This endpoint enables some additional functionality which includes: * Packet and Byte Counters * Switch Firmware Upgrades * Event and Error logs * Querying port link status * Custom user firmware commands This patch introduces the switchtec kernel module which provides PCI driver that exposes a char device. The char device provides userspace access to this interface through read, write and (optionally) poll calls. A userspace tool and library which utilizes this interface is available at [1]. This tool takes inspiration (and borrows some code) from nvme-cli [2]. The tool is largely complete at this time but additional features may be added in the future. [1] https://github.com/sbates130272/switchtec-user [2] https://github.com/linux-nvme/nvme-cli Signed-off-by: Logan Gunthorpe <log...@deltatee.com> Signed-off-by: Stephen Bates <stephen.ba...@microsemi.com> Tested-by: Krishna Dhulipala <krish...@fb.com> Reviewed-by: Wei Zhang <wzh...@fb.com> Reviewed-by: Jens Axboe <ax...@fb.com> Reviewed-by: Greg Kroah-Hartman <gre...@linuxfoundation.org> --- MAINTAINERS|8 + drivers/pci/Kconfig|1 + drivers/pci/Makefile |1 + drivers/pci/switch/Kconfig | 13 + drivers/pci/switch/Makefile|1 + drivers/pci/switch/switchtec.c | 1031 6 files changed, 1055 insertions(+) create mode 100644 drivers/pci/switch/Kconfig create mode 100644 drivers/pci/switch/Makefile create mode 100644 drivers/pci/switch/switchtec.c diff --git a/MAINTAINERS b/MAINTAINERS index 527d137..a57686f 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -9506,6 +9506,14 @@ S: Maintained F: Documentation/devicetree/bindings/pci/aardvark-pci.txt F: drivers/pci/host/pci-aardvark.c +PCI DRIVER FOR MICROSEMI SWITCHTEC +M: Kurt Schwemmer <kurt.schwem...@microsemi.com> +M: Stephen Bates <stephen.ba...@microsemi.com> +M: Logan Gunthorpe <log...@deltatee.com> +L: linux-...@vger.kernel.org +S: Maintained +F: drivers/pci/switch/switchtec* + PCI DRIVER FOR NVIDIA TEGRA M: Thierry Reding <thierry.red...@gmail.com> L: linux-te...@vger.kernel.org diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig index 6555eb7..f72e8c5 100644 --- a/drivers/pci/Kconfig +++ b/drivers/pci/Kconfig @@ -133,3 +133,4 @@ config PCI_HYPERV source "drivers/pci/hotplug/Kconfig" source "drivers/pci/host/Kconfig" +source "drivers/pci/switch/Kconfig" diff --git a/drivers/pci/Makefile b/drivers/pci/Makefile index 8db5079..15b46dd 100644 --- a/drivers/pci/Makefile +++ b/drivers/pci/Makefile @@ -68,3 +68,4 @@ ccflags-$(CONFIG_PCI_DEBUG) := -DDEBUG # PCI host controller drivers obj-y += host/ +obj-y += switch/ diff --git a/drivers/pci/switch/Kconfig b/drivers/pci/switch/Kconfig new file mode 100644 index 000..4c49648 --- /dev/null +++ b/drivers/pci/switch/Kconfig @@ -0,0 +1,13 @@ +menu "PCI switch controller drivers" + depends on PCI + +config PCI_SW_SWITCHTEC + tristate "MicroSemi Switchtec PCIe Switch Management Driver" + help +Enables support for the management interface for the MicroSemi +Switchtec series of PCIe switches. Supports userspace access +to submit MRPC commands to the switch via /dev/switchtecX +devices. See for more +information. + +endmenu diff --git a/drivers/pci/switch/Makefile b/drivers/pci/switch/Makefile new file mode 100644 index 000..37d8cfb --- /dev/null +++ b/drivers/pci/switch/Makefile @@ -0,0 +1 @@ +obj-$(CONFIG_PCI_SW_SWITCHTEC) += switchtec.o diff --git a/drivers/pci/switch/switchtec.c b/drivers/pci/switch/switchtec.c new file mode 100644 index 000..f6bcff1 --- /dev/null +++ b/drivers/pci/switch/switchtec.c @@ -0,0 +1,1031 @@ +/* + * Microsemi Switchtec(tm) PCIe Management Driver + * Copyright (c) 2017, Microsemi Corporation + * + * This program is free software; you can redistribute it and/or modify it + * under the terms and conditions of the GNU General Public License, + * version 2, as published by the Free Software Foundation. + * + * This program is distributed in the hope it will be useful, but WITHOUT + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or + * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for + * more details. + * + */ + +#include +#include +#include +#include +#include +#include +#include +#include + +MODULE_DESCRIPTION("Microsemi Switchtec(tm) PCI-E Management Driver"); +MODULE_VERSION("0.1"); +MODULE_LICENSE("GPL"); +MODULE_AUTHOR("Microsemi Cor
[PATCH v3 2/4] switchtec: Add user interface documentation
This adds standard documentation for the sysfs switchtec attributes and a RST formatted text file which documents the char device interface. Jonathan Corbet has indicated he will move this to a new user-space developer documentation book once it's created. Signed-off-by: Logan Gunthorpe <log...@deltatee.com> Signed-off-by: Stephen Bates <stephen.ba...@microsemi.com> Tested-by: Krishna Dhulipala <krish...@fb.com> Reviewed-by: Wei Zhang <wzh...@fb.com> Reviewed-by: Jens Axboe <ax...@fb.com> --- Documentation/switchtec.txt | 53 + MAINTAINERS | 1 + 2 files changed, 54 insertions(+) create mode 100644 Documentation/switchtec.txt diff --git a/Documentation/switchtec.txt b/Documentation/switchtec.txt new file mode 100644 index 000..4bced4c --- /dev/null +++ b/Documentation/switchtec.txt @@ -0,0 +1,53 @@ + +Linux Switchtec Support + + +Microsemi's "Switchtec" line of PCI switch devices is already +supported by the kernel with standard PCI switch drivers. However, the +Switchtec device advertises a special management endpoint which +enables some additional functionality. This includes: + +* Packet and Byte Counters +* Firmware Upgrades +* Event and Error logs +* Querying port link status +* Custom user firmware commands + +The switchtec kernel module implements this functionality. + + +Interface += + +The primary means of communicating with the Switchtec management firmware is +through the Memory-mapped Remote Procedure Call (MRPC) interface. +Commands are submitted to the interface with a 4-byte command +identifier and up to 1KB of command specific data. The firmware will +respond with a 4 bytes return code and up to 1KB of command specific +data. The interface only processes a single command at a time. + + +Userspace Interface +=== + +The MRPC interface will be exposed to userspace through a simple char +device: /dev/switchtec#, one for each management endpoint in the system. + +The char device has the following semantics: + +* A write must consist of at least 4 bytes and no more than 1028 bytes. + The first four bytes will be interpreted as the command to run and + the remainder will be used as the input data. A write will send the + command to the firmware to begin processing. + +* Each write must be followed by exactly one read. Any double write will + produce an error and any read that doesn't follow a write will + produce an error. + +* A read will block until the firmware completes the command and return + the four bytes of status plus up to 1024 bytes of output data. (The + length will be specified by the size parameter of the read call -- + reading less than 4 bytes will produce an error. + +* The poll call will also be supported for userspace applications that + need to do other things while waiting for the command to complete. diff --git a/MAINTAINERS b/MAINTAINERS index a57686f..aa702b0 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -9512,6 +9512,7 @@ M:Stephen Bates <stephen.ba...@microsemi.com> M: Logan Gunthorpe <log...@deltatee.com> L: linux-...@vger.kernel.org S: Maintained +F: Documentation/switchtec.txt F: drivers/pci/switch/switchtec* PCI DRIVER FOR NVIDIA TEGRA -- 2.1.4
Re: [PATCH v2 1/4] MicroSemi Switchtec management interface driver
Hey Bjorn, Thanks for the thorough review. It definitely helped a lot to make the code as good as it can be. I've made all of the changes you suggested. I'm just going to do a bit more testing and then post a v4. See my responses to all of your feedback bellow. Logan On 23/02/17 05:35 PM, Bjorn Helgaas wrote: > Would it make any sense to integrate this with the perf tool? It > looks like switchtec-user measures bandwidth and latency, which sounds > sort of perf-ish. That would be cool but lets file that under potential future work. This driver also enables more than bandwidth and latency so it's still required for us. >> +MODULE_DESCRIPTION("Microsemi Switchtec(tm) PCI-E Management Driver"); > > Nit: s/PCI-E/PCIe/ > Fixed. >> +MODULE_VERSION("0.1"); >> +MODULE_LICENSE("GPL"); >> +MODULE_AUTHOR("Microsemi Corporation"); >> + >> +static int max_devices = 16; > > This static initialization is slightly misleading because > switchtec_init() immediately sets max_devices to at least 256. Oops copied that from dax without thinking. I've just removed the max() call in the init function. >> +atomic_t event_cnt; > > Why is this atomic_t? You only do an atomic_set() (in stdev_create()) > and atomic_reads() -- no writes other than the initial one. So I'm > not sure what value atomic_t brings. If you looked at a following patch in the series you'd see that there's an atomic_inc in the ISR. The splitting of the series wasn't as precise as it maybe should have been and thus this became a bit confusing. >> +memcpy_fromio(stuser->data, >mmio_mrpc->output_data, >> + sizeof(stuser->data)); > > Apparently you always get 1K of data back, no matter what? Yes, sort of unfortunately. Because these transactions can occur before the user actually does the read, we don't know how much data the user wants. This may be a performance improvement in the future (ie. if the user reads before the MRPC transaction comes back, then only read the requested amount). But we will leave it this way for now. >> +if (!stdev_is_alive(stdev)) >> +return -ENXIO; > > What exactly does this protect against? Is it OK if stdev_is_alive() > becomes false immediately after we check? Yup, you're right: that's racy. Turns out cleanup is hard and I've learned a lot even in the short time since I wrote this code. I've gotten rid of stdev_is_alive and moved the pci cleanup code into the device's release function so they won't occur until the last user closes the cdev. I think that should work better but please let me know if you see an issue with this. >> + >> +if (size < sizeof(stuser->cmd) || >> +size > sizeof(stuser->cmd) + SWITCHTEC_MRPC_PAYLOAD_SIZE) > > I think I would use "sizeof(stuser->data)" instead of > SWITCHTEC_MRPC_PAYLOAD_SIZE so it matches what mrpc_complete_cmd() > does. Same below in switchtec_dev_read(). Ok. > mrpc_mutex doesn't protect the stuser things, does it? If not, you > could do this without the gotos. And I think it's a little easier to > be confident there are no buffer overruns: I was using the mutex to protect stuser->state as well. But I've made your changes and just adjusted stuser->state to be atomic and I think this should be just as good. >> +static int switchtec_init_msix_isr(struct switchtec_dev *stdev) >> +{ >> +struct pci_dev *pdev = stdev->pdev; >> +int rc, i, msix_count, node; >> + >> +node = dev_to_node(>dev); > > Unused? Yup, I've removed it. >> +for (i = 0; i < ARRAY_SIZE(stdev->msix); ++i) >> +stdev->msix[i].entry = i; >> + >> +msix_count = pci_enable_msix_range(pdev, stdev->msix, 1, >> + ARRAY_SIZE(stdev->msix)); >> +if (msix_count < 0) >> +return msix_count; > > Maybe this could be simplified by using pci_alloc_irq_vectors()? Yup! I wasn't aware of that function. It's a _huge_ improvement. Thanks. Still I'd really appreciate a quick re-review after I post v4 just to make sure I got it correct. >> +stdev->event_irq = ioread32(>mmio_part_cfg->vep_vector_number); >> +if (stdev->event_irq < 0 || stdev->event_irq >= msix_count) { >> +rc = -EFAULT; >> +goto err_msix_request; >> +} > > Not sure what this is doing, but you immediately overwrite > stdev->event_irq below. If you're using it as just a temporary above > for doing some validation, I would just use a local variable to avoid > the connection with stdev->event_irq. Yes, I made this temporary. >> +stdev->event_irq = stdev->msix[stdev->event_irq].vector; > > Oh, I guess you allocate several MSI-X vectors, but you only actually > use one of them? Why is that? I was confused about why you allocate > several vectors, but there's only one devm_request_irq() call below. The event_irq is configurable in hardware and won't necessarily be the first irq available. (It should always be in the first four.) As I understand it, we need to allocate all of
[PATCH v4 4/4] switchtec: Add IOCTLs to the Switchtec driver
This patch introduces a couple of special IOCTLs which are provided to: * Inform userspace of firmware partition locations * Pass event counts and allow userspace to wait on events * Translate between PFF numbers used by the switch to port numbers. Signed-off-by: Logan Gunthorpe <log...@deltatee.com> Signed-off-by: Stephen Bates <stephen.ba...@microsemi.com> Tested-by: Krishna Dhulipala <krish...@fb.com> Reviewed-by: Wei Zhang <wzh...@fb.com> Reviewed-by: Jens Axboe <ax...@fb.com> --- Documentation/ioctl/ioctl-number.txt | 1 + Documentation/switchtec.txt | 27 ++ MAINTAINERS | 1 + drivers/pci/switch/switchtec.c | 467 +++ include/uapi/linux/switchtec_ioctl.h | 132 ++ 5 files changed, 628 insertions(+) create mode 100644 include/uapi/linux/switchtec_ioctl.h diff --git a/Documentation/ioctl/ioctl-number.txt b/Documentation/ioctl/ioctl-number.txt index 81c7f2b..032b33c 100644 --- a/Documentation/ioctl/ioctl-number.txt +++ b/Documentation/ioctl/ioctl-number.txt @@ -191,6 +191,7 @@ Code Seq#(hex) Include FileComments 'W'00-1F linux/watchdog.hconflict! 'W'00-1F linux/wanrouter.h conflict! (pre 3.9) 'W'00-3F sound/asound.h conflict! +'W'40-5F drivers/pci/switch/switchtec.c 'X'all fs/xfs/xfs_fs.h conflict! and fs/xfs/linux-2.6/xfs_ioctl32.h and include/linux/falloc.h diff --git a/Documentation/switchtec.txt b/Documentation/switchtec.txt index 4bced4c..a0a9c7b 100644 --- a/Documentation/switchtec.txt +++ b/Documentation/switchtec.txt @@ -51,3 +51,30 @@ The char device has the following semantics: * The poll call will also be supported for userspace applications that need to do other things while waiting for the command to complete. + +The following IOCTLs are also supported by the device: + +* SWITCHTEC_IOCTL_FLASH_INFO - Retrieve firmware length and number + of partitions in the device. + +* SWITCHTEC_IOCTL_FLASH_PART_INFO - Retrieve address and lengeth for + any specified partition in flash. + +* SWITCHTEC_IOCTL_EVENT_SUMMARY - Read a structure of bitmaps + indicating all uncleared events. + +* SWITCHTEC_IOCTL_EVENT_CTL - Get the current count, clear and set flags + for any event. This ioctl takes in a switchtec_ioctl_event_ctl struct + with the event_id, index and flags set (index being the partition or PFF + number for non-global events). It returns whether the event has + occurred, the number of times and any event specific data. The flags + can be used to clear the count or enable and disable actions to + happen when the event occurs. + By using the SWITCHTEC_IOCTL_EVENT_FLAG_EN_POLL flag, + you can set an event to trigger a poll command to return with + POLLPRI. In this way, userspace can wait for events to occur. + +* SWITCHTEC_IOCTL_PFF_TO_PORT and SWITCHTEC_IOCTL_PORT_TO_PFF convert + between PCI Function Framework number (used by the event system) + and Switchtec Logic Port ID and Partition number (which is more + user friendly). diff --git a/MAINTAINERS b/MAINTAINERS index 6fed938..c1a9a30 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -9515,6 +9515,7 @@ S:Maintained F: Documentation/switchtec.txt F: Documentation/ABI/testing/sysfs-class-switchtec F: drivers/pci/switch/switchtec* +F: include/uapi/linux/switchtec_ioctl.h PCI DRIVER FOR NVIDIA TEGRA M: Thierry Reding <thierry.red...@gmail.com> diff --git a/drivers/pci/switch/switchtec.c b/drivers/pci/switch/switchtec.c index 2e3a45b..18286d3 100644 --- a/drivers/pci/switch/switchtec.c +++ b/drivers/pci/switch/switchtec.c @@ -13,6 +13,8 @@ * */ +#include + #include #include #include @@ -717,6 +719,417 @@ static unsigned int switchtec_dev_poll(struct file *filp, poll_table *wait) return ret; } +static int ioctl_flash_info(struct switchtec_dev *stdev, + struct switchtec_ioctl_flash_info __user *uinfo) +{ + struct switchtec_ioctl_flash_info info = {0}; + struct flash_info_regs __iomem *fi = stdev->mmio_flash_info; + + info.flash_length = ioread32(>flash_length); + info.num_partitions = SWITCHTEC_IOCTL_NUM_PARTITIONS; + + if (copy_to_user(uinfo, , sizeof(info))) + return -EFAULT; + + return 0; +} + +static void set_fw_info_part(struct switchtec_ioctl_flash_part_info *info, +struct partition_info __iomem *pi) +{ + info->address = ioread32(>address); + info->length = ioread32(>length); +} + +static int ioctl_flash_part_info(struct switchtec_dev *stdev, + struct switchtec_ioctl_flash_part_info __user *uinfo) +{ + struct switchtec_ioctl_flash_part_info info = {0}; + struct flash_info_regs __iomem *fi = stdev->mmio_flash_info; + u32 active_addr = -1; + + if (copy
[PATCH v4 0/4] New Microsemi PCI Switch Management Driver
Changes since v3: * Removed stdev_is_alive and moved pci deinit functions into the device release function which only occurs after all cdev instances are closed. (per Bjorn) * Reduced locking in read/write functions (per Bjorn) * Use pci_alloc_irq_vectors to vastly improve ISR initialization (Bjorn) * A few other minor nits and cleanup as noticed by Bjorn Changes since v2: * Collected reviewed and tested tags * Very minor fix to the error path in the create function Changes since v1: * Rebased onto 4.10-rc6 (cleanly) * Split the patch into a few more easily digestible patches (as suggested by Greg Kroah-Hartman) * Folded switchtec.c into switchtec.h (per Greg) * Fixed a bunch of 32bit build warnings caught by the kbuild test robot * Fixed some issues in the documentation so it has a proper reStructredText format (as noted by Jonathan Corbet) * Fixed padding and sizes in the IOCTL structures as noticed by Emil Velikov and used pahole to verify their consistency across 32 and 64 bit builds * Reworked one of the IOCTL interfaces to be more future proof (per Emil). Changes since RFC: * Fixed incorrect use of the drive model as pointed out by Greg Kroah-Hartman * Used devm functions as suggested by Keith Busch * Added a handful of sysfs attributes to the switchtec class * Added a handful of IOCTLs to the switchtec device * A number of miscellaneous bug fixes -- Hi, This is a continuation of the RFC we posted lasted month [1] which proposes a management driver for Microsemi's Switchtec line of PCI switches. This hardware is still looking to be used in the Open Compute Platform To make this entirely clear: the Switchtec products are compliant with the PCI specifications and are supported today with the standard in-kernel driver. However, these devices also expose a management endpoint on a separate PCI function address which can be used to perform some advanced operations. This is a driver for that function. See the patch for more information. Since the RFC, we've made the changes requested by Greg Kroah-Hartman and Keith Busch, and we've also fleshed out a number of features. We've added a couple of IOCTLs and sysfs attributes which are documented in the patch. Significant work has also been done on the userspace tool which is available under a GPL license at [2]. We've also had testing done by some of the interested parties. We hope to see this work included in either 4.11 or 4.12 assuming a smooth review process. The patch is based off of the v4.10-rc6 release. Thanks for your review, Logan [1] https://www.spinics.net/lists/linux-pci/msg56897.html [2] https://github.com/sbates130272/switchtec-user Logan Gunthorpe (4): MicroSemi Switchtec management interface driver switchtec: Add user interface documentation switchtec: Add sysfs attributes to the Switchtec driver switchtec: Add IOCTLs to the Switchtec driver Documentation/ABI/testing/sysfs-class-switchtec | 96 ++ Documentation/ioctl/ioctl-number.txt|1 + Documentation/switchtec.txt | 80 ++ MAINTAINERS | 11 + drivers/pci/Kconfig |1 + drivers/pci/Makefile|1 + drivers/pci/switch/Kconfig | 13 + drivers/pci/switch/Makefile |1 + drivers/pci/switch/switchtec.c | 1503 +++ include/uapi/linux/switchtec_ioctl.h| 132 ++ 10 files changed, 1839 insertions(+) create mode 100644 Documentation/ABI/testing/sysfs-class-switchtec create mode 100644 Documentation/switchtec.txt create mode 100644 drivers/pci/switch/Kconfig create mode 100644 drivers/pci/switch/Makefile create mode 100644 drivers/pci/switch/switchtec.c create mode 100644 include/uapi/linux/switchtec_ioctl.h -- 2.1.4
[PATCH v4 3/4] switchtec: Add sysfs attributes to the Switchtec driver
This patch adds a few read-only sysfs attributes which provide some device information that is exposed from the devices. Primarily component and device names and versions. These are documented in Documentation/ABI/testing/sysfs-class-switchtec. Signed-off-by: Logan Gunthorpe <log...@deltatee.com> Signed-off-by: Stephen Bates <stephen.ba...@microsemi.com> Tested-by: Krishna Dhulipala <krish...@fb.com> Reviewed-by: Wei Zhang <wzh...@fb.com> Reviewed-by: Jens Axboe <ax...@fb.com> Reviewed-by: Greg Kroah-Hartman <gre...@linuxfoundation.org> --- Documentation/ABI/testing/sysfs-class-switchtec | 96 MAINTAINERS | 1 + drivers/pci/switch/switchtec.c | 113 3 files changed, 210 insertions(+) create mode 100644 Documentation/ABI/testing/sysfs-class-switchtec diff --git a/Documentation/ABI/testing/sysfs-class-switchtec b/Documentation/ABI/testing/sysfs-class-switchtec new file mode 100644 index 000..48cb4c1 --- /dev/null +++ b/Documentation/ABI/testing/sysfs-class-switchtec @@ -0,0 +1,96 @@ +switchtec - Microsemi Switchtec PCI Switch Management Endpoint + +For details on this subsystem look at Documentation/switchtec.txt. + +What: /sys/class/switchtec +Date: 05-Jan-2017 +KernelVersion: v4.11 +Contact: Logan Gunthorpe <log...@deltatee.com> +Description: The switchtec class subsystem folder. + Each registered switchtec driver is represented by a switchtecX + subfolder (X being an integer >= 0). + + +What: /sys/class/switchtec/switchtec[0-9]+/component_id +Date: 05-Jan-2017 +KernelVersion: v4.11 +Contact: Logan Gunthorpe <log...@deltatee.com> +Description: Component identifier as stored in the hardware (eg. PM8543) + (read only) +Values:arbitrary string. + + +What: /sys/class/switchtec/switchtec[0-9]+/component_revision +Date: 05-Jan-2017 +KernelVersion: v4.11 +Contact: Logan Gunthorpe <log...@deltatee.com> +Description: Component revision stored in the hardware (read only) +Values:integer. + + +What: /sys/class/switchtec/switchtec[0-9]+/component_vendor +Date: 05-Jan-2017 +KernelVersion: v4.11 +Contact: Logan Gunthorpe <log...@deltatee.com> +Description: Component vendor as stored in the hardware (eg. MICROSEM) + (read only) +Values:arbitrary string. + + +What: /sys/class/switchtec/switchtec[0-9]+/device_version +Date: 05-Jan-2017 +KernelVersion: v4.11 +Contact: Logan Gunthorpe <log...@deltatee.com> +Description: Device version as stored in the hardware (read only) +Values:integer. + + +What: /sys/class/switchtec/switchtec[0-9]+/fw_version +Date: 05-Jan-2017 +KernelVersion: v4.11 +Contact: Logan Gunthorpe <log...@deltatee.com> +Description: Currently running firmware version (read only) +Values:integer (in hexadecimal). + + +What: /sys/class/switchtec/switchtec[0-9]+/partition +Date: 05-Jan-2017 +KernelVersion: v4.11 +Contact: Logan Gunthorpe <log...@deltatee.com> +Description: Partition number for this device in the switch (read only) +Values:integer. + + +What: /sys/class/switchtec/switchtec[0-9]+/partition_count +Date: 05-Jan-2017 +KernelVersion: v4.11 +Contact: Logan Gunthorpe <log...@deltatee.com> +Description: Total number of partitions in the switch (read only) +Values:integer. + + +What: /sys/class/switchtec/switchtec[0-9]+/product_id +Date: 05-Jan-2017 +KernelVersion: v4.11 +Contact: Logan Gunthorpe <log...@deltatee.com> +Description: Product identifier as stored in the hardware (eg. PSX 48XG3) + (read only) +Values:arbitrary string. + + +What: /sys/class/switchtec/switchtec[0-9]+/product_revision +Date: 05-Jan-2017 +KernelVersion: v4.11 +Contact: Logan Gunthorpe <log...@deltatee.com> +Description: Product revision stored in the hardware (eg. RevB) + (read only) +Values:arbitrary string. + + +What: /sys/class/switchtec/switchtec[0-9]+/product_vendor +Date: 05-Jan-2017 +KernelVersion: v4.11 +Contact: Logan Gunthorpe <log...@deltatee.com> +Description: Product vendor as stored in the hardware (eg. MICROSEM) + (read only) +Values:arbitrary string. diff --git a/MAINTAINERS b/MAINTAINERS index aa702b0..6fed938 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -9513,6 +9513,7 @@ M:Logan Gunthorpe <log...@deltatee.com> L: linux-...@vger.kernel.org S: Maintained F: Documentation/switchtec.txt +F: Documentation/ABI/testing/sysfs-class-switchtec F: drivers/pci/switch/switchtec* PCI DRIVER FOR NVIDIA TEGRA diff --gi
[PATCH v4 1/4] MicroSemi Switchtec management interface driver
Microsemi's "Switchtec" line of PCI switch devices is already well supported by the kernel with standard PCI switch drivers. However, the Switchtec device advertises a special management endpoint with a separate PCI function address and class code. This endpoint enables some additional functionality which includes: * Packet and Byte Counters * Switch Firmware Upgrades * Event and Error logs * Querying port link status * Custom user firmware commands This patch introduces the switchtec kernel module which provides PCI driver that exposes a char device. The char device provides userspace access to this interface through read, write and (optionally) poll calls. A userspace tool and library which utilizes this interface is available at [1]. This tool takes inspiration (and borrows some code) from nvme-cli [2]. The tool is largely complete at this time but additional features may be added in the future. [1] https://github.com/sbates130272/switchtec-user [2] https://github.com/linux-nvme/nvme-cli Signed-off-by: Logan Gunthorpe <log...@deltatee.com> Signed-off-by: Stephen Bates <stephen.ba...@microsemi.com> Tested-by: Krishna Dhulipala <krish...@fb.com> Reviewed-by: Wei Zhang <wzh...@fb.com> Reviewed-by: Jens Axboe <ax...@fb.com> Reviewed-by: Greg Kroah-Hartman <gre...@linuxfoundation.org> --- MAINTAINERS| 8 + drivers/pci/Kconfig| 1 + drivers/pci/Makefile | 1 + drivers/pci/switch/Kconfig | 13 + drivers/pci/switch/Makefile| 1 + drivers/pci/switch/switchtec.c | 923 + 6 files changed, 947 insertions(+) create mode 100644 drivers/pci/switch/Kconfig create mode 100644 drivers/pci/switch/Makefile create mode 100644 drivers/pci/switch/switchtec.c diff --git a/MAINTAINERS b/MAINTAINERS index 527d137..a57686f 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -9506,6 +9506,14 @@ S: Maintained F: Documentation/devicetree/bindings/pci/aardvark-pci.txt F: drivers/pci/host/pci-aardvark.c +PCI DRIVER FOR MICROSEMI SWITCHTEC +M: Kurt Schwemmer <kurt.schwem...@microsemi.com> +M: Stephen Bates <stephen.ba...@microsemi.com> +M: Logan Gunthorpe <log...@deltatee.com> +L: linux-...@vger.kernel.org +S: Maintained +F: drivers/pci/switch/switchtec* + PCI DRIVER FOR NVIDIA TEGRA M: Thierry Reding <thierry.red...@gmail.com> L: linux-te...@vger.kernel.org diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig index 6555eb7..f72e8c5 100644 --- a/drivers/pci/Kconfig +++ b/drivers/pci/Kconfig @@ -133,3 +133,4 @@ config PCI_HYPERV source "drivers/pci/hotplug/Kconfig" source "drivers/pci/host/Kconfig" +source "drivers/pci/switch/Kconfig" diff --git a/drivers/pci/Makefile b/drivers/pci/Makefile index 8db5079..15b46dd 100644 --- a/drivers/pci/Makefile +++ b/drivers/pci/Makefile @@ -68,3 +68,4 @@ ccflags-$(CONFIG_PCI_DEBUG) := -DDEBUG # PCI host controller drivers obj-y += host/ +obj-y += switch/ diff --git a/drivers/pci/switch/Kconfig b/drivers/pci/switch/Kconfig new file mode 100644 index 000..4c49648 --- /dev/null +++ b/drivers/pci/switch/Kconfig @@ -0,0 +1,13 @@ +menu "PCI switch controller drivers" + depends on PCI + +config PCI_SW_SWITCHTEC + tristate "MicroSemi Switchtec PCIe Switch Management Driver" + help +Enables support for the management interface for the MicroSemi +Switchtec series of PCIe switches. Supports userspace access +to submit MRPC commands to the switch via /dev/switchtecX +devices. See for more +information. + +endmenu diff --git a/drivers/pci/switch/Makefile b/drivers/pci/switch/Makefile new file mode 100644 index 000..37d8cfb --- /dev/null +++ b/drivers/pci/switch/Makefile @@ -0,0 +1 @@ +obj-$(CONFIG_PCI_SW_SWITCHTEC) += switchtec.o diff --git a/drivers/pci/switch/switchtec.c b/drivers/pci/switch/switchtec.c new file mode 100644 index 000..f5813e0 --- /dev/null +++ b/drivers/pci/switch/switchtec.c @@ -0,0 +1,923 @@ +/* + * Microsemi Switchtec(tm) PCIe Management Driver + * Copyright (c) 2017, Microsemi Corporation + * + * This program is free software; you can redistribute it and/or modify it + * under the terms and conditions of the GNU General Public License, + * version 2, as published by the Free Software Foundation. + * + * This program is distributed in the hope it will be useful, but WITHOUT + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or + * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for + * more details. + * + */ + +#include +#include +#include +#include +#include +#include +#include +#include + +MODULE_DESCRIPTION("Microsemi Switchtec(tm) PCIe Management Driver"); +MODULE_VERSION("0.1"); +MODULE_LICENSE("GPL"); +MODULE_AUTHOR("Microsemi Corporation"
[PATCH v4 2/4] switchtec: Add user interface documentation
This adds standard documentation for the sysfs switchtec attributes and a RST formatted text file which documents the char device interface. Jonathan Corbet has indicated he will move this to a new user-space developer documentation book once it's created. Signed-off-by: Logan Gunthorpe <log...@deltatee.com> Signed-off-by: Stephen Bates <stephen.ba...@microsemi.com> Tested-by: Krishna Dhulipala <krish...@fb.com> Reviewed-by: Wei Zhang <wzh...@fb.com> Reviewed-by: Jens Axboe <ax...@fb.com> --- Documentation/switchtec.txt | 53 + MAINTAINERS | 1 + 2 files changed, 54 insertions(+) create mode 100644 Documentation/switchtec.txt diff --git a/Documentation/switchtec.txt b/Documentation/switchtec.txt new file mode 100644 index 000..4bced4c --- /dev/null +++ b/Documentation/switchtec.txt @@ -0,0 +1,53 @@ + +Linux Switchtec Support + + +Microsemi's "Switchtec" line of PCI switch devices is already +supported by the kernel with standard PCI switch drivers. However, the +Switchtec device advertises a special management endpoint which +enables some additional functionality. This includes: + +* Packet and Byte Counters +* Firmware Upgrades +* Event and Error logs +* Querying port link status +* Custom user firmware commands + +The switchtec kernel module implements this functionality. + + +Interface += + +The primary means of communicating with the Switchtec management firmware is +through the Memory-mapped Remote Procedure Call (MRPC) interface. +Commands are submitted to the interface with a 4-byte command +identifier and up to 1KB of command specific data. The firmware will +respond with a 4 bytes return code and up to 1KB of command specific +data. The interface only processes a single command at a time. + + +Userspace Interface +=== + +The MRPC interface will be exposed to userspace through a simple char +device: /dev/switchtec#, one for each management endpoint in the system. + +The char device has the following semantics: + +* A write must consist of at least 4 bytes and no more than 1028 bytes. + The first four bytes will be interpreted as the command to run and + the remainder will be used as the input data. A write will send the + command to the firmware to begin processing. + +* Each write must be followed by exactly one read. Any double write will + produce an error and any read that doesn't follow a write will + produce an error. + +* A read will block until the firmware completes the command and return + the four bytes of status plus up to 1024 bytes of output data. (The + length will be specified by the size parameter of the read call -- + reading less than 4 bytes will produce an error. + +* The poll call will also be supported for userspace applications that + need to do other things while waiting for the command to complete. diff --git a/MAINTAINERS b/MAINTAINERS index a57686f..aa702b0 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -9512,6 +9512,7 @@ M:Stephen Bates <stephen.ba...@microsemi.com> M: Logan Gunthorpe <log...@deltatee.com> L: linux-...@vger.kernel.org S: Maintained +F: Documentation/switchtec.txt F: drivers/pci/switch/switchtec* PCI DRIVER FOR NVIDIA TEGRA -- 2.1.4
Re: [PATCH] switchtec: cleanup cdev init
Hi, Please don't apply this patch and instead apply the switchtec driver as we submitted in v2. As per the discussion in [1], not using the cdev's kobj parent results in incorrect reference counting and a possible use of the cdev after its containing structure is freed. I've also done a quick review around the kernel and found the pattern in question to be quite prevalent. It's used in, at least, these drivers: drivers/dax/dax.c:708 drivers/input/evdev.c:1419 drivers/input/joydev.c:908 drivers/input/mousedev.c:904 drivers/gpio/gpiolib.c:1039 drivers/char/tpm/tpm-chip.c:190 drivers/platform/chrome/cros_ec_dev.c:411 drivers/infiniband/hw/hfi1/device.c:72 drivers/infiniband/core/uverbs_main.c:1168 drivers/infiniband/core/user_mad.c:1186 drivers/infiniband/core/user_mad.c:1205 drivers/iio/industrialio-core.c:1721 drivers/media/cec/cec-core.c:140 drivers/media/media-devnode.c:258 Dan Williams has proposed a helper API in [2] that could make this code appear significantly less suspect which I think would be a good idea. However, for now, I feel the switchtec code should also follow this pattern (ie. the way it was submitted in v2) and can be changed or cleaned up when/if the above numerous examples are fixed up. Thanks, Logan [1] https://lkml.org/lkml/2017/2/10/589 [2] https://lkml.org/lkml/2017/2/13/700 On 10/02/17 10:57 AM, Logan Gunthorpe wrote: > Per a suggestion from Greg Kroah-Hartman [1], don't set the cdev's > kobject parent. This allows us to use device_register instead of init > and add. > > [1] https://lkml.org/lkml/2017/2/10/370 > --- > drivers/pci/switch/switchtec.c | 15 +++ > 1 file changed, 7 insertions(+), 8 deletions(-) > > diff --git a/drivers/pci/switch/switchtec.c b/drivers/pci/switch/switchtec.c > index 82bfd18..014eaec 100644 > --- a/drivers/pci/switch/switchtec.c > +++ b/drivers/pci/switch/switchtec.c > @@ -1222,24 +1222,23 @@ static struct switchtec_dev *stdev_create(struct > pci_dev *pdev) > return ERR_PTR(minor); > > dev = >dev; > - device_initialize(dev); > dev->devt = MKDEV(MAJOR(switchtec_devt), minor); > - dev->class = switchtec_class; > - dev->parent = >dev; > - dev->groups = switchtec_device_groups; > - dev->release = stdev_release; > - dev_set_name(dev, "switchtec%d", minor); > > cdev = >cdev; > cdev_init(cdev, _fops); > cdev->owner = THIS_MODULE; > - cdev->kobj.parent = >kobj; > > rc = cdev_add(>cdev, dev->devt, 1); > if (rc) > goto err_cdev; > > - rc = device_add(dev); > + dev->class = switchtec_class; > + dev->parent = >dev; > + dev->groups = switchtec_device_groups; > + dev->release = stdev_release; > + dev_set_name(dev, "switchtec%d", minor); > + > + rc = device_register(dev); > if (rc) { > cdev_del(>cdev); > put_device(dev); >
Re: [PATCH] switchtec: cleanup cdev init
On 19/02/17 02:43 PM, Dan Williams wrote: > Is this race present for all file operations? I've only seen it with > mmap() and late faults. So if these other drivers do not support mmap > it's not clear they can trigger the failure. I don't see how it's related to mmap only. I think there's a number of variations on this but the race I see happens if you try to take a reference to the device in the open/close handlers of a char device file. If a driver puts the last remaining reference in the release handler, then the device structure will be freed, and on returning from the release handler, the char_dev code will try to put the last reference to the cdev structure which may have already been free'd. There needs to be a way for the cdev structure to take a reference on the device structure so it doesn't get freed and the kobj.parent trick seems to accomplish this fairly well. Really, in any situation where there's a cdev and a device in the same structure, the life cycles of the two become linked but their reference counts are not and that is the problem here. I feel like a number of developers have made the same realization independently so this would probably be a good thing to clean up. See these commits for examples spanning around 5 years: 4a215aa Input: fix use-after-free introduced with dynamic minor changes 0d5b7da iio: Prevent race between IIO chardev opening and IIO device ba0ef85 tpm: Fix initialization of the cdev Also, seems both my brother and I have independently looked into this exact issue: https://www.mail-archive.com/linux-rdma@vger.kernel.org/msg25692.html So, I think it would be a good idea to clean it up with Dan's proposed API cleanup. If I have some time tomorrow, I may throw together a patch set with that patch and the changes to the instances around the kernel. Logan
Re: [PATCH] device-dax: don't set kobj parent during cdev init
On 11/02/17 11:27 AM, Dan Williams wrote: > Why, when the lifetime of the cdev is already correct? Well, it's only correct if you use the kobj parent trick which Greg is arguing against. As someone reviewing/copying code that trick is unclear, undocumented and it looks rather odd messing with internal kobjects. Taking the explicit reference would be very clear, very standard and only net one additional line. > See commit ba09c01d2fa8 "dax: convert to the cdev api". I used to take > explicit references like you suggest, but cdev made it cleaner. I agree that, on the whole, that patch makes things a good deal cleaner. I'm not so sure that this one small aspect is an improvement. In any case, it's up to you. If you'd like I can certainly submit a v2 patch that adds the get/put. Thanks, Logan
Re: [PATCH] device-dax: don't set kobj parent during cdev init
Hey, I like the interface. It's just Greg that needs to comment on whether using the kobj.parent for this purpose is actually sane. That was his argument from the beginning. Logan On 13/02/17 01:47 PM, Dan Williams wrote: > On Sat, Feb 11, 2017 at 9:42 PM, Logan Gunthorpe <log...@deltatee.com> wrote: >> On 11/02/17 11:58 AM, Dan Williams wrote: >>> Also when using an embedded cdev how would you recommend avoiding this >>> problem? >> >> I don't know. Hopefully, Greg has a good idea. But it sounds like a >> general problem that a lot of cdev's actually suffer from without >> realizing. Perhaps we need a more general solution. Some way for the >> cdev to reference its containing structure in a way that it's designed >> for such that anyone writing a driver will do the right thing without >> needing to dive into the kobjects. >> > > How about something like the below? I.e. hide the details with a new > helper api so that all driver writers need to worry about is the > parent device and cdev_del(). This is similar to the device_add_disk() > and del_gendisk() pairing that we have for block-device drivers. > > diff --git a/fs/char_dev.c b/fs/char_dev.c > index 3bc97002c86f..4c5d51cdd353 100644 > --- a/fs/char_dev.c > +++ b/fs/char_dev.c > @@ -471,6 +471,12 @@ int cdev_add(struct cdev *p, dev_t dev, unsigned count) > return 0; > } > > +int device_add_cdev(struct device *dev, struct cdev *p) > +{ > + p->kobj.parent = >kobj; > + return cdev_add(p, dev->devt, 1); > +} > + > static void cdev_unmap(dev_t dev, unsigned count) > { > kobj_unmap(cdev_map, dev, count); > diff --git a/include/linux/cdev.h b/include/linux/cdev.h > index f8763615a5f2..043168df1d8f 100644 > --- a/include/linux/cdev.h > +++ b/include/linux/cdev.h > @@ -25,6 +25,7 @@ struct cdev *cdev_alloc(void); > void cdev_put(struct cdev *p); > > int cdev_add(struct cdev *, dev_t, unsigned); > +int device_add_cdev(struct device *dev, struct cdev *); > > void cdev_del(struct cdev *); >
Re: [PATCH] device-dax: don't set kobj parent during cdev init
On 11/02/17 11:58 AM, Dan Williams wrote: > Also when using an embedded cdev how would you recommend avoiding this > problem? I don't know. Hopefully, Greg has a good idea. But it sounds like a general problem that a lot of cdev's actually suffer from without realizing. Perhaps we need a more general solution. Some way for the cdev to reference its containing structure in a way that it's designed for such that anyone writing a driver will do the right thing without needing to dive into the kobjects. Logan
Re: [PATCH v2 1/4] MicroSemi Switchtec management interface driver
Hey Greg, Thanks so much for the review. On 10/02/17 07:51 AM, Greg Kroah-Hartman wrote: > On Thu, Feb 02, 2017 at 11:06:00AM -0700, Logan Gunthorpe wrote: >> +cdev = >cdev; >> +cdev_init(cdev, _fops); >> +cdev->owner = THIS_MODULE; >> +cdev->kobj.parent = >kobj; > > Minor nit, the kobject in a cdev is unlike any other kobject you have > ever seen, don't mess with it, it's not doing anything like you think it > is doing. So no need to set the parent field. Ok, that makes sense. I'll do a v3 shortly. I copied this from drivers/dax/dax.c so when I have a spare moment I'll submit a patch to remove it from there as well. Just to make sure I get this right without extra churn: does this look correct? cdev = >cdev; cdev_init(cdev, _fops); cdev->owner = THIS_MODULE; rc = cdev_add(>cdev, dev->devt, 1); if (rc) goto err_cdev; dev = >dev; dev->devt = MKDEV(MAJOR(switchtec_devt), minor); dev->class = switchtec_class; dev->parent = >dev; dev->groups = switchtec_device_groups; dev->release = stdev_release; dev_set_name(dev, "switchtec%d", minor); rc = device_register(dev); if (rc) { cdev_del(>cdev); put_device(dev); return ERR_PTR(rc); } Thanks, Logan
[PATCH] device-dax: don't set kobj parent during cdev init
I copied this code and per feedback from Greg Kroah-Hartman [1] the cdev's kobject's parent should not be set to the related device. This should have minor consequences but isn't doing what anyone expects it to. This patch then fixes device-dax so it doesn't make the same mistake. [1] https://lkml.org/lkml/2017/2/10/370 Signed-off-by: Logan Gunthorpe <log...@deltatee.com> --- drivers/dax/dax.c | 6 +- 1 file changed, 1 insertion(+), 5 deletions(-) diff --git a/drivers/dax/dax.c b/drivers/dax/dax.c index ed758b7..24e53b7 100644 --- a/drivers/dax/dax.c +++ b/drivers/dax/dax.c @@ -699,13 +699,9 @@ struct dax_dev *devm_create_dax_dev(struct dax_region *dax_region, goto err_inode; } - /* device_initialize() so cdev can reference kobj parent */ - device_initialize(dev); - cdev = _dev->cdev; cdev_init(cdev, _fops); cdev->owner = parent->driver->owner; - cdev->kobj.parent = >kobj; rc = cdev_add(_dev->cdev, dev_t, 1); if (rc) goto err_cdev; @@ -722,7 +718,7 @@ struct dax_dev *devm_create_dax_dev(struct dax_region *dax_region, dev->groups = dax_attribute_groups; dev->release = dax_dev_release; dev_set_name(dev, "dax%d.%d", dax_region->id, dax_dev->id); - rc = device_add(dev); + rc = device_register(dev); if (rc) { put_device(dev); return ERR_PTR(rc); -- 2.1.4
Re: [PATCH] device-dax: don't set kobj parent during cdev init
Hey, Also on the subject of very minor fixes: I noticed drivers/dax is not in the maintainers file. I just assumed the nvdimm list should have been included with those from get_maintainers. Thanks, Logan On 10/02/17 12:19 PM, Logan Gunthorpe wrote: > I copied this code and per feedback from Greg Kroah-Hartman [1] the > cdev's kobject's parent should not be set to the related device. > This should have minor consequences but isn't doing what anyone > expects it to. > > This patch then fixes device-dax so it doesn't make the same mistake. > > [1] https://lkml.org/lkml/2017/2/10/370 > > Signed-off-by: Logan Gunthorpe <log...@deltatee.com> > --- > drivers/dax/dax.c | 6 +- > 1 file changed, 1 insertion(+), 5 deletions(-) > > diff --git a/drivers/dax/dax.c b/drivers/dax/dax.c > index ed758b7..24e53b7 100644 > --- a/drivers/dax/dax.c > +++ b/drivers/dax/dax.c > @@ -699,13 +699,9 @@ struct dax_dev *devm_create_dax_dev(struct dax_region > *dax_region, > goto err_inode; > } > > - /* device_initialize() so cdev can reference kobj parent */ > - device_initialize(dev); > - > cdev = _dev->cdev; > cdev_init(cdev, _fops); > cdev->owner = parent->driver->owner; > - cdev->kobj.parent = >kobj; > rc = cdev_add(_dev->cdev, dev_t, 1); > if (rc) > goto err_cdev; > @@ -722,7 +718,7 @@ struct dax_dev *devm_create_dax_dev(struct dax_region > *dax_region, > dev->groups = dax_attribute_groups; > dev->release = dax_dev_release; > dev_set_name(dev, "dax%d.%d", dax_region->id, dax_dev->id); > - rc = device_add(dev); > + rc = device_register(dev); > if (rc) { > put_device(dev); > return ERR_PTR(rc); >
[PATCH] switchtec: cleanup cdev init
Per a suggestion from Greg Kroah-Hartman [1], don't set the cdev's kobject parent. This allows us to use device_register instead of init and add. [1] https://lkml.org/lkml/2017/2/10/370 --- drivers/pci/switch/switchtec.c | 15 +++ 1 file changed, 7 insertions(+), 8 deletions(-) diff --git a/drivers/pci/switch/switchtec.c b/drivers/pci/switch/switchtec.c index 82bfd18..014eaec 100644 --- a/drivers/pci/switch/switchtec.c +++ b/drivers/pci/switch/switchtec.c @@ -1222,24 +1222,23 @@ static struct switchtec_dev *stdev_create(struct pci_dev *pdev) return ERR_PTR(minor); dev = >dev; - device_initialize(dev); dev->devt = MKDEV(MAJOR(switchtec_devt), minor); - dev->class = switchtec_class; - dev->parent = >dev; - dev->groups = switchtec_device_groups; - dev->release = stdev_release; - dev_set_name(dev, "switchtec%d", minor); cdev = >cdev; cdev_init(cdev, _fops); cdev->owner = THIS_MODULE; - cdev->kobj.parent = >kobj; rc = cdev_add(>cdev, dev->devt, 1); if (rc) goto err_cdev; - rc = device_add(dev); + dev->class = switchtec_class; + dev->parent = >dev; + dev->groups = switchtec_device_groups; + dev->release = stdev_release; + dev_set_name(dev, "switchtec%d", minor); + + rc = device_register(dev); if (rc) { cdev_del(>cdev); put_device(dev); -- 2.1.4
Re: [PATCH v2 1/4] MicroSemi Switchtec management interface driver
On 10/02/17 10:09 AM, Greg Kroah-Hartman wrote: > Sure, or just wait for these to be applied to the PCI tree and then send > a follow-on patch. It's up to Bjorn really, as to what he wants. Ok, I sent a working follow-on patch to this thread. @Bjorn: I'm happy to send the patches however you like. Just let me know. If at all possible, we'd really like to see this in for the 4.11 merge window. Thanks, Logan
Re: [PATCH v2 1/4] MicroSemi Switchtec management interface driver
On 10/02/17 09:55 AM, Greg Kroah-Hartman wrote: > Yes, but try it yourself to verify it really is correct :) Yes, of course... turns out it wasn't. I accidentally refereed to dev before I assigned it. I had mainly just wanted your feedback to ensure that switching to device_register made sense. > And it can just be an add-on patch, no need to respin a whole new > version for just that simple change, it doesn't hurt anything as-is, > it's just "not needed". Ok... How should I do that exactly? Attempt to reply to the thread with another patch? Thanks, Logan
Re: [PATCH] device-dax: don't set kobj parent during cdev init
On 11/02/17 01:56 AM, Dan Williams wrote: When the device is unregistered it invalidates all existing mappings, but the driver may continue to service vm fault requests until the final put of the cdev. Until that time the fault handler needs to be able to check dax_dev->alive. Since the final cdev put is handled by the vfs I use the cdev's kobject to keep the struct dax_dev instance alive. I'm just taking a wild stab at this, but would it not make sense to just take a reference to the dax_dev device in dax_open and put it back it in dax_release? (Or perhaps, in the open/close of the vm_ops.) That way the structure won't be free'd until there are no users and alive will always be accessible. It would also be a bit more clear as to what's going on because you are actually making a reference in filp->private_data. Logan
[PATCH v5 3/4] switchtec: Add sysfs attributes to the Switchtec driver
This patch adds a few read-only sysfs attributes which provide some device information that is exposed from the devices. Primarily component and device names and versions. These are documented in Documentation/ABI/testing/sysfs-class-switchtec. Signed-off-by: Logan Gunthorpe <log...@deltatee.com> Signed-off-by: Stephen Bates <stephen.ba...@microsemi.com> Tested-by: Krishna Dhulipala <krish...@fb.com> Reviewed-by: Wei Zhang <wzh...@fb.com> Reviewed-by: Jens Axboe <ax...@fb.com> Reviewed-by: Greg Kroah-Hartman <gre...@linuxfoundation.org> --- Documentation/ABI/testing/sysfs-class-switchtec | 96 MAINTAINERS | 1 + drivers/pci/switch/switchtec.c | 113 3 files changed, 210 insertions(+) create mode 100644 Documentation/ABI/testing/sysfs-class-switchtec diff --git a/Documentation/ABI/testing/sysfs-class-switchtec b/Documentation/ABI/testing/sysfs-class-switchtec new file mode 100644 index 000..48cb4c1 --- /dev/null +++ b/Documentation/ABI/testing/sysfs-class-switchtec @@ -0,0 +1,96 @@ +switchtec - Microsemi Switchtec PCI Switch Management Endpoint + +For details on this subsystem look at Documentation/switchtec.txt. + +What: /sys/class/switchtec +Date: 05-Jan-2017 +KernelVersion: v4.11 +Contact: Logan Gunthorpe <log...@deltatee.com> +Description: The switchtec class subsystem folder. + Each registered switchtec driver is represented by a switchtecX + subfolder (X being an integer >= 0). + + +What: /sys/class/switchtec/switchtec[0-9]+/component_id +Date: 05-Jan-2017 +KernelVersion: v4.11 +Contact: Logan Gunthorpe <log...@deltatee.com> +Description: Component identifier as stored in the hardware (eg. PM8543) + (read only) +Values:arbitrary string. + + +What: /sys/class/switchtec/switchtec[0-9]+/component_revision +Date: 05-Jan-2017 +KernelVersion: v4.11 +Contact: Logan Gunthorpe <log...@deltatee.com> +Description: Component revision stored in the hardware (read only) +Values:integer. + + +What: /sys/class/switchtec/switchtec[0-9]+/component_vendor +Date: 05-Jan-2017 +KernelVersion: v4.11 +Contact: Logan Gunthorpe <log...@deltatee.com> +Description: Component vendor as stored in the hardware (eg. MICROSEM) + (read only) +Values:arbitrary string. + + +What: /sys/class/switchtec/switchtec[0-9]+/device_version +Date: 05-Jan-2017 +KernelVersion: v4.11 +Contact: Logan Gunthorpe <log...@deltatee.com> +Description: Device version as stored in the hardware (read only) +Values:integer. + + +What: /sys/class/switchtec/switchtec[0-9]+/fw_version +Date: 05-Jan-2017 +KernelVersion: v4.11 +Contact: Logan Gunthorpe <log...@deltatee.com> +Description: Currently running firmware version (read only) +Values:integer (in hexadecimal). + + +What: /sys/class/switchtec/switchtec[0-9]+/partition +Date: 05-Jan-2017 +KernelVersion: v4.11 +Contact: Logan Gunthorpe <log...@deltatee.com> +Description: Partition number for this device in the switch (read only) +Values:integer. + + +What: /sys/class/switchtec/switchtec[0-9]+/partition_count +Date: 05-Jan-2017 +KernelVersion: v4.11 +Contact: Logan Gunthorpe <log...@deltatee.com> +Description: Total number of partitions in the switch (read only) +Values:integer. + + +What: /sys/class/switchtec/switchtec[0-9]+/product_id +Date: 05-Jan-2017 +KernelVersion: v4.11 +Contact: Logan Gunthorpe <log...@deltatee.com> +Description: Product identifier as stored in the hardware (eg. PSX 48XG3) + (read only) +Values:arbitrary string. + + +What: /sys/class/switchtec/switchtec[0-9]+/product_revision +Date: 05-Jan-2017 +KernelVersion: v4.11 +Contact: Logan Gunthorpe <log...@deltatee.com> +Description: Product revision stored in the hardware (eg. RevB) + (read only) +Values:arbitrary string. + + +What: /sys/class/switchtec/switchtec[0-9]+/product_vendor +Date: 05-Jan-2017 +KernelVersion: v4.11 +Contact: Logan Gunthorpe <log...@deltatee.com> +Description: Product vendor as stored in the hardware (eg. MICROSEM) + (read only) +Values:arbitrary string. diff --git a/MAINTAINERS b/MAINTAINERS index aa702b0..6fed938 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -9513,6 +9513,7 @@ M:Logan Gunthorpe <log...@deltatee.com> L: linux-...@vger.kernel.org S: Maintained F: Documentation/switchtec.txt +F: Documentation/ABI/testing/sysfs-class-switchtec F: drivers/pci/switch/switchtec* PCI DRIVER FOR NVIDIA TEGRA diff --gi
[PATCH v5 0/4] New Microsemi PCI Switch Management Driver
Changes since v4: * Turns out pushing the pci release code into the device release function didn't work as I would have liked. If you try to unbind the device with an instance open, then you hit a kernel bug at drivers/pci/msi.c:371. (This didn't occur in v3.) To solve this, we've moved the pci release code back into the unregister function and reintroduced an alive flag. This time, however, the alive flag is protected by mrpc_mutex and we're very careful about what happens to devices still in use (they should all be released through the timeout path and an ENODEV error returned to userspace; while new commands are blocked with the same error). Changes since v3: * Removed stdev_is_alive and moved pci deinit functions into the device release function which only occurs after all cdev instances are closed. (per Bjorn) * Reduced locking in read/write functions (per Bjorn) * Use pci_alloc_irq_vectors to vastly improve ISR initialization (Bjorn) * A few other minor nits and cleanup as noticed by Bjorn Changes since v2: * Collected reviewed and tested tags * Very minor fix to the error path in the create function Changes since v1: * Rebased onto 4.10-rc6 (cleanly) * Split the patch into a few more easily digestible patches (as suggested by Greg Kroah-Hartman) * Folded switchtec.c into switchtec.h (per Greg) * Fixed a bunch of 32bit build warnings caught by the kbuild test robot * Fixed some issues in the documentation so it has a proper reStructredText format (as noted by Jonathan Corbet) * Fixed padding and sizes in the IOCTL structures as noticed by Emil Velikov and used pahole to verify their consistency across 32 and 64 bit builds * Reworked one of the IOCTL interfaces to be more future proof (per Emil). Changes since RFC: * Fixed incorrect use of the drive model as pointed out by Greg Kroah-Hartman * Used devm functions as suggested by Keith Busch * Added a handful of sysfs attributes to the switchtec class * Added a handful of IOCTLs to the switchtec device * A number of miscellaneous bug fixes -- Hi, This is a continuation of the RFC we posted lasted month [1] which proposes a management driver for Microsemi's Switchtec line of PCI switches. This hardware is still looking to be used in the Open Compute Platform To make this entirely clear: the Switchtec products are compliant with the PCI specifications and are supported today with the standard in-kernel driver. However, these devices also expose a management endpoint on a separate PCI function address which can be used to perform some advanced operations. This is a driver for that function. See the patch for more information. Since the RFC, we've made the changes requested by Greg Kroah-Hartman and Keith Busch, and we've also fleshed out a number of features. We've added a couple of IOCTLs and sysfs attributes which are documented in the patch. Significant work has also been done on the userspace tool which is available under a GPL license at [2]. We've also had testing done by some of the interested parties. We hope to see this work included in either 4.11 or 4.12 assuming a smooth review process. The patch is based off of the v4.10-rc6 release. Thanks for your review, Logan [1] https://www.spinics.net/lists/linux-pci/msg56897.html [2] https://github.com/sbates130272/switchtec-user Logan Gunthorpe (4): MicroSemi Switchtec management interface driver switchtec: Add user interface documentation switchtec: Add sysfs attributes to the Switchtec driver switchtec: Add IOCTLs to the Switchtec driver Documentation/ABI/testing/sysfs-class-switchtec | 96 ++ Documentation/ioctl/ioctl-number.txt|1 + Documentation/switchtec.txt | 80 ++ MAINTAINERS | 11 + drivers/pci/Kconfig |1 + drivers/pci/Makefile|1 + drivers/pci/switch/Kconfig | 13 + drivers/pci/switch/Makefile |1 + drivers/pci/switch/switchtec.c | 1535 +++ include/uapi/linux/switchtec_ioctl.h| 132 ++ 10 files changed, 1871 insertions(+) create mode 100644 Documentation/ABI/testing/sysfs-class-switchtec create mode 100644 Documentation/switchtec.txt create mode 100644 drivers/pci/switch/Kconfig create mode 100644 drivers/pci/switch/Makefile create mode 100644 drivers/pci/switch/switchtec.c create mode 100644 include/uapi/linux/switchtec_ioctl.h -- 2.1.4
[PATCH v5 1/4] MicroSemi Switchtec management interface driver
Microsemi's "Switchtec" line of PCI switch devices is already well supported by the kernel with standard PCI switch drivers. However, the Switchtec device advertises a special management endpoint with a separate PCI function address and class code. This endpoint enables some additional functionality which includes: * Packet and Byte Counters * Switch Firmware Upgrades * Event and Error logs * Querying port link status * Custom user firmware commands This patch introduces the switchtec kernel module which provides PCI driver that exposes a char device. The char device provides userspace access to this interface through read, write and (optionally) poll calls. A userspace tool and library which utilizes this interface is available at [1]. This tool takes inspiration (and borrows some code) from nvme-cli [2]. The tool is largely complete at this time but additional features may be added in the future. [1] https://github.com/sbates130272/switchtec-user [2] https://github.com/linux-nvme/nvme-cli Signed-off-by: Logan Gunthorpe <log...@deltatee.com> Signed-off-by: Stephen Bates <stephen.ba...@microsemi.com> Tested-by: Krishna Dhulipala <krish...@fb.com> Reviewed-by: Wei Zhang <wzh...@fb.com> Reviewed-by: Jens Axboe <ax...@fb.com> Reviewed-by: Greg Kroah-Hartman <gre...@linuxfoundation.org> --- MAINTAINERS| 8 + drivers/pci/Kconfig| 1 + drivers/pci/Makefile | 1 + drivers/pci/switch/Kconfig | 13 + drivers/pci/switch/Makefile| 1 + drivers/pci/switch/switchtec.c | 955 + 6 files changed, 979 insertions(+) create mode 100644 drivers/pci/switch/Kconfig create mode 100644 drivers/pci/switch/Makefile create mode 100644 drivers/pci/switch/switchtec.c diff --git a/MAINTAINERS b/MAINTAINERS index 527d137..a57686f 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -9506,6 +9506,14 @@ S: Maintained F: Documentation/devicetree/bindings/pci/aardvark-pci.txt F: drivers/pci/host/pci-aardvark.c +PCI DRIVER FOR MICROSEMI SWITCHTEC +M: Kurt Schwemmer <kurt.schwem...@microsemi.com> +M: Stephen Bates <stephen.ba...@microsemi.com> +M: Logan Gunthorpe <log...@deltatee.com> +L: linux-...@vger.kernel.org +S: Maintained +F: drivers/pci/switch/switchtec* + PCI DRIVER FOR NVIDIA TEGRA M: Thierry Reding <thierry.red...@gmail.com> L: linux-te...@vger.kernel.org diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig index 6555eb7..f72e8c5 100644 --- a/drivers/pci/Kconfig +++ b/drivers/pci/Kconfig @@ -133,3 +133,4 @@ config PCI_HYPERV source "drivers/pci/hotplug/Kconfig" source "drivers/pci/host/Kconfig" +source "drivers/pci/switch/Kconfig" diff --git a/drivers/pci/Makefile b/drivers/pci/Makefile index 8db5079..15b46dd 100644 --- a/drivers/pci/Makefile +++ b/drivers/pci/Makefile @@ -68,3 +68,4 @@ ccflags-$(CONFIG_PCI_DEBUG) := -DDEBUG # PCI host controller drivers obj-y += host/ +obj-y += switch/ diff --git a/drivers/pci/switch/Kconfig b/drivers/pci/switch/Kconfig new file mode 100644 index 000..4c49648 --- /dev/null +++ b/drivers/pci/switch/Kconfig @@ -0,0 +1,13 @@ +menu "PCI switch controller drivers" + depends on PCI + +config PCI_SW_SWITCHTEC + tristate "MicroSemi Switchtec PCIe Switch Management Driver" + help +Enables support for the management interface for the MicroSemi +Switchtec series of PCIe switches. Supports userspace access +to submit MRPC commands to the switch via /dev/switchtecX +devices. See for more +information. + +endmenu diff --git a/drivers/pci/switch/Makefile b/drivers/pci/switch/Makefile new file mode 100644 index 000..37d8cfb --- /dev/null +++ b/drivers/pci/switch/Makefile @@ -0,0 +1 @@ +obj-$(CONFIG_PCI_SW_SWITCHTEC) += switchtec.o diff --git a/drivers/pci/switch/switchtec.c b/drivers/pci/switch/switchtec.c new file mode 100644 index 000..4aaf89b --- /dev/null +++ b/drivers/pci/switch/switchtec.c @@ -0,0 +1,955 @@ +/* + * Microsemi Switchtec(tm) PCIe Management Driver + * Copyright (c) 2017, Microsemi Corporation + * + * This program is free software; you can redistribute it and/or modify it + * under the terms and conditions of the GNU General Public License, + * version 2, as published by the Free Software Foundation. + * + * This program is distributed in the hope it will be useful, but WITHOUT + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or + * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for + * more details. + * + */ + +#include +#include +#include +#include +#include +#include +#include +#include + +MODULE_DESCRIPTION("Microsemi Switchtec(tm) PCIe Management Driver"); +MODULE_VERSION("0.1"); +MODULE_LICENSE("GPL"); +MODULE_AUTHOR("Microsemi Corporation"
[PATCH v5 4/4] switchtec: Add IOCTLs to the Switchtec driver
This patch introduces a couple of special IOCTLs which are provided to: * Inform userspace of firmware partition locations * Pass event counts and allow userspace to wait on events * Translate between PFF numbers used by the switch to port numbers. Signed-off-by: Logan Gunthorpe <log...@deltatee.com> Signed-off-by: Stephen Bates <stephen.ba...@microsemi.com> Tested-by: Krishna Dhulipala <krish...@fb.com> Reviewed-by: Wei Zhang <wzh...@fb.com> Reviewed-by: Jens Axboe <ax...@fb.com> --- Documentation/ioctl/ioctl-number.txt | 1 + Documentation/switchtec.txt | 27 ++ MAINTAINERS | 1 + drivers/pci/switch/switchtec.c | 467 +++ include/uapi/linux/switchtec_ioctl.h | 132 ++ 5 files changed, 628 insertions(+) create mode 100644 include/uapi/linux/switchtec_ioctl.h diff --git a/Documentation/ioctl/ioctl-number.txt b/Documentation/ioctl/ioctl-number.txt index 81c7f2b..032b33c 100644 --- a/Documentation/ioctl/ioctl-number.txt +++ b/Documentation/ioctl/ioctl-number.txt @@ -191,6 +191,7 @@ Code Seq#(hex) Include FileComments 'W'00-1F linux/watchdog.hconflict! 'W'00-1F linux/wanrouter.h conflict! (pre 3.9) 'W'00-3F sound/asound.h conflict! +'W'40-5F drivers/pci/switch/switchtec.c 'X'all fs/xfs/xfs_fs.h conflict! and fs/xfs/linux-2.6/xfs_ioctl32.h and include/linux/falloc.h diff --git a/Documentation/switchtec.txt b/Documentation/switchtec.txt index 4bced4c..a0a9c7b 100644 --- a/Documentation/switchtec.txt +++ b/Documentation/switchtec.txt @@ -51,3 +51,30 @@ The char device has the following semantics: * The poll call will also be supported for userspace applications that need to do other things while waiting for the command to complete. + +The following IOCTLs are also supported by the device: + +* SWITCHTEC_IOCTL_FLASH_INFO - Retrieve firmware length and number + of partitions in the device. + +* SWITCHTEC_IOCTL_FLASH_PART_INFO - Retrieve address and lengeth for + any specified partition in flash. + +* SWITCHTEC_IOCTL_EVENT_SUMMARY - Read a structure of bitmaps + indicating all uncleared events. + +* SWITCHTEC_IOCTL_EVENT_CTL - Get the current count, clear and set flags + for any event. This ioctl takes in a switchtec_ioctl_event_ctl struct + with the event_id, index and flags set (index being the partition or PFF + number for non-global events). It returns whether the event has + occurred, the number of times and any event specific data. The flags + can be used to clear the count or enable and disable actions to + happen when the event occurs. + By using the SWITCHTEC_IOCTL_EVENT_FLAG_EN_POLL flag, + you can set an event to trigger a poll command to return with + POLLPRI. In this way, userspace can wait for events to occur. + +* SWITCHTEC_IOCTL_PFF_TO_PORT and SWITCHTEC_IOCTL_PORT_TO_PFF convert + between PCI Function Framework number (used by the event system) + and Switchtec Logic Port ID and Partition number (which is more + user friendly). diff --git a/MAINTAINERS b/MAINTAINERS index 6fed938..c1a9a30 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -9515,6 +9515,7 @@ S:Maintained F: Documentation/switchtec.txt F: Documentation/ABI/testing/sysfs-class-switchtec F: drivers/pci/switch/switchtec* +F: include/uapi/linux/switchtec_ioctl.h PCI DRIVER FOR NVIDIA TEGRA M: Thierry Reding <thierry.red...@gmail.com> diff --git a/drivers/pci/switch/switchtec.c b/drivers/pci/switch/switchtec.c index 2e47d79..94384e7 100644 --- a/drivers/pci/switch/switchtec.c +++ b/drivers/pci/switch/switchtec.c @@ -13,6 +13,8 @@ * */ +#include + #include #include #include @@ -740,6 +742,417 @@ static unsigned int switchtec_dev_poll(struct file *filp, poll_table *wait) return ret; } +static int ioctl_flash_info(struct switchtec_dev *stdev, + struct switchtec_ioctl_flash_info __user *uinfo) +{ + struct switchtec_ioctl_flash_info info = {0}; + struct flash_info_regs __iomem *fi = stdev->mmio_flash_info; + + info.flash_length = ioread32(>flash_length); + info.num_partitions = SWITCHTEC_IOCTL_NUM_PARTITIONS; + + if (copy_to_user(uinfo, , sizeof(info))) + return -EFAULT; + + return 0; +} + +static void set_fw_info_part(struct switchtec_ioctl_flash_part_info *info, +struct partition_info __iomem *pi) +{ + info->address = ioread32(>address); + info->length = ioread32(>length); +} + +static int ioctl_flash_part_info(struct switchtec_dev *stdev, + struct switchtec_ioctl_flash_part_info __user *uinfo) +{ + struct switchtec_ioctl_flash_part_info info = {0}; + struct flash_info_regs __iomem *fi = stdev->mmio_flash_info; + u32 active_addr = -1; + + if (copy
[PATCH v5 2/4] switchtec: Add user interface documentation
This adds standard documentation for the sysfs switchtec attributes and a RST formatted text file which documents the char device interface. Jonathan Corbet has indicated he will move this to a new user-space developer documentation book once it's created. Signed-off-by: Logan Gunthorpe <log...@deltatee.com> Signed-off-by: Stephen Bates <stephen.ba...@microsemi.com> Tested-by: Krishna Dhulipala <krish...@fb.com> Reviewed-by: Wei Zhang <wzh...@fb.com> Reviewed-by: Jens Axboe <ax...@fb.com> --- Documentation/switchtec.txt | 53 + MAINTAINERS | 1 + 2 files changed, 54 insertions(+) create mode 100644 Documentation/switchtec.txt diff --git a/Documentation/switchtec.txt b/Documentation/switchtec.txt new file mode 100644 index 000..4bced4c --- /dev/null +++ b/Documentation/switchtec.txt @@ -0,0 +1,53 @@ + +Linux Switchtec Support + + +Microsemi's "Switchtec" line of PCI switch devices is already +supported by the kernel with standard PCI switch drivers. However, the +Switchtec device advertises a special management endpoint which +enables some additional functionality. This includes: + +* Packet and Byte Counters +* Firmware Upgrades +* Event and Error logs +* Querying port link status +* Custom user firmware commands + +The switchtec kernel module implements this functionality. + + +Interface += + +The primary means of communicating with the Switchtec management firmware is +through the Memory-mapped Remote Procedure Call (MRPC) interface. +Commands are submitted to the interface with a 4-byte command +identifier and up to 1KB of command specific data. The firmware will +respond with a 4 bytes return code and up to 1KB of command specific +data. The interface only processes a single command at a time. + + +Userspace Interface +=== + +The MRPC interface will be exposed to userspace through a simple char +device: /dev/switchtec#, one for each management endpoint in the system. + +The char device has the following semantics: + +* A write must consist of at least 4 bytes and no more than 1028 bytes. + The first four bytes will be interpreted as the command to run and + the remainder will be used as the input data. A write will send the + command to the firmware to begin processing. + +* Each write must be followed by exactly one read. Any double write will + produce an error and any read that doesn't follow a write will + produce an error. + +* A read will block until the firmware completes the command and return + the four bytes of status plus up to 1024 bytes of output data. (The + length will be specified by the size parameter of the read call -- + reading less than 4 bytes will produce an error. + +* The poll call will also be supported for userspace applications that + need to do other things while waiting for the command to complete. diff --git a/MAINTAINERS b/MAINTAINERS index a57686f..aa702b0 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -9512,6 +9512,7 @@ M:Stephen Bates <stephen.ba...@microsemi.com> M: Logan Gunthorpe <log...@deltatee.com> L: linux-...@vger.kernel.org S: Maintained +F: Documentation/switchtec.txt F: drivers/pci/switch/switchtec* PCI DRIVER FOR NVIDIA TEGRA -- 2.1.4
Re: [PATCH v5 0/4] New Microsemi PCI Switch Management Driver
> This driver doesn't have anything to do with the PCI core, other than > using the pci_register_driver() interface (just like all other drivers > for PCI-connected devices), so drivers/pci doesn't really feel like > the right place for it. Putting it in drivers/pci leads to the sort > of confusion you mentioned above ("To make this entirely clear ..."). > > Would drivers/perf or drivers/misc/switchtec/ be possible places for > it? I made a similar argument when we made the decision of where to put the code. In the end, the device _is_ a PCI Switch and someone going through menuconfig or the source tree would probably look there first. As for drivers/perf, our device does a fair bit more than performance counters so it doesn't seem like it really fits in there. drivers/misc just seems like a dumping ground which we'd prefer not to contribute to. We also considered drivers/char (seeing it exposes a char device), but that also seems like a dumping ground with stuff that belongs and other stuff that just ended up stuck between the cracks. If you still feel strongly about this we can move it into misc, but I think from an organizational perspective pci/switch makes a bit more sense. In any case, I also wish we could have had this discussion 3 months ago when we posted the RFC and not when I have people pushing to get this merged. Thanks, Logan
Re: [PATCH v5 0/4] New Microsemi PCI Switch Management Driver
On 01/03/17 05:32 PM, Bjorn Helgaas wrote: > It's not a perfect fit in drivers/pci because it's not bus > infrastructure and I don't want to be the default maintainer of it, > but I agree there's not really an alternative that's clearly better, > so let's leave it where it is for now. Sounds good, thanks. It would be nice if it could stay where it is for organization but go through other maintainers trees (though I'm not sure who). I understand why you wouldn't want to take on the maintenance of it. Hopefully, myself and the other Microsemi developers will be able to do the bulk of the maintenance work for the driver. Thanks, Logan
Re: [PATCH v5 0/4] New Microsemi PCI Switch Management Driver
On 01/03/17 04:58 PM, Jason Gunthorpe wrote: > On Wed, Mar 01, 2017 at 03:49:04PM -0700, Logan Gunthorpe wrote: > >> Seems to me like an elegant solution would be to implement a 'cdev_kill' >> function which could kill all the processes using a cdev. Thus, during >> an unbind, a driver could call it and be sure that there are no users >> left and it can safely allow the devres unwind to continue. Then no >> difficult and racy 'alive' flags would be necessary and it would be much >> easier on drivers. > > That could help, but this would mean cdev would have to insert a shim > to grab locks around the various file ops. Hmm, I was hoping something more along the lines of actually killing the processes instead of just shimming away fops. > AFAIK TPM is correct and has been robustly tested now. We have a 'vtpm' > driver that agressively uses hot-unplug. Ah, thanks for the explanation of how that works. I didn't notice the semaphore usage. Switchtec is a bit more tricky because a) there's no upper level driver to handle things and b) userspace may be inside a wait_for_completion (via read or poll) that needs to be completed. If a so called 'cdev_kill' could actually just kill these processes it would be a bit easier. Currently, in the Switchtec code, there's a timeout if the interrupt doesn't fire (which occurs if the pci device has been torn down) and the code will check an alive bit (under mutex protection) and error out if it's not alive. Because of how poll works, I don't see how I can just hold a semaphore inside every fops call like the tpm code does. Logan
Re: [PATCH v5 0/4] New Microsemi PCI Switch Management Driver
On 01/03/17 05:23 PM, Logan Gunthorpe wrote: > Because of how poll works, I don't see how I can just hold a semaphore > inside every fops call like the tpm code does. On 01/03/17 04:58 PM, Jason Gunthorpe wrote: > Both infiniband and TPM have the 'detach' flavour of unplug where the > user is not forced to close their open cdevs. For simpler cases you > can just wait for all cdevs to close with a simple rwsem, much like > sysfs does already during device_del. Though, after reading your email again, a hard fence on close sounds like a good idea. Tomorrow I'll post a v6 that uses that approach. Thanks, Logan
Re: [PATCH v5 0/4] New Microsemi PCI Switch Management Driver
On 01/03/17 03:26 PM, Keith Busch wrote: > I think this is from using the managed device resource API to request the > irq actions. The scope of the resource used to be tied to the pci_dev's > dev, but now it's the new switchec class dev, which has a different > lifetime while open references exist, so it's not releasing the irq's. The scope of the IRQ was originally tied to the pci_dev. Then in v4 I tied it to the switchtec device in order to try and keep using the pci device after unbind. This didn't work, so I switched it back to using the pci_dev. (This seems to be the way most drivers work anyway.) > One thing about the BUG_ON that is confusing me is how it's getting > to free_msi_irq's BUG in v4 or v5. I don't see any part releasing the > allocated ones. Maybe the devres API is harder to use than having the > driver manage all the resources... free_msi_irqs seems to be called via pci_disable_device in pcim_release which devres will call during release of the PCI device and before all the references to the pci_dev are freed (I tried adding an extra get_device which gets put in the child devices release -- this didn't work): [ 1079.845616] Call Trace: [ 1079.845652] ? pcim_release+0x35/0x96 [ 1079.845691] ? release_nodes+0x15b/0x17c [ 1079.845730] ? device_release_driver_internal+0x12d/0x1cb [ 1079.845771] ? unbind_store+0x59/0x89 [ 1079.845809] ? kernfs_fop_write+0xe7/0x129 [ 1079.845847] ? __vfs_write+0x1c/0xa2 [ 1079.845885] ? kmem_cache_alloc+0xc5/0x131 [ 1079.845923] ? fput+0xd/0x7d [ 1079.845958] ? filp_close+0x5a/0x61 [ 1079.845993] ? vfs_write+0xa2/0xe4 [ 1079.846028] ? SyS_write+0x48/0x73 [ 1079.846066] ? entry_SYSCALL_64_fastpath+0x13/0x94 v5 is correct because it registers the irqs against the pci_dev (with devm_request_irq) and thus they get freed in time as part of the devres unwind. Logan
Re: [PATCH v5 0/4] New Microsemi PCI Switch Management Driver
On 01/03/17 02:41 PM, Bjorn Helgaas wrote: > I don't think this is indicating a bug in the PCI core (although I do > think a BUG_ON() here is an excessive response). I think it's an > indication that the driver didn't disconnect its ISR. Without more > details of the failure it's hard to tell if the BUG_ON is a symptom of > a problem in the driver or what. Yes, my assumption was that when you force an unbind on the PCI core, it's designed to stop using the PCI device right away even if there are users using it. Thus it becomes the drivers responsibility to handle this situation. > An "alive" flag feels racy, and I can't tell if it's really the best > way to deal with this, or if it's just avoiding the issue. There must > be other drivers with the same cleanup issue -- do they handle it the > same way? I haven't done a comprehensive search, but it's very common for people to use (and this is what I've adopted again in v5): devm_request_irq(>dev, ...) In this way, the IRQs are released with the pci_dev (or often platform) and thus the BUG_ON never hits. However, it means any user space program waiting on an IRQ (like via a cdev call) will hang unless handled with other means. Exactly what those means are seems driver specific and not always obvious. I wouldn't be surprised if a lot of drivers get this aspect wrong. A couple examples I've looked at: 1) drivers/dax/dax.c uses an alive flag without any mutexes, atomics or anything. So I don't know if it's racy or perhaps correct for other reasons. 2) drivers/char/hw_random has a drop_current_rng that looks like it could easily be racy with the get_current_rng in the userspace flow. 3) A couple of drivers drivers/char/tpm doesn't seem to have any protection at all and appears like they would continue to use io operations even after the they may get unmapped because the char device persists. So I'm not sure where you'd find a driver that does it correctly and in a simpler way.. Another thing: based on comments in [1], a lot of people don't seem to realize that cdev instances can persist long after cdev_del so it's probably very common for drivers to get this wrong. Logan [1] https://lists.01.org/pipermail/linux-nvdimm/2017-February/009001.html >> To solve this, we've moved the pci release code back into the >> unregister function and reintroduced an alive flag. This time, >> however, the alive flag is protected by mrpc_mutex and we're very >> careful about what happens to devices still in use (they should >> all be released through the timeout path and an ENODEV error >> returned to userspace; while new commands are blocked with the >> same error).
Re: [PATCH v5 0/4] New Microsemi PCI Switch Management Driver
Hey, Seems to me like an elegant solution would be to implement a 'cdev_kill' function which could kill all the processes using a cdev. Thus, during an unbind, a driver could call it and be sure that there are no users left and it can safely allow the devres unwind to continue. Then no difficult and racy 'alive' flags would be necessary and it would be much easier on drivers. However, I don't think any such thing exists at the moment and it's not likely to be done in the near term. I'm reasonably confident in the correctness of v5 of my driver (especially when compared to other drivers) and unless someone can describe how it's wrong or a better solution I'd rather see it merged as is. If and when a better approach arrives I'd happily patch it to improve the situation. Logan On 01/03/17 03:24 PM, Logan Gunthorpe wrote: > > > On 01/03/17 02:41 PM, Bjorn Helgaas wrote: >> I don't think this is indicating a bug in the PCI core (although I do >> think a BUG_ON() here is an excessive response). I think it's an >> indication that the driver didn't disconnect its ISR. Without more >> details of the failure it's hard to tell if the BUG_ON is a symptom of >> a problem in the driver or what. > > Yes, my assumption was that when you force an unbind on the PCI core, > it's designed to stop using the PCI device right away even if there are > users using it. Thus it becomes the drivers responsibility to handle > this situation. > >> An "alive" flag feels racy, and I can't tell if it's really the best >> way to deal with this, or if it's just avoiding the issue. There must >> be other drivers with the same cleanup issue -- do they handle it the >> same way? > > I haven't done a comprehensive search, but it's very common for people > to use (and this is what I've adopted again in v5): > > devm_request_irq(>dev, ...) > > In this way, the IRQs are released with the pci_dev (or often platform) > and thus the BUG_ON never hits. However, it means any user space program > waiting on an IRQ (like via a cdev call) will hang unless handled with > other means. Exactly what those means are seems driver specific and not > always obvious. I wouldn't be surprised if a lot of drivers get this > aspect wrong. > > A couple examples I've looked at: > > 1) drivers/dax/dax.c uses an alive flag without any mutexes, atomics or > anything. So I don't know if it's racy or perhaps correct for other reasons. > > 2) drivers/char/hw_random has a drop_current_rng that looks like it > could easily be racy with the get_current_rng in the userspace flow. > > 3) A couple of drivers drivers/char/tpm doesn't seem to have any > protection at all and appears like they would continue to use io > operations even after the they may get unmapped because the char device > persists. > > So I'm not sure where you'd find a driver that does it correctly and in > a simpler way.. > > Another thing: based on comments in [1], a lot of people don't seem to > realize that cdev instances can persist long after cdev_del so it's > probably very common for drivers to get this wrong. > > Logan > > [1] https://lists.01.org/pipermail/linux-nvdimm/2017-February/009001.html
Re: [PATCH v5 0/4] New Microsemi PCI Switch Management Driver
On 01/03/17 03:59 PM, Keith Busch wrote: > Okay, I see. Was mistakenliy looking at v4. The v5 looks right. Great! Thanks for reviewing that for me. Logan
Re: [PATCH 2/3] iopmem : Add a block device driver for PCIe attached IO memory.
Hi Christoph, Thanks so much for the detailed review of the code! Even though by the sounds of things we will be moving to device dax and most of this is moot. Still, it's great to get some feedback and learn a few things. I've given some responses below. On 28/10/16 12:45 AM, Christoph Hellwig wrote: >> + * This driver is heavily based on drivers/block/pmem.c. >> + * Copyright (c) 2014, Intel Corporation. >> + * Copyright (C) 2007 Nick Piggin >> + * Copyright (C) 2007 Novell Inc. > > Is there anything left of it actually? I didn't spot anything > obvious. Nevermind that we don't have a file with that name anymore :) Yes, actually there's still a lot of similarities with the current pmem.c. Though, yes, the path was on oversight. Some of this code is getting pretty old (it started from an out-of-tree version of pmem.c) and we've tried our best to track as many of the changes to the pmem.c as possible. This proved to be difficult. Note: this is now the nvdimm pmem and not the dax pmem (drivers/nvdimm/pmem.c) >> + /* >> + * We can only access the iopmem device with full 32-bit word >> + * accesses which cannot be gaurantee'd by the regular memcpy >> + */ > > Odd comment formatting. Oops. I'm surprised check_patch didn't pick up on that. > >> +static void memcpy_from_iopmem(void *dst, const void *src, size_t sz) >> +{ >> +u64 *wdst = dst; >> +const u64 *wsrc = src; >> +u64 tmp; >> + >> +while (sz >= sizeof(*wdst)) { >> +*wdst++ = *wsrc++; >> +sz -= sizeof(*wdst); >> +} >> + >> +if (!sz) >> +return; >> + >> +tmp = *wsrc; >> +memcpy(wdst, , sz); >> +} > > And then we dod a memcpy here anyway. And no volatile whatsover, so > the compiler could do anything to it. I defintively feel a bit uneasy > about having this in the driver as well. Can we define the exact > semantics for this and define it by the system, possibly in an arch > specific way? Yeah, you're right. We should have reviewed this function a bit more. Anyway, I'd be interested in learning a better approach to forcing a copy from a mapped BAR with larger widths. >> +static void iopmem_do_bvec(struct iopmem_device *iopmem, struct page *page, >> + unsigned int len, unsigned int off, bool is_write, >> + sector_t sector) >> +{ >> +phys_addr_t iopmem_off = sector * 512; >> +void *iopmem_addr = iopmem->virt_addr + iopmem_off; >> + >> +if (!is_write) { >> +read_iopmem(page, off, iopmem_addr, len); >> +flush_dcache_page(page); >> +} else { >> +flush_dcache_page(page); >> +write_iopmem(iopmem_addr, page, off, len); >> +} > > How about moving the address and offset calculation as well as the > cache flushing into read_iopmem/write_iopmem and removing this function? Could do. This was copied from the existing pmem.c and once the bad_pmem stuff was stripped out this function became relatively simple. > >> +static blk_qc_t iopmem_make_request(struct request_queue *q, struct bio >> *bio) >> +{ >> +struct iopmem_device *iopmem = q->queuedata; >> +struct bio_vec bvec; >> +struct bvec_iter iter; >> + >> +bio_for_each_segment(bvec, bio, iter) { >> +iopmem_do_bvec(iopmem, bvec.bv_page, bvec.bv_len, >> +bvec.bv_offset, op_is_write(bio_op(bio)), >> +iter.bi_sector); > > op_is_write just checks the data direction. I'd feel much more > comfortable with a switch on the op, e.g. That makes sense. This was also copied from pmem.c, so this same change may make sense there too. >> +static long iopmem_direct_access(struct block_device *bdev, sector_t sector, >> + void **kaddr, pfn_t *pfn, long size) >> +{ >> +struct iopmem_device *iopmem = bdev->bd_queue->queuedata; >> +resource_size_t offset = sector * 512; >> + >> +if (!iopmem) >> +return -ENODEV; > > I don't think this can ever happen, can it? Yes, I think now that's the case. This is probably a holdover from a previous version. > Just use ida_simple_get/ida_simple_remove instead to take care > of the locking and preloading, and get rid of these two functions. Thanks, noted. That would be much better. I never found a simple example of that when I was looking, though I expected there should have been. > >> +static int iopmem_attach_disk(struct iopmem_device *iopmem) >> +{ >> +struct gendisk *disk; >> +int nid = dev_to_node(iopmem->dev); >> +struct request_queue *q = iopmem->queue; >> + >> +blk_queue_write_cache(q, true, true); > > You don't handle flush commands or the fua bit in make_request, so > this setting seems wrong. Yup, ok. I'm afraid this is a case of copying without complete comprehension. > >> +int err = 0; >> +int nid = dev_to_node(>dev); >> + >> +if (pci_enable_device_mem(pdev) < 0) { > > propagate the actual error code, please. Hmm,
Re: Enabling peer to peer device transactions for PCIe devices
On 25/11/16 06:06 AM, Christian König wrote: > Well Serguei send me a couple of documents about QPI when we started to > discuss this internally as well and that's exactly one of the cases I > had in mind when writing this. > > If I understood it correctly for such systems P2P is technical possible, > but not necessary a good idea. Usually it is faster to just use a > bouncing buffer when the peers are a bit "father" apart. > > That this problem is solved on newer hardware is good, but doesn't helps > us at all if we at want to support at least systems from the last five > years or so. Well we have been testing with Sandy Bridge, I think the problem was supposed to be fixed in Ivy but we never tested it so I can't say what the performance turned out to be. Ivy is nearly 5 years old. I expect this is something that will be improved more and more with subsequent generations. A white list may end up being rather complicated if it has to cover different CPU generations and system architectures. I feel this is a decision user space could easily make. Logan
Re: Enabling peer to peer device transactions for PCIe devices
On 23/11/16 02:55 PM, Jason Gunthorpe wrote: >>> Only ODP hardware allows changing the DMA address on the fly, and it >>> works at the page table level. We do not need special handling for >>> RDMA. >> >> I am aware of ODP but, noted by others, it doesn't provide a general >> solution to the points above. > > How do you mean? I was only saying it wasn't general in that it wouldn't work for IB hardware that doesn't support ODP or other hardware that doesn't do similar things (like an NVMe drive). It makes sense for hardware that supports ODP to allow MRs to not pin the underlying memory and provide for migrations that the hardware can follow. But most DMA engines will require the memory to be pinned and any complex allocators (GPU or otherwise) should respect that. And that seems like it should be the default way most of this works -- and I think it wouldn't actually take too much effort to make it all work now as is. (Our iopmem work is actually quite small and simple.) >> It's also worth noting that #4 makes use of ZONE_DEVICE (#2) so they are >> really the same option. iopmem is really just one way to get BAR >> addresses to user-space while inside the kernel it's ZONE_DEVICE. > > Seems fine for RDMA? Yeah, we've had RDMA and O_DIRECT transfers to PCIe backed ZONE_DEVICE memory working for some time. I'd say it's a good fit. The main question we've had is how to expose PCIe bars to userspace to be used as MRs and such. Logan
Re: Enabling peer to peer device transactions for PCIe devices
Hey, On 22/11/16 11:59 AM, Serguei Sagalovitch wrote: > - How well we will be able to handle case when we need to "move"/"evict" >memory/data to the new location so CPU pointer should point to the > new physical location/address > (and may be not in PCI device memory at all)? IMO any memory that has been registered for a P2P transaction should be locked from being evicted. So if there's a get_user_pages call it needs to be pinned until the put_page. The main issue being with the RDMA case: handling an eviction when a chunk of memory has been registered as an MR would be very tricky. The MR may be relied upon by another host and the kernel would have to inform user-space the MR was invalid then user-space would have to tell the remote application. This seems like a lot of burden to place on applications and may be subject to timing issues. Either that or all RDMA applications need to be written with the assumption that their target memory could go away at any time. More generally, if you tell one PCI device to do a DMA transfer to another PCI device's BAR space, and the target memory gets evicted then DMA transaction needs to be aborted which means every driver doing the transfer would need special support for this. If the memory can be relied on to not be evicted than existing drivers should work unmodified (ie O_DIRECT to/from an NVMe card would just work). I feel the better approach is to pin memory subject to P2P transactions as is typically done with DMA transfers to main memory. Logan
Re: Enabling peer to peer device transactions for PCIe devices
On 23/11/16 01:33 PM, Jason Gunthorpe wrote: > On Wed, Nov 23, 2016 at 02:58:38PM -0500, Serguei Sagalovitch wrote: > >>We do not want to have "highly" dynamic translation due to >>performance cost. We need to support "overcommit" but would >>like to minimize impact. To support RDMA MRs for GPU/VRAM/PCIe >>device memory (which is must) we need either globally force >>pinning for the scope of "get_user_pages() / "put_pages" or have >>special handling for RDMA MRs and similar cases. > > As I said, there is no possible special handling. Standard IB hardware > does not support changing the DMA address once a MR is created. Forget > about doing that. Yeah, that's essentially the point I was trying to make. Not to mention all the other unrelated hardware that can't DMA to an address that might disappear mid-transfer. > Only ODP hardware allows changing the DMA address on the fly, and it > works at the page table level. We do not need special handling for > RDMA. I am aware of ODP but, noted by others, it doesn't provide a general solution to the points above. > Like I said, this is the direction the industry seems to be moving in, > so any solution here should focus on VMAs/page tables as the way to link > the peer-peer devices. Yes, this was the appeal to us of using ZONE_DEVICE. > To me this means at least items #1 and #3 should be removed from > Alexander's list. It's also worth noting that #4 makes use of ZONE_DEVICE (#2) so they are really the same option. iopmem is really just one way to get BAR addresses to user-space while inside the kernel it's ZONE_DEVICE. Logan
Re: Enabling peer to peer device transactions for PCIe devices
On 28/11/16 09:57 AM, Jason Gunthorpe wrote: >> On PeerDirect, we have some kind of a middle-ground solution for pinning >> GPU memory. We create a non-ODP MR pointing to VRAM but rely on >> user-space and the GPU not to migrate it. If they do, the MR gets >> destroyed immediately. > > That sounds horrible. How can that possibly work? What if the MR is > being used when the GPU decides to migrate? I would not support that > upstream without a lot more explanation.. Yup, this was our experience when playing around with PeerDirect. There was nothing we could do if the GPU decided to invalidate the P2P mapping. It just meant the application would fail or need complicated logic to detect this and redo just about everything. And given that it was a reasonably rare occurrence during development it probably means not a lot of applications will be developed to handle it and most would end up being randomly broken in environments with memory pressure. Logan
Re: Enabling peer to peer device transactions for PCIe devices
On 28/11/16 12:35 PM, Serguei Sagalovitch wrote: > As soon as PeerDirect mapping is called then GPU must not "move" the > such memory. It is by PeerDirect design. It is similar how it is works > with system memory and RDMA MR: when "get_user_pages" is called then the > memory is pinned. We haven't touch this in a long time and perhaps it changed, but there definitely was a call back in the PeerDirect API to allow the GPU to invalidate the mapping. That's what we don't want. Logan
Re: Enabling peer to peer device transactions for PCIe devices
Hey, On 24/11/16 02:45 AM, Christian König wrote: > E.g. it can happen that PCI device A exports it's BAR using ZONE_DEVICE. > Not PCI device B (a SATA device) can directly read/write to it because > it is on the same bus segment, but PCI device C (a network card for > example) can't because it is on a different bus segment and the bridge > can't handle P2P transactions. Yeah, that could be an issue but in our experience we have yet to see it. We've tested with two separate PCI buses on different CPUs connected through QPI links and it works fine. (It is rather slow but I understand Intel has improved the bottleneck in newer CPUs than the ones we tested.) It may just be older hardware that has this issue. I expect that as long as a failed transfer can be handled gracefully by the initiator I don't see a need to predetermine whether a device can see another devices memory. Logan
Re: Enabling peer to peer device transactions for PCIe devices
On 24/11/16 09:42 AM, Jason Gunthorpe wrote: > There are three cases to worry about: > - Coherent long lived page table mirroring (RDMA ODP MR) > - Non-coherent long lived page table mirroring (RDMA MR) > - Short lived DMA mapping (everything else) > > Like you say below we have to handle short lived in the usual way, and > that covers basically every device except IB MRs, including the > command queue on a NVMe drive. Yes, this makes sense to me. Though I thought regular IB MRs with regular memory currently pinned the pages (despite being long lived) that's why we can run up against the "max locked memory" limit. It doesn't seem so terrible if GPU memory had a similar restriction until ODP like solutions get implemented. >> Yeah, we've had RDMA and O_DIRECT transfers to PCIe backed ZONE_DEVICE >> memory working for some time. I'd say it's a good fit. The main question >> we've had is how to expose PCIe bars to userspace to be used as MRs and >> such. > Is there any progress on that? Well, I guess there's some consensus building to do. The existing options are: * Device DAX: which could work but the problem I see with it is that it only allows one application to do these transfers. Or there would have to be some user-space coordination to figure which application gets what memeroy. * Regular DAX in the FS doesn't work at this time because the FS can move the file you think your transfer to out from under you. Though I understand there's been some work with XFS to solve that issue. Though, we've been considering that the backed memory would be non-volatile which adds some of this complexity. If the memory were volatile the kernel would just need to do some relatively straight forward allocation to user-space when asked. For example, with NVMe, the kernel could give chunks of the CMB buffer to userspace via an mmap call to /dev/nvmeX. Though I think there's been some push back against things like that as well. > I still don't quite get what iopmem was about.. I thought the > objection to uncachable ZONE_DEVICE & DAX made sense, so running DAX > over iopmem and still ending up with uncacheable mmaps still seems > like a non-starter to me... The latest incarnation of iopmem simply created a block device backed by ZONE_DEVICE memory on a PCIe BAR. We then put a DAX FS on it and user-space could mmap the files and send them to other devices to do P2P transfers. I don't think there was a hard objection to uncachable ZONE_DEVICE and DAX. We did try our experimental hardware with cached ZONE_DEVICE and it did work but the performance was beyond unusable (which may be a hardware issue). In the end I feel the driver would have to decide the most appropriate caching for the hardware and I don't understand why WC or UC wouldn't work with ZONE_DEVICE. Logan
Re: Enabling peer to peer device transactions for PCIe devices
Hey, On 06/12/16 09:38 AM, Jason Gunthorpe wrote: >>> I'm not opposed to mapping /dev/nvmeX. However, the lookup is trivial >>> to accomplish in sysfs through /sys/dev/char to find the sysfs path of the >>> device-dax instance under the nvme device, or if you already have the nvme >>> sysfs path the dax instance(s) will appear under the "dax" sub-directory. >> >> Personally I think mapping the dax resource in the sysfs tree is a nice >> way to do this and a bit more intuitive than mapping a /dev/nvmeX. > > It is still not at all clear to me what userpsace is supposed to do > with this on nvme.. How is the CMB usable from userspace? The flow is pretty simple. For example to write to NVMe from an RDMA device: 1) Obtain a chunk of the CMB to use as a buffer(either by mmaping /dev/nvmx, the device dax char device or through a block layer interface (which sounds like a good suggestion from Christoph, but I'm not really sure how it would look). 2) Create an MR with the buffer and use an RDMA function to fill it with data from a remote host. This will cause the RDMA hardware to write directly to the memory in the NVMe card. 3) Using O_DIRECT, write the buffer to a file on the NVMe filesystem. When the address reaches hardware the NVMe will recognize it as local memory and copy it directly there. Thus we are able to transfer data to any file on an NVMe device without going through system memory. This has benefits on systems with lots of activity in system memory but step 3 is likely to be slowish due to the need to pin/unpin the memory for every transaction. Logan
[RFC 0/1] New PCI Switch Management Driver
Hi, [Appologies: this is a resend for some people. Due to a configuration error the original email was rejected by the mailing lists. I hope this one makes it!] We're looking to get some initial feedback on a new driver for a line of PCIe switches produced and produced and sold by Microsemi. The goal is to get the process moving to get this code included in upstream hopefully for 4.11. Facebook is currently gearing up to use this hardware in its Open Compute Platform and is pushing to have this driver in the upstream kernel. The following patch briefly describes the hardware and provides the first draft of driver code. Currently, the driver works and has been tested but is not feature complete. Thus, we are not looking to get it merged immediately. However we would like some early review, specifically on the interfaces and core concepts so that we don't do a lot of work down a path the community would reject. Barring any objections to this RFC, we will flesh out all the features and provide a completed patch for inclusion in the coming weeks. Work on a userspace tool, that utilizes this driver, is also being done at [1]. The tool is currently also a bit of a skeleton and will be fleshed out assuming there are no serious objections to our userspace interface. In the end, the tool will be released with a GPL license. The patch is based off of the v4.9 release. Thanks for your review, Logan [1] https://github.com/sbates130272/switchtec-user Logan Gunthorpe (1): MicroSemi Switchtec management interface driver Documentation/switchtec.txt| 54 +++ MAINTAINERS| 9 + drivers/pci/Kconfig| 1 + drivers/pci/Makefile | 1 + drivers/pci/switch/Kconfig | 13 + drivers/pci/switch/Makefile| 1 + drivers/pci/switch/switchtec.c | 824 + drivers/pci/switch/switchtec.h | 119 ++ 8 files changed, 1022 insertions(+) create mode 100644 Documentation/switchtec.txt create mode 100644 drivers/pci/switch/Kconfig create mode 100644 drivers/pci/switch/Makefile create mode 100644 drivers/pci/switch/switchtec.c create mode 100644 drivers/pci/switch/switchtec.h -- 2.1.4
[RFC 1/1] MicroSemi Switchtec management interface driver
Microsemi's "Switchtec" line of PCI switch devices is already supported by the kernel with standard PCI switch drivers. However, the Switchtec device advertises a special management endpoint which enables some additional functionality. This includes: * Packet and Byte Counters * Firmware Upgrades * Event and Error logs * Querying port link status * Custom user firmware commands This patch introduces the switchtec kernel module which provides pci driver that exposes a char device. The char device provides userspace access to this interface through read, write and (optionally) poll calls. Currently no ioctls have been implemented but a couple may be added in a later revision. A short text file is provided which documents the switchtec driver and outlines the semantics of using the char device. A WIP userspace tool which utilizes this interface is available at [1]. This tool takes inspiration (and borrows some code) from nvme-cli [2]. [1] https://github.com/sbates130272/switchtec-user [2] https://github.com/linux-nvme/nvme-cli Signed-off-by: Logan Gunthorpe <log...@deltatee.com> Signed-off-by: Stephen Bates <stephen.ba...@microsemi.com> --- Documentation/switchtec.txt| 54 +++ MAINTAINERS| 9 + drivers/pci/Kconfig| 1 + drivers/pci/Makefile | 1 + drivers/pci/switch/Kconfig | 13 + drivers/pci/switch/Makefile| 1 + drivers/pci/switch/switchtec.c | 824 + drivers/pci/switch/switchtec.h | 119 ++ 8 files changed, 1022 insertions(+) create mode 100644 Documentation/switchtec.txt create mode 100644 drivers/pci/switch/Kconfig create mode 100644 drivers/pci/switch/Makefile create mode 100644 drivers/pci/switch/switchtec.c create mode 100644 drivers/pci/switch/switchtec.h diff --git a/Documentation/switchtec.txt b/Documentation/switchtec.txt new file mode 100644 index 000..04657ce --- /dev/null +++ b/Documentation/switchtec.txt @@ -0,0 +1,54 @@ + +Linux Switchtec Support + + +Microsemi's "Switchtec" line of PCI switch devices is already +supported by the kernel with standard PCI switch drivers. However, the +Switchtec device advertises a special management endpoint which +enables some additional functionality. This includes: + + * Packet and Byte Counters + * Firmware Upgrades + * Event and Error logs + * Querying port link status + * Custom user firmware commands + +The switchtec kernel module implements this functionality. + + + +Interface += + +The primary means of communicating with the Switchtec management firmware is +through the Memory-mapped Remote Procedure Call (MRPC) interface. +Commands are submitted to the interface with a 4-byte command +identifier and up to 1KB of command specific data. The firmware will +respond with a 4 bytes return code and up to 1KB of command specific +data. The interface only processes a single command at a time. + + +Userspace Interface +=== + +The MRPC interface will be exposed to userspace through a simple char +device: /dev/switchtec#, one for each management endpoint in the system. + +The char device has the following semantics: + + * A write must consist of at least 4 bytes and no more than 1028 bytes. + The first four bytes will be interpreted as the command to run and + the remainder will be used as the input data. A write will send the + command to the firmware to begin processing. + + * Each write must be followed by exactly one read. Any double write will + produce an error and any read that doesn't follow a write will + produce an error. + + * A read will block until the firmware completes the command and return + the four bytes of status plus up to 1024 bytes of output data. (The + length will be specified by the size parameter of the read call -- + reading less than 4 bytes will produce an error. + + * The poll call will also be supported for userspace applications that + need to do other things while waiting for the command to complete. diff --git a/MAINTAINERS b/MAINTAINERS index 63cefa6..1e21505 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -9288,6 +9288,15 @@ S: Maintained F: Documentation/devicetree/bindings/pci/aardvark-pci.txt F: drivers/pci/host/pci-aardvark.c +PCI DRIVER FOR MICROSEMI SWITCHTEC +M: Kurt Schwemmer <kurt.schwem...@microsemi.com> +M: Stephen Bates <stephen.ba...@microsemi.com> +M: Logan Gunthorpe <log...@deltatee.com> +L: linux-...@vger.kernel.org +S: Maintained +F: Documentation/switchtec.txt +F: drivers/pci/switch/switchtec* + PCI DRIVER FOR NVIDIA TEGRA M: Thierry Reding <thierry.red...@gmail.com> L: linux-te...@vger.kernel.org diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig index 6555eb7..f72e8c5 100644 --- a/drivers/pci/Kconfig +++ b/drivers/pci/Kconfig @@ -133,3 +133,4 @@ config PCI_HYPERV source "drivers/pci/hotplug
[PATCH 2/2] ntb_hw_intel: Style fixes: open code macros that just obfuscate code
As per a comments in [1] by Greg Kroah-Hartman, the ndev_* macros should be cleaned up. This makes it more clear what's actually going on when reading the code. [1] http://www.spinics.net/lists/linux-pci/msg56904.html Signed-off-by: Logan Gunthorpe <log...@deltatee.com> --- drivers/ntb/hw/intel/ntb_hw_intel.c | 145 ++-- drivers/ntb/hw/intel/ntb_hw_intel.h | 3 - 2 files changed, 73 insertions(+), 75 deletions(-) diff --git a/drivers/ntb/hw/intel/ntb_hw_intel.c b/drivers/ntb/hw/intel/ntb_hw_intel.c index 7310a26..3c8ef1d 100644 --- a/drivers/ntb/hw/intel/ntb_hw_intel.c +++ b/drivers/ntb/hw/intel/ntb_hw_intel.c @@ -254,12 +254,12 @@ static inline int ndev_db_addr(struct intel_ntb_dev *ndev, if (db_addr) { *db_addr = reg_addr + reg; - dev_dbg(ndev_dev(ndev), "Peer db addr %llx\n", *db_addr); + dev_dbg(>ntb.pdev->dev, "Peer db addr %llx\n", *db_addr); } if (db_size) { *db_size = ndev->reg->db_size; - dev_dbg(ndev_dev(ndev), "Peer db size %llx\n", *db_size); + dev_dbg(>ntb.pdev->dev, "Peer db size %llx\n", *db_size); } return 0; @@ -352,7 +352,8 @@ static inline int ndev_spad_addr(struct intel_ntb_dev *ndev, int idx, if (spad_addr) { *spad_addr = reg_addr + reg + (idx << 2); - dev_dbg(ndev_dev(ndev), "Peer spad addr %llx\n", *spad_addr); + dev_dbg(>ntb.pdev->dev, "Peer spad addr %llx\n", + *spad_addr); } return 0; @@ -390,7 +391,7 @@ static irqreturn_t ndev_interrupt(struct intel_ntb_dev *ndev, int vec) vec_mask = ndev_vec_mask(ndev, vec); - dev_dbg(ndev_dev(ndev), "vec %d vec_mask %llx\n", vec, vec_mask); + dev_dbg(>ntb.pdev->dev, "vec %d vec_mask %llx\n", vec, vec_mask); ndev->last_ts = jiffies; @@ -416,7 +417,7 @@ static irqreturn_t ndev_irq_isr(int irq, void *dev) { struct intel_ntb_dev *ndev = dev; - return ndev_interrupt(ndev, irq - ndev_pdev(ndev)->irq); + return ndev_interrupt(ndev, irq - ndev->ntb.pdev->irq); } static int ndev_init_isr(struct intel_ntb_dev *ndev, @@ -426,7 +427,7 @@ static int ndev_init_isr(struct intel_ntb_dev *ndev, struct pci_dev *pdev; int rc, i, msix_count, node; - pdev = ndev_pdev(ndev); + pdev = ndev->ntb.pdev; node = dev_to_node(>dev); @@ -465,7 +466,7 @@ static int ndev_init_isr(struct intel_ntb_dev *ndev, goto err_msix_request; } - dev_dbg(ndev_dev(ndev), "Using msix interrupts\n"); + dev_dbg(>dev, "Using msix interrupts\n"); ndev->db_vec_count = msix_count; ndev->db_vec_shift = msix_shift; return 0; @@ -493,7 +494,7 @@ static int ndev_init_isr(struct intel_ntb_dev *ndev, if (rc) goto err_msi_request; - dev_dbg(ndev_dev(ndev), "Using msi interrupts\n"); + dev_dbg(>dev, "Using msi interrupts\n"); ndev->db_vec_count = 1; ndev->db_vec_shift = total_shift; return 0; @@ -511,7 +512,7 @@ static int ndev_init_isr(struct intel_ntb_dev *ndev, if (rc) goto err_intx_request; - dev_dbg(ndev_dev(ndev), "Using intx interrupts\n"); + dev_dbg(>dev, "Using intx interrupts\n"); ndev->db_vec_count = 1; ndev->db_vec_shift = total_shift; return 0; @@ -525,7 +526,7 @@ static void ndev_deinit_isr(struct intel_ntb_dev *ndev) struct pci_dev *pdev; int i; - pdev = ndev_pdev(ndev); + pdev = ndev->ntb.pdev; /* Mask all doorbell interrupts */ ndev->db_mask = ndev->db_valid_mask; @@ -559,7 +560,7 @@ static ssize_t ndev_debugfs_read(struct file *filp, char __user *ubuf, union { u64 v64; u32 v32; u16 v16; u8 v8; } u; ndev = filp->private_data; - pdev = ndev_pdev(ndev); + pdev = ndev->ntb.pdev; mmio = ndev->self_mmio; buf_size = min(count, 0x800ul); @@ -820,7 +821,8 @@ static void ndev_init_debugfs(struct intel_ntb_dev *ndev) ndev->debugfs_info = NULL; } else { ndev->debugfs_dir = - debugfs_create_dir(ndev_name(ndev), debugfs_dir); + debugfs_create_dir(pci_name(ndev->ntb.pdev), + debugfs_dir); if (!ndev->debugfs_dir) ndev->debugfs_info = NULL; else @@ -1007,13 +1009,13 @@ static int intel_ntb_link_enable(struct ntb_dev *ntb, if (ndev->ntb.topo == NTB_TOPO_SEC) return -EINVAL; -
[PATCH 1/2] ntb_hw_amd: Style fixes: open code macros that just obfuscate code
As per a comments in [1] by Greg Kroah-Hartman, the ndev_* macros should be cleaned up. This makes it more clear what's actually going on when reading the code. [1] http://www.spinics.net/lists/linux-pci/msg56904.html Signed-off-by: Logan Gunthorpe <log...@deltatee.com> --- drivers/ntb/hw/amd/ntb_hw_amd.c | 59 ++--- drivers/ntb/hw/amd/ntb_hw_amd.h | 3 --- 2 files changed, 31 insertions(+), 31 deletions(-) diff --git a/drivers/ntb/hw/amd/ntb_hw_amd.c b/drivers/ntb/hw/amd/ntb_hw_amd.c index 6ccba0d..85a9a4f 100644 --- a/drivers/ntb/hw/amd/ntb_hw_amd.c +++ b/drivers/ntb/hw/amd/ntb_hw_amd.c @@ -98,10 +98,10 @@ static int amd_ntb_mw_get_range(struct ntb_dev *ntb, int idx, return bar; if (base) - *base = pci_resource_start(ndev->ntb.pdev, bar); + *base = pci_resource_start(ntb->pdev, bar); if (size) - *size = pci_resource_len(ndev->ntb.pdev, bar); + *size = pci_resource_len(ntb->pdev, bar); if (align) *align = SZ_4K; @@ -126,7 +126,7 @@ static int amd_ntb_mw_set_trans(struct ntb_dev *ntb, int idx, if (bar < 0) return bar; - mw_size = pci_resource_len(ndev->ntb.pdev, bar); + mw_size = pci_resource_len(ntb->pdev, bar); /* make sure the range fits in the usable mw size */ if (size > mw_size) @@ -135,7 +135,7 @@ static int amd_ntb_mw_set_trans(struct ntb_dev *ntb, int idx, mmio = ndev->self_mmio; peer_mmio = ndev->peer_mmio; - base_addr = pci_resource_start(ndev->ntb.pdev, bar); + base_addr = pci_resource_start(ntb->pdev, bar); if (bar != 1) { xlat_reg = AMD_BAR23XLAT_OFFSET + ((bar - 2) << 3); @@ -226,7 +226,7 @@ static int amd_ntb_link_is_up(struct ntb_dev *ntb, if (width) *width = NTB_LNK_STA_WIDTH(ndev->lnk_sta); - dev_dbg(ndev_dev(ndev), "link is up.\n"); + dev_dbg(>pdev->dev, "link is up.\n"); ret = 1; } else { @@ -235,7 +235,7 @@ static int amd_ntb_link_is_up(struct ntb_dev *ntb, if (width) *width = NTB_WIDTH_NONE; - dev_dbg(ndev_dev(ndev), "link is down.\n"); + dev_dbg(>pdev->dev, "link is down.\n"); } return ret; @@ -255,7 +255,7 @@ static int amd_ntb_link_enable(struct ntb_dev *ntb, if (ndev->ntb.topo == NTB_TOPO_SEC) return -EINVAL; - dev_dbg(ndev_dev(ndev), "Enabling Link.\n"); + dev_dbg(>pdev->dev, "Enabling Link.\n"); ntb_ctl = readl(mmio + AMD_CNTL_OFFSET); ntb_ctl |= (PMM_REG_CTL | SMM_REG_CTL); @@ -276,7 +276,7 @@ static int amd_ntb_link_disable(struct ntb_dev *ntb) if (ndev->ntb.topo == NTB_TOPO_SEC) return -EINVAL; - dev_dbg(ndev_dev(ndev), "Enabling Link.\n"); + dev_dbg(>pdev->dev, "Enabling Link.\n"); ntb_ctl = readl(mmio + AMD_CNTL_OFFSET); ntb_ctl &= ~(PMM_REG_CTL | SMM_REG_CTL); @@ -467,18 +467,19 @@ static void amd_ack_smu(struct amd_ntb_dev *ndev, u32 bit) static void amd_handle_event(struct amd_ntb_dev *ndev, int vec) { void __iomem *mmio = ndev->self_mmio; + struct device *dev = >ntb.pdev->dev; u32 status; status = readl(mmio + AMD_INTSTAT_OFFSET); if (!(status & AMD_EVENT_INTMASK)) return; - dev_dbg(ndev_dev(ndev), "status = 0x%x and vec = %d\n", status, vec); + dev_dbg(dev, "status = 0x%x and vec = %d\n", status, vec); status &= AMD_EVENT_INTMASK; switch (status) { case AMD_PEER_FLUSH_EVENT: - dev_info(ndev_dev(ndev), "Flush is done.\n"); + dev_info(dev, "Flush is done.\n"); break; case AMD_PEER_RESET_EVENT: amd_ack_smu(ndev, AMD_PEER_RESET_EVENT); @@ -502,7 +503,7 @@ static void amd_handle_event(struct amd_ntb_dev *ndev, int vec) status = readl(mmio + AMD_PMESTAT_OFFSET); /* check if this is WAKEUP event */ if (status & 0x1) - dev_info(ndev_dev(ndev), "Wakeup is done.\n"); + dev_info(dev, "Wakeup is done.\n"); amd_ack_smu(ndev, AMD_PEER_D0_EVENT); @@ -511,14 +512,14 @@ static void amd_handle_event(struct amd_ntb_dev *ndev, int vec) AMD_LINK_HB_TIMEOUT); break; default: - dev_info(ndev_dev(ndev), "event status = 0x%x.\n", status); + dev_info(dev, "event status = 0x%x.\n", status);