date:20180607

[PATCH kernel 0/6] powerpc/powernv/iommu: Optimize memory use

2018-06-07 Thread Alexey Kardashevskiy

This patchset aims to reduce actual memory use for guests with
sparse memory. The pseries guest uses dynamic DMA windows to map
the entire guest RAM but it only actually maps onlined memory
which may be not be contiguous. I hit this when tried passing
through NVLink2-connected GPU RAM of NVIDIA V100 and trying to
map this RAM at the same offset as in the real hardware
forced me to rework I handle these windows.

This moves userspace-to-host-physical translation table
(iommu_table::it_userspace) from VFIO TCE IOMMU subdriver to
the platform code and reuses the already existing multilevel
TCE table code which we have for the hardware tables.
At last in 6/6 I switch to on-demand allocation so we do not
allocate huge chunks of the table if we do not have to;
there is some math in 6/6.


Please comment. Thanks.



Alexey Kardashevskiy (6):
  powerpc/powernv: Remove useless wrapper
  powerpc/powernv: Move TCE manupulation code to its own file
  KVM: PPC: Make iommu_table::it_userspace big endian
  powerpc/powernv: Add indirect levels to it_userspace
  powerpc/powernv: Rework TCE level allocation
  powerpc/powernv/ioda: Allocate indirect TCE levels on demand

 arch/powerpc/platforms/powernv/Makefile   |   2 +-
 arch/powerpc/include/asm/iommu.h  |  11 +-
 arch/powerpc/platforms/powernv/pci.h  |  44 ++-
 arch/powerpc/kvm/book3s_64_vio.c  |  11 +-
 arch/powerpc/kvm/book3s_64_vio_hv.c   |  18 +-
 arch/powerpc/platforms/powernv/pci-ioda-tce.c | 395 ++
 arch/powerpc/platforms/powernv/pci-ioda.c | 192 ++---
 arch/powerpc/platforms/powernv/pci.c  | 158 ---
 drivers/vfio/vfio_iommu_spapr_tce.c   |  65 +
 9 files changed, 482 insertions(+), 414 deletions(-)
 create mode 100644 arch/powerpc/platforms/powernv/pci-ioda-tce.c

-- 
2.11.0

[PATCH kernel 5/6] powerpc/powernv: Rework TCE level allocation

2018-06-07 Thread Alexey Kardashevskiy

This moves actual pages allocation to a separate function which is going
to be reused later in on-demand TCE allocation.

While we are at it, remove unnecessary level size round up as the caller
does this already.

Signed-off-by: Alexey Kardashevskiy 
---
 arch/powerpc/platforms/powernv/pci-ioda-tce.c | 30 +--
 1 file changed, 19 insertions(+), 11 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda-tce.c 
b/arch/powerpc/platforms/powernv/pci-ioda-tce.c
index f14b282..36c2eb0 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda-tce.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda-tce.c
@@ -31,6 +31,23 @@ void pnv_pci_setup_iommu_table(struct iommu_table *tbl,
tbl->it_type = TCE_PCI;
 }
 
+static __be64 *pnv_alloc_tce_level(int nid, unsigned int shift)
+{
+   struct page *tce_mem = NULL;
+   __be64 *addr;
+
+   tce_mem = alloc_pages_node(nid, GFP_KERNEL, shift - PAGE_SHIFT);
+   if (!tce_mem) {
+   pr_err("Failed to allocate a TCE memory, level shift=%d\n",
+   shift);
+   return NULL;
+   }
+   addr = page_address(tce_mem);
+   memset(addr, 0, 1UL << shift);
+
+   return addr;
+}
+
 static __be64 *pnv_tce(struct iommu_table *tbl, bool user, long idx)
 {
__be64 *tmp = user ? tbl->it_userspace : (__be64 *) tbl->it_base;
@@ -165,21 +182,12 @@ static __be64 *pnv_pci_ioda2_table_do_alloc_pages(int 
nid, unsigned int shift,
unsigned int levels, unsigned long limit,
unsigned long *current_offset, unsigned long *total_allocated)
 {
-   struct page *tce_mem = NULL;
__be64 *addr, *tmp;
-   unsigned int order = max_t(unsigned int, shift, PAGE_SHIFT) -
-   PAGE_SHIFT;
-   unsigned long allocated = 1UL << (order + PAGE_SHIFT);
+   unsigned long allocated = 1UL << shift;
unsigned int entries = 1UL << (shift - 3);
long i;
 
-   tce_mem = alloc_pages_node(nid, GFP_KERNEL, order);
-   if (!tce_mem) {
-   pr_err("Failed to allocate a TCE memory, order=%d\n", order);
-   return NULL;
-   }
-   addr = page_address(tce_mem);
-   memset(addr, 0, allocated);
+   addr = pnv_alloc_tce_level(nid, shift);
*total_allocated += allocated;
 
--levels;
-- 
2.11.0

[PATCH kernel 4/6] powerpc/powernv: Add indirect levels to it_userspace

2018-06-07 Thread Alexey Kardashevskiy

We want to support sparse memory and therefore huge chunks of DMA windows
do not need to be mapped. If a DMA window big enough to require 2 or more
indirect levels, and a DMA window is used to map all RAM (which is
a default case for 64bit window), we can actually save some memory by
not allocation TCE for regions which we are not going to map anyway.

The hardware tables alreary support indirect levels but we also keep
host-physical-to-userspace translation array which is allocated by
vmalloc() and is a flat array which might use quite some memory.

This converts it_userspace from vmalloc'ed array to a multi level table.

As the format becomes platform dependend, this replaces the direct access
to it_usespace with a iommu_table_ops::useraddrptr hook which returns
a pointer to the userspace copy of a TCE; future extension will return
NULL if the level was not allocated.

This should not change non-KVM handling of TCE tables and it_userspace
will not be allocated for non-KVM tables.

Signed-off-by: Alexey Kardashevskiy 
---
 arch/powerpc/include/asm/iommu.h  |  6 +--
 arch/powerpc/platforms/powernv/pci.h  |  3 +-
 arch/powerpc/kvm/book3s_64_vio_hv.c   |  8 
 arch/powerpc/platforms/powernv/pci-ioda-tce.c | 65 +--
 arch/powerpc/platforms/powernv/pci-ioda.c | 31 ++---
 drivers/vfio/vfio_iommu_spapr_tce.c   | 46 ---
 6 files changed, 81 insertions(+), 78 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index 803ac70..4bdcf22 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -69,6 +69,8 @@ struct iommu_table_ops {
long index,
unsigned long *hpa,
enum dma_data_direction *direction);
+
+   __be64 *(*useraddrptr)(struct iommu_table *tbl, long index);
 #endif
void (*clear)(struct iommu_table *tbl,
long index, long npages);
@@ -123,9 +125,7 @@ struct iommu_table {
 };
 
 #define IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry) \
-   ((tbl)->it_userspace ? \
-   &((tbl)->it_userspace[(entry) - (tbl)->it_offset]) : \
-   NULL)
+   ((tbl)->it_ops->useraddrptr((tbl), (entry)))
 
 /* Pure 2^n version of get_order */
 static inline __attribute_const__
diff --git a/arch/powerpc/platforms/powernv/pci.h 
b/arch/powerpc/platforms/powernv/pci.h
index f507baf..5e02408 100644
--- a/arch/powerpc/platforms/powernv/pci.h
+++ b/arch/powerpc/platforms/powernv/pci.h
@@ -268,11 +268,12 @@ extern int pnv_tce_build(struct iommu_table *tbl, long 
index, long npages,
 extern void pnv_tce_free(struct iommu_table *tbl, long index, long npages);
 extern int pnv_tce_xchg(struct iommu_table *tbl, long index,
unsigned long *hpa, enum dma_data_direction *direction);
+extern __be64 *pnv_tce_useraddrptr(struct iommu_table *tbl, long index);
 extern unsigned long pnv_tce_get(struct iommu_table *tbl, long index);
 
 extern long pnv_pci_ioda2_table_alloc_pages(int nid, __u64 bus_offset,
__u32 page_shift, __u64 window_size, __u32 levels,
-   struct iommu_table *tbl);
+   bool alloc_userspace_copy, struct iommu_table *tbl);
 extern void pnv_pci_ioda2_table_free_pages(struct iommu_table *tbl);
 
 extern long pnv_pci_link_table_and_group(int node, int num,
diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c 
b/arch/powerpc/kvm/book3s_64_vio_hv.c
index 18109f3..db0490c 100644
--- a/arch/powerpc/kvm/book3s_64_vio_hv.c
+++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
@@ -206,10 +206,6 @@ static long kvmppc_rm_tce_iommu_mapped_dec(struct kvm *kvm,
/* it_userspace allocation might be delayed */
return H_TOO_HARD;
 
-   pua = (void *) vmalloc_to_phys(pua);
-   if (WARN_ON_ONCE_RM(!pua))
-   return H_HARDWARE;
-
mem = mm_iommu_lookup_rm(kvm->mm, be64_to_cpu(*pua), pgsize);
if (!mem)
return H_TOO_HARD;
@@ -282,10 +278,6 @@ static long kvmppc_rm_tce_iommu_do_map(struct kvm *kvm, 
struct iommu_table *tbl,
if (WARN_ON_ONCE_RM(mm_iommu_ua_to_hpa_rm(mem, ua, )))
return H_HARDWARE;
 
-   pua = (void *) vmalloc_to_phys(pua);
-   if (WARN_ON_ONCE_RM(!pua))
-   return H_HARDWARE;
-
if (WARN_ON_ONCE_RM(mm_iommu_mapped_inc(mem)))
return H_CLOSED;
 
diff --git a/arch/powerpc/platforms/powernv/pci-ioda-tce.c 
b/arch/powerpc/platforms/powernv/pci-ioda-tce.c
index 700ceb1..f14b282 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda-tce.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda-tce.c
@@ -31,9 +31,9 @@ void pnv_pci_setup_iommu_table(struct iommu_table *tbl,
tbl->it_type = TCE_PCI;
 }
 
-static __be64 *pnv_tce(struct iommu_table *tbl, long idx)
+static __be64 *pnv_tce(struct iommu_table *tbl, bool user, long idx)
 {
-   __be64 *tmp =

[PATCH kernel 3/6] KVM: PPC: Make iommu_table::it_userspace big endian

2018-06-07 Thread Alexey Kardashevskiy

We are going to reuse multilevel TCE code for the userspace copy of
the TCE table and since it is big endian, let's make the copy big endian
too.

Signed-off-by: Alexey Kardashevskiy 
---
 arch/powerpc/include/asm/iommu.h|  2 +-
 arch/powerpc/kvm/book3s_64_vio.c| 11 ++-
 arch/powerpc/kvm/book3s_64_vio_hv.c | 10 +-
 drivers/vfio/vfio_iommu_spapr_tce.c | 19 +--
 4 files changed, 21 insertions(+), 21 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index 20febe0..803ac70 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -117,7 +117,7 @@ struct iommu_table {
unsigned long *it_map;   /* A simple allocation bitmap for now */
unsigned long  it_page_shift;/* table iommu page size */
struct list_head it_group_list;/* List of iommu_table_group_link */
-   unsigned long *it_userspace; /* userspace view of the table */
+   __be64 *it_userspace; /* userspace view of the table */
struct iommu_table_ops *it_ops;
struct krefit_kref;
 };
diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
index 80ead38..1dbca4b 100644
--- a/arch/powerpc/kvm/book3s_64_vio.c
+++ b/arch/powerpc/kvm/book3s_64_vio.c
@@ -378,19 +378,19 @@ static long kvmppc_tce_iommu_mapped_dec(struct kvm *kvm,
 {
struct mm_iommu_table_group_mem_t *mem = NULL;
const unsigned long pgsize = 1ULL << tbl->it_page_shift;
-   unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
+   __be64 *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
 
if (!pua)
/* it_userspace allocation might be delayed */
return H_TOO_HARD;
 
-   mem = mm_iommu_lookup(kvm->mm, *pua, pgsize);
+   mem = mm_iommu_lookup(kvm->mm, be64_to_cpu(*pua), pgsize);
if (!mem)
return H_TOO_HARD;
 
mm_iommu_mapped_dec(mem);
 
-   *pua = 0;
+   *pua = cpu_to_be64(0);
 
return H_SUCCESS;
 }
@@ -437,7 +437,8 @@ long kvmppc_tce_iommu_do_map(struct kvm *kvm, struct 
iommu_table *tbl,
enum dma_data_direction dir)
 {
long ret;
-   unsigned long hpa, *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
+   unsigned long hpa;
+   __be64 *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
struct mm_iommu_table_group_mem_t *mem;
 
if (!pua)
@@ -464,7 +465,7 @@ long kvmppc_tce_iommu_do_map(struct kvm *kvm, struct 
iommu_table *tbl,
if (dir != DMA_NONE)
kvmppc_tce_iommu_mapped_dec(kvm, tbl, entry);
 
-   *pua = ua;
+   *pua = cpu_to_be64(ua);
 
return 0;
 }
diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c 
b/arch/powerpc/kvm/book3s_64_vio_hv.c
index 635f3ca..18109f3 100644
--- a/arch/powerpc/kvm/book3s_64_vio_hv.c
+++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
@@ -200,7 +200,7 @@ static long kvmppc_rm_tce_iommu_mapped_dec(struct kvm *kvm,
 {
struct mm_iommu_table_group_mem_t *mem = NULL;
const unsigned long pgsize = 1ULL << tbl->it_page_shift;
-   unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
+   __be64 *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
 
if (!pua)
/* it_userspace allocation might be delayed */
@@ -210,13 +210,13 @@ static long kvmppc_rm_tce_iommu_mapped_dec(struct kvm 
*kvm,
if (WARN_ON_ONCE_RM(!pua))
return H_HARDWARE;
 
-   mem = mm_iommu_lookup_rm(kvm->mm, *pua, pgsize);
+   mem = mm_iommu_lookup_rm(kvm->mm, be64_to_cpu(*pua), pgsize);
if (!mem)
return H_TOO_HARD;
 
mm_iommu_mapped_dec(mem);
 
-   *pua = 0;
+   *pua = cpu_to_be64(0);
 
return H_SUCCESS;
 }
@@ -268,7 +268,7 @@ static long kvmppc_rm_tce_iommu_do_map(struct kvm *kvm, 
struct iommu_table *tbl,
 {
long ret;
unsigned long hpa = 0;
-   unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
+   __be64 *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
struct mm_iommu_table_group_mem_t *mem;
 
if (!pua)
@@ -302,7 +302,7 @@ static long kvmppc_rm_tce_iommu_do_map(struct kvm *kvm, 
struct iommu_table *tbl,
if (dir != DMA_NONE)
kvmppc_rm_tce_iommu_mapped_dec(kvm, tbl, entry);
 
-   *pua = ua;
+   *pua = cpu_to_be64(ua);
 
return 0;
 }
diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c 
b/drivers/vfio/vfio_iommu_spapr_tce.c
index 47071f3..81f48114 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -231,7 +231,7 @@ static long tce_iommu_userspace_view_alloc(struct 
iommu_table *tbl,
decrement_locked_vm(mm, cb >> PAGE_SHIFT);
return -ENOMEM;
}
-   tbl->it_userspace = uas;
+   tbl->it_userspace = (__be64 *) uas;
 
return 0;
 }
@@ -494,20 +494,20 @@ static void tce_iommu_unuse_page_v2(struct tce_container 
*container,

[PATCH kernel 2/6] powerpc/powernv: Move TCE manupulation code to its own file

2018-06-07 Thread Alexey Kardashevskiy

Right now we have allocation code in pci-ioda.c and traversing code in
pci.c, let's keep them toghether. However both files are big enough
already so let's move this business to a new file.

While we at it, move the code which links IOMMU table groups to
IOMMU tables as it is not specific to any PNV PHB model.

These puts exported symbols from the new file together.

This fixes several warnings from checkpatch.pl like this:
"WARNING: Prefer 'unsigned int' to bare use of 'unsigned'".

As this is almost cut-n-paste, there should be no behavioral change.

Signed-off-by: Alexey Kardashevskiy 
---
 arch/powerpc/platforms/powernv/Makefile   |   2 +-
 arch/powerpc/platforms/powernv/pci.h  |  41 ++--
 arch/powerpc/platforms/powernv/pci-ioda-tce.c | 313 ++
 arch/powerpc/platforms/powernv/pci-ioda.c | 146 
 arch/powerpc/platforms/powernv/pci.c  | 158 -
 5 files changed, 340 insertions(+), 320 deletions(-)
 create mode 100644 arch/powerpc/platforms/powernv/pci-ioda-tce.c

diff --git a/arch/powerpc/platforms/powernv/Makefile 
b/arch/powerpc/platforms/powernv/Makefile
index 703a350..b540ce8e 100644
--- a/arch/powerpc/platforms/powernv/Makefile
+++ b/arch/powerpc/platforms/powernv/Makefile
@@ -6,7 +6,7 @@ obj-y   += opal-msglog.o opal-hmi.o 
opal-power.o opal-irqchip.o
 obj-y  += opal-kmsg.o opal-powercap.o opal-psr.o 
opal-sensor-groups.o
 
 obj-$(CONFIG_SMP)  += smp.o subcore.o subcore-asm.o
-obj-$(CONFIG_PCI)  += pci.o pci-ioda.o npu-dma.o
+obj-$(CONFIG_PCI)  += pci.o pci-ioda.o npu-dma.o pci-ioda-tce.o
 obj-$(CONFIG_CXL_BASE) += pci-cxl.o
 obj-$(CONFIG_EEH)  += eeh-powernv.o
 obj-$(CONFIG_PPC_SCOM) += opal-xscom.o
diff --git a/arch/powerpc/platforms/powernv/pci.h 
b/arch/powerpc/platforms/powernv/pci.h
index 1408247..f507baf 100644
--- a/arch/powerpc/platforms/powernv/pci.h
+++ b/arch/powerpc/platforms/powernv/pci.h
@@ -202,13 +202,6 @@ struct pnv_phb {
 };
 
 extern struct pci_ops pnv_pci_ops;
-extern int pnv_tce_build(struct iommu_table *tbl, long index, long npages,
-   unsigned long uaddr, enum dma_data_direction direction,
-   unsigned long attrs);
-extern void pnv_tce_free(struct iommu_table *tbl, long index, long npages);
-extern int pnv_tce_xchg(struct iommu_table *tbl, long index,
-   unsigned long *hpa, enum dma_data_direction *direction);
-extern unsigned long pnv_tce_get(struct iommu_table *tbl, long index);
 
 void pnv_pci_dump_phb_diag_data(struct pci_controller *hose,
unsigned char *log_buff);
@@ -218,14 +211,6 @@ int pnv_pci_cfg_write(struct pci_dn *pdn,
  int where, int size, u32 val);
 extern struct iommu_table *pnv_pci_table_alloc(int nid);
 
-extern long pnv_pci_link_table_and_group(int node, int num,
-   struct iommu_table *tbl,
-   struct iommu_table_group *table_group);
-extern void pnv_pci_unlink_table_and_group(struct iommu_table *tbl,
-   struct iommu_table_group *table_group);
-extern void pnv_pci_setup_iommu_table(struct iommu_table *tbl,
- void *tce_mem, u64 tce_size,
- u64 dma_offset, unsigned page_shift);
 extern void pnv_pci_init_ioda_hub(struct device_node *np);
 extern void pnv_pci_init_ioda2_phb(struct device_node *np);
 extern void pnv_pci_init_npu_phb(struct device_node *np);
@@ -273,4 +258,30 @@ extern void pnv_cxl_cx4_teardown_msi_irqs(struct pci_dev 
*pdev);
 /* phb ops (cxl switches these when enabling the kernel api on the phb) */
 extern const struct pci_controller_ops pnv_cxl_cx4_ioda_controller_ops;
 
+/* pci-ioda-tce.c */
+#define POWERNV_IOMMU_DEFAULT_LEVELS   1
+#define POWERNV_IOMMU_MAX_LEVELS   5
+
+extern int pnv_tce_build(struct iommu_table *tbl, long index, long npages,
+   unsigned long uaddr, enum dma_data_direction direction,
+   unsigned long attrs);
+extern void pnv_tce_free(struct iommu_table *tbl, long index, long npages);
+extern int pnv_tce_xchg(struct iommu_table *tbl, long index,
+   unsigned long *hpa, enum dma_data_direction *direction);
+extern unsigned long pnv_tce_get(struct iommu_table *tbl, long index);
+
+extern long pnv_pci_ioda2_table_alloc_pages(int nid, __u64 bus_offset,
+   __u32 page_shift, __u64 window_size, __u32 levels,
+   struct iommu_table *tbl);
+extern void pnv_pci_ioda2_table_free_pages(struct iommu_table *tbl);
+
+extern long pnv_pci_link_table_and_group(int node, int num,
+   struct iommu_table *tbl,
+   struct iommu_table_group *table_group);
+extern void pnv_pci_unlink_table_and_group(struct iommu_table *tbl,
+   struct iommu_table_group *table_group);
+extern void pnv_pci_setup_iommu_table(struct iommu_table *tbl,
+   void *tce_mem, u64 tce_size,
+   u64 dma_offset, unsigned int page_shift);
+

[PATCH kernel 1/6] powerpc/powernv: Remove useless wrapper

2018-06-07 Thread Alexey Kardashevskiy

This gets rid of a useless wrapper around
pnv_pci_ioda2_table_free_pages().

Signed-off-by: Alexey Kardashevskiy 
---
 arch/powerpc/platforms/powernv/pci-ioda.c | 7 +--
 1 file changed, 1 insertion(+), 6 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index 29f798c..d4c60b6 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -2206,11 +2206,6 @@ static void pnv_ioda2_tce_free(struct iommu_table *tbl, 
long index,
pnv_pci_ioda2_tce_invalidate(tbl, index, npages, false);
 }
 
-static void pnv_ioda2_table_free(struct iommu_table *tbl)
-{
-   pnv_pci_ioda2_table_free_pages(tbl);
-}
-
 static struct iommu_table_ops pnv_ioda2_iommu_ops = {
.set = pnv_ioda2_tce_build,
 #ifdef CONFIG_IOMMU_API
@@ -2219,7 +2214,7 @@ static struct iommu_table_ops pnv_ioda2_iommu_ops = {
 #endif
.clear = pnv_ioda2_tce_free,
.get = pnv_tce_get,
-   .free = pnv_ioda2_table_free,
+   .free = pnv_pci_ioda2_table_free_pages,
 };
 
 static int pnv_pci_ioda_dev_dma_weight(struct pci_dev *dev, void *data)
-- 
2.11.0

Re: [RFC PATCH kernel 0/5] powerpc/P9/vfio: Pass through NVIDIA Tesla V100

2018-06-07 Thread Alex Williamson

On Fri, 8 Jun 2018 14:14:23 +1000
Alexey Kardashevskiy  wrote:

> On 8/6/18 1:44 pm, Alex Williamson wrote:
> > On Fri, 8 Jun 2018 13:08:54 +1000
> > Alexey Kardashevskiy  wrote:
> >   
> >> On 8/6/18 8:15 am, Alex Williamson wrote:  
> >>> On Fri, 08 Jun 2018 07:54:02 +1000
> >>> Benjamin Herrenschmidt  wrote:
> >>> 
>  On Thu, 2018-06-07 at 11:04 -0600, Alex Williamson wrote:
> >
> > Can we back up and discuss whether the IOMMU grouping of NVLink
> > connected devices makes sense?  AIUI we have a PCI view of these
> > devices and from that perspective they're isolated.  That's the view of
> > the device used to generate the grouping.  However, not visible to us,
> > these devices are interconnected via NVLink.  What isolation properties
> > does NVLink provide given that its entire purpose for existing seems to
> > be to provide a high performance link for p2p between devices?  
> 
>  Not entire. On POWER chips, we also have an nvlink between the device
>  and the CPU which is running significantly faster than PCIe.
> 
>  But yes, there are cross-links and those should probably be accounted
>  for in the grouping.
> >>>
> >>> Then after we fix the grouping, can we just let the host driver manage
> >>> this coherent memory range and expose vGPUs to guests?  The use case of
> >>> assigning all 6 GPUs to one VM seems pretty limited.  (Might need to
> >>> convince NVIDIA to support more than a single vGPU per VM though)
> >>
> >> These are physical GPUs, not virtual sriov-alike things they are
> >> implementing as well elsewhere.  
> > 
> > vGPUs as implemented on M- and P-series Teslas aren't SR-IOV like
> > either.  That's why we have mdev devices now to implement software
> > defined devices.  I don't have first hand experience with V-series, but
> > I would absolutely expect a PCIe-based Tesla V100 to support vGPU.  
> 
> So assuming V100 can do vGPU, you are suggesting ditching this patchset and
> using mediated vGPUs instead, correct?

If it turns out that our PCIe-only-based IOMMU grouping doesn't
account for lack of isolation on the NVLink side and we correct that,
limiting assignment to sets of 3 interconnected GPUs, is that still a
useful feature?  OTOH, it's entirely an NVIDIA proprietary decision
whether they choose to support vGPU on these GPUs or whether they can
be convinced to support multiple vGPUs per VM.

> >> My current understanding is that every P9 chip in that box has some NVLink2
> >> logic on it so each P9 is directly connected to 3 GPUs via PCIe and
> >> 2xNVLink2, and GPUs in that big group are interconnected by NVLink2 links
> >> as well.
> >>
> >> From small bits of information I have it seems that a GPU can perfectly
> >> work alone and if the NVIDIA driver does not see these interconnects
> >> (because we do not pass the rest of the big 3xGPU group to this guest), it
> >> continues with a single GPU. There is an "nvidia-smi -r" big reset hammer
> >> which simply refuses to work until all 3 GPUs are passed so there is some
> >> distinction between passing 1 or 3 GPUs, and I am trying (as we speak) to
> >> get a confirmation from NVIDIA that it is ok to pass just a single GPU.
> >>
> >> So we will either have 6 groups (one per GPU) or 2 groups (one per
> >> interconnected group).  
> > 
> > I'm not gaining much confidence that we can rely on isolation between
> > NVLink connected GPUs, it sounds like you're simply expecting that
> > proprietary code from NVIDIA on a proprietary interconnect from NVIDIA
> > is going to play nice and nobody will figure out how to do bad things
> > because... obfuscation?  Thanks,  
> 
> Well, we already believe that a proprietary firmware of a sriov-capable
> adapter like Mellanox ConnextX is not doing bad things, how is this
> different in principle?

It seems like the scope and hierarchy are different.  Here we're
talking about exposing big discrete devices, which are peers of one
another (and have history of being reverse engineered), to userspace
drivers.  Once handed to userspace, each of those devices needs to be
considered untrusted.  In the case of SR-IOV, we typically have a
trusted host driver for the PF managing untrusted VFs.  We do rely on
some sanity in the hardware/firmware in isolating the VFs from each
other and from the PF, but we also often have source code for Linux
drivers for these devices and sometimes even datasheets.  Here we have
neither of those and perhaps we won't know the extent of the lack of
isolation between these devices until nouveau (best case) or some
exploit (worst case) exposes it.  IOMMU grouping always assumes a lack
of isolation between devices unless the hardware provides some
indication that isolation exists, for example ACS on PCIe.  If NVIDIA
wants to expose isolation on NVLink, perhaps they need to document
enough of it that the host kernel can manipulate and test for isolation,
perhaps even enabling virtualization of the

RE: [v3, 00/10] Support DPAA PTP clock and timestamping

2018-06-07 Thread Y.b. Lu

> -Original Message-
> From: Richard Cochran [mailto:richardcoch...@gmail.com]
> Sent: Friday, June 8, 2018 12:27 PM
> To: Y.b. Lu 
> Cc: net...@vger.kernel.org; Madalin-cristian Bucur
> ; Rob Herring ; Shawn Guo
> ; David S . Miller ;
> devicet...@vger.kernel.org; linuxppc-dev@lists.ozlabs.org;
> linux-arm-ker...@lists.infradead.org; linux-ker...@vger.kernel.org
> Subject: Re: [v3, 00/10] Support DPAA PTP clock and timestamping
> 
> On Thu, Jun 07, 2018 at 05:20:40PM +0800, Yangbo Lu wrote:
> > This patchset is to support DPAA FMAN PTP clock and HW timestamping.
> > It had been verified on both ARM platform and PPC platform.
> > - The patch #1 to patch #5 are to support DPAA FMAN 1588 timer in
> >   ptp_qoriq driver.
> > - The patch #6 to patch #10 are to add HW timestamping support in
> >   DPAA ethernet driver.
> 
> Right now, net-next is closed for new stuff.  You will have to post the series
> again after the merge window closes.  You can check the status here:
> 
> 
> https://emea01.safelinks.protection.outlook.com/?url=http:%2F%2Fvger.kern
> el.org%2F~davem%2Fnet-next.html=02%7C01%7Cyangbo.lu%40nxp.co
> m%7Cbaab0b22e7444386c37008d5ccf81b37%7C686ea1d3bc2b4c6fa92cd99
> c5c301635%7C0%7C0%7C636640288347563742=jCmNlwoeWA50PV4
> w3lKZ%2Fs4akPjw0VV2OrJ3t4FizJ0%3D=0
> 
> When you do re-post, you can add my:
> 
> Acked-by: Richard Cochran 

[Y.b. Lu] Get it. And thanks a lot 

Re: [RFC PATCH kernel 5/5] vfio_pci: Add NVIDIA GV100GL [Tesla V100 SXM2] [10de:1db1] subdriver

2018-06-07 Thread Alex Williamson

On Fri, 8 Jun 2018 13:52:05 +1000
Alexey Kardashevskiy  wrote:

> On 8/6/18 1:35 pm, Alex Williamson wrote:
> > On Fri, 8 Jun 2018 13:09:13 +1000
> > Alexey Kardashevskiy  wrote:  
> >> On 8/6/18 3:04 am, Alex Williamson wrote:  
> >>> On Thu,  7 Jun 2018 18:44:20 +1000
> >>> Alexey Kardashevskiy  wrote:  
>  diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
>  index 7bddf1e..38c9475 100644
>  --- a/drivers/vfio/pci/vfio_pci.c
>  +++ b/drivers/vfio/pci/vfio_pci.c
>  @@ -306,6 +306,15 @@ static int vfio_pci_enable(struct vfio_pci_device 
>  *vdev)
>   }
>   }
>   
>  +if (pdev->vendor == PCI_VENDOR_ID_NVIDIA &&
>  +pdev->device == 0x1db1 &&
>  +IS_ENABLED(CONFIG_VFIO_PCI_NVLINK2)) {
> >>>
> >>> Can't we do better than check this based on device ID?  Perhaps PCIe
> >>> capability hints at this?
> >>
> >> A normal PCI pluggable device looks like this:
> >>
> >> root@fstn3:~# sudo lspci -vs :03:00.0
> >> :03:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)
> >>Subsystem: NVIDIA Corporation GK210GL [Tesla K80]
> >>Flags: fast devsel, IRQ 497
> >>Memory at 3fe0 (32-bit, non-prefetchable) [disabled] [size=16M]
> >>Memory at 2000 (64-bit, prefetchable) [disabled] [size=16G]
> >>Memory at 2004 (64-bit, prefetchable) [disabled] [size=32M]
> >>Capabilities: [60] Power Management version 3
> >>Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
> >>Capabilities: [78] Express Endpoint, MSI 00
> >>Capabilities: [100] Virtual Channel
> >>Capabilities: [128] Power Budgeting 
> >>Capabilities: [420] Advanced Error Reporting
> >>Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 
> >> 
> >>Capabilities: [900] #19
> >>
> >>
> >> This is a NVLink v1 machine:
> >>
> >> aik@garrison1:~$ sudo lspci -vs 000a:01:00.0
> >> 000a:01:00.0 3D controller: NVIDIA Corporation Device 15fe (rev a1)
> >>Subsystem: NVIDIA Corporation Device 116b
> >>Flags: bus master, fast devsel, latency 0, IRQ 457
> >>Memory at 3fe3 (32-bit, non-prefetchable) [size=16M]
> >>Memory at 2600 (64-bit, prefetchable) [size=16G]
> >>Memory at 2604 (64-bit, prefetchable) [size=32M]
> >>Capabilities: [60] Power Management version 3
> >>Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
> >>Capabilities: [78] Express Endpoint, MSI 00
> >>Capabilities: [100] Virtual Channel
> >>Capabilities: [250] Latency Tolerance Reporting
> >>Capabilities: [258] L1 PM Substates
> >>Capabilities: [128] Power Budgeting 
> >>Capabilities: [420] Advanced Error Reporting
> >>Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 
> >> 
> >>Capabilities: [900] #19
> >>Kernel driver in use: nvidia
> >>Kernel modules: nvidiafb, nouveau, nvidia_384_drm, nvidia_384
> >>
> >>
> >> This is the one the patch is for:
> >>
> >> [aik@yc02goos ~]$ sudo lspci -vs 0035:03:00.0
> >> 0035:03:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2]
> >> (rev a1)
> >>Subsystem: NVIDIA Corporation Device 1212
> >>Flags: fast devsel, IRQ 82, NUMA node 8
> >>Memory at 620c28000 (32-bit, non-prefetchable) [disabled] [size=16M]
> >>Memory at 62280 (64-bit, prefetchable) [disabled] [size=16G]
> >>Memory at 62284 (64-bit, prefetchable) [disabled] [size=32M]
> >>Capabilities: [60] Power Management version 3
> >>Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
> >>Capabilities: [78] Express Endpoint, MSI 00
> >>Capabilities: [100] Virtual Channel
> >>Capabilities: [250] Latency Tolerance Reporting
> >>Capabilities: [258] L1 PM Substates
> >>Capabilities: [128] Power Budgeting 
> >>Capabilities: [420] Advanced Error Reporting
> >>Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 
> >> 
> >>Capabilities: [900] #19
> >>Capabilities: [ac0] #23
> >>Kernel driver in use: vfio-pci
> >>
> >>
> >> I can only see a new capability #23 which I have no idea about what it
> >> actually does - my latest PCIe spec is
> >> PCI_Express_Base_r3.1a_December7-2015.pdf and that only knows capabilities
> >> till #21, do you have any better spec? Does not seem promising anyway...  
> > 
> > You could just look in include/uapi/linux/pci_regs.h and see that 23
> > (0x17) is a TPH Requester capability and google for that...  It's a TLP
> > processing hint related to cache processing for requests from system
> > specific interconnects.  Sounds rather promising.  Of course there's
> > also the vendor specific capability that might be probed if NVIDIA will
> > tell you what to look for and the init function you've implemented
> > looks for specific devicetree nodes, that I imagine you could test for
> > in a probe as well.  
> 
> 
> This 23 is

Re: [v3, 00/10] Support DPAA PTP clock and timestamping

2018-06-07 Thread Richard Cochran

On Thu, Jun 07, 2018 at 05:20:40PM +0800, Yangbo Lu wrote:
> This patchset is to support DPAA FMAN PTP clock and HW timestamping.
> It had been verified on both ARM platform and PPC platform.
> - The patch #1 to patch #5 are to support DPAA FMAN 1588 timer in
>   ptp_qoriq driver.
> - The patch #6 to patch #10 are to add HW timestamping support in
>   DPAA ethernet driver.

Right now, net-next is closed for new stuff.  You will have to post
the series again after the merge window closes.  You can check the
status here:

http://vger.kernel.org/~davem/net-next.html

When you do re-post, you can add my:

Acked-by: Richard Cochran

Re: [RFC PATCH kernel 1/5] vfio/spapr_tce: Simplify page contained test

2018-06-07 Thread David Gibson

On Thu, Jun 07, 2018 at 06:44:16PM +1000, Alexey Kardashevskiy wrote:
> The test function takes a page struct pointer which is not used by
> either of two callers in any other way, make it simple and just pass
> a physical address there.
> 
> This should cause no behavioral change now but later we may start
> supporting host addresses for memory devices which are not backed
> with page structs.
> 
> Signed-off-by: Alexey Kardashevskiy 

Reviewed-by: David Gibson 

> ---
>  drivers/vfio/vfio_iommu_spapr_tce.c | 11 ---
>  1 file changed, 4 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c 
> b/drivers/vfio/vfio_iommu_spapr_tce.c
> index 759a5bd..2c4a048 100644
> --- a/drivers/vfio/vfio_iommu_spapr_tce.c
> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
> @@ -249,8 +249,9 @@ static void tce_iommu_userspace_view_free(struct 
> iommu_table *tbl,
>   decrement_locked_vm(mm, cb >> PAGE_SHIFT);
>  }
>  
> -static bool tce_page_is_contained(struct page *page, unsigned page_shift)
> +static bool tce_page_is_contained(unsigned long hpa, unsigned page_shift)
>  {
> + struct page *page = pfn_to_page(hpa >> PAGE_SHIFT);
>   /*
>* Check that the TCE table granularity is not bigger than the size of
>* a page we just found. Otherwise the hardware can get access to
> @@ -549,7 +550,6 @@ static long tce_iommu_build(struct tce_container 
> *container,
>   enum dma_data_direction direction)
>  {
>   long i, ret = 0;
> - struct page *page;
>   unsigned long hpa;
>   enum dma_data_direction dirtmp;
>  
> @@ -560,8 +560,7 @@ static long tce_iommu_build(struct tce_container 
> *container,
>   if (ret)
>   break;
>  
> - page = pfn_to_page(hpa >> PAGE_SHIFT);
> - if (!tce_page_is_contained(page, tbl->it_page_shift)) {
> + if (!tce_page_is_contained(hpa, tbl->it_page_shift)) {
>   ret = -EPERM;
>   break;
>   }
> @@ -595,7 +594,6 @@ static long tce_iommu_build_v2(struct tce_container 
> *container,
>   enum dma_data_direction direction)
>  {
>   long i, ret = 0;
> - struct page *page;
>   unsigned long hpa;
>   enum dma_data_direction dirtmp;
>  
> @@ -615,8 +613,7 @@ static long tce_iommu_build_v2(struct tce_container 
> *container,
>   if (ret)
>   break;
>  
> - page = pfn_to_page(hpa >> PAGE_SHIFT);
> - if (!tce_page_is_contained(page, tbl->it_page_shift)) {
> + if (!tce_page_is_contained(hpa, tbl->it_page_shift)) {
>   ret = -EPERM;
>   break;
>   }

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature

Re: [RFC PATCH kernel 0/5] powerpc/P9/vfio: Pass through NVIDIA Tesla V100

2018-06-07 Thread Alexey Kardashevskiy

On 8/6/18 1:44 pm, Alex Williamson wrote:
> On Fri, 8 Jun 2018 13:08:54 +1000
> Alexey Kardashevskiy  wrote:
> 
>> On 8/6/18 8:15 am, Alex Williamson wrote:
>>> On Fri, 08 Jun 2018 07:54:02 +1000
>>> Benjamin Herrenschmidt  wrote:
>>>   
 On Thu, 2018-06-07 at 11:04 -0600, Alex Williamson wrote:  
>
> Can we back up and discuss whether the IOMMU grouping of NVLink
> connected devices makes sense?  AIUI we have a PCI view of these
> devices and from that perspective they're isolated.  That's the view of
> the device used to generate the grouping.  However, not visible to us,
> these devices are interconnected via NVLink.  What isolation properties
> does NVLink provide given that its entire purpose for existing seems to
> be to provide a high performance link for p2p between devices?

 Not entire. On POWER chips, we also have an nvlink between the device
 and the CPU which is running significantly faster than PCIe.

 But yes, there are cross-links and those should probably be accounted
 for in the grouping.  
>>>
>>> Then after we fix the grouping, can we just let the host driver manage
>>> this coherent memory range and expose vGPUs to guests?  The use case of
>>> assigning all 6 GPUs to one VM seems pretty limited.  (Might need to
>>> convince NVIDIA to support more than a single vGPU per VM though)  
>>
>> These are physical GPUs, not virtual sriov-alike things they are
>> implementing as well elsewhere.
> 
> vGPUs as implemented on M- and P-series Teslas aren't SR-IOV like
> either.  That's why we have mdev devices now to implement software
> defined devices.  I don't have first hand experience with V-series, but
> I would absolutely expect a PCIe-based Tesla V100 to support vGPU.

So assuming V100 can do vGPU, you are suggesting ditching this patchset and
using mediated vGPUs instead, correct?


>> My current understanding is that every P9 chip in that box has some NVLink2
>> logic on it so each P9 is directly connected to 3 GPUs via PCIe and
>> 2xNVLink2, and GPUs in that big group are interconnected by NVLink2 links
>> as well.
>>
>> From small bits of information I have it seems that a GPU can perfectly
>> work alone and if the NVIDIA driver does not see these interconnects
>> (because we do not pass the rest of the big 3xGPU group to this guest), it
>> continues with a single GPU. There is an "nvidia-smi -r" big reset hammer
>> which simply refuses to work until all 3 GPUs are passed so there is some
>> distinction between passing 1 or 3 GPUs, and I am trying (as we speak) to
>> get a confirmation from NVIDIA that it is ok to pass just a single GPU.
>>
>> So we will either have 6 groups (one per GPU) or 2 groups (one per
>> interconnected group).
> 
> I'm not gaining much confidence that we can rely on isolation between
> NVLink connected GPUs, it sounds like you're simply expecting that
> proprietary code from NVIDIA on a proprietary interconnect from NVIDIA
> is going to play nice and nobody will figure out how to do bad things
> because... obfuscation?  Thanks,

Well, we already believe that a proprietary firmware of a sriov-capable
adapter like Mellanox ConnextX is not doing bad things, how is this
different in principle?


ps. their obfuscation is funny indeed :)
-- 
Alexey

Re: [RFC PATCH kernel 5/5] vfio_pci: Add NVIDIA GV100GL [Tesla V100 SXM2] [10de:1db1] subdriver

2018-06-07 Thread Alexey Kardashevskiy

On 8/6/18 1:35 pm, Alex Williamson wrote:
> On Fri, 8 Jun 2018 13:09:13 +1000
> Alexey Kardashevskiy  wrote:
>> On 8/6/18 3:04 am, Alex Williamson wrote:
>>> On Thu,  7 Jun 2018 18:44:20 +1000
>>> Alexey Kardashevskiy  wrote:
 diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
 index 7bddf1e..38c9475 100644
 --- a/drivers/vfio/pci/vfio_pci.c
 +++ b/drivers/vfio/pci/vfio_pci.c
 @@ -306,6 +306,15 @@ static int vfio_pci_enable(struct vfio_pci_device 
 *vdev)
}
}
  
 +  if (pdev->vendor == PCI_VENDOR_ID_NVIDIA &&
 +  pdev->device == 0x1db1 &&
 +  IS_ENABLED(CONFIG_VFIO_PCI_NVLINK2)) {  
>>>
>>> Can't we do better than check this based on device ID?  Perhaps PCIe
>>> capability hints at this?  
>>
>> A normal PCI pluggable device looks like this:
>>
>> root@fstn3:~# sudo lspci -vs :03:00.0
>> :03:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)
>>  Subsystem: NVIDIA Corporation GK210GL [Tesla K80]
>>  Flags: fast devsel, IRQ 497
>>  Memory at 3fe0 (32-bit, non-prefetchable) [disabled] [size=16M]
>>  Memory at 2000 (64-bit, prefetchable) [disabled] [size=16G]
>>  Memory at 2004 (64-bit, prefetchable) [disabled] [size=32M]
>>  Capabilities: [60] Power Management version 3
>>  Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
>>  Capabilities: [78] Express Endpoint, MSI 00
>>  Capabilities: [100] Virtual Channel
>>  Capabilities: [128] Power Budgeting 
>>  Capabilities: [420] Advanced Error Reporting
>>  Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 
>> 
>>  Capabilities: [900] #19
>>
>>
>> This is a NVLink v1 machine:
>>
>> aik@garrison1:~$ sudo lspci -vs 000a:01:00.0
>> 000a:01:00.0 3D controller: NVIDIA Corporation Device 15fe (rev a1)
>>  Subsystem: NVIDIA Corporation Device 116b
>>  Flags: bus master, fast devsel, latency 0, IRQ 457
>>  Memory at 3fe3 (32-bit, non-prefetchable) [size=16M]
>>  Memory at 2600 (64-bit, prefetchable) [size=16G]
>>  Memory at 2604 (64-bit, prefetchable) [size=32M]
>>  Capabilities: [60] Power Management version 3
>>  Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
>>  Capabilities: [78] Express Endpoint, MSI 00
>>  Capabilities: [100] Virtual Channel
>>  Capabilities: [250] Latency Tolerance Reporting
>>  Capabilities: [258] L1 PM Substates
>>  Capabilities: [128] Power Budgeting 
>>  Capabilities: [420] Advanced Error Reporting
>>  Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 
>> 
>>  Capabilities: [900] #19
>>  Kernel driver in use: nvidia
>>  Kernel modules: nvidiafb, nouveau, nvidia_384_drm, nvidia_384
>>
>>
>> This is the one the patch is for:
>>
>> [aik@yc02goos ~]$ sudo lspci -vs 0035:03:00.0
>> 0035:03:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2]
>> (rev a1)
>>  Subsystem: NVIDIA Corporation Device 1212
>>  Flags: fast devsel, IRQ 82, NUMA node 8
>>  Memory at 620c28000 (32-bit, non-prefetchable) [disabled] [size=16M]
>>  Memory at 62280 (64-bit, prefetchable) [disabled] [size=16G]
>>  Memory at 62284 (64-bit, prefetchable) [disabled] [size=32M]
>>  Capabilities: [60] Power Management version 3
>>  Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
>>  Capabilities: [78] Express Endpoint, MSI 00
>>  Capabilities: [100] Virtual Channel
>>  Capabilities: [250] Latency Tolerance Reporting
>>  Capabilities: [258] L1 PM Substates
>>  Capabilities: [128] Power Budgeting 
>>  Capabilities: [420] Advanced Error Reporting
>>  Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 
>> 
>>  Capabilities: [900] #19
>>  Capabilities: [ac0] #23
>>  Kernel driver in use: vfio-pci
>>
>>
>> I can only see a new capability #23 which I have no idea about what it
>> actually does - my latest PCIe spec is
>> PCI_Express_Base_r3.1a_December7-2015.pdf and that only knows capabilities
>> till #21, do you have any better spec? Does not seem promising anyway...
> 
> You could just look in include/uapi/linux/pci_regs.h and see that 23
> (0x17) is a TPH Requester capability and google for that...  It's a TLP
> processing hint related to cache processing for requests from system
> specific interconnects.  Sounds rather promising.  Of course there's
> also the vendor specific capability that might be probed if NVIDIA will
> tell you what to look for and the init function you've implemented
> looks for specific devicetree nodes, that I imagine you could test for
> in a probe as well.


This 23 is in hex:

[aik@yc02goos ~]$ sudo lspci -vs 0035:03:00.0
0035:03:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2]
(rev a1)
Subsystem: NVIDIA Corporation Device 1212
Flags: fast devsel, IRQ 82, NUMA

Re: [RFC PATCH kernel 0/5] powerpc/P9/vfio: Pass through NVIDIA Tesla V100

2018-06-07 Thread Alex Williamson

On Fri, 8 Jun 2018 13:08:54 +1000
Alexey Kardashevskiy  wrote:

> On 8/6/18 8:15 am, Alex Williamson wrote:
> > On Fri, 08 Jun 2018 07:54:02 +1000
> > Benjamin Herrenschmidt  wrote:
> >   
> >> On Thu, 2018-06-07 at 11:04 -0600, Alex Williamson wrote:  
> >>>
> >>> Can we back up and discuss whether the IOMMU grouping of NVLink
> >>> connected devices makes sense?  AIUI we have a PCI view of these
> >>> devices and from that perspective they're isolated.  That's the view of
> >>> the device used to generate the grouping.  However, not visible to us,
> >>> these devices are interconnected via NVLink.  What isolation properties
> >>> does NVLink provide given that its entire purpose for existing seems to
> >>> be to provide a high performance link for p2p between devices?
> >>
> >> Not entire. On POWER chips, we also have an nvlink between the device
> >> and the CPU which is running significantly faster than PCIe.
> >>
> >> But yes, there are cross-links and those should probably be accounted
> >> for in the grouping.  
> > 
> > Then after we fix the grouping, can we just let the host driver manage
> > this coherent memory range and expose vGPUs to guests?  The use case of
> > assigning all 6 GPUs to one VM seems pretty limited.  (Might need to
> > convince NVIDIA to support more than a single vGPU per VM though)  
> 
> These are physical GPUs, not virtual sriov-alike things they are
> implementing as well elsewhere.

vGPUs as implemented on M- and P-series Teslas aren't SR-IOV like
either.  That's why we have mdev devices now to implement software
defined devices.  I don't have first hand experience with V-series, but
I would absolutely expect a PCIe-based Tesla V100 to support vGPU.

> My current understanding is that every P9 chip in that box has some NVLink2
> logic on it so each P9 is directly connected to 3 GPUs via PCIe and
> 2xNVLink2, and GPUs in that big group are interconnected by NVLink2 links
> as well.
> 
> From small bits of information I have it seems that a GPU can perfectly
> work alone and if the NVIDIA driver does not see these interconnects
> (because we do not pass the rest of the big 3xGPU group to this guest), it
> continues with a single GPU. There is an "nvidia-smi -r" big reset hammer
> which simply refuses to work until all 3 GPUs are passed so there is some
> distinction between passing 1 or 3 GPUs, and I am trying (as we speak) to
> get a confirmation from NVIDIA that it is ok to pass just a single GPU.
> 
> So we will either have 6 groups (one per GPU) or 2 groups (one per
> interconnected group).

I'm not gaining much confidence that we can rely on isolation between
NVLink connected GPUs, it sounds like you're simply expecting that
proprietary code from NVIDIA on a proprietary interconnect from NVIDIA
is going to play nice and nobody will figure out how to do bad things
because... obfuscation?  Thanks,

Alex

Re: [RFC PATCH kernel 5/5] vfio_pci: Add NVIDIA GV100GL [Tesla V100 SXM2] [10de:1db1] subdriver

2018-06-07 Thread Alex Williamson

On Fri, 8 Jun 2018 13:09:13 +1000
Alexey Kardashevskiy  wrote:
> On 8/6/18 3:04 am, Alex Williamson wrote:
> > On Thu,  7 Jun 2018 18:44:20 +1000
> > Alexey Kardashevskiy  wrote:
> >> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> >> index 7bddf1e..38c9475 100644
> >> --- a/drivers/vfio/pci/vfio_pci.c
> >> +++ b/drivers/vfio/pci/vfio_pci.c
> >> @@ -306,6 +306,15 @@ static int vfio_pci_enable(struct vfio_pci_device 
> >> *vdev)
> >>}
> >>}
> >>  
> >> +  if (pdev->vendor == PCI_VENDOR_ID_NVIDIA &&
> >> +  pdev->device == 0x1db1 &&
> >> +  IS_ENABLED(CONFIG_VFIO_PCI_NVLINK2)) {  
> > 
> > Can't we do better than check this based on device ID?  Perhaps PCIe
> > capability hints at this?  
> 
> A normal PCI pluggable device looks like this:
> 
> root@fstn3:~# sudo lspci -vs :03:00.0
> :03:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)
>   Subsystem: NVIDIA Corporation GK210GL [Tesla K80]
>   Flags: fast devsel, IRQ 497
>   Memory at 3fe0 (32-bit, non-prefetchable) [disabled] [size=16M]
>   Memory at 2000 (64-bit, prefetchable) [disabled] [size=16G]
>   Memory at 2004 (64-bit, prefetchable) [disabled] [size=32M]
>   Capabilities: [60] Power Management version 3
>   Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
>   Capabilities: [78] Express Endpoint, MSI 00
>   Capabilities: [100] Virtual Channel
>   Capabilities: [128] Power Budgeting 
>   Capabilities: [420] Advanced Error Reporting
>   Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 
> 
>   Capabilities: [900] #19
> 
> 
> This is a NVLink v1 machine:
> 
> aik@garrison1:~$ sudo lspci -vs 000a:01:00.0
> 000a:01:00.0 3D controller: NVIDIA Corporation Device 15fe (rev a1)
>   Subsystem: NVIDIA Corporation Device 116b
>   Flags: bus master, fast devsel, latency 0, IRQ 457
>   Memory at 3fe3 (32-bit, non-prefetchable) [size=16M]
>   Memory at 2600 (64-bit, prefetchable) [size=16G]
>   Memory at 2604 (64-bit, prefetchable) [size=32M]
>   Capabilities: [60] Power Management version 3
>   Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
>   Capabilities: [78] Express Endpoint, MSI 00
>   Capabilities: [100] Virtual Channel
>   Capabilities: [250] Latency Tolerance Reporting
>   Capabilities: [258] L1 PM Substates
>   Capabilities: [128] Power Budgeting 
>   Capabilities: [420] Advanced Error Reporting
>   Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 
> 
>   Capabilities: [900] #19
>   Kernel driver in use: nvidia
>   Kernel modules: nvidiafb, nouveau, nvidia_384_drm, nvidia_384
> 
> 
> This is the one the patch is for:
> 
> [aik@yc02goos ~]$ sudo lspci -vs 0035:03:00.0
> 0035:03:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2]
> (rev a1)
>   Subsystem: NVIDIA Corporation Device 1212
>   Flags: fast devsel, IRQ 82, NUMA node 8
>   Memory at 620c28000 (32-bit, non-prefetchable) [disabled] [size=16M]
>   Memory at 62280 (64-bit, prefetchable) [disabled] [size=16G]
>   Memory at 62284 (64-bit, prefetchable) [disabled] [size=32M]
>   Capabilities: [60] Power Management version 3
>   Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
>   Capabilities: [78] Express Endpoint, MSI 00
>   Capabilities: [100] Virtual Channel
>   Capabilities: [250] Latency Tolerance Reporting
>   Capabilities: [258] L1 PM Substates
>   Capabilities: [128] Power Budgeting 
>   Capabilities: [420] Advanced Error Reporting
>   Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 
> 
>   Capabilities: [900] #19
>   Capabilities: [ac0] #23
>   Kernel driver in use: vfio-pci
> 
> 
> I can only see a new capability #23 which I have no idea about what it
> actually does - my latest PCIe spec is
> PCI_Express_Base_r3.1a_December7-2015.pdf and that only knows capabilities
> till #21, do you have any better spec? Does not seem promising anyway...

You could just look in include/uapi/linux/pci_regs.h and see that 23
(0x17) is a TPH Requester capability and google for that...  It's a TLP
processing hint related to cache processing for requests from system
specific interconnects.  Sounds rather promising.  Of course there's
also the vendor specific capability that might be probed if NVIDIA will
tell you what to look for and the init function you've implemented
looks for specific devicetree nodes, that I imagine you could test for
in a probe as well.

> > Is it worthwhile to continue with assigning the device in the !ENABLED
> > case?  For instance, maybe it would be better to provide a weak
> > definition of vfio_pci_nvlink2_init() that would cause us to fail here
> > if we don't have this device specific support enabled.  I realize
> > you're following the example

[PATCH] powerpc: Don't let userspace trigger a kernel WARN_ON

2018-06-07 Thread Benjamin Herrenschmidt

In commit 2865d08dd9ea876524652f3900b4b3b9c8b22e77
"powerpc/mm: Move the DSISR_PROTFAULT sanity check",
I completely missed the fact that an attempt at reading
kernel memory *will* trip the warning.

So this partially reverts it. We keep the test in a
helper to keep the code clean, but we move it back to
after the VMA has been found.

Signed-off-by: Benjamin Herrenschmidt 
---

diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
index c01d627e687a..20384445ca44 100644
--- a/arch/powerpc/mm/fault.c
+++ b/arch/powerpc/mm/fault.c
@@ -416,9 +416,6 @@ static int __do_page_fault(struct pt_regs *regs, unsigned 
long address,
return SIGBUS;
}
 
-   /* Additional sanity check(s) */
-   sanity_check_fault(is_write, error_code);
-
/*
 * The kernel should never take an execute fault nor should it
 * take a page fault to a kernel address.
@@ -511,6 +508,10 @@ static int __do_page_fault(struct pt_regs *regs, unsigned 
long address,
return bad_area(regs, address);
 
 good_area:
+   /* Additional sanity check(s) */
+   sanity_check_fault(is_write, error_code);
+
+   /* Check for VMA access permissions */
if (unlikely(access_error(is_write, is_exec, vma)))
return bad_access(regs, address);

Re: [RFC PATCH kernel 5/5] vfio_pci: Add NVIDIA GV100GL [Tesla V100 SXM2] [10de:1db1] subdriver

2018-06-07 Thread Alexey Kardashevskiy

On 8/6/18 3:04 am, Alex Williamson wrote:
> On Thu,  7 Jun 2018 18:44:20 +1000
> Alexey Kardashevskiy  wrote:
> 
>> Some POWER9 chips come with special NVLink2 links which provide
>> cacheable memory access to the RAM physically located on NVIDIA GPU.
>> This memory is presented to a host via the device tree but remains
>> offline until the NVIDIA driver onlines it.
>>
>> This exports this RAM to the userspace as a new region so
>> the NVIDIA driver in the guest can train these links and online GPU RAM.
>>
>> Signed-off-by: Alexey Kardashevskiy 
>> ---
>>  drivers/vfio/pci/Makefile   |   1 +
>>  drivers/vfio/pci/vfio_pci_private.h |   8 ++
>>  include/uapi/linux/vfio.h   |   3 +
>>  drivers/vfio/pci/vfio_pci.c |   9 ++
>>  drivers/vfio/pci/vfio_pci_nvlink2.c | 190 
>> 
>>  drivers/vfio/pci/Kconfig|   4 +
>>  6 files changed, 215 insertions(+)
>>  create mode 100644 drivers/vfio/pci/vfio_pci_nvlink2.c
>>
>> diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
>> index 76d8ec0..9662c06 100644
>> --- a/drivers/vfio/pci/Makefile
>> +++ b/drivers/vfio/pci/Makefile
>> @@ -1,5 +1,6 @@
>>  
>>  vfio-pci-y := vfio_pci.o vfio_pci_intrs.o vfio_pci_rdwr.o vfio_pci_config.o
>>  vfio-pci-$(CONFIG_VFIO_PCI_IGD) += vfio_pci_igd.o
>> +vfio-pci-$(CONFIG_VFIO_PCI_NVLINK2) += vfio_pci_nvlink2.o
>>  
>>  obj-$(CONFIG_VFIO_PCI) += vfio-pci.o
>> diff --git a/drivers/vfio/pci/vfio_pci_private.h 
>> b/drivers/vfio/pci/vfio_pci_private.h
>> index 86aab05..7115b9b 100644
>> --- a/drivers/vfio/pci/vfio_pci_private.h
>> +++ b/drivers/vfio/pci/vfio_pci_private.h
>> @@ -160,4 +160,12 @@ static inline int vfio_pci_igd_init(struct 
>> vfio_pci_device *vdev)
>>  return -ENODEV;
>>  }
>>  #endif
>> +#ifdef CONFIG_VFIO_PCI_NVLINK2
>> +extern int vfio_pci_nvlink2_init(struct vfio_pci_device *vdev);
>> +#else
>> +static inline int vfio_pci_nvlink2_init(struct vfio_pci_device *vdev)
>> +{
>> +return -ENODEV;
>> +}
>> +#endif
>>  #endif /* VFIO_PCI_PRIVATE_H */
>> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
>> index 1aa7b82..2fe8227 100644
>> --- a/include/uapi/linux/vfio.h
>> +++ b/include/uapi/linux/vfio.h
>> @@ -301,6 +301,9 @@ struct vfio_region_info_cap_type {
>>  #define VFIO_REGION_SUBTYPE_INTEL_IGD_HOST_CFG  (2)
>>  #define VFIO_REGION_SUBTYPE_INTEL_IGD_LPC_CFG   (3)
>>  
>> +/* NVIDIA GPU NV2 */
>> +#define VFIO_REGION_SUBTYPE_NVIDIA_NVLINK2  (4)
> 
> You're continuing the Intel vendor ID sub-types for an NVIDIA vendor ID
> subtype.  Each vendor has their own address space of sub-types.


True, I'll update. I just like unique numbers better :)

> 
>> +
>>  /*
>>   * The MSIX mappable capability informs that MSIX data of a BAR can be 
>> mmapped
>>   * which allows direct access to non-MSIX registers which happened to be 
>> within
>> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
>> index 7bddf1e..38c9475 100644
>> --- a/drivers/vfio/pci/vfio_pci.c
>> +++ b/drivers/vfio/pci/vfio_pci.c
>> @@ -306,6 +306,15 @@ static int vfio_pci_enable(struct vfio_pci_device *vdev)
>>  }
>>  }
>>  
>> +if (pdev->vendor == PCI_VENDOR_ID_NVIDIA &&
>> +pdev->device == 0x1db1 &&
>> +IS_ENABLED(CONFIG_VFIO_PCI_NVLINK2)) {
> 
> Can't we do better than check this based on device ID?  Perhaps PCIe
> capability hints at this?

A normal PCI pluggable device looks like this:

root@fstn3:~# sudo lspci -vs :03:00.0
:03:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)
Subsystem: NVIDIA Corporation GK210GL [Tesla K80]
Flags: fast devsel, IRQ 497
Memory at 3fe0 (32-bit, non-prefetchable) [disabled] [size=16M]
Memory at 2000 (64-bit, prefetchable) [disabled] [size=16G]
Memory at 2004 (64-bit, prefetchable) [disabled] [size=32M]
Capabilities: [60] Power Management version 3
Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
Capabilities: [78] Express Endpoint, MSI 00
Capabilities: [100] Virtual Channel
Capabilities: [128] Power Budgeting 
Capabilities: [420] Advanced Error Reporting
Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 

Capabilities: [900] #19


This is a NVLink v1 machine:

aik@garrison1:~$ sudo lspci -vs 000a:01:00.0
000a:01:00.0 3D controller: NVIDIA Corporation Device 15fe (rev a1)
Subsystem: NVIDIA Corporation Device 116b
Flags: bus master, fast devsel, latency 0, IRQ 457
Memory at 3fe3 (32-bit, non-prefetchable) [size=16M]
Memory at 2600 (64-bit, prefetchable) [size=16G]
Memory at 2604 (64-bit, prefetchable) [size=32M]
Capabilities: [60] Power Management version 3
Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
Capabilities: [78] Express Endpoint, MSI 00
Capabilities: [100] Virtual

Re: [RFC PATCH kernel 0/5] powerpc/P9/vfio: Pass through NVIDIA Tesla V100

2018-06-07 Thread Alexey Kardashevskiy

On 8/6/18 8:15 am, Alex Williamson wrote:
> On Fri, 08 Jun 2018 07:54:02 +1000
> Benjamin Herrenschmidt  wrote:
> 
>> On Thu, 2018-06-07 at 11:04 -0600, Alex Williamson wrote:
>>>
>>> Can we back up and discuss whether the IOMMU grouping of NVLink
>>> connected devices makes sense?  AIUI we have a PCI view of these
>>> devices and from that perspective they're isolated.  That's the view of
>>> the device used to generate the grouping.  However, not visible to us,
>>> these devices are interconnected via NVLink.  What isolation properties
>>> does NVLink provide given that its entire purpose for existing seems to
>>> be to provide a high performance link for p2p between devices?  
>>
>> Not entire. On POWER chips, we also have an nvlink between the device
>> and the CPU which is running significantly faster than PCIe.
>>
>> But yes, there are cross-links and those should probably be accounted
>> for in the grouping.
> 
> Then after we fix the grouping, can we just let the host driver manage
> this coherent memory range and expose vGPUs to guests?  The use case of
> assigning all 6 GPUs to one VM seems pretty limited.  (Might need to
> convince NVIDIA to support more than a single vGPU per VM though)

These are physical GPUs, not virtual sriov-alike things they are
implementing as well elsewhere.

My current understanding is that every P9 chip in that box has some NVLink2
logic on it so each P9 is directly connected to 3 GPUs via PCIe and
2xNVLink2, and GPUs in that big group are interconnected by NVLink2 links
as well.

>From small bits of information I have it seems that a GPU can perfectly
work alone and if the NVIDIA driver does not see these interconnects
(because we do not pass the rest of the big 3xGPU group to this guest), it
continues with a single GPU. There is an "nvidia-smi -r" big reset hammer
which simply refuses to work until all 3 GPUs are passed so there is some
distinction between passing 1 or 3 GPUs, and I am trying (as we speak) to
get a confirmation from NVIDIA that it is ok to pass just a single GPU.

So we will either have 6 groups (one per GPU) or 2 groups (one per
interconnected group).


-- 
Alexey

Re: pkeys on POWER: Access rights not reset on execve

2018-06-07 Thread Ram Pai

> 
> So the remaining question at this point is whether the Intel
> behavior (default-deny instead of default-allow) is preferable.

Florian, remind me what behavior needs to fixed? 

-- 
Ram Pai

[PATCH v2 09/12] macintosh/via-pmu: Replace via-pmu68k driver with via-pmu driver

2018-06-07 Thread Finn Thain

Now that the PowerMac via-pmu driver supports m68k PowerBooks,
switch over to that driver and remove the via-pmu68k driver.

Cc: Geert Uytterhoeven 
Tested-by: Stan Johnson 
Signed-off-by: Finn Thain 
---
 arch/m68k/configs/mac_defconfig   |   2 +-
 arch/m68k/configs/multi_defconfig |   2 +-
 arch/m68k/mac/config.c|   2 +-
 arch/m68k/mac/misc.c  |  48 +--
 drivers/macintosh/Kconfig |  13 +-
 drivers/macintosh/Makefile|   1 -
 drivers/macintosh/adb.c   |   2 +-
 drivers/macintosh/via-pmu68k.c| 846 --
 include/uapi/linux/pmu.h  |   1 -
 9 files changed, 13 insertions(+), 904 deletions(-)
 delete mode 100644 drivers/macintosh/via-pmu68k.c

diff --git a/arch/m68k/configs/mac_defconfig b/arch/m68k/configs/mac_defconfig
index 390d4a87441c..ee63f1242e9a 100644
--- a/arch/m68k/configs/mac_defconfig
+++ b/arch/m68k/configs/mac_defconfig
@@ -370,7 +370,7 @@ CONFIG_TCM_PSCSI=m
 CONFIG_ADB=y
 CONFIG_ADB_MACII=y
 CONFIG_ADB_IOP=y
-CONFIG_ADB_PMU68K=y
+CONFIG_ADB_PMU=y
 CONFIG_ADB_CUDA=y
 CONFIG_INPUT_ADBHID=y
 CONFIG_MAC_EMUMOUSEBTN=y
diff --git a/arch/m68k/configs/multi_defconfig 
b/arch/m68k/configs/multi_defconfig
index 77be97d82dc3..6421a3da616c 100644
--- a/arch/m68k/configs/multi_defconfig
+++ b/arch/m68k/configs/multi_defconfig
@@ -403,7 +403,7 @@ CONFIG_TCM_PSCSI=m
 CONFIG_ADB=y
 CONFIG_ADB_MACII=y
 CONFIG_ADB_IOP=y
-CONFIG_ADB_PMU68K=y
+CONFIG_ADB_PMU=y
 CONFIG_ADB_CUDA=y
 CONFIG_INPUT_ADBHID=y
 CONFIG_MAC_EMUMOUSEBTN=y
diff --git a/arch/m68k/mac/config.c b/arch/m68k/mac/config.c
index e522307db47c..92e80cf0d8aa 100644
--- a/arch/m68k/mac/config.c
+++ b/arch/m68k/mac/config.c
@@ -891,7 +891,7 @@ static void __init mac_identify(void)
 #ifdef CONFIG_ADB_CUDA
find_via_cuda();
 #endif
-#ifdef CONFIG_ADB_PMU68K
+#ifdef CONFIG_ADB_PMU
find_via_pmu();
 #endif
 }
diff --git a/arch/m68k/mac/misc.c b/arch/m68k/mac/misc.c
index 7ccb799eeb57..28090a44fa09 100644
--- a/arch/m68k/mac/misc.c
+++ b/arch/m68k/mac/misc.c
@@ -85,7 +85,7 @@ static void cuda_write_pram(int offset, __u8 data)
 }
 #endif /* CONFIG_ADB_CUDA */
 
-#ifdef CONFIG_ADB_PMU68K
+#ifdef CONFIG_ADB_PMU
 static long pmu_read_time(void)
 {
struct adb_request req;
@@ -136,7 +136,7 @@ static void pmu_write_pram(int offset, __u8 data)
while (!req.complete)
pmu_poll();
 }
-#endif /* CONFIG_ADB_PMU68K */
+#endif /* CONFIG_ADB_PMU */
 
 /*
  * VIA PRAM/RTC access routines
@@ -367,38 +367,6 @@ static void cuda_shutdown(void)
 }
 #endif /* CONFIG_ADB_CUDA */
 
-#ifdef CONFIG_ADB_PMU68K
-
-void pmu_restart(void)
-{
-   struct adb_request req;
-   if (pmu_request(, NULL,
-   2, PMU_SET_INTR_MASK, PMU_INT_ADB|PMU_INT_TICK) < 0)
-   return;
-   while (!req.complete)
-   pmu_poll();
-   if (pmu_request(, NULL, 1, PMU_RESET) < 0)
-   return;
-   while (!req.complete)
-   pmu_poll();
-}
-
-void pmu_shutdown(void)
-{
-   struct adb_request req;
-   if (pmu_request(, NULL,
-   2, PMU_SET_INTR_MASK, PMU_INT_ADB|PMU_INT_TICK) < 0)
-   return;
-   while (!req.complete)
-   pmu_poll();
-   if (pmu_request(, NULL, 5, PMU_SHUTDOWN, 'M', 'A', 'T', 'T') < 0)
-   return;
-   while (!req.complete)
-   pmu_poll();
-}
-
-#endif
-
 /*
  *---
  * Below this point are the generic routines; they'll dispatch to the
@@ -423,7 +391,7 @@ void mac_pram_read(int offset, __u8 *buffer, int len)
func = cuda_read_pram;
break;
 #endif
-#ifdef CONFIG_ADB_PMU68K
+#ifdef CONFIG_ADB_PMU
case MAC_ADB_PB2:
func = pmu_read_pram;
break;
@@ -453,7 +421,7 @@ void mac_pram_write(int offset, __u8 *buffer, int len)
func = cuda_write_pram;
break;
 #endif
-#ifdef CONFIG_ADB_PMU68K
+#ifdef CONFIG_ADB_PMU
case MAC_ADB_PB2:
func = pmu_write_pram;
break;
@@ -477,7 +445,7 @@ void mac_poweroff(void)
   macintosh_config->adb_type == MAC_ADB_CUDA) {
cuda_shutdown();
 #endif
-#ifdef CONFIG_ADB_PMU68K
+#ifdef CONFIG_ADB_PMU
} else if (macintosh_config->adb_type == MAC_ADB_PB2) {
pmu_shutdown();
 #endif
@@ -518,7 +486,7 @@ void mac_reset(void)
   macintosh_config->adb_type == MAC_ADB_CUDA) {
cuda_restart();
 #endif
-#ifdef CONFIG_ADB_PMU68K
+#ifdef CONFIG_ADB_PMU
} else if (macintosh_config->adb_type == MAC_ADB_PB2) {
pmu_restart();
 #endif
@@ -670,7 +638,7 @@ int mac_hwclk(int op, struct rtc_time *t)
now = cuda_read_time();
break;
 #endif
-#ifdef CONFIG_ADB_PMU68K
+#ifdef CONFIG_ADB_PMU
case MAC_ADB_PB2:
now =

[PATCH v2 12/12] macintosh/via-pmu: Disambiguate interrupt statistics

2018-06-07 Thread Finn Thain

Some of the event counters are overloaded which makes it very
difficult to interpret their values.

Counter 0 is supposed to report CB1 interrupts but it can also count
PMU_INT_WAITING_CHARGER events.

Counter 1 is supposed to report GPIO interrupts but it can also count
other events (depending upon the value of the PMU_INT_ADB bit).

Disambiguate these statistics with dedicated counters for GPIO and
CB1 interrupts.

Comments in the MkLinux source code say that the type 0 and type 1
interrupts are model-specific. Label them as "unknown".

This change to the contents of /proc/pmu/interrupts is by necessity
visible in userland. However, packages which interact with the PMU
(that is, pbbuttonsd, pmac-utils and pmud) don't open this file.
AFAIK, user software has no need to poll these counters.

Tested-by: Stan Johnson 
Signed-off-by: Finn Thain 
---
The file now looks like this,

  0:  0 (Unknown interrupt (type 0))
  1:  0 (Unknown interrupt (type 1))
  2:  0 (PC-Card eject button)
  3: 23 (Sound/Brightness button)
  4: 74 (ADB message)
  5:  0 (Battery state change)
  6:  0 (Environment interrupt)
  7:121 (Tick timer)
  8:  0 (Ghost interrupt (zero len))
  9:  1 (Empty interrupt (empty mask))
 10:  2 (Max irqs in a row)
 11:194 (Total CB1 triggered events)
 12:  0 (Total GPIO1 triggered events)

rather than this,

  0:194 (Total CB1 triggered events)
  1:  0 (Total GPIO1 triggered events)
  2:  0 (PC-Card eject button)
  3: 23 (Sound/Brightness button)
  4: 74 (ADB message)
  5:  0 (Battery state change)
  6:  0 (Environment interrupt)
  7:121 (Tick timer)
  8:  0 (Ghost interrupt (zero len))
  9:  1 (Empty interrupt (empty mask))
 10:  2 (Max irqs in a row)

If some parser exists for this file, and if this change is problematic,
we could increment the driver version number in /proc/pmu/info, to
correspond with the format change.
---
 drivers/macintosh/via-pmu.c | 20 
 1 file changed, 12 insertions(+), 8 deletions(-)

diff --git a/drivers/macintosh/via-pmu.c b/drivers/macintosh/via-pmu.c
index 730c10f7fbb7..44919b3b56e0 100644
--- a/drivers/macintosh/via-pmu.c
+++ b/drivers/macintosh/via-pmu.c
@@ -172,7 +172,9 @@ static int drop_interrupts;
 static int option_lid_wakeup = 1;
 #endif /* CONFIG_SUSPEND && CONFIG_PPC32 */
 static unsigned long async_req_locks;
-static unsigned int pmu_irq_stats[11];
+
+#define NUM_IRQ_STATS 13
+static unsigned int pmu_irq_stats[NUM_IRQ_STATS];
 
 static struct proc_dir_entry *proc_pmu_root;
 static struct proc_dir_entry *proc_pmu_info;
@@ -884,9 +886,9 @@ static const struct file_operations pmu_info_proc_fops = {
 static int pmu_irqstats_proc_show(struct seq_file *m, void *v)
 {
int i;
-   static const char *irq_names[] = {
-   "Total CB1 triggered events",
-   "Total GPIO1 triggered events",
+   static const char *irq_names[NUM_IRQ_STATS] = {
+   "Unknown interrupt (type 0)",
+   "Unknown interrupt (type 1)",
"PC-Card eject button",
"Sound/Brightness button",
"ADB message",
@@ -895,10 +897,12 @@ static int pmu_irqstats_proc_show(struct seq_file *m, 
void *v)
"Tick timer",
"Ghost interrupt (zero len)",
"Empty interrupt (empty mask)",
-   "Max irqs in a row"
+   "Max irqs in a row",
+   "Total CB1 triggered events",
+   "Total GPIO1 triggered events",
 };
 
-   for (i=0; i<11; i++) {
+   for (i = 0; i < NUM_IRQ_STATS; i++) {
seq_printf(m, " %2u: %10u (%s)\n",
 i, pmu_irq_stats[i], irq_names[i]);
}
@@ -1659,7 +1663,7 @@ via_pmu_interrupt(int irq, void *arg)
}
if (intr & CB1_INT) {
adb_int_pending = 1;
-   pmu_irq_stats[0]++;
+   pmu_irq_stats[11]++;
}
if (intr & SR_INT) {
req = pmu_sr_intr();
@@ -1746,7 +1750,7 @@ gpio1_interrupt(int irq, void *arg)
disable_irq_nosync(gpio_irq);
gpio_irq_enabled = 0;
}
-   pmu_irq_stats[1]++;
+   pmu_irq_stats[12]++;
adb_int_pending = 1;
spin_unlock_irqrestore(_lock, flags);
via_pmu_interrupt(0, NULL);
-- 
2.16.4

[PATCH v2 11/12] macintosh/via-pmu: Clean up interrupt statistics

2018-06-07 Thread Finn Thain

Replace an open-coded ffs() with the function call.
Simplify an if-else cascade using a switch statement.
Correct a typo and an indentation issue.

Tested-by: Stan Johnson 
Signed-off-by: Finn Thain 
Reviewed-by: Geert Uytterhoeven 
---
 drivers/macintosh/via-pmu.c | 39 ++-
 1 file changed, 22 insertions(+), 17 deletions(-)

diff --git a/drivers/macintosh/via-pmu.c b/drivers/macintosh/via-pmu.c
index 38d7dd0bdb28..730c10f7fbb7 100644
--- a/drivers/macintosh/via-pmu.c
+++ b/drivers/macintosh/via-pmu.c
@@ -1392,7 +1392,8 @@ pmu_resume(void)
 static void
 pmu_handle_data(unsigned char *data, int len)
 {
-   unsigned char ints, pirq;
+   unsigned char ints;
+   int idx;
int i = 0;
 
asleep = 0;
@@ -1414,25 +1415,24 @@ pmu_handle_data(unsigned char *data, int len)
ints &= ~(PMU_INT_ADB_AUTO | PMU_INT_AUTO_SRQ_POLL);
 
 next:
-
if (ints == 0) {
if (i > pmu_irq_stats[10])
pmu_irq_stats[10] = i;
return;
}
-
-   for (pirq = 0; pirq < 8; pirq++)
-   if (ints & (1 << pirq))
-   break;
-   pmu_irq_stats[pirq]++;
i++;
-   ints &= ~(1 << pirq);
+
+   idx = ffs(ints) - 1;
+   ints &= ~BIT(idx);
+
+   pmu_irq_stats[idx]++;
 
/* Note: for some reason, we get an interrupt with len=1,
 * data[0]==0 after each normal ADB interrupt, at least
 * on the Pismo. Still investigating...  --BenH
 */
-   if ((1 << pirq) & PMU_INT_ADB) {
+   switch (BIT(idx)) {
+   case PMU_INT_ADB:
if ((data[0] & PMU_INT_ADB_AUTO) == 0) {
struct adb_request *req = req_awaiting_reply;
if (req == 0) {
@@ -1470,25 +1470,28 @@ pmu_handle_data(unsigned char *data, int len)
adb_input(data+1, len-1, 1);
 #endif /* CONFIG_ADB */
}
-   }
+   break;
+
/* Sound/brightness button pressed */
-   else if ((1 << pirq) & PMU_INT_SNDBRT) {
+   case PMU_INT_SNDBRT:
 #ifdef CONFIG_PMAC_BACKLIGHT
if (len == 3)
pmac_backlight_set_legacy_brightness_pmu(data[1] >> 4);
 #endif
-   }
+   break;
+
/* Tick interrupt */
-   else if ((1 << pirq) & PMU_INT_TICK) {
-   /* Environement or tick interrupt, query batteries */
+   case PMU_INT_TICK:
+   /* Environment or tick interrupt, query batteries */
if (pmu_battery_count) {
if ((--query_batt_timer) == 0) {
query_battery_state();
query_batt_timer = BATTERY_POLLING_COUNT;
}
}
-}
-   else if ((1 << pirq) & PMU_INT_ENVIRONMENT) {
+   break;
+
+   case PMU_INT_ENVIRONMENT:
if (pmu_battery_count)
query_battery_state();
pmu_pass_intr(data, len);
@@ -1498,7 +1501,9 @@ pmu_handle_data(unsigned char *data, int len)
via_pmu_event(PMU_EVT_POWER, !!(data[1]&8));
via_pmu_event(PMU_EVT_LID, data[1]&1);
}
-   } else {
+   break;
+
+   default:
   pmu_pass_intr(data, len);
}
goto next;
-- 
2.16.4

[PATCH v2 10/12] macintosh: Use common code to access RTC

2018-06-07 Thread Finn Thain

Now that the 68k Mac port has adopted the via-pmu driver, it must access
the PMU RTC using the appropriate command format. The same code can now
be used for both m68k and powerpc.

Replace the RTC code that's duplicated in arch/powerpc and arch/m68k
with common RTC accessors for Cuda and PMU devices.

Cc: Geert Uytterhoeven 
Cc: Paul Mackerras ,
Cc: Michael Ellerman 
Tested-by: Stan Johnson 
Signed-off-by: Finn Thain 
---
 arch/m68k/mac/misc.c   | 64 ++---
 arch/powerpc/platforms/powermac/time.c | 74 +-
 drivers/macintosh/via-cuda.c   | 34 
 drivers/macintosh/via-pmu.c| 32 +++
 include/linux/cuda.h   |  3 ++
 include/linux/pmu.h|  3 ++
 6 files changed, 78 insertions(+), 132 deletions(-)

diff --git a/arch/m68k/mac/misc.c b/arch/m68k/mac/misc.c
index 28090a44fa09..397f9f942a9f 100644
--- a/arch/m68k/mac/misc.c
+++ b/arch/m68k/mac/misc.c
@@ -33,34 +33,6 @@
 static void (*rom_reset)(void);
 
 #ifdef CONFIG_ADB_CUDA
-static long cuda_read_time(void)
-{
-   struct adb_request req;
-   long time;
-
-   if (cuda_request(, NULL, 2, CUDA_PACKET, CUDA_GET_TIME) < 0)
-   return 0;
-   while (!req.complete)
-   cuda_poll();
-
-   time = (req.reply[3] << 24) | (req.reply[4] << 16) |
-  (req.reply[5] << 8) | req.reply[6];
-   return time - RTC_OFFSET;
-}
-
-static void cuda_write_time(long data)
-{
-   struct adb_request req;
-
-   data += RTC_OFFSET;
-   if (cuda_request(, NULL, 6, CUDA_PACKET, CUDA_SET_TIME,
-(data >> 24) & 0xFF, (data >> 16) & 0xFF,
-(data >> 8) & 0xFF, data & 0xFF) < 0)
-   return;
-   while (!req.complete)
-   cuda_poll();
-}
-
 static __u8 cuda_read_pram(int offset)
 {
struct adb_request req;
@@ -86,34 +58,6 @@ static void cuda_write_pram(int offset, __u8 data)
 #endif /* CONFIG_ADB_CUDA */
 
 #ifdef CONFIG_ADB_PMU
-static long pmu_read_time(void)
-{
-   struct adb_request req;
-   long time;
-
-   if (pmu_request(, NULL, 1, PMU_READ_RTC) < 0)
-   return 0;
-   while (!req.complete)
-   pmu_poll();
-
-   time = (req.reply[1] << 24) | (req.reply[2] << 16) |
-  (req.reply[3] << 8) | req.reply[4];
-   return time - RTC_OFFSET;
-}
-
-static void pmu_write_time(long data)
-{
-   struct adb_request req;
-
-   data += RTC_OFFSET;
-   if (pmu_request(, NULL, 5, PMU_SET_RTC,
-   (data >> 24) & 0xFF, (data >> 16) & 0xFF,
-   (data >> 8) & 0xFF, data & 0xFF) < 0)
-   return;
-   while (!req.complete)
-   pmu_poll();
-}
-
 static __u8 pmu_read_pram(int offset)
 {
struct adb_request req;
@@ -635,12 +579,12 @@ int mac_hwclk(int op, struct rtc_time *t)
 #ifdef CONFIG_ADB_CUDA
case MAC_ADB_EGRET:
case MAC_ADB_CUDA:
-   now = cuda_read_time();
+   now = cuda_get_time();
break;
 #endif
 #ifdef CONFIG_ADB_PMU
case MAC_ADB_PB2:
-   now = pmu_read_time();
+   now = pmu_get_time();
break;
 #endif
default:
@@ -671,12 +615,12 @@ int mac_hwclk(int op, struct rtc_time *t)
 #ifdef CONFIG_ADB_CUDA
case MAC_ADB_EGRET:
case MAC_ADB_CUDA:
-   cuda_write_time(now);
+   cuda_set_time(now);
break;
 #endif
 #ifdef CONFIG_ADB_PMU
case MAC_ADB_PB2:
-   pmu_write_time(now);
+   pmu_set_time(now);
break;
 #endif
default:
diff --git a/arch/powerpc/platforms/powermac/time.c 
b/arch/powerpc/platforms/powermac/time.c
index 274af6fa388e..e9c1f3dafe2f 100644
--- a/arch/powerpc/platforms/powermac/time.c
+++ b/arch/powerpc/platforms/powermac/time.c
@@ -42,9 +42,6 @@
 #define DBG(x...)
 #endif
 
-/* Apparently the RTC stores seconds since 1 Jan 1904 */
-#define RTC_OFFSET 2082844800
-
 /*
  * Calibrate the decrementer frequency with the VIA timer 1.
  */
@@ -103,43 +100,8 @@ static unsigned long from_rtc_time(struct rtc_time *tm)
 #endif
 
 #ifdef CONFIG_ADB_CUDA
-static unsigned long cuda_get_time(void)
-{
-   struct adb_request req;
-   unsigned int now;
-
-   if (cuda_request(, NULL, 2, CUDA_PACKET, CUDA_GET_TIME) < 0)
-   return 0;
-   while (!req.complete)
-   cuda_poll();
-   if (req.reply_len != 7)
-   printk(KERN_ERR "cuda_get_time: got %d byte reply\n",
-  req.reply_len);
-   now = (req.reply[3] << 24) + (req.reply[4] << 16)
-   + (req.reply[5] << 8) + req.reply[6];
-   return ((unsigned long)now) -

[PATCH v2 08/12] macintosh/via-pmu68k: Don't load driver on unsupported hardware

2018-06-07 Thread Finn Thain

Don't load the via-pmu68k driver on early PowerBooks. The M50753 PMU
device found in those models was never supported by this driver.
Attempting to load the driver usually causes a boot hang.

Cc: Geert Uytterhoeven 
Signed-off-by: Finn Thain 
---
 arch/m68k/mac/misc.c   | 6 ++
 drivers/macintosh/via-pmu68k.c | 4 
 include/uapi/linux/pmu.h   | 1 -
 3 files changed, 2 insertions(+), 9 deletions(-)

diff --git a/arch/m68k/mac/misc.c b/arch/m68k/mac/misc.c
index c68054361615..7ccb799eeb57 100644
--- a/arch/m68k/mac/misc.c
+++ b/arch/m68k/mac/misc.c
@@ -478,8 +478,7 @@ void mac_poweroff(void)
cuda_shutdown();
 #endif
 #ifdef CONFIG_ADB_PMU68K
-   } else if (macintosh_config->adb_type == MAC_ADB_PB1
-   || macintosh_config->adb_type == MAC_ADB_PB2) {
+   } else if (macintosh_config->adb_type == MAC_ADB_PB2) {
pmu_shutdown();
 #endif
}
@@ -520,8 +519,7 @@ void mac_reset(void)
cuda_restart();
 #endif
 #ifdef CONFIG_ADB_PMU68K
-   } else if (macintosh_config->adb_type == MAC_ADB_PB1
-   || macintosh_config->adb_type == MAC_ADB_PB2) {
+   } else if (macintosh_config->adb_type == MAC_ADB_PB2) {
pmu_restart();
 #endif
} else if (CPU_IS_030) {
diff --git a/drivers/macintosh/via-pmu68k.c b/drivers/macintosh/via-pmu68k.c
index d545ed45e482..bec8e1837d7d 100644
--- a/drivers/macintosh/via-pmu68k.c
+++ b/drivers/macintosh/via-pmu68k.c
@@ -175,9 +175,6 @@ static s8 pmu_data_len[256][2] = {
 int __init find_via_pmu(void)
 {
switch (macintosh_config->adb_type) {
-   case MAC_ADB_PB1:
-   pmu_kind = PMU_68K_V1;
-   break;
case MAC_ADB_PB2:
pmu_kind = PMU_68K_V2;
break;
@@ -785,7 +782,6 @@ pmu_enable_backlight(int on)
/* first call: get current backlight value */
if (backlight_level < 0) {
switch(pmu_kind) {
-   case PMU_68K_V1:
case PMU_68K_V2:
pmu_request(, NULL, 3, PMU_READ_NVRAM, 0x14, 0xe);
while (!req.complete)
diff --git a/include/uapi/linux/pmu.h b/include/uapi/linux/pmu.h
index 89cb1acea93a..30f64d46f5db 100644
--- a/include/uapi/linux/pmu.h
+++ b/include/uapi/linux/pmu.h
@@ -93,7 +93,6 @@ enum {
PMU_HEATHROW_BASED, /* PowerBook G3 series */
PMU_PADDINGTON_BASED,   /* 1999 PowerBook G3 */
PMU_KEYLARGO_BASED, /* Core99 motherboard (PMU99) */
-   PMU_68K_V1, /* 68K PMU, version 1 */
PMU_68K_V2, /* 68K PMU, version 2 */
 };
 
-- 
2.16.4

[PATCH v2 06/12] macintosh/via-pmu: Add support for m68k PowerBooks

2018-06-07 Thread Finn Thain

Put #ifdefs around the Open Firmware, xmon, interrupt dispatch,
battery and suspend code. Add the necessary interrupt handling to
support m68k PowerBooks.

The pmu_kind value is available to userspace using the
PMU_IOC_GET_MODEL ioctl. It is not clear yet what hardware classes
are be needed to describe m68k PowerBook models, so pmu_kind is given
the provisional value PMU_UNKNOWN.

To find out about the hardware, user programs can use /proc/bootinfo
or /proc/hardware, or send the PMU_GET_VERSION command using /dev/adb.

Tested-by: Stan Johnson 
Signed-off-by: Finn Thain 
---
 drivers/macintosh/Kconfig   |   2 +-
 drivers/macintosh/via-pmu.c | 101 +++-
 2 files changed, 91 insertions(+), 12 deletions(-)

diff --git a/drivers/macintosh/Kconfig b/drivers/macintosh/Kconfig
index 97a420c11eed..9c6452b38c36 100644
--- a/drivers/macintosh/Kconfig
+++ b/drivers/macintosh/Kconfig
@@ -65,7 +65,7 @@ config ADB_CUDA
  If unsure say Y.
 
 config ADB_PMU
-   bool "Support for PMU  based PowerMacs"
+   bool "Support for PMU based PowerMacs and PowerBooks"
depends on PPC_PMAC
help
  On PowerBooks, iBooks, and recent iMacs and Power Macintoshes, the
diff --git a/drivers/macintosh/via-pmu.c b/drivers/macintosh/via-pmu.c
index 2e09137410f6..22cb7d94e3ce 100644
--- a/drivers/macintosh/via-pmu.c
+++ b/drivers/macintosh/via-pmu.c
@@ -1,6 +1,6 @@
 // SPDX-License-Identifier: GPL-2.0
 /*
- * Device driver for the via-pmu on Apple Powermacs.
+ * Device driver for the PMU in Apple PowerBooks and PowerMacs.
  *
  * The VIA (versatile interface adapter) interfaces to the PMU,
  * a 6805 microprocessor core whose primary function is to control
@@ -49,20 +49,26 @@
 #include 
 #include 
 #include 
-#include 
+#include 
 #include 
 #include 
 #include 
 #include 
 #include 
+#ifdef CONFIG_PPC_PMAC
 #include 
 #include 
 #include 
-#include 
+#include 
 #include 
 #include 
 #include 
 #include 
+#else
+#include 
+#include 
+#include 
+#endif
 
 #include "via-pmu-event.h"
 
@@ -97,8 +103,13 @@ static DEFINE_MUTEX(pmu_info_proc_mutex);
 #define ANH(15*RS) /* A-side data, no handshake */
 
 /* Bits in B data register: both active low */
+#ifdef CONFIG_PPC_PMAC
 #define TACK   0x08/* Transfer acknowledge (input) */
 #define TREQ   0x10/* Transfer request (output) */
+#else
+#define TACK   0x02
+#define TREQ   0x04
+#endif
 
 /* Bits in ACR */
 #define SR_CTRL0x1c/* Shift register control bits 
*/
@@ -140,13 +151,15 @@ static int data_index;
 static int data_len;
 static volatile int adb_int_pending;
 static volatile int disable_poll;
-static struct device_node *vias;
 static int pmu_kind = PMU_UNKNOWN;
 static int pmu_fully_inited;
 static int pmu_has_adb;
+#ifdef CONFIG_PPC_PMAC
 static volatile unsigned char __iomem *via1;
 static volatile unsigned char __iomem *via2;
+static struct device_node *vias;
 static struct device_node *gpio_node;
+#endif
 static unsigned char __iomem *gpio_reg;
 static int gpio_irq = 0;
 static int gpio_irq_enabled = -1;
@@ -273,6 +286,7 @@ static char *pbook_type[] = {
 
 int __init find_via_pmu(void)
 {
+#ifdef CONFIG_PPC_PMAC
u64 taddr;
const u32 *reg;
 
@@ -355,9 +369,6 @@ int __init find_via_pmu(void)
if (!init_pmu())
goto fail_init;
 
-   printk(KERN_INFO "PMU driver v%d initialized for %s, firmware: %02x\n",
-  PMU_DRIVER_VERSION, pbook_type[pmu_kind], pmu_version);
-  
sys_ctrler = SYS_CTRLER_PMU;

return 1;
@@ -373,6 +384,30 @@ int __init find_via_pmu(void)
vias = NULL;
pmu_state = uninitialized;
return 0;
+#else
+   if (macintosh_config->adb_type != MAC_ADB_PB2)
+   return 0;
+
+   pmu_kind = PMU_UNKNOWN;
+
+   spin_lock_init(_lock);
+
+   pmu_has_adb = 1;
+
+   pmu_intr_mask = PMU_INT_PCEJECT |
+   PMU_INT_SNDBRT |
+   PMU_INT_ADB |
+   PMU_INT_TICK;
+
+   pmu_state = idle;
+
+   if (!init_pmu()) {
+   pmu_state = uninitialized;
+   return 0;
+   }
+
+   return 1;
+#endif /* !CONFIG_PPC_PMAC */
 }
 
 #ifdef CONFIG_ADB
@@ -396,13 +431,14 @@ static int pmu_init(void)
  */
 static int __init via_pmu_start(void)
 {
-   unsigned int irq;
+   unsigned int __maybe_unused irq;
 
if (pmu_state == uninitialized)
return -ENODEV;
 
batt_req.complete = 1;
 
+#ifdef CONFIG_PPC_PMAC
irq = irq_of_parse_and_map(vias, 0);
if (!irq) {
printk(KERN_ERR "via-pmu: can't map interrupt\n");
@@ -439,6 +475,19 @@ static int __init via_pmu_start(void)
 
/* Enable interrupts */
out_8([IER], IER_SET | SR_INT | CB1_INT);
+#else
+   if (request_irq(IRQ_MAC_ADB_SR, via_pmu_interrupt, IRQF_NO_SUSPEND,
+

[PATCH v2 05/12] macintosh/via-pmu: Replace via pointer with via1 and via2 pointers

2018-06-07 Thread Finn Thain

On most PowerPC Macs, the PMU driver uses the shift register and
IO port B from a single VIA chip.

On 68k and early PowerPC PowerBooks, the driver uses the shift register
from one VIA chip together with IO port B from another.

Replace via with via1 and via2 to accommodate this. For the
CONFIG_PPC_PMAC case, set via1 = via2 so there is no change.

Tested-by: Stan Johnson 
Signed-off-by: Finn Thain 
---
 drivers/macintosh/via-pmu.c | 142 +---
 1 file changed, 69 insertions(+), 73 deletions(-)

diff --git a/drivers/macintosh/via-pmu.c b/drivers/macintosh/via-pmu.c
index c4c324fb5fa6..2e09137410f6 100644
--- a/drivers/macintosh/via-pmu.c
+++ b/drivers/macintosh/via-pmu.c
@@ -76,7 +76,6 @@
 #define BATTERY_POLLING_COUNT  2
 
 static DEFINE_MUTEX(pmu_info_proc_mutex);
-static volatile unsigned char __iomem *via;
 
 /* VIA registers - spaced 0x200 bytes apart */
 #define RS 0x200   /* skip between registers */
@@ -145,6 +144,8 @@ static struct device_node *vias;
 static int pmu_kind = PMU_UNKNOWN;
 static int pmu_fully_inited;
 static int pmu_has_adb;
+static volatile unsigned char __iomem *via1;
+static volatile unsigned char __iomem *via2;
 static struct device_node *gpio_node;
 static unsigned char __iomem *gpio_reg;
 static int gpio_irq = 0;
@@ -340,14 +341,14 @@ int __init find_via_pmu(void)
} else
pmu_kind = PMU_UNKNOWN;
 
-   via = ioremap(taddr, 0x2000);
-   if (via == NULL) {
+   via1 = via2 = ioremap(taddr, 0x2000);
+   if (via1 == NULL) {
printk(KERN_ERR "via-pmu: Can't map address !\n");
goto fail_via_remap;
}

-   out_8([IER], IER_CLR | 0x7f);   /* disable all intrs */
-   out_8([IFR], 0x7f); /* clear IFR */
+   out_8([IER], IER_CLR | 0x7f);  /* disable all intrs */
+   out_8([IFR], 0x7f);/* clear IFR */
 
pmu_state = idle;
 
@@ -362,8 +363,8 @@ int __init find_via_pmu(void)
return 1;
 
  fail_init:
-   iounmap(via);
-   via = NULL;
+   iounmap(via1);
+   via1 = via2 = NULL;
  fail_via_remap:
iounmap(gpio_reg);
gpio_reg = NULL;
@@ -437,7 +438,7 @@ static int __init via_pmu_start(void)
}
 
/* Enable interrupts */
-   out_8([IER], IER_SET | SR_INT | CB1_INT);
+   out_8([IER], IER_SET | SR_INT | CB1_INT);
 
pmu_fully_inited = 1;
 
@@ -533,8 +534,8 @@ init_pmu(void)
struct adb_request req;
 
/* Negate TREQ. Set TACK to input and TREQ to output. */
-   out_8([B], in_8([B]) | TREQ);
-   out_8([DIRB], (in_8([DIRB]) | TREQ) & ~TACK);
+   out_8([B], in_8([B]) | TREQ);
+   out_8([DIRB], (in_8([DIRB]) | TREQ) & ~TACK);
 
pmu_request(, NULL, 2, PMU_SET_INTR_MASK, pmu_intr_mask);
timeout =  10;
@@ -1174,7 +1175,7 @@ wait_for_ack(void)
 * reported
 */
int timeout = 4000;
-   while ((in_8([B]) & TACK) == 0) {
+   while ((in_8([B]) & TACK) == 0) {
if (--timeout < 0) {
printk(KERN_ERR "PMU not responding (!ack)\n");
return;
@@ -1188,23 +1189,19 @@ wait_for_ack(void)
 static inline void
 send_byte(int x)
 {
-   volatile unsigned char __iomem *v = via;
-
-   out_8([ACR], in_8([ACR]) | SR_OUT | SR_EXT);
-   out_8([SR], x);
-   out_8([B], in_8([B]) & ~TREQ);  /* assert TREQ */
-   (void)in_8([B]);
+   out_8([ACR], in_8([ACR]) | SR_OUT | SR_EXT);
+   out_8([SR], x);
+   out_8([B], in_8([B]) & ~TREQ);/* assert TREQ */
+   (void)in_8([B]);
 }
 
 static inline void
 recv_byte(void)
 {
-   volatile unsigned char __iomem *v = via;
-
-   out_8([ACR], (in_8([ACR]) & ~SR_OUT) | SR_EXT);
-   in_8([SR]);   /* resets SR */
-   out_8([B], in_8([B]) & ~TREQ);
-   (void)in_8([B]);
+   out_8([ACR], (in_8([ACR]) & ~SR_OUT) | SR_EXT);
+   in_8([SR]);/* resets SR */
+   out_8([B], in_8([B]) & ~TREQ);
+   (void)in_8([B]);
 }
 
 static inline void
@@ -1307,7 +1304,7 @@ pmu_suspend(void)
if (!adb_int_pending && pmu_state == idle && 
!req_awaiting_reply) {
if (gpio_irq >= 0)
disable_irq_nosync(gpio_irq);
-   out_8([IER], CB1_INT | IER_CLR);
+   out_8([IER], CB1_INT | IER_CLR);
spin_unlock_irqrestore(_lock, flags);
break;
}
@@ -1331,7 +1328,7 @@ pmu_resume(void)
adb_int_pending = 1;
if (gpio_irq >= 0)
enable_irq(gpio_irq);
-   out_8([IER], CB1_INT | IER_SET);
+   out_8([IER], CB1_INT | IER_SET);
spin_unlock_irqrestore(_lock, flags);
pmu_poll();
 }
@@ -1456,20 +1453,20 @@ pmu_sr_intr(void)
struct adb_request *req;
int bite = 0;
 
-   if (in_8([B]) &

[PATCH v2 07/12] macintosh/via-pmu: Make CONFIG_PPC_PMAC Kconfig deps explicit

2018-06-07 Thread Finn Thain

At present, CONFIG_ADB_PMU depends on CONFIG_PPC_PMAC. When this gets
relaxed to CONFIG_PPC_PMAC || CONFIG_MAC, those Kconfig symbols with
implicit deps on PPC_PMAC will need explicit deps. Add them now.
No functional change.

Tested-by: Stan Johnson 
Signed-off-by: Finn Thain 
---
 drivers/macintosh/Kconfig | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/macintosh/Kconfig b/drivers/macintosh/Kconfig
index 9c6452b38c36..26abae4c899d 100644
--- a/drivers/macintosh/Kconfig
+++ b/drivers/macintosh/Kconfig
@@ -79,7 +79,7 @@ config ADB_PMU
 
 config ADB_PMU_LED
bool "Support for the Power/iBook front LED"
-   depends on ADB_PMU
+   depends on PPC_PMAC && ADB_PMU
select NEW_LEDS
select LEDS_CLASS
help
@@ -122,7 +122,7 @@ config PMAC_MEDIABAY
 
 config PMAC_BACKLIGHT
bool "Backlight control for LCD screens"
-   depends on ADB_PMU && FB = y && (BROKEN || !PPC64)
+   depends on PPC_PMAC && ADB_PMU && FB = y && (BROKEN || !PPC64)
select FB_BACKLIGHT
help
  Say Y here to enable Macintosh specific extensions of the generic
-- 
2.16.4

[PATCH v2 04/12] macintosh/via-pmu: Enhance state machine with new 'uninitialized' state

2018-06-07 Thread Finn Thain

On 68k Macs, the via/vias pointer can't be used to determine whether
the PMU driver has been initialized. For portability, add a new state
to indicate that via_find_pmu() succeeded.

After via_find_pmu() executes, testing vias == NULL is equivalent to
testing via == NULL. Replace these tests with pmu_state == uninitialized
which is simpler and more consistent. No functional change.

Tested-by: Stan Johnson 
Signed-off-by: Finn Thain 
---
 drivers/macintosh/via-pmu.c | 44 ++--
 1 file changed, 22 insertions(+), 22 deletions(-)

diff --git a/drivers/macintosh/via-pmu.c b/drivers/macintosh/via-pmu.c
index 4c1bae5380c2..c4c324fb5fa6 100644
--- a/drivers/macintosh/via-pmu.c
+++ b/drivers/macintosh/via-pmu.c
@@ -114,6 +114,7 @@ static volatile unsigned char __iomem *via;
 #define CB1_INT0x10/* transition on CB1 input */
 
 static volatile enum pmu_state {
+   uninitialized = 0,
idle,
sending,
intack,
@@ -274,7 +275,7 @@ int __init find_via_pmu(void)
u64 taddr;
const u32 *reg;
 
-   if (via != 0)
+   if (pmu_state != uninitialized)
return 1;
vias = of_find_node_by_name(NULL, "via-pmu");
if (vias == NULL)
@@ -369,20 +370,19 @@ int __init find_via_pmu(void)
  fail:
of_node_put(vias);
vias = NULL;
+   pmu_state = uninitialized;
return 0;
 }
 
 #ifdef CONFIG_ADB
 static int pmu_probe(void)
 {
-   return vias == NULL? -ENODEV: 0;
+   return pmu_state == uninitialized ? -ENODEV : 0;
 }
 
 static int pmu_init(void)
 {
-   if (vias == NULL)
-   return -ENODEV;
-   return 0;
+   return pmu_state == uninitialized ? -ENODEV : 0;
 }
 #endif /* CONFIG_ADB */
 
@@ -397,7 +397,7 @@ static int __init via_pmu_start(void)
 {
unsigned int irq;
 
-   if (vias == NULL)
+   if (pmu_state == uninitialized)
return -ENODEV;
 
batt_req.complete = 1;
@@ -463,7 +463,7 @@ arch_initcall(via_pmu_start);
  */
 static int __init via_pmu_dev_init(void)
 {
-   if (vias == NULL)
+   if (pmu_state == uninitialized)
return -ENODEV;
 
 #ifdef CONFIG_PMAC_BACKLIGHT
@@ -966,7 +966,7 @@ static int pmu_send_request(struct adb_request *req, int 
sync)
 {
int i, ret;
 
-   if ((vias == NULL) || (!pmu_fully_inited)) {
+   if (pmu_state == uninitialized || !pmu_fully_inited) {
req->complete = 1;
return -ENXIO;
}
@@ -1060,7 +1060,7 @@ static int __pmu_adb_autopoll(int devs)
 
 static int pmu_adb_autopoll(int devs)
 {
-   if ((vias == NULL) || (!pmu_fully_inited) || !pmu_has_adb)
+   if (pmu_state == uninitialized || !pmu_fully_inited || !pmu_has_adb)
return -ENXIO;
 
adb_dev_map = devs;
@@ -1073,7 +1073,7 @@ static int pmu_adb_reset_bus(void)
struct adb_request req;
int save_autopoll = adb_dev_map;
 
-   if ((vias == NULL) || (!pmu_fully_inited) || !pmu_has_adb)
+   if (pmu_state == uninitialized || !pmu_fully_inited || !pmu_has_adb)
return -ENXIO;
 
/* anyone got a better idea?? */
@@ -1109,7 +1109,7 @@ pmu_request(struct adb_request *req, void (*done)(struct 
adb_request *),
va_list list;
int i;
 
-   if (vias == NULL)
+   if (pmu_state == uninitialized)
return -ENXIO;
 
if (nbytes < 0 || nbytes > 32) {
@@ -1134,7 +1134,7 @@ pmu_queue_request(struct adb_request *req)
unsigned long flags;
int nsend;
 
-   if (via == NULL) {
+   if (pmu_state == uninitialized) {
req->complete = 1;
return -ENXIO;
}
@@ -1247,7 +1247,7 @@ pmu_start(void)
 void
 pmu_poll(void)
 {
-   if (!via)
+   if (pmu_state == uninitialized)
return;
if (disable_poll)
return;
@@ -1257,7 +1257,7 @@ pmu_poll(void)
 void
 pmu_poll_adb(void)
 {
-   if (!via)
+   if (pmu_state == uninitialized)
return;
if (disable_poll)
return;
@@ -1272,7 +1272,7 @@ pmu_poll_adb(void)
 void
 pmu_wait_complete(struct adb_request *req)
 {
-   if (!via)
+   if (pmu_state == uninitialized)
return;
while((pmu_state != idle && pmu_state != locked) || !req->complete)
via_pmu_interrupt(0, NULL);
@@ -1288,7 +1288,7 @@ pmu_suspend(void)
 {
unsigned long flags;
 
-   if (!via)
+   if (pmu_state == uninitialized)
return;

spin_lock_irqsave(_lock, flags);
@@ -1319,7 +1319,7 @@ pmu_resume(void)
 {
unsigned long flags;
 
-   if (!via || (pmu_suspended < 1))
+   if (pmu_state == uninitialized || pmu_suspended < 1)
return;
 
spin_lock_irqsave(_lock, flags);
@@ -1681,7 +1681,7 @@ pmu_enable_irled(int on)
 {
struct adb_request req;
 
-   if (vias == NULL)
+   if (pmu_state ==

[PATCH v2 03/12] macintosh/via-pmu: Don't clear shift register interrupt flag twice

2018-06-07 Thread Finn Thain

The shift register interrupt flag gets cleared in via_pmu_interrupt()
and once again in pmu_sr_intr(). Fix this theoretical race condition.

Tested-by: Stan Johnson 
Signed-off-by: Finn Thain 
Reviewed-by: Geert Uytterhoeven 
---
 drivers/macintosh/via-pmu.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/drivers/macintosh/via-pmu.c b/drivers/macintosh/via-pmu.c
index 74065ea410bd..4c1bae5380c2 100644
--- a/drivers/macintosh/via-pmu.c
+++ b/drivers/macintosh/via-pmu.c
@@ -1458,7 +1458,6 @@ pmu_sr_intr(void)
 
if (in_8([B]) & TREQ) {
printk(KERN_ERR "PMU: spurious SR intr (%x)\n", in_8([B]));
-   out_8([IFR], SR_INT);
return NULL;
}
/* The ack may not yet be low when we get the interrupt */
-- 
2.16.4

[PATCH v2 01/12] macintosh/via-pmu: Fix section mismatch warning

2018-06-07 Thread Finn Thain

The pmu_init() function has the __init qualifier, but the ops struct
that holds a pointer to it does not. This causes a build warning.
The driver works fine because the pointer is only dereferenced early.

The function is so small that there's negligible benefit from using
the __init qualifier. Remove it to fix the warning, consistent with
the other ADB drivers.

Tested-by: Stan Johnson 
Signed-off-by: Finn Thain 
Reviewed-by: Geert Uytterhoeven 
---
 drivers/macintosh/via-pmu.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/macintosh/via-pmu.c b/drivers/macintosh/via-pmu.c
index 433dbeddfcf9..fd3c5640d586 100644
--- a/drivers/macintosh/via-pmu.c
+++ b/drivers/macintosh/via-pmu.c
@@ -378,7 +378,7 @@ static int pmu_probe(void)
return vias == NULL? -ENODEV: 0;
 }
 
-static int __init pmu_init(void)
+static int pmu_init(void)
 {
if (vias == NULL)
return -ENODEV;
-- 
2.16.4

[PATCH v2 02/12] macintosh/via-pmu: Add missing mmio accessors

2018-06-07 Thread Finn Thain

Add missing in_8() accessors to init_pmu() and pmu_sr_intr().

This fixes several sparse warnings:
drivers/macintosh/via-pmu.c:536:29: warning: dereference of noderef expression
drivers/macintosh/via-pmu.c:537:33: warning: dereference of noderef expression
drivers/macintosh/via-pmu.c:1455:17: warning: dereference of noderef expression
drivers/macintosh/via-pmu.c:1456:69: warning: dereference of noderef expression

Tested-by: Stan Johnson 
Signed-off-by: Finn Thain 
Reviewed-by: Geert Uytterhoeven 
---
 drivers/macintosh/via-pmu.c | 9 +
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/drivers/macintosh/via-pmu.c b/drivers/macintosh/via-pmu.c
index fd3c5640d586..74065ea410bd 100644
--- a/drivers/macintosh/via-pmu.c
+++ b/drivers/macintosh/via-pmu.c
@@ -532,8 +532,9 @@ init_pmu(void)
int timeout;
struct adb_request req;
 
-   out_8([B], via[B] | TREQ);  /* negate TREQ */
-   out_8([DIRB], (via[DIRB] | TREQ) & ~TACK);  /* TACK in, TREQ out */
+   /* Negate TREQ. Set TACK to input and TREQ to output. */
+   out_8([B], in_8([B]) | TREQ);
+   out_8([DIRB], (in_8([DIRB]) | TREQ) & ~TACK);
 
pmu_request(, NULL, 2, PMU_SET_INTR_MASK, pmu_intr_mask);
timeout =  10;
@@ -1455,8 +1456,8 @@ pmu_sr_intr(void)
struct adb_request *req;
int bite = 0;
 
-   if (via[B] & TREQ) {
-   printk(KERN_ERR "PMU: spurious SR intr (%x)\n", via[B]);
+   if (in_8([B]) & TREQ) {
+   printk(KERN_ERR "PMU: spurious SR intr (%x)\n", in_8([B]));
out_8([IFR], SR_INT);
return NULL;
}
-- 
2.16.4

[PATCH v2 00/12] macintosh: Resolve various PMU driver problems

2018-06-07 Thread Finn Thain

This series of patches has the following aims.

1) Eliminate duplicated code. Linux presently has two drivers for
   the 68HC05-based PMU devices found in Macs: via-pmu and via-pmu68k.
   There's no value in having separate PMU drivers for each architecture.

2) Avoid further work on via-pmu68k that's not needed for via-pmu.

3) Fix some bugs in the via-pmu driver.

4) Enable the /dev/pmu and /proc/pmu/* userspace APIs on m68k Macs
   by adopting via-pmu.

5) Improve stability on early 100-series PowerBooks by loading no PMU
   driver at all. Neither via-pmu nor via-pmu68k supports the early
   M50753-based PMU device found in these models.

6) Eliminate duplicated RTC accessors for PMU and Cuda. Presently these
   can be found under both arch/m68k and arch/powerpc.

7) Assist the out-of-tree NuBus PowerMac port to support PMU designs
   shared with the m68k Mac port (e.g. PowerBooks 190 and 5300).

This patch series has been regression tested on various PowerBooks
(190, 520, 3400, Pismo G3) and PowerMacs (Beige G3, G5). These patches
did not affect userland utilities. (Note that there is a userland-
visible change to the contents of /proc/pmu/interrupts.)

Changed since v1:
1) Added blank lines after 'break' statements in patch 10.
2) Improved patch description for patch 3.
3) Added reviewed-by tags.
4) Split patch 8 to make code review easier.


Finn Thain (12):
  macintosh/via-pmu: Fix section mismatch warning
  macintosh/via-pmu: Add missing mmio accessors
  macintosh/via-pmu: Don't clear shift register interrupt flag twice
  macintosh/via-pmu: Enhance state machine with new 'uninitialized'
state
  macintosh/via-pmu: Replace via pointer with via1 and via2 pointers
  macintosh/via-pmu: Add support for m68k PowerBooks
  macintosh/via-pmu: Make CONFIG_PPC_PMAC Kconfig deps explicit
  macintosh/via-pmu68k: Don't load driver on unsupported hardware
  macintosh/via-pmu: Replace via-pmu68k driver with via-pmu driver
  macintosh: Use common code to access RTC
  macintosh/via-pmu: Clean up interrupt statistics
  macintosh/via-pmu: Disambiguate interrupt statistics

 arch/m68k/configs/mac_defconfig|   2 +-
 arch/m68k/configs/multi_defconfig  |   2 +-
 arch/m68k/mac/config.c |   2 +-
 arch/m68k/mac/misc.c   | 118 +
 arch/powerpc/platforms/powermac/time.c |  74 +--
 drivers/macintosh/Kconfig  |  19 +-
 drivers/macintosh/Makefile |   1 -
 drivers/macintosh/adb.c|   2 +-
 drivers/macintosh/via-cuda.c   |  34 ++
 drivers/macintosh/via-pmu.c| 378 ++-
 drivers/macintosh/via-pmu68k.c | 850 -
 include/linux/cuda.h   |   3 +
 include/linux/pmu.h|   3 +
 include/uapi/linux/pmu.h   |   2 -
 14 files changed, 311 insertions(+), 1179 deletions(-)
 delete mode 100644 drivers/macintosh/via-pmu68k.c

-- 
2.16.4

Re: [PATCH 1/3] powerpc: make CPU selection logic generic in Makefile

2018-06-07 Thread Nicholas Piggin

On Thu,  7 Jun 2018 10:10:18 + (UTC)
Christophe Leroy  wrote:

> At the time being, when adding a new CPU for selection, both
> Kconfig.cputype and Makefile have to be modified.
> 
> This patch moves into Kconfig.cputype the name of the CPU to me
> passed to the -mcpu= argument.

Seems like a good cleanup.

Reviewed-by: Nicholas Piggin 

> 
> Signed-off-by: Christophe Leroy 
> ---
>  arch/powerpc/Makefile  |  8 +---
>  arch/powerpc/platforms/Kconfig.cputype | 15 +++
>  2 files changed, 16 insertions(+), 7 deletions(-)
> 
> diff --git a/arch/powerpc/Makefile b/arch/powerpc/Makefile
> index 9704ab360d39..9a5642552abc 100644
> --- a/arch/powerpc/Makefile
> +++ b/arch/powerpc/Makefile
> @@ -175,13 +175,7 @@ ifdef CONFIG_MPROFILE_KERNEL
>  endif
>  endif
>  
> -CFLAGS-$(CONFIG_CELL_CPU) += $(call cc-option,-mcpu=cell)
> -CFLAGS-$(CONFIG_POWER5_CPU) += $(call cc-option,-mcpu=power5)
> -CFLAGS-$(CONFIG_POWER6_CPU) += $(call cc-option,-mcpu=power6)
> -CFLAGS-$(CONFIG_POWER7_CPU) += $(call cc-option,-mcpu=power7)
> -CFLAGS-$(CONFIG_POWER8_CPU) += $(call cc-option,-mcpu=power8)
> -CFLAGS-$(CONFIG_POWER9_CPU) += $(call cc-option,-mcpu=power9)
> -CFLAGS-$(CONFIG_PPC_8xx) += $(call cc-option,-mcpu=860)
> +CFLAGS-$(CONFIG_SPECIAL_CPU_BOOL) += $(call 
> cc-option,-mcpu=$(CONFIG_SPECIAL_CPU))
>  
>  # Altivec option not allowed with e500mc64 in GCC.
>  ifeq ($(CONFIG_ALTIVEC),y)
> diff --git a/arch/powerpc/platforms/Kconfig.cputype 
> b/arch/powerpc/platforms/Kconfig.cputype
> index cc892dcfa114..71ef559cc474 100644
> --- a/arch/powerpc/platforms/Kconfig.cputype
> +++ b/arch/powerpc/platforms/Kconfig.cputype
> @@ -140,6 +140,21 @@ config E6500_CPU
>  
>  endchoice
>  
> +config SPECIAL_CPU_BOOL
> + bool
> + default !GENERIC_CPU
> +
> +config SPECIAL_CPU
> + string
> + depends on SPECIAL_CPU_BOOL
> + default "cell" if CELL_CPU
> + default "power5" if POWER5_CPU
> + default "power6" if POWER6_CPU
> + default "power7" if POWER7_CPU
> + default "power8" if POWER8_CPU
> + default "power9" if POWER9_CPU
> + default "860" if PPC_8xx
> +
>  config PPC_BOOK3S
>   def_bool y
>   depends on PPC_BOOK3S_32 || PPC_BOOK3S_64

Re: [v3 PATCH 5/5] powerpc/pseries: Display machine check error details.

2018-06-07 Thread Nicholas Piggin

On Thu, 07 Jun 2018 22:59:04 +0530
Mahesh J Salgaonkar  wrote:

> From: Mahesh Salgaonkar 
> 
> Extract the MCE error details from RTAS extended log and display it to
> console.
> 
> With this patch you should now see mce logs like below:
> 
> [  142.371818] Severe Machine check interrupt [Recovered]
> [  142.371822]   NIP [dca301b8]: init_module+0x1b8/0x338 [bork_kernel]
> [  142.371822]   Initiator: CPU
> [  142.371823]   Error type: SLB [Multihit]
> [  142.371824] Effective address: dca7
> 
> Signed-off-by: Mahesh Salgaonkar 
> ---
>  arch/powerpc/include/asm/rtas.h  |5 +
>  arch/powerpc/platforms/pseries/ras.c |  128 
> +-
>  2 files changed, 131 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/rtas.h b/arch/powerpc/include/asm/rtas.h
> index 3f2fba7ef23b..8100a95c133a 100644
> --- a/arch/powerpc/include/asm/rtas.h
> +++ b/arch/powerpc/include/asm/rtas.h
> @@ -190,6 +190,11 @@ static inline uint8_t rtas_error_extended(const struct 
> rtas_error_log *elog)
>   return (elog->byte1 & 0x04) >> 2;
>  }
>  
> +static inline uint8_t rtas_error_initiator(const struct rtas_error_log *elog)
> +{
> + return (elog->byte2 & 0xf0) >> 4;
> +}
> +
>  #define rtas_error_type(x)   ((x)->byte3)
>  
>  static inline
> diff --git a/arch/powerpc/platforms/pseries/ras.c 
> b/arch/powerpc/platforms/pseries/ras.c
> index e56759d92356..cd9446980092 100644
> --- a/arch/powerpc/platforms/pseries/ras.c
> +++ b/arch/powerpc/platforms/pseries/ras.c
> @@ -422,7 +422,130 @@ int pSeries_system_reset_exception(struct pt_regs *regs)
>   return 0; /* need to perform reset */
>  }
>  
> -static int mce_handle_error(struct rtas_error_log *errp)
> +#define VAL_TO_STRING(ar, val)   ((val < ARRAY_SIZE(ar)) ? ar[val] : 
> "Unknown")
> +
> +static void pseries_print_mce_info(struct pt_regs *regs,
> + struct rtas_error_log *errp, int disposition)
> +{
> + const char *level, *sevstr;
> + struct pseries_errorlog *pseries_log;
> + struct pseries_mc_errorlog *mce_log;
> + uint8_t error_type, err_sub_type;
> + uint8_t initiator = rtas_error_initiator(errp);
> + uint64_t addr;
> +
> + static const char * const initiators[] = {
> + "Unknown",
> + "CPU",
> + "PCI",
> + "ISA",
> + "Memory",
> + "Power Mgmt",
> + };
> + static const char * const mc_err_types[] = {
> + "UE",
> + "SLB",
> + "ERAT",
> + "TLB",
> + "D-Cache",
> + "Unknown",
> + "I-Cache",
> + };
> + static const char * const mc_ue_types[] = {
> + "Indeterminate",
> + "Instruction fetch",
> + "Page table walk ifetch",
> + "Load/Store",
> + "Page table walk Load/Store",
> + };
> +
> + /* SLB sub errors valid values are 0x0, 0x1, 0x2 */
> + static const char * const mc_slb_types[] = {
> + "Parity",
> + "Multihit",
> + "Indeterminate",
> + };
> +
> + /* TLB and ERAT sub errors valid values are 0x1, 0x2, 0x3 */
> + static const char * const mc_soft_types[] = {
> + "Unknown",
> + "Parity",
> + "Multihit",
> + "Indeterminate",
> + };
> +
> + pseries_log = get_pseries_errorlog(errp, PSERIES_ELOG_SECT_ID_MCE);
> + if (pseries_log == NULL)
> + return;
> +
> + mce_log = (struct pseries_mc_errorlog *)pseries_log->data;
> +
> + error_type = rtas_mc_error_type(mce_log);
> + err_sub_type = rtas_mc_error_sub_type(mce_log);
> +
> + switch (rtas_error_severity(errp)) {
> + case RTAS_SEVERITY_NO_ERROR:
> + level = KERN_INFO;
> + sevstr = "Harmless";
> + break;
> + case RTAS_SEVERITY_WARNING:
> + level = KERN_WARNING;
> + sevstr = "";
> + break;
> + case RTAS_SEVERITY_ERROR:
> + case RTAS_SEVERITY_ERROR_SYNC:
> + level = KERN_ERR;
> + sevstr = "Severe";
> + break;
> + case RTAS_SEVERITY_FATAL:
> + default:
> + level = KERN_ERR;
> + sevstr = "Fatal";
> + break;
> + }
> +
> + printk("%s%s Machine check interrupt [%s]\n", level, sevstr,
> + disposition == RTAS_DISP_FULLY_RECOVERED ?
> + "Recovered" : "Not recovered");
> + if (user_mode(regs)) {
> + printk("%s  NIP: [%016lx] PID: %d Comm: %s\n", level,
> + regs->nip, current->pid, current->comm);
> + } else {
> + printk("%s  NIP [%016lx]: %pS\n", level, regs->nip,
> + (void *)regs->nip);
> + }

I think it's probably still useful to print pid/comm for kernel mode
faults if !in_interrupt()... I see you're basically taking kernel/mce.c
and doing the same thing.

Is

Re: [v3 PATCH 4/5] powerpc/pseries: Dump and flush SLB contents on SLB MCE errors.

2018-06-07 Thread Nicholas Piggin

On Thu, 07 Jun 2018 22:58:55 +0530
Mahesh J Salgaonkar  wrote:

> From: Mahesh Salgaonkar 
> 
> If we get a machine check exceptions due to SLB errors then dump the
> current SLB contents which will be very much helpful in debugging the
> root cause of SLB errors. On pseries, as of today system crashes on SLB
> errors. These are soft errors and can be fixed by flushing the SLBs so
> the kernel can continue to function instead of system crash. This patch
> fixes that also.

So pseries never flushed SLB and reloaded in response to multi hit
errors? This seems like quite a good improvement then. I like
dumping SLB too.

It's a bit annoying we can't share the same code with xmon really,
that's okay but I just suggest commenting them both if you take a
copy like this with a note to keep them in synch if you re-post
the series.

> 
> With this patch the console will log SLB contents like below on SLB MCE
> errors:
> 
> [  822.711728] slb contents:

Suggest keeping the same format as the xmon dump (in particular
CPU number, even though it's probably printed elsewhere in the MCE
message it doesn't hurt.

Reviewed-by: Nicholas Piggin 

Thanks,
Nick

> [  822.711730] 00 c800 400ea1b217000500
> [  822.711731]   1T  ESID=   c0  VSID=  ea1b217 LLP:100
> [  822.711732] 01 d800 400d43642f000510
> [  822.711733]   1T  ESID=   d0  VSID=  d43642f LLP:110
> [  822.711734] 09 f800 400a86c85f000500
> [  822.711736]   1T  ESID=   f0  VSID=  a86c85f LLP:100
> [  822.711737] 10 7f000800 400d1f26e3000d90
> [  822.711738]   1T  ESID=   7f  VSID=  d1f26e3 LLP:110
> [  822.711739] 11 1800 000e3615f520fd90
> [  822.711740]  256M ESID=1  VSID=   e3615f520f LLP:110
> [  822.711740] 12 d800 400d43642f000510
> [  822.711741]   1T  ESID=   d0  VSID=  d43642f LLP:110
> [  822.711742] 13 d800 400d43642f000510
> [  822.711743]   1T  ESID=   d0  VSID=  d43642f LLP:110
> 
> 
> Suggested-by: Aneesh Kumar K.V 
> Suggested-by: Michael Ellerman 
> Signed-off-by: Mahesh Salgaonkar 
> ---
>  arch/powerpc/include/asm/book3s/64/mmu-hash.h |1 +
>  arch/powerpc/mm/slb.c |   35 
> +
>  arch/powerpc/platforms/pseries/ras.c  |   29 -
>  3 files changed, 64 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/include/asm/book3s/64/mmu-hash.h 
> b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
> index 50ed64fba4ae..c0da68927235 100644
> --- a/arch/powerpc/include/asm/book3s/64/mmu-hash.h
> +++ b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
> @@ -487,6 +487,7 @@ extern void hpte_init_native(void);
>  
>  extern void slb_initialize(void);
>  extern void slb_flush_and_rebolt(void);
> +extern void slb_dump_contents(void);
>  
>  extern void slb_vmalloc_update(void);
>  extern void slb_set_size(u16 size);
> diff --git a/arch/powerpc/mm/slb.c b/arch/powerpc/mm/slb.c
> index 66577cc66dc9..799aa117cec3 100644
> --- a/arch/powerpc/mm/slb.c
> +++ b/arch/powerpc/mm/slb.c
> @@ -145,6 +145,41 @@ void slb_flush_and_rebolt(void)
>   get_paca()->slb_cache_ptr = 0;
>  }
>  
> +void slb_dump_contents(void)
> +{
> + int i;
> + unsigned long e, v;
> + unsigned long llp;
> +
> + pr_err("slb contents:\n");
> + for (i = 0; i < mmu_slb_size; i++) {
> + asm volatile("slbmfee  %0,%1" : "=r" (e) : "r" (i));
> + asm volatile("slbmfev  %0,%1" : "=r" (v) : "r" (i));
> +
> + if (!e && !v)
> + continue;
> +
> + pr_err("%02d %016lx %016lx", i, e, v);
> +
> + if (!(e & SLB_ESID_V)) {
> + pr_err("\n");
> + continue;
> + }
> + llp = v & SLB_VSID_LLP;
> + if (v & SLB_VSID_B_1T) {
> + pr_err("  1T  ESID=%9lx  VSID=%13lx LLP:%3lx\n",
> + GET_ESID_1T(e),
> + (v & ~SLB_VSID_B) >> SLB_VSID_SHIFT_1T,
> + llp);
> + } else {
> + pr_err(" 256M ESID=%9lx  VSID=%13lx LLP:%3lx\n",
> + GET_ESID(e),
> + (v & ~SLB_VSID_B) >> SLB_VSID_SHIFT,
> + llp);
> + }
> + }
> +}
> +
>  void slb_vmalloc_update(void)
>  {
>   unsigned long vflags;
> diff --git a/arch/powerpc/platforms/pseries/ras.c 
> b/arch/powerpc/platforms/pseries/ras.c
> index 2edc673be137..e56759d92356 100644
> --- a/arch/powerpc/platforms/pseries/ras.c
> +++ b/arch/powerpc/platforms/pseries/ras.c
> @@ -422,6 +422,31 @@ int pSeries_system_reset_exception(struct pt_regs *regs)
>   return 0; /* need to perform reset */
>  }
>  
> +static int mce_handle_error(struct rtas_error_log *errp)
> +{
> + struct pseries_errorlog *pseries_log;
> + struct pseries_mc_errorlog *mce_log;
> + int disposition =

Re: [v3 PATCH 2/5] powerpc/pseries: Fix endainness while restoring of r3 in MCE handler.

2018-06-07 Thread Nicholas Piggin

On Thu, 07 Jun 2018 22:58:33 +0530
Mahesh J Salgaonkar  wrote:

> From: Mahesh Salgaonkar 
> 
> During Machine Check interrupt on pseries platform, register r3 points
> RTAS extended event log passed by hypervisor. Since hypervisor uses r3
> to pass pointer to rtas log, it stores the original r3 value at the
> start of the memory (first 8 bytes) pointed by r3. Since hypervisor
> stores this info and rtas log is in BE format, linux should make
> sure to restore r3 value in correct endian format.
> 
> Without this patch when MCE handler, after recovery, returns to code that
> that caused the MCE may end up with Data SLB access interrupt for invalid
> address followed by kernel panic or hang.
> 
> [   62.878965] Severe Machine check interrupt [Recovered]
> [   62.878968]   NIP [dca301b8]: init_module+0x1b8/0x338 [bork_kernel]
> [   62.878969]   Initiator: CPU
> [   62.878970]   Error type: SLB [Multihit]
> [   62.878971] Effective address: dca7
> cpu 0xa: Vector: 380 (Data SLB Access) at [c000fc7775b0]
> pc: c09694c0: vsnprintf+0x80/0x480
> lr: c09698e0: vscnprintf+0x20/0x60
> sp: c000fc777830
>msr: 82009033
>dar: a803a30c00d0
>   current = 0xcbc9ef00
>   paca= 0xc0001eca5c00 softe: 3irq_happened: 0x01
> pid   = 8860, comm = insmod
> [c000fc7778b0] c09698e0 vscnprintf+0x20/0x60
> [c000fc7778e0] c016b6c4 vprintk_emit+0xb4/0x4b0
> [c000fc777960] c016d40c vprintk_func+0x5c/0xd0
> [c000fc777980] c016cbb4 printk+0x38/0x4c
> [c000fc7779a0] dca301c0 init_module+0x1c0/0x338 [bork_kernel]
> [c000fc777a40] c000d9c4 do_one_initcall+0x54/0x230
> [c000fc777b00] c01b3b74 do_init_module+0x8c/0x248
> [c000fc777b90] c01b2478 load_module+0x12b8/0x15b0
> [c000fc777d30] c01b29e8 sys_finit_module+0xa8/0x110
> [c000fc777e30] c000b204 system_call+0x58/0x6c
> --- Exception: c00 (System Call) at 7fff8bda0644
> SP (7fffdfbfe980) is in userspace
> 
> This patch fixes this issue.

LGTM

Reviewed-by: Nicholas Piggin 

> 
> Fixes: a08a53ea4c97 ("powerpc/le: Enable RTAS events support")
> Cc: sta...@vger.kernel.org
> Signed-off-by: Mahesh Salgaonkar 
> ---
>  arch/powerpc/platforms/pseries/ras.c |2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/platforms/pseries/ras.c 
> b/arch/powerpc/platforms/pseries/ras.c
> index 5e1ef9150182..2edc673be137 100644
> --- a/arch/powerpc/platforms/pseries/ras.c
> +++ b/arch/powerpc/platforms/pseries/ras.c
> @@ -360,7 +360,7 @@ static struct rtas_error_log *fwnmi_get_errinfo(struct 
> pt_regs *regs)
>   }
>  
>   savep = __va(regs->gpr[3]);
> - regs->gpr[3] = savep[0];/* restore original r3 */
> + regs->gpr[3] = be64_to_cpu(savep[0]);   /* restore original r3 */
>  
>   /* If it isn't an extended log we can use the per cpu 64bit buffer */
>   h = (struct rtas_error_log *)[1];
>

Re: [v3 PATCH 1/5] powerpc/pseries: convert rtas_log_buf to linear allocation.

2018-06-07 Thread Nicholas Piggin

On Thu, 07 Jun 2018 22:58:11 +0530
Mahesh J Salgaonkar  wrote:

> From: Mahesh Salgaonkar 
> 
> rtas_log_buf is a buffer to hold RTAS event data that are communicated
> to kernel by hypervisor. This buffer is then used to pass RTAS event
> data to user through proc fs. This buffer is allocated from vmalloc
> (non-linear mapping) area.
> 
> On Machine check interrupt, register r3 points to RTAS extended event
> log passed by hypervisor that contains the MCE event. The pseries
> machine check handler then logs this error into rtas_log_buf. The
> rtas_log_buf is a vmalloc-ed (non-linear) buffer we end up taking up a
> page fault (vector 0x300) while accessing it. Since machine check
> interrupt handler runs in NMI context we can not afford to take any
> page fault. Page faults are not honored in NMI context and causes
> kernel panic. This patch fixes this issue by allocating rtas_log_buf
> using kmalloc.
> 
> Fixes: b96672dd840f ("powerpc: Machine check interrupt is a non-maskable 
> interrupt")
> Cc: sta...@vger.kernel.org
> Suggested-by: Aneesh Kumar K.V 
> Signed-off-by: Mahesh Salgaonkar 
> ---
>  arch/powerpc/kernel/rtasd.c |2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/kernel/rtasd.c b/arch/powerpc/kernel/rtasd.c
> index f915db93cd42..3957d4ae2ba2 100644
> --- a/arch/powerpc/kernel/rtasd.c
> +++ b/arch/powerpc/kernel/rtasd.c
> @@ -559,7 +559,7 @@ static int __init rtas_event_scan_init(void)
>   rtas_error_log_max = rtas_get_error_log_max();
>   rtas_error_log_buffer_max = rtas_error_log_max + sizeof(int);
>  
> - rtas_log_buf = vmalloc(rtas_error_log_buffer_max*LOG_NUMBER);
> + rtas_log_buf = kmalloc(rtas_error_log_buffer_max*LOG_NUMBER, 
> GFP_KERNEL);

Does this have to be in the RMA region if it's to be accessed with
relocation off in the guest?

A comment about it being accessed with relocation off might be helpful
too.

Thanks,
Nick

Re: [RFC PATCH kernel 0/5] powerpc/P9/vfio: Pass through NVIDIA Tesla V100

2018-06-07 Thread Alex Williamson

On Fri, 08 Jun 2018 10:58:54 +1000
Benjamin Herrenschmidt  wrote:

> On Thu, 2018-06-07 at 18:34 -0600, Alex Williamson wrote:
> > > We *can* allow individual GPUs to be passed through, either if somebody
> > > designs a system without cross links, or if the user is ok with the
> > > security risk as the guest driver will not enable them if it doesn't
> > > "find" both sides of them.  
> > 
> > If GPUs are not isolated and we cannot prevent them from probing each
> > other via these links, then I think we have an obligation to configure
> > grouping in a way that doesn't rely on a benevolent userspace.  Thanks,  
> 
> Well, it's a user decision, no ? Like how we used to let the user
> decide whether to pass-through things that have LSIs shared out of
> their domain.

No, users don't get to pinky swear they'll be good.  The kernel creates
IOMMU groups assuming the worst case isolation and malicious users.
Its the kernel's job to protect itself from users and to protect users
from each other.  Anything else is unsupportable.  The only way to
bypass the default grouping is to modify the kernel.  Thanks,

Alex

Re: [RFC PATCH kernel 0/5] powerpc/P9/vfio: Pass through NVIDIA Tesla V100

2018-06-07 Thread Benjamin Herrenschmidt

On Thu, 2018-06-07 at 18:34 -0600, Alex Williamson wrote:
> > We *can* allow individual GPUs to be passed through, either if somebody
> > designs a system without cross links, or if the user is ok with the
> > security risk as the guest driver will not enable them if it doesn't
> > "find" both sides of them.
> 
> If GPUs are not isolated and we cannot prevent them from probing each
> other via these links, then I think we have an obligation to configure
> grouping in a way that doesn't rely on a benevolent userspace.  Thanks,

Well, it's a user decision, no ? Like how we used to let the user
decide whether to pass-through things that have LSIs shared out of
their domain.

Cheers,
Ben.

Re: [RFC PATCH -tip v5 18/27] powerpc/kprobes: Don't call the ->break_handler() in arm kprobes code

2018-06-07 Thread Masami Hiramatsu

On Thu, 07 Jun 2018 22:07:26 +0530
"Naveen N. Rao"  wrote:

> Masami Hiramatsu wrote:
> > On Thu, 07 Jun 2018 17:07:00 +0530
> > "Naveen N. Rao"  wrote:
> > 
> >> Masami Hiramatsu wrote:
> >> > Don't call the ->break_handler() from the arm kprobes code,
> >>^^^ powerpc
> >> 
> >> > because it was only used by jprobes which got removed.
> >> > 
> >> > This also makes skip_singlestep() a static function since
> >> > only ftrace-kprobe.c is using this function.
> >> > 
> >> > Signed-off-by: Masami Hiramatsu 
> >> > Cc: Benjamin Herrenschmidt 
> >> > Cc: Paul Mackerras 
> >> > Cc: Michael Ellerman 
> >> > Cc: "Naveen N. Rao" 
> >> > Cc: linuxppc-dev@lists.ozlabs.org
> >> > ---
> >> >  arch/powerpc/include/asm/kprobes.h   |   10 --
> >> >  arch/powerpc/kernel/kprobes-ftrace.c |   16 +++-
> >> >  arch/powerpc/kernel/kprobes.c|   31 
> >> > +++
> >> >  3 files changed, 14 insertions(+), 43 deletions(-)
> >> 
> >> With 2 small comments...
> > 
> > 2 ? or 1 ?
> 
> Two, with one in the commit log above :)

Oops, sorry I missed it. yeah, the comment above is my mistake. I'll fix it.

Thanks!


-- 
Masami Hiramatsu

Re: [RFC PATCH kernel 0/5] powerpc/P9/vfio: Pass through NVIDIA Tesla V100

2018-06-07 Thread Alex Williamson

On Fri, 08 Jun 2018 09:20:30 +1000
Benjamin Herrenschmidt  wrote:

> On Thu, 2018-06-07 at 16:15 -0600, Alex Williamson wrote:
> > On Fri, 08 Jun 2018 07:54:02 +1000
> > Benjamin Herrenschmidt  wrote:
> >   
> > > On Thu, 2018-06-07 at 11:04 -0600, Alex Williamson wrote:  
> > > > 
> > > > Can we back up and discuss whether the IOMMU grouping of NVLink
> > > > connected devices makes sense?  AIUI we have a PCI view of these
> > > > devices and from that perspective they're isolated.  That's the view of
> > > > the device used to generate the grouping.  However, not visible to us,
> > > > these devices are interconnected via NVLink.  What isolation properties
> > > > does NVLink provide given that its entire purpose for existing seems to
> > > > be to provide a high performance link for p2p between devices?
> > > 
> > > Not entire. On POWER chips, we also have an nvlink between the device
> > > and the CPU which is running significantly faster than PCIe.
> > > 
> > > But yes, there are cross-links and those should probably be accounted
> > > for in the grouping.  
> > 
> > Then after we fix the grouping, can we just let the host driver manage
> > this coherent memory range and expose vGPUs to guests?  The use case of
> > assigning all 6 GPUs to one VM seems pretty limited.  (Might need to
> > convince NVIDIA to support more than a single vGPU per VM though)
> > Thanks,  
> 
> I don't know about "vGPUs" and what nVidia may be cooking in that area.
> 
> The patched from Alexey allow for passing through the full thing, but
> they aren't trivial (there are additional issues, I'm not sure how
> covered they are, as we need to pay with the mapping attributes of
> portions of the GPU memory on the host side...).
> 
> Note: The cross-links are only per-socket so that would be 2 groups of
> 3.
> 
> We *can* allow individual GPUs to be passed through, either if somebody
> designs a system without cross links, or if the user is ok with the
> security risk as the guest driver will not enable them if it doesn't
> "find" both sides of them.

If GPUs are not isolated and we cannot prevent them from probing each
other via these links, then I think we have an obligation to configure
grouping in a way that doesn't rely on a benevolent userspace.  Thanks,

Alex

Re: linux-next: manual merge of the powerpc tree with the kbuild tree

2018-06-07 Thread Stephen Rothwell

Hi all,

On Thu, 31 May 2018 09:32:16 +1000 Stephen Rothwell  
wrote:
>
> Today's linux-next merge of the powerpc tree got a conflict in:
> 
>   arch/powerpc/kernel/module_64.c
> 
> between commit:
> 
>   06aeb9e3f2bc ("powerpc/kbuild: move -mprofile-kernel check to Kconfig")
> 
> from the kbuild tree and commit:
> 
>   250122baed29 ("powerpc64/module: Tighten detection of mcount call sites 
> with -mprofile-kernel")
> 
> from the powerpc tree.
> 
> I fixed it up (see below) and can carry the fix as necessary. This
> is now fixed as far as linux-next is concerned, but any non trivial
> conflicts should be mentioned to your upstream maintainer when your tree
> is submitted for merging.  You may also want to consider cooperating
> with the maintainer of the conflicting tree to minimise any particularly
> complex conflicts.
> 
> -- 
> Cheers,
> Stephen Rothwell
> 
> diff --cc arch/powerpc/kernel/module_64.c
> index 55bccc315e1a,f7667e2ebfcb..
> --- a/arch/powerpc/kernel/module_64.c
> +++ b/arch/powerpc/kernel/module_64.c
> @@@ -462,9 -466,12 +466,12 @@@ static unsigned long stub_for_addr(cons
>   return (unsigned long)[i];
>   }
>   
>  -#ifdef CC_USING_MPROFILE_KERNEL
>  +#ifdef CONFIG_MPROFILE_KERNEL
> - static bool is_early_mcount_callsite(u32 *instruction)
> + static bool is_mprofile_mcount_callsite(const char *name, u32 *instruction)
>   {
> + if (strcmp("_mcount", name))
> + return false;
> + 
>   /*
>* Check if this is one of the -mprofile-kernel sequences.
>*/

This is now a conflict between the kbuild tree and Linus' tree.

-- 
Cheers,
Stephen Rothwell


pgpigJezwQ2US.pgp
Description: OpenPGP digital signature

Re: [RFC PATCH kernel 0/5] powerpc/P9/vfio: Pass through NVIDIA Tesla V100

2018-06-07 Thread Benjamin Herrenschmidt

On Thu, 2018-06-07 at 16:15 -0600, Alex Williamson wrote:
> On Fri, 08 Jun 2018 07:54:02 +1000
> Benjamin Herrenschmidt  wrote:
> 
> > On Thu, 2018-06-07 at 11:04 -0600, Alex Williamson wrote:
> > > 
> > > Can we back up and discuss whether the IOMMU grouping of NVLink
> > > connected devices makes sense?  AIUI we have a PCI view of these
> > > devices and from that perspective they're isolated.  That's the view of
> > > the device used to generate the grouping.  However, not visible to us,
> > > these devices are interconnected via NVLink.  What isolation properties
> > > does NVLink provide given that its entire purpose for existing seems to
> > > be to provide a high performance link for p2p between devices?  
> > 
> > Not entire. On POWER chips, we also have an nvlink between the device
> > and the CPU which is running significantly faster than PCIe.
> > 
> > But yes, there are cross-links and those should probably be accounted
> > for in the grouping.
> 
> Then after we fix the grouping, can we just let the host driver manage
> this coherent memory range and expose vGPUs to guests?  The use case of
> assigning all 6 GPUs to one VM seems pretty limited.  (Might need to
> convince NVIDIA to support more than a single vGPU per VM though)
> Thanks,

I don't know about "vGPUs" and what nVidia may be cooking in that area.

The patched from Alexey allow for passing through the full thing, but
they aren't trivial (there are additional issues, I'm not sure how
covered they are, as we need to pay with the mapping attributes of
portions of the GPU memory on the host side...).

Note: The cross-links are only per-socket so that would be 2 groups of
3.

We *can* allow individual GPUs to be passed through, either if somebody
designs a system without cross links, or if the user is ok with the
security risk as the guest driver will not enable them if it doesn't
"find" both sides of them.

Cheers,
Ben.

Re: [RFC PATCH kernel 0/5] powerpc/P9/vfio: Pass through NVIDIA Tesla V100

2018-06-07 Thread Alex Williamson

On Fri, 08 Jun 2018 07:54:02 +1000
Benjamin Herrenschmidt  wrote:

> On Thu, 2018-06-07 at 11:04 -0600, Alex Williamson wrote:
> > 
> > Can we back up and discuss whether the IOMMU grouping of NVLink
> > connected devices makes sense?  AIUI we have a PCI view of these
> > devices and from that perspective they're isolated.  That's the view of
> > the device used to generate the grouping.  However, not visible to us,
> > these devices are interconnected via NVLink.  What isolation properties
> > does NVLink provide given that its entire purpose for existing seems to
> > be to provide a high performance link for p2p between devices?  
> 
> Not entire. On POWER chips, we also have an nvlink between the device
> and the CPU which is running significantly faster than PCIe.
> 
> But yes, there are cross-links and those should probably be accounted
> for in the grouping.

Then after we fix the grouping, can we just let the host driver manage
this coherent memory range and expose vGPUs to guests?  The use case of
assigning all 6 GPUs to one VM seems pretty limited.  (Might need to
convince NVIDIA to support more than a single vGPU per VM though)
Thanks,

Alex

Re: [RFC PATCH kernel 0/5] powerpc/P9/vfio: Pass through NVIDIA Tesla V100

2018-06-07 Thread Benjamin Herrenschmidt

On Thu, 2018-06-07 at 11:04 -0600, Alex Williamson wrote:
> 
> Can we back up and discuss whether the IOMMU grouping of NVLink
> connected devices makes sense?  AIUI we have a PCI view of these
> devices and from that perspective they're isolated.  That's the view of
> the device used to generate the grouping.  However, not visible to us,
> these devices are interconnected via NVLink.  What isolation properties
> does NVLink provide given that its entire purpose for existing seems to
> be to provide a high performance link for p2p between devices?

Not entire. On POWER chips, we also have an nvlink between the device
and the CPU which is running significantly faster than PCIe.

But yes, there are cross-links and those should probably be accounted
for in the grouping.

> > Each bridge represents an additional hardware interface called "NVLink2",
> > it is not a PCI link but separate but. The design inherits from original
> > NVLink from POWER8.
> > 
> > The new feature of V100 is 16GB of cache coherent memory on GPU board.
> > This memory is presented to the host via the device tree and remains offline
> > until the NVIDIA driver loads, trains NVLink2 (via the config space of these
> > bridges above) and the nvidia-persistenced daemon then onlines it.
> > The memory remains online as long as nvidia-persistenced is running, when
> > it stops, it offlines the memory.
> > 
> > The amount of GPUs suggest passing them through to a guest. However,
> > in order to do so we cannot use the NVIDIA driver so we have a host with
> > a 128GB window (bigger or equal to actual GPU RAM size) in a system memory
> > with no page structs backing this window and we cannot touch this memory
> > before the NVIDIA driver configures it in a host or a guest as
> > HMI (hardware management interrupt?) occurs.
> 
> Having a lot of GPUs only suggests assignment to a guest if there's
> actually isolation provided between those GPUs.  Otherwise we'd need to
> assign them as one big group, which gets a lot less useful.  Thanks,
> 
> Alex
> 
> > On the example system the GPU RAM windows are located at:
> > 0x0400  
> > 0x0420  
> > 0x0440  
> > 0x2400  
> > 0x2420  
> > 0x2440  
> > 
> > So the complications are:
> > 
> > 1. cannot touch the GPU memory till it is trained, i.e. cannot add ptes
> > to VFIO-to-userspace or guest-to-host-physical translations till
> > the driver trains it (i.e. nvidia-persistenced has started), otherwise
> > prefetching happens and HMI occurs; I am trying to get this changed
> > somehow;
> > 
> > 2. since it appears as normal cache coherent memory, it will be used
> > for DMA which means it has to be pinned and mapped in the host. Having
> > no page structs makes it different from the usual case - we only need
> > translate user addresses to host physical and map GPU RAM memory but
> > pinning is not required.
> > 
> > This series maps GPU RAM via the GPU vfio-pci device so QEMU can then
> > register this memory as a KVM memory slot and present memory nodes to
> > the guest. Unless NVIDIA provides an userspace driver, this is no use
> > for things like DPDK.
> > 
> > 
> > There is another problem which the series does not address but worth
> > mentioning - it is not strictly necessary to map GPU RAM to the guest
> > exactly where it is in the host (I tested this to some extent), we still
> > might want to represent the memory at the same offset as on the host
> > which increases the size of a TCE table needed to cover such a huge
> > window: (((0x2440 + 0x20) >> 16)*8)>>20 = 4556MB
> > I am addressing this in a separate patchset by allocating indirect TCE
> > levels on demand and using 16MB IOMMU pages in the guest as we can now
> > back emulated pages with the smaller hardware ones.
> > 
> > 
> > This is an RFC. Please comment. Thanks.
> > 
> > 
> > 
> > Alexey Kardashevskiy (5):
> >   vfio/spapr_tce: Simplify page contained test
> >   powerpc/iommu_context: Change referencing in API
> >   powerpc/iommu: Do not pin memory of a memory device
> >   vfio_pci: Allow mapping extra regions
> >   vfio_pci: Add NVIDIA GV100GL [Tesla V100 SXM2] [10de:1db1] subdriver
> > 
> >  drivers/vfio/pci/Makefile  |   1 +
> >  arch/powerpc/include/asm/mmu_context.h |   5 +-
> >  drivers/vfio/pci/vfio_pci_private.h|  11 ++
> >  include/uapi/linux/vfio.h  |   3 +
> >  arch/powerpc/kernel/iommu.c|   8 +-
> >  arch/powerpc/mm/mmu_context_iommu.c|  70 +---
> >  drivers/vfio/pci/vfio_pci.c|  19 +++-
> >  drivers/vfio/pci/vfio_pci_nvlink2.c| 190 
> > +
> >  drivers/vfio/vfio_iommu_spapr_tce.c|  42 +---
> >  drivers/vfio/pci/Kconfig   |   4 +
> >  10 files changed, 319 insertions(+), 34 deletions(-)
> >  create mode 100644 drivers/vfio/pci/vfio_pci_nvlink2.c
> >

Re: [RFC PATCH 0/4] powerpc/pseries: Machien check handler improvements.

2018-06-07 Thread Michal Suchánek

On Wed, 06 Jun 2018 10:06:23 +0530
Mahesh J Salgaonkar  wrote:

> This patch series includes some improvement to Machine check handler
> for pseries. Patch 1 fixes an issue where machine check handler
> crashes kernel while accessing vmalloc-ed buffer while in nmi context.
> Patch 3 dumps the SLB contents on SLB MCE errors to improve the
> debugability. Patch 4 display's the MCE error details on console.
> 
> ---
> 
> Mahesh Salgaonkar (4):
>   powerpc/pseries: convert rtas_log_buf to linear allocation.
>   powerpc/pseries: Define MCE error event section.
>   powerpc/pseries: Dump and flush SLB contents on SLB MCE errors.
>   powerpc/pseries: Display machine check error details.

Tested-by: Michal Suchánek 

Thanks

Michal

UBSAN: Undefined behaviour in ../include/linux/percpu_counter.h:137:13

2018-06-07 Thread Mathieu Malaterre

Hi there,

I have a reproducible UBSAN appearing in dmesg after a while on my G4
(*). Could anyone suggest a way to diagnose the actual root issue here
(or is it just a false positive) ?

Thanks,

(*)
[41877.514338] 

[41877.514364] UBSAN: Undefined behaviour in
../include/linux/percpu_counter.h:137:13
[41877.514373] signed integer overflow:
[41877.514378] 9223352809007201260 + 41997676517838 cannot be
represented in type 'long long int'
[41877.514389] CPU: 0 PID: 0 Comm: swapper Not tainted 4.17.0+ #54
[41877.514394] Call Trace:
[41877.514411] [dffedd30] [c047a5f8] ubsan_epilogue+0x18/0x4c (unreliable)
[41877.514422] [dffedd40] [c047af98] handle_overflow+0xbc/0xdc
[41877.514437] [dffeddc0] [c043aaa8] cfq_completed_request+0x560/0x1234
[41877.514446] [dffede40] [c03f595c] __blk_put_request+0xb0/0x2dc
[41877.514460] [dffede80] [c05aa41c] scsi_end_request+0x19c/0x344
[41877.514469] [dffedeb0] [c05abba0] scsi_io_completion+0x4b4/0x854
[41877.514482] [dffedf10] [c040604c] blk_done_softirq+0xe4/0x1e0
[41877.514496] [dffedf60] [c07eef84] __do_softirq+0x16c/0x5f0
[41877.514508] [dffedfd0] [c0065160] irq_exit+0x110/0x1a8
[41877.514520] [dffedff0] [c001646c] call_do_irq+0x24/0x3c
[41877.514533] [c0ce5e80] [c0009a2c] do_IRQ+0x98/0x1a0
[41877.514541] [c0ce5eb0] [c001b93c] ret_from_except+0x0/0x14
[41877.514549] --- interrupt: 501 at arch_cpu_idle+0x30/0x78
   LR = arch_cpu_idle+0x30/0x78
[41877.514558] [c0ce5f70] [c0ce4000] 0xc0ce4000 (unreliable)
[41877.514570] [c0ce5f80] [c00a3928] do_idle+0xc4/0x158
[41877.514577] [c0ce5fb0] [c00a3b74] cpu_startup_entry+0x24/0x28
[41877.514585] [c0ce5fc0] [c0988820] start_kernel+0x47c/0x490
[41877.514592] [c0ce5ff0] [3444] 0x3444
[41877.514597] 

[41886.390210] 

[41886.390236] UBSAN: Undefined behaviour in
../include/linux/percpu_counter.h:137:13
[41886.390245] signed integer overflow:
[41886.390250] 9223366156262940402 + 42006563339289 cannot be
represented in type 'long long int'
[41886.390260] CPU: 0 PID: 0 Comm: swapper Not tainted 4.17.0+ #54
[41886.390265] Call Trace:
[41886.390282] [dffedd30] [c047a5f8] ubsan_epilogue+0x18/0x4c (unreliable)
[41886.390293] [dffedd40] [c047af98] handle_overflow+0xbc/0xdc
[41886.390309] [dffeddc0] [c043a8c4] cfq_completed_request+0x37c/0x1234
[41886.390317] [dffede40] [c03f595c] __blk_put_request+0xb0/0x2dc
[41886.390331] [dffede80] [c05aa41c] scsi_end_request+0x19c/0x344
[41886.390340] [dffedeb0] [c05abba0] scsi_io_completion+0x4b4/0x854
[41886.390353] [dffedf10] [c040604c] blk_done_softirq+0xe4/0x1e0
[41886.390367] [dffedf60] [c07eef84] __do_softirq+0x16c/0x5f0
[41886.390379] [dffedfd0] [c0065160] irq_exit+0x110/0x1a8
[41886.390391] [dffedff0] [c001646c] call_do_irq+0x24/0x3c
[41886.390404] [c0ce5e80] [c0009a2c] do_IRQ+0x98/0x1a0
[41886.390411] [c0ce5eb0] [c001b93c] ret_from_except+0x0/0x14
[41886.390420] --- interrupt: 501 at arch_cpu_idle+0x30/0x78
   LR = arch_cpu_idle+0x30/0x78
[41886.390429] [c0ce5f70] [c0ce4000] 0xc0ce4000 (unreliable)
[41886.390441] [c0ce5f80] [c00a3928] do_idle+0xc4/0x158
[41886.390449] [c0ce5fb0] [c00a3b74] cpu_startup_entry+0x24/0x28
[41886.390457] [c0ce5fc0] [c0988820] start_kernel+0x47c/0x490
[41886.390463] [c0ce5ff0] [3444] 0x3444
[41886.390468]

[v3 PATCH 5/5] powerpc/pseries: Display machine check error details.

2018-06-07 Thread Mahesh J Salgaonkar

From: Mahesh Salgaonkar 

Extract the MCE error details from RTAS extended log and display it to
console.

With this patch you should now see mce logs like below:

[  142.371818] Severe Machine check interrupt [Recovered]
[  142.371822]   NIP [dca301b8]: init_module+0x1b8/0x338 [bork_kernel]
[  142.371822]   Initiator: CPU
[  142.371823]   Error type: SLB [Multihit]
[  142.371824] Effective address: dca7

Signed-off-by: Mahesh Salgaonkar 
---
 arch/powerpc/include/asm/rtas.h  |5 +
 arch/powerpc/platforms/pseries/ras.c |  128 +-
 2 files changed, 131 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/rtas.h b/arch/powerpc/include/asm/rtas.h
index 3f2fba7ef23b..8100a95c133a 100644
--- a/arch/powerpc/include/asm/rtas.h
+++ b/arch/powerpc/include/asm/rtas.h
@@ -190,6 +190,11 @@ static inline uint8_t rtas_error_extended(const struct 
rtas_error_log *elog)
return (elog->byte1 & 0x04) >> 2;
 }
 
+static inline uint8_t rtas_error_initiator(const struct rtas_error_log *elog)
+{
+   return (elog->byte2 & 0xf0) >> 4;
+}
+
 #define rtas_error_type(x) ((x)->byte3)
 
 static inline
diff --git a/arch/powerpc/platforms/pseries/ras.c 
b/arch/powerpc/platforms/pseries/ras.c
index e56759d92356..cd9446980092 100644
--- a/arch/powerpc/platforms/pseries/ras.c
+++ b/arch/powerpc/platforms/pseries/ras.c
@@ -422,7 +422,130 @@ int pSeries_system_reset_exception(struct pt_regs *regs)
return 0; /* need to perform reset */
 }
 
-static int mce_handle_error(struct rtas_error_log *errp)
+#define VAL_TO_STRING(ar, val) ((val < ARRAY_SIZE(ar)) ? ar[val] : "Unknown")
+
+static void pseries_print_mce_info(struct pt_regs *regs,
+   struct rtas_error_log *errp, int disposition)
+{
+   const char *level, *sevstr;
+   struct pseries_errorlog *pseries_log;
+   struct pseries_mc_errorlog *mce_log;
+   uint8_t error_type, err_sub_type;
+   uint8_t initiator = rtas_error_initiator(errp);
+   uint64_t addr;
+
+   static const char * const initiators[] = {
+   "Unknown",
+   "CPU",
+   "PCI",
+   "ISA",
+   "Memory",
+   "Power Mgmt",
+   };
+   static const char * const mc_err_types[] = {
+   "UE",
+   "SLB",
+   "ERAT",
+   "TLB",
+   "D-Cache",
+   "Unknown",
+   "I-Cache",
+   };
+   static const char * const mc_ue_types[] = {
+   "Indeterminate",
+   "Instruction fetch",
+   "Page table walk ifetch",
+   "Load/Store",
+   "Page table walk Load/Store",
+   };
+
+   /* SLB sub errors valid values are 0x0, 0x1, 0x2 */
+   static const char * const mc_slb_types[] = {
+   "Parity",
+   "Multihit",
+   "Indeterminate",
+   };
+
+   /* TLB and ERAT sub errors valid values are 0x1, 0x2, 0x3 */
+   static const char * const mc_soft_types[] = {
+   "Unknown",
+   "Parity",
+   "Multihit",
+   "Indeterminate",
+   };
+
+   pseries_log = get_pseries_errorlog(errp, PSERIES_ELOG_SECT_ID_MCE);
+   if (pseries_log == NULL)
+   return;
+
+   mce_log = (struct pseries_mc_errorlog *)pseries_log->data;
+
+   error_type = rtas_mc_error_type(mce_log);
+   err_sub_type = rtas_mc_error_sub_type(mce_log);
+
+   switch (rtas_error_severity(errp)) {
+   case RTAS_SEVERITY_NO_ERROR:
+   level = KERN_INFO;
+   sevstr = "Harmless";
+   break;
+   case RTAS_SEVERITY_WARNING:
+   level = KERN_WARNING;
+   sevstr = "";
+   break;
+   case RTAS_SEVERITY_ERROR:
+   case RTAS_SEVERITY_ERROR_SYNC:
+   level = KERN_ERR;
+   sevstr = "Severe";
+   break;
+   case RTAS_SEVERITY_FATAL:
+   default:
+   level = KERN_ERR;
+   sevstr = "Fatal";
+   break;
+   }
+
+   printk("%s%s Machine check interrupt [%s]\n", level, sevstr,
+   disposition == RTAS_DISP_FULLY_RECOVERED ?
+   "Recovered" : "Not recovered");
+   if (user_mode(regs)) {
+   printk("%s  NIP: [%016lx] PID: %d Comm: %s\n", level,
+   regs->nip, current->pid, current->comm);
+   } else {
+   printk("%s  NIP [%016lx]: %pS\n", level, regs->nip,
+   (void *)regs->nip);
+   }
+   printk("%s  Initiator: %s\n", level,
+   VAL_TO_STRING(initiators, initiator));
+
+   switch (error_type) {
+   case PSERIES_MC_ERROR_TYPE_UE:
+   printk("%s  Error type: %s [%s]\n", level,
+   VAL_TO_STRING(mc_err_types, error_type),
+

[v3 PATCH 4/5] powerpc/pseries: Dump and flush SLB contents on SLB MCE errors.

2018-06-07 Thread Mahesh J Salgaonkar

From: Mahesh Salgaonkar 

If we get a machine check exceptions due to SLB errors then dump the
current SLB contents which will be very much helpful in debugging the
root cause of SLB errors. On pseries, as of today system crashes on SLB
errors. These are soft errors and can be fixed by flushing the SLBs so
the kernel can continue to function instead of system crash. This patch
fixes that also.

With this patch the console will log SLB contents like below on SLB MCE
errors:

[  822.711728] slb contents:
[  822.711730] 00 c800 400ea1b217000500
[  822.711731]   1T  ESID=   c0  VSID=  ea1b217 LLP:100
[  822.711732] 01 d800 400d43642f000510
[  822.711733]   1T  ESID=   d0  VSID=  d43642f LLP:110
[  822.711734] 09 f800 400a86c85f000500
[  822.711736]   1T  ESID=   f0  VSID=  a86c85f LLP:100
[  822.711737] 10 7f000800 400d1f26e3000d90
[  822.711738]   1T  ESID=   7f  VSID=  d1f26e3 LLP:110
[  822.711739] 11 1800 000e3615f520fd90
[  822.711740]  256M ESID=1  VSID=   e3615f520f LLP:110
[  822.711740] 12 d800 400d43642f000510
[  822.711741]   1T  ESID=   d0  VSID=  d43642f LLP:110
[  822.711742] 13 d800 400d43642f000510
[  822.711743]   1T  ESID=   d0  VSID=  d43642f LLP:110


Suggested-by: Aneesh Kumar K.V 
Suggested-by: Michael Ellerman 
Signed-off-by: Mahesh Salgaonkar 
---
 arch/powerpc/include/asm/book3s/64/mmu-hash.h |1 +
 arch/powerpc/mm/slb.c |   35 +
 arch/powerpc/platforms/pseries/ras.c  |   29 -
 3 files changed, 64 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/book3s/64/mmu-hash.h 
b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
index 50ed64fba4ae..c0da68927235 100644
--- a/arch/powerpc/include/asm/book3s/64/mmu-hash.h
+++ b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
@@ -487,6 +487,7 @@ extern void hpte_init_native(void);
 
 extern void slb_initialize(void);
 extern void slb_flush_and_rebolt(void);
+extern void slb_dump_contents(void);
 
 extern void slb_vmalloc_update(void);
 extern void slb_set_size(u16 size);
diff --git a/arch/powerpc/mm/slb.c b/arch/powerpc/mm/slb.c
index 66577cc66dc9..799aa117cec3 100644
--- a/arch/powerpc/mm/slb.c
+++ b/arch/powerpc/mm/slb.c
@@ -145,6 +145,41 @@ void slb_flush_and_rebolt(void)
get_paca()->slb_cache_ptr = 0;
 }
 
+void slb_dump_contents(void)
+{
+   int i;
+   unsigned long e, v;
+   unsigned long llp;
+
+   pr_err("slb contents:\n");
+   for (i = 0; i < mmu_slb_size; i++) {
+   asm volatile("slbmfee  %0,%1" : "=r" (e) : "r" (i));
+   asm volatile("slbmfev  %0,%1" : "=r" (v) : "r" (i));
+
+   if (!e && !v)
+   continue;
+
+   pr_err("%02d %016lx %016lx", i, e, v);
+
+   if (!(e & SLB_ESID_V)) {
+   pr_err("\n");
+   continue;
+   }
+   llp = v & SLB_VSID_LLP;
+   if (v & SLB_VSID_B_1T) {
+   pr_err("  1T  ESID=%9lx  VSID=%13lx LLP:%3lx\n",
+   GET_ESID_1T(e),
+   (v & ~SLB_VSID_B) >> SLB_VSID_SHIFT_1T,
+   llp);
+   } else {
+   pr_err(" 256M ESID=%9lx  VSID=%13lx LLP:%3lx\n",
+   GET_ESID(e),
+   (v & ~SLB_VSID_B) >> SLB_VSID_SHIFT,
+   llp);
+   }
+   }
+}
+
 void slb_vmalloc_update(void)
 {
unsigned long vflags;
diff --git a/arch/powerpc/platforms/pseries/ras.c 
b/arch/powerpc/platforms/pseries/ras.c
index 2edc673be137..e56759d92356 100644
--- a/arch/powerpc/platforms/pseries/ras.c
+++ b/arch/powerpc/platforms/pseries/ras.c
@@ -422,6 +422,31 @@ int pSeries_system_reset_exception(struct pt_regs *regs)
return 0; /* need to perform reset */
 }
 
+static int mce_handle_error(struct rtas_error_log *errp)
+{
+   struct pseries_errorlog *pseries_log;
+   struct pseries_mc_errorlog *mce_log;
+   int disposition = rtas_error_disposition(errp);
+   uint8_t error_type;
+
+   pseries_log = get_pseries_errorlog(errp, PSERIES_ELOG_SECT_ID_MCE);
+   if (pseries_log == NULL)
+   goto out;
+
+   mce_log = (struct pseries_mc_errorlog *)pseries_log->data;
+   error_type = rtas_mc_error_type(mce_log);
+
+   if ((disposition == RTAS_DISP_NOT_RECOVERED) &&
+   (error_type == PSERIES_MC_ERROR_TYPE_SLB)) {
+   slb_dump_contents();
+   slb_flush_and_rebolt();
+   disposition = RTAS_DISP_FULLY_RECOVERED;
+   }
+
+out:
+   return disposition;
+}
+
 /*
  * See if we can recover from a machine check exception.
  * This is only called on power4 (or above) and only via
@@ -434,7 +459,9 @@ int

[v3 PATCH 3/5] powerpc/pseries: Define MCE error event section.

2018-06-07 Thread Mahesh J Salgaonkar

From: Mahesh Salgaonkar 

On pseries, the machine check error details are part of RTAS extended
event log passed under Machine check exception section. This patch adds
the definition of rtas MCE event section and related helper
functions.

Signed-off-by: Mahesh Salgaonkar 
---
 arch/powerpc/include/asm/rtas.h |  104 +++
 1 file changed, 104 insertions(+)

diff --git a/arch/powerpc/include/asm/rtas.h b/arch/powerpc/include/asm/rtas.h
index ec9dd79398ee..3f2fba7ef23b 100644
--- a/arch/powerpc/include/asm/rtas.h
+++ b/arch/powerpc/include/asm/rtas.h
@@ -275,6 +275,7 @@ inline uint32_t rtas_ext_event_company_id(struct 
rtas_ext_event_log_v6 *ext_log)
 #define PSERIES_ELOG_SECT_ID_CALL_HOME (('C' << 8) | 'H')
 #define PSERIES_ELOG_SECT_ID_USER_DEF  (('U' << 8) | 'D')
 #define PSERIES_ELOG_SECT_ID_HOTPLUG   (('H' << 8) | 'P')
+#define PSERIES_ELOG_SECT_ID_MCE   (('M' << 8) | 'C')
 
 /* Vendor specific Platform Event Log Format, Version 6, section header */
 struct pseries_errorlog {
@@ -326,6 +327,109 @@ struct pseries_hp_errorlog {
 #define PSERIES_HP_ELOG_ID_DRC_COUNT   3
 #define PSERIES_HP_ELOG_ID_DRC_IC  4
 
+/* RTAS pseries MCE errorlog section */
+#pragma pack(push, 1)
+struct pseries_mc_errorlog {
+   __be32  fru_id;
+   __be32  proc_id;
+   uint8_t error_type;
+   union {
+   struct {
+   uint8_t ue_err_type;
+   /* 
+* X1: Permanent or Transient UE.
+*  X   1: Effective address provided.
+*   X  1: Logical address provided.
+*XX2: Reserved.
+*  XXX 3: Type of UE error.
+*/
+   uint8_t reserved_1[6];
+   __be64  effective_address;
+   __be64  logical_address;
+   } ue_error;
+   struct {
+   uint8_t soft_err_type;
+   /* 
+* X1: Effective address provided.
+*  X   5: Reserved.
+*   XX 2: Type of SLB/ERAT/TLB error.
+*/
+   uint8_t reserved_1[6];
+   __be64  effective_address;
+   uint8_t reserved_2[8];
+   } soft_error;
+   } u;
+};
+#pragma pack(pop)
+
+/* RTAS pseries MCE error types */
+#define PSERIES_MC_ERROR_TYPE_UE   0x00
+#define PSERIES_MC_ERROR_TYPE_SLB  0x01
+#define PSERIES_MC_ERROR_TYPE_ERAT 0x02
+#define PSERIES_MC_ERROR_TYPE_TLB  0x04
+#define PSERIES_MC_ERROR_TYPE_D_CACHE  0x05
+#define PSERIES_MC_ERROR_TYPE_I_CACHE  0x07
+
+/* RTAS pseries MCE error sub types */
+#define PSERIES_MC_ERROR_UE_INDETERMINATE  0
+#define PSERIES_MC_ERROR_UE_IFETCH 1
+#define PSERIES_MC_ERROR_UE_PAGE_TABLE_WALK_IFETCH 2
+#define PSERIES_MC_ERROR_UE_LOAD_STORE 3
+#define PSERIES_MC_ERROR_UE_PAGE_TABLE_WALK_LOAD_STORE 4
+
+#define PSERIES_MC_ERROR_SLB_PARITY0
+#define PSERIES_MC_ERROR_SLB_MULTIHIT  1
+#define PSERIES_MC_ERROR_SLB_INDETERMINATE 2
+
+#define PSERIES_MC_ERROR_ERAT_PARITY   1
+#define PSERIES_MC_ERROR_ERAT_MULTIHIT 2
+#define PSERIES_MC_ERROR_ERAT_INDETERMINATE3
+
+#define PSERIES_MC_ERROR_TLB_PARITY1
+#define PSERIES_MC_ERROR_TLB_MULTIHIT  2
+#define PSERIES_MC_ERROR_TLB_INDETERMINATE 3
+
+static inline uint8_t rtas_mc_error_type(const struct pseries_mc_errorlog 
*mlog)
+{
+   return mlog->error_type;
+}
+
+static inline uint8_t rtas_mc_error_sub_type(
+   const struct pseries_mc_errorlog *mlog)
+{
+   switch (mlog->error_type) {
+   casePSERIES_MC_ERROR_TYPE_UE:
+   return (mlog->u.ue_error.ue_err_type & 0x07);
+   casePSERIES_MC_ERROR_TYPE_SLB:
+   casePSERIES_MC_ERROR_TYPE_ERAT:
+   casePSERIES_MC_ERROR_TYPE_TLB:
+   return (mlog->u.soft_error.soft_err_type & 0x03);
+   default:
+   return 0;
+   }
+}
+
+static inline uint64_t rtas_mc_get_effective_addr(
+   const struct pseries_mc_errorlog *mlog)
+{
+   uint64_t addr = 0;
+
+   switch (mlog->error_type) {
+   casePSERIES_MC_ERROR_TYPE_UE:
+   if (mlog->u.ue_error.ue_err_type & 0x40)
+   addr = mlog->u.ue_error.effective_address;
+   break;
+   casePSERIES_MC_ERROR_TYPE_SLB:
+   casePSERIES_MC_ERROR_TYPE_ERAT:
+   casePSERIES_MC_ERROR_TYPE_TLB:
+   if (mlog->u.soft_error.soft_err_type & 0x80)
+   addr =

[v3 PATCH 2/5] powerpc/pseries: Fix endainness while restoring of r3 in MCE handler.

2018-06-07 Thread Mahesh J Salgaonkar

From: Mahesh Salgaonkar 

During Machine Check interrupt on pseries platform, register r3 points
RTAS extended event log passed by hypervisor. Since hypervisor uses r3
to pass pointer to rtas log, it stores the original r3 value at the
start of the memory (first 8 bytes) pointed by r3. Since hypervisor
stores this info and rtas log is in BE format, linux should make
sure to restore r3 value in correct endian format.

Without this patch when MCE handler, after recovery, returns to code that
that caused the MCE may end up with Data SLB access interrupt for invalid
address followed by kernel panic or hang.

[   62.878965] Severe Machine check interrupt [Recovered]
[   62.878968]   NIP [dca301b8]: init_module+0x1b8/0x338 [bork_kernel]
[   62.878969]   Initiator: CPU
[   62.878970]   Error type: SLB [Multihit]
[   62.878971] Effective address: dca7
cpu 0xa: Vector: 380 (Data SLB Access) at [c000fc7775b0]
pc: c09694c0: vsnprintf+0x80/0x480
lr: c09698e0: vscnprintf+0x20/0x60
sp: c000fc777830
   msr: 82009033
   dar: a803a30c00d0
  current = 0xcbc9ef00
  paca= 0xc0001eca5c00   softe: 3irq_happened: 0x01
pid   = 8860, comm = insmod
[c000fc7778b0] c09698e0 vscnprintf+0x20/0x60
[c000fc7778e0] c016b6c4 vprintk_emit+0xb4/0x4b0
[c000fc777960] c016d40c vprintk_func+0x5c/0xd0
[c000fc777980] c016cbb4 printk+0x38/0x4c
[c000fc7779a0] dca301c0 init_module+0x1c0/0x338 [bork_kernel]
[c000fc777a40] c000d9c4 do_one_initcall+0x54/0x230
[c000fc777b00] c01b3b74 do_init_module+0x8c/0x248
[c000fc777b90] c01b2478 load_module+0x12b8/0x15b0
[c000fc777d30] c01b29e8 sys_finit_module+0xa8/0x110
[c000fc777e30] c000b204 system_call+0x58/0x6c
--- Exception: c00 (System Call) at 7fff8bda0644
SP (7fffdfbfe980) is in userspace

This patch fixes this issue.

Fixes: a08a53ea4c97 ("powerpc/le: Enable RTAS events support")
Cc: sta...@vger.kernel.org
Signed-off-by: Mahesh Salgaonkar 
---
 arch/powerpc/platforms/pseries/ras.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/platforms/pseries/ras.c 
b/arch/powerpc/platforms/pseries/ras.c
index 5e1ef9150182..2edc673be137 100644
--- a/arch/powerpc/platforms/pseries/ras.c
+++ b/arch/powerpc/platforms/pseries/ras.c
@@ -360,7 +360,7 @@ static struct rtas_error_log *fwnmi_get_errinfo(struct 
pt_regs *regs)
}
 
savep = __va(regs->gpr[3]);
-   regs->gpr[3] = savep[0];/* restore original r3 */
+   regs->gpr[3] = be64_to_cpu(savep[0]);   /* restore original r3 */
 
/* If it isn't an extended log we can use the per cpu 64bit buffer */
h = (struct rtas_error_log *)[1];

[v3 PATCH 1/5] powerpc/pseries: convert rtas_log_buf to linear allocation.

2018-06-07 Thread Mahesh J Salgaonkar

From: Mahesh Salgaonkar 

rtas_log_buf is a buffer to hold RTAS event data that are communicated
to kernel by hypervisor. This buffer is then used to pass RTAS event
data to user through proc fs. This buffer is allocated from vmalloc
(non-linear mapping) area.

On Machine check interrupt, register r3 points to RTAS extended event
log passed by hypervisor that contains the MCE event. The pseries
machine check handler then logs this error into rtas_log_buf. The
rtas_log_buf is a vmalloc-ed (non-linear) buffer we end up taking up a
page fault (vector 0x300) while accessing it. Since machine check
interrupt handler runs in NMI context we can not afford to take any
page fault. Page faults are not honored in NMI context and causes
kernel panic. This patch fixes this issue by allocating rtas_log_buf
using kmalloc.

Fixes: b96672dd840f ("powerpc: Machine check interrupt is a non-maskable 
interrupt")
Cc: sta...@vger.kernel.org
Suggested-by: Aneesh Kumar K.V 
Signed-off-by: Mahesh Salgaonkar 
---
 arch/powerpc/kernel/rtasd.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/rtasd.c b/arch/powerpc/kernel/rtasd.c
index f915db93cd42..3957d4ae2ba2 100644
--- a/arch/powerpc/kernel/rtasd.c
+++ b/arch/powerpc/kernel/rtasd.c
@@ -559,7 +559,7 @@ static int __init rtas_event_scan_init(void)
rtas_error_log_max = rtas_get_error_log_max();
rtas_error_log_buffer_max = rtas_error_log_max + sizeof(int);
 
-   rtas_log_buf = vmalloc(rtas_error_log_buffer_max*LOG_NUMBER);
+   rtas_log_buf = kmalloc(rtas_error_log_buffer_max*LOG_NUMBER, 
GFP_KERNEL);
if (!rtas_log_buf) {
printk(KERN_ERR "rtasd: no memory\n");
return -ENOMEM;

[v3 PATCH 0/5] powerpc/pseries: Machien check handler improvements.

2018-06-07 Thread Mahesh J Salgaonkar

This patch series includes some improvement to Machine check handler
for pseries. Patch 1 fixes an issue where machine check handler crashes
kernel while accessing vmalloc-ed buffer while in nmi context.
Patch 2 fixes endain bug while restoring of r3 in MCE handler.
Patch 4 dumps the SLB contents on SLB MCE errors to improve the debugability.
Patch 5 display's the MCE error details on console.

CHange in V3:
- Moved patch 5 to patch 2

Change in V2:
- patch 3: Display additional info (NIP and task info) in MCE error details.
- patch 5: Fix endain bug while restoring of r3 in MCE handler.

---

Mahesh Salgaonkar (5):
  powerpc/pseries: convert rtas_log_buf to linear allocation.
  powerpc/pseries: Fix endainness while restoring of r3 in MCE handler.
  powerpc/pseries: Define MCE error event section.
  powerpc/pseries: Dump and flush SLB contents on SLB MCE errors.
  powerpc/pseries: Display machine check error details.


 arch/powerpc/include/asm/book3s/64/mmu-hash.h |1 
 arch/powerpc/include/asm/rtas.h   |  109 ++
 arch/powerpc/kernel/rtasd.c   |2 
 arch/powerpc/mm/slb.c |   35 ++
 arch/powerpc/platforms/pseries/ras.c  |  155 +
 5 files changed, 299 insertions(+), 3 deletions(-)

--
Signature

Re: [RFC PATCH kernel 4/5] vfio_pci: Allow mapping extra regions

2018-06-07 Thread Alex Williamson

On Thu,  7 Jun 2018 18:44:19 +1000
Alexey Kardashevskiy  wrote:

What's an "extra region", -ENOCOMMITLOG

> Signed-off-by: Alexey Kardashevskiy 
> ---
>  drivers/vfio/pci/vfio_pci_private.h |  3 +++
>  drivers/vfio/pci/vfio_pci.c | 10 --
>  2 files changed, 11 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/vfio/pci/vfio_pci_private.h 
> b/drivers/vfio/pci/vfio_pci_private.h
> index cde3b5d..86aab05 100644
> --- a/drivers/vfio/pci/vfio_pci_private.h
> +++ b/drivers/vfio/pci/vfio_pci_private.h
> @@ -59,6 +59,9 @@ struct vfio_pci_regops {
> size_t count, loff_t *ppos, bool iswrite);
>   void(*release)(struct vfio_pci_device *vdev,
>  struct vfio_pci_region *region);
> + int (*mmap)(struct vfio_pci_device *vdev,
> + struct vfio_pci_region *region,
> + struct vm_area_struct *vma);
>  };
>  
>  struct vfio_pci_region {
> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> index 3729937..7bddf1e 100644
> --- a/drivers/vfio/pci/vfio_pci.c
> +++ b/drivers/vfio/pci/vfio_pci.c
> @@ -1123,10 +1123,16 @@ static int vfio_pci_mmap(void *device_data, struct 
> vm_area_struct *vma)
>   return -EINVAL;
>   if ((vma->vm_flags & VM_SHARED) == 0)
>   return -EINVAL;
> + if (index >= VFIO_PCI_NUM_REGIONS) {
> + int regnum = index - VFIO_PCI_NUM_REGIONS;
> + struct vfio_pci_region *region = vdev->region + regnum;
> +
> + if (region && region->ops && region->ops->mmap)
> + return region->ops->mmap(vdev, region, vma);
> + return -EINVAL;
> + }
>   if (index >= VFIO_PCI_ROM_REGION_INDEX)
>   return -EINVAL;
> - if (!vdev->bar_mmap_supported[index])
> - return -EINVAL;

This seems unrelated.  Thanks,

Alex
  
>   phys_len = PAGE_ALIGN(pci_resource_len(pdev, index));
>   req_len = vma->vm_end - vma->vm_start;

Re: [RFC PATCH kernel 5/5] vfio_pci: Add NVIDIA GV100GL [Tesla V100 SXM2] [10de:1db1] subdriver

2018-06-07 Thread Alex Williamson

On Thu,  7 Jun 2018 18:44:20 +1000
Alexey Kardashevskiy  wrote:

> Some POWER9 chips come with special NVLink2 links which provide
> cacheable memory access to the RAM physically located on NVIDIA GPU.
> This memory is presented to a host via the device tree but remains
> offline until the NVIDIA driver onlines it.
> 
> This exports this RAM to the userspace as a new region so
> the NVIDIA driver in the guest can train these links and online GPU RAM.
> 
> Signed-off-by: Alexey Kardashevskiy 
> ---
>  drivers/vfio/pci/Makefile   |   1 +
>  drivers/vfio/pci/vfio_pci_private.h |   8 ++
>  include/uapi/linux/vfio.h   |   3 +
>  drivers/vfio/pci/vfio_pci.c |   9 ++
>  drivers/vfio/pci/vfio_pci_nvlink2.c | 190 
> 
>  drivers/vfio/pci/Kconfig|   4 +
>  6 files changed, 215 insertions(+)
>  create mode 100644 drivers/vfio/pci/vfio_pci_nvlink2.c
> 
> diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
> index 76d8ec0..9662c06 100644
> --- a/drivers/vfio/pci/Makefile
> +++ b/drivers/vfio/pci/Makefile
> @@ -1,5 +1,6 @@
>  
>  vfio-pci-y := vfio_pci.o vfio_pci_intrs.o vfio_pci_rdwr.o vfio_pci_config.o
>  vfio-pci-$(CONFIG_VFIO_PCI_IGD) += vfio_pci_igd.o
> +vfio-pci-$(CONFIG_VFIO_PCI_NVLINK2) += vfio_pci_nvlink2.o
>  
>  obj-$(CONFIG_VFIO_PCI) += vfio-pci.o
> diff --git a/drivers/vfio/pci/vfio_pci_private.h 
> b/drivers/vfio/pci/vfio_pci_private.h
> index 86aab05..7115b9b 100644
> --- a/drivers/vfio/pci/vfio_pci_private.h
> +++ b/drivers/vfio/pci/vfio_pci_private.h
> @@ -160,4 +160,12 @@ static inline int vfio_pci_igd_init(struct 
> vfio_pci_device *vdev)
>   return -ENODEV;
>  }
>  #endif
> +#ifdef CONFIG_VFIO_PCI_NVLINK2
> +extern int vfio_pci_nvlink2_init(struct vfio_pci_device *vdev);
> +#else
> +static inline int vfio_pci_nvlink2_init(struct vfio_pci_device *vdev)
> +{
> + return -ENODEV;
> +}
> +#endif
>  #endif /* VFIO_PCI_PRIVATE_H */
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index 1aa7b82..2fe8227 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -301,6 +301,9 @@ struct vfio_region_info_cap_type {
>  #define VFIO_REGION_SUBTYPE_INTEL_IGD_HOST_CFG   (2)
>  #define VFIO_REGION_SUBTYPE_INTEL_IGD_LPC_CFG(3)
>  
> +/* NVIDIA GPU NV2 */
> +#define VFIO_REGION_SUBTYPE_NVIDIA_NVLINK2   (4)

You're continuing the Intel vendor ID sub-types for an NVIDIA vendor ID
subtype.  Each vendor has their own address space of sub-types.

> +
>  /*
>   * The MSIX mappable capability informs that MSIX data of a BAR can be 
> mmapped
>   * which allows direct access to non-MSIX registers which happened to be 
> within
> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> index 7bddf1e..38c9475 100644
> --- a/drivers/vfio/pci/vfio_pci.c
> +++ b/drivers/vfio/pci/vfio_pci.c
> @@ -306,6 +306,15 @@ static int vfio_pci_enable(struct vfio_pci_device *vdev)
>   }
>   }
>  
> + if (pdev->vendor == PCI_VENDOR_ID_NVIDIA &&
> + pdev->device == 0x1db1 &&
> + IS_ENABLED(CONFIG_VFIO_PCI_NVLINK2)) {

Can't we do better than check this based on device ID?  Perhaps PCIe
capability hints at this?

Is it worthwhile to continue with assigning the device in the !ENABLED
case?  For instance, maybe it would be better to provide a weak
definition of vfio_pci_nvlink2_init() that would cause us to fail here
if we don't have this device specific support enabled.  I realize
you're following the example set forth for IGD, but those regions are
optional, for better or worse.

> + ret = vfio_pci_nvlink2_init(vdev);
> + if (ret)
> + dev_warn(>pdev->dev,
> +  "Failed to setup NVIDIA NV2 RAM region\n");
> + }
> +
>   vfio_pci_probe_mmaps(vdev);
>  
>   return 0;
> diff --git a/drivers/vfio/pci/vfio_pci_nvlink2.c 
> b/drivers/vfio/pci/vfio_pci_nvlink2.c
> new file mode 100644
> index 000..451c5cb
> --- /dev/null
> +++ b/drivers/vfio/pci/vfio_pci_nvlink2.c
> @@ -0,0 +1,190 @@
> +// SPDX-License-Identifier: GPL-2.0+
> +/*
> + * VFIO PCI NVIDIA Whitherspoon GPU support a.k.a. NVLink2.
> + *
> + * Copyright (C) 2018 IBM Corp.  All rights reserved.
> + * Author: Alexey Kardashevskiy 
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + *
> + * Register an on-GPU RAM region for cacheable access.
> + *
> + * Derived from original vfio_pci_igd.c:
> + * Copyright (C) 2016 Red Hat, Inc.  All rights reserved.
> + *   Author: Alex Williamson 
> + */
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +#include "vfio_pci_private.h"
> +
> +struct vfio_pci_nvlink2_data {
> + unsigned long gpu_hpa;
> + unsigned long useraddr;
> + unsigned long size;
> + struct mm_struct

Re: [RFC PATCH kernel 0/5] powerpc/P9/vfio: Pass through NVIDIA Tesla V100

2018-06-07 Thread Alex Williamson

On Thu,  7 Jun 2018 18:44:15 +1000
Alexey Kardashevskiy  wrote:

> Here is an rfc of some patches adding psaa-through support
> for NVIDIA V100 GPU found in some POWER9 boxes.
> 
> The example P9 system has 6 GPUs, each accompanied with 2 bridges
> representing the hardware links (aka NVLink2):
> 
>  4  0004:04:00.0 3D: NVIDIA Corporation GV100GL [Tesla V100 SXM2] (rev a1)
>  5  0004:05:00.0 3D: NVIDIA Corporation GV100GL [Tesla V100 SXM2] (rev a1)
>  6  0004:06:00.0 3D: NVIDIA Corporation GV100GL [Tesla V100 SXM2] (rev a1)
>  4  0006:00:00.0 Bridge: IBM Device 04ea (rev 01)
>  4  0006:00:00.1 Bridge: IBM Device 04ea (rev 01)
>  5  0006:00:01.0 Bridge: IBM Device 04ea (rev 01)
>  5  0006:00:01.1 Bridge: IBM Device 04ea (rev 01)
>  6  0006:00:02.0 Bridge: IBM Device 04ea (rev 01)
>  6  0006:00:02.1 Bridge: IBM Device 04ea (rev 01)
> 10  0007:00:00.0 Bridge: IBM Device 04ea (rev 01)
> 10  0007:00:00.1 Bridge: IBM Device 04ea (rev 01)
> 11  0007:00:01.0 Bridge: IBM Device 04ea (rev 01)
> 11  0007:00:01.1 Bridge: IBM Device 04ea (rev 01)
> 12  0007:00:02.0 Bridge: IBM Device 04ea (rev 01)
> 12  0007:00:02.1 Bridge: IBM Device 04ea (rev 01)
> 10  0035:03:00.0 3D: NVIDIA Corporation GV100GL [Tesla V100 SXM2] (rev a1)
> 11  0035:04:00.0 3D: NVIDIA Corporation GV100GL [Tesla V100 SXM2] (rev a1)
> 12  0035:05:00.0 3D: NVIDIA Corporation GV100GL [Tesla V100 SXM2] (rev a1)
> 
> ^^ the number is an IOMMU group ID.

Can we back up and discuss whether the IOMMU grouping of NVLink
connected devices makes sense?  AIUI we have a PCI view of these
devices and from that perspective they're isolated.  That's the view of
the device used to generate the grouping.  However, not visible to us,
these devices are interconnected via NVLink.  What isolation properties
does NVLink provide given that its entire purpose for existing seems to
be to provide a high performance link for p2p between devices?
 
> Each bridge represents an additional hardware interface called "NVLink2",
> it is not a PCI link but separate but. The design inherits from original
> NVLink from POWER8.
> 
> The new feature of V100 is 16GB of cache coherent memory on GPU board.
> This memory is presented to the host via the device tree and remains offline
> until the NVIDIA driver loads, trains NVLink2 (via the config space of these
> bridges above) and the nvidia-persistenced daemon then onlines it.
> The memory remains online as long as nvidia-persistenced is running, when
> it stops, it offlines the memory.
> 
> The amount of GPUs suggest passing them through to a guest. However,
> in order to do so we cannot use the NVIDIA driver so we have a host with
> a 128GB window (bigger or equal to actual GPU RAM size) in a system memory
> with no page structs backing this window and we cannot touch this memory
> before the NVIDIA driver configures it in a host or a guest as
> HMI (hardware management interrupt?) occurs.

Having a lot of GPUs only suggests assignment to a guest if there's
actually isolation provided between those GPUs.  Otherwise we'd need to
assign them as one big group, which gets a lot less useful.  Thanks,

Alex

> On the example system the GPU RAM windows are located at:
> 0x0400  
> 0x0420  
> 0x0440  
> 0x2400  
> 0x2420  
> 0x2440  
> 
> So the complications are:
> 
> 1. cannot touch the GPU memory till it is trained, i.e. cannot add ptes
> to VFIO-to-userspace or guest-to-host-physical translations till
> the driver trains it (i.e. nvidia-persistenced has started), otherwise
> prefetching happens and HMI occurs; I am trying to get this changed
> somehow;
> 
> 2. since it appears as normal cache coherent memory, it will be used
> for DMA which means it has to be pinned and mapped in the host. Having
> no page structs makes it different from the usual case - we only need
> translate user addresses to host physical and map GPU RAM memory but
> pinning is not required.
> 
> This series maps GPU RAM via the GPU vfio-pci device so QEMU can then
> register this memory as a KVM memory slot and present memory nodes to
> the guest. Unless NVIDIA provides an userspace driver, this is no use
> for things like DPDK.
> 
> 
> There is another problem which the series does not address but worth
> mentioning - it is not strictly necessary to map GPU RAM to the guest
> exactly where it is in the host (I tested this to some extent), we still
> might want to represent the memory at the same offset as on the host
> which increases the size of a TCE table needed to cover such a huge
> window: (((0x2440 + 0x20) >> 16)*8)>>20 = 4556MB
> I am addressing this in a separate patchset by allocating indirect TCE
> levels on demand and using 16MB IOMMU pages in the guest as we can now
> back emulated pages with the smaller hardware ones.
> 
> 
> This is an RFC. Please comment. Thanks.
> 
> 
> 
> Alexey Kardashevskiy (5):
>   vfio/spapr_tce: Simplify page contained test
>   powerpc/iommu_context: Change

Re: [RFC PATCH -tip v5 18/27] powerpc/kprobes: Don't call the ->break_handler() in arm kprobes code

2018-06-07 Thread Naveen N. Rao


Masami Hiramatsu wrote:

On Thu, 07 Jun 2018 17:07:00 +0530
"Naveen N. Rao"  wrote:


Masami Hiramatsu wrote:
> Don't call the ->break_handler() from the arm kprobes code,
^^^ powerpc

> because it was only used by jprobes which got removed.
> 
> This also makes skip_singlestep() a static function since

> only ftrace-kprobe.c is using this function.
> 
> Signed-off-by: Masami Hiramatsu 

> Cc: Benjamin Herrenschmidt 
> Cc: Paul Mackerras 
> Cc: Michael Ellerman 
> Cc: "Naveen N. Rao" 
> Cc: linuxppc-dev@lists.ozlabs.org
> ---
>  arch/powerpc/include/asm/kprobes.h   |   10 --
>  arch/powerpc/kernel/kprobes-ftrace.c |   16 +++-
>  arch/powerpc/kernel/kprobes.c|   31 +++
>  3 files changed, 14 insertions(+), 43 deletions(-)

With 2 small comments...


2 ? or 1 ?


Two, with one in the commit log above :)

- Naveen

Re: [v2 PATCH 0/5] powerpc/pseries: Machien check handler improvements.

2018-06-07 Thread Mahesh Jagannath Salgaonkar

On 06/07/2018 04:15 PM, Nicholas Piggin wrote:
> On Thu, 07 Jun 2018 15:36:25 +0530
> Mahesh J Salgaonkar  wrote:
> 
>> This patch series includes some improvement to Machine check handler
>> for pseries. Patch 1 fixes an issue where machine check handler crashes
>> kernel while accessing vmalloc-ed buffer while in nmi context.
>> Patch 3 dumps the SLB contents on SLB MCE errors to improve the debugability.
>> Patch 4 display's the MCE error details on console.
>>
>> Change in V2:
>> - patch 4: Display additional info (NIP and task info) in MCE error details.
>> - patch 5: Fix endain bug while restoring of r3 in MCE handler.
>>
>> ---
>>
>> Mahesh Salgaonkar (5):
>>   powerpc/pseries: convert rtas_log_buf to linear allocation.
>>   powerpc/pseries: Define MCE error event section.
>>   powerpc/pseries: Dump and flush SLB contents on SLB MCE errors.
>>   powerpc/pseries: Display machine check error details.
>>   powerpc/pseries: Fix endainness while restoring of r3 in MCE handler.
> 
> These look good, should patch 5 be moved to patch 2 and the first 2
> patches marked for stable?

Yup. Will move patch 5 to 2nd position.

> 
> Do you also plan to dump SLB contents for bare metal MCEs?

Yes. That's the plan. Will do that separately.

Thanks,
-Mahesh.

Re: [RFC V2] virtio: Add platform specific DMA API translation for virito devices

2018-06-07 Thread Michael S. Tsirkin

On Wed, Jun 06, 2018 at 10:23:06PM -0700, Christoph Hellwig wrote:
> On Thu, May 31, 2018 at 08:43:58PM +0300, Michael S. Tsirkin wrote:
> > Pls work on a long term solution. Short term needs can be served by
> > enabling the iommu platform in qemu.
> 
> So, I spent some time looking at converting virtio to dma ops overrides,
> and the current virtio spec, and the sad through I have to tell is that
> both the spec and the Linux implementation are complete and utterly fucked
> up.

Let me restate it: DMA API has support for a wide range of hardware, and
hardware based virtio implementations likely won't benefit from all of
it.

And given virtio right now is optimized for specific workloads, improving
portability without regressing performance isn't easy.

I think it's unsurprising since it started a strictly a guest/host
mechanism.  People did implement offloads on specific platforms though,
and they are known to work. To improve portability even further,
we might need to make spec and code changes.

I'm not really sympathetic to people complaining that they can't even
set a flag in qemu though. If that's the case the stack in question is
way too inflexible.

> Both in the flag naming and the implementation there is an implication
> of DMA API == IOMMU, which is fundamentally wrong.

Maybe we need to extend the meaning of PLATFORM_IOMMU or rename it.

It's possible that some setups will benefit from a more
fine-grained approach where some aspects of the DMA
API are bypassed, others aren't.

This seems to be what was being asked for in this thread,
with comments claiming IOMMU flag adds too much overhead.

> The DMA API does a few different things:
> 
>  a) address translation
> 
>   This does include IOMMUs.  But it also includes random offsets
>   between PCI bars and system memory that we see on various
>   platforms.

I don't think you mean bars. That's unrelated to DMA.

>  Worse so some of these offsets might be based on
>   banks, e.g. on the broadcom bmips platform.  It also deals
>   with bitmask in physical addresses related to memory encryption
>   like AMD SEV.  I'd be really curious how for example the
>   Intel virtio based NIC is going to work on any of those
>   plaforms.

SEV guys report that they just set the iommu flag and then it all works.
I guess if there's translation we can think of this as a kind of iommu.
Maybe we should rename PLATFORM_IOMMU to PLARTFORM_TRANSLATION?

And apparently some people complain that just setting that flag makes
qemu check translation on each access with an unacceptable performance
overhead.  Forcing same behaviour for everyone on general principles
even without the flag is unlikely to make them happy.

>   b) coherency
> 
>   On many architectures DMA is not cache coherent, and we need
>   to invalidate and/or write back cache lines before doing
>   DMA.  Again, I wonder how this is every going to work with
>   hardware based virtio implementations.

You mean dma_Xmb and friends?
There's a new feature VIRTIO_F_IO_BARRIER that's being proposed
for that.

>  Even worse I think this
>   is actually broken at least for VIVT event for virtualized
>   implementations.  E.g. a KVM guest is going to access memory
>   using different virtual addresses than qemu, vhost might throw
>   in another different address space.

I don't really know what VIVT is. Could you help me please?

>   c) bounce buffering
> 
>   Many DMA implementations can not address all physical memory
>   due to addressing limitations.  In such cases we copy the
>   DMA memory into a known addressable bounc buffer and DMA
>   from there.

Don't do it then?

>   d) flushing write combining buffers or similar
> 
>   On some hardware platforms we need workarounds to e.g. read
>   from a certain mmio address to make sure DMA can actually
>   see memory written by the host.

I guess it isn't an issue as long as WC isn't actually used.
It will become an issue when virtio spec adds some WC capability -
I suspect we can ignore this for now.

> 
> All of this is bypassed by virtio by default despite generally being
> platform issues, not particular to a given device.

It's both a device and a platform issue. A PV device is often more like
another CPU than like a PCI device.

-- 
MST

Re: Fwd: [powerpc/Baremetal]Kernel OOPS while executing memory hotplug on Power8 baremetal

2018-06-07 Thread Jens Axboe

On 6/7/18 8:45 AM, Jens Axboe wrote:
> On 6/7/18 4:37 AM, vrbagal1 wrote:
>> On 2018-06-07 13:12, Bart Van Assche wrote:
>>> On Thu, 2018-06-07 at 12:56 +0530, Venkat Rao B wrote:
 On Thursday 07 June 2018 12:46 PM, Bart Van Assche wrote:
> On Thu, 2018-06-07 at 12:38 +0530, vrbagal1 wrote:
>> Observing Kernel oops and machine reboots while executing memory hotplug
>> test case, on Power8 Baremetal machine.
>>
>> I see this is introduced some where between rc6 and 4.17.
>
> Please provide the exact versions (git commit IDs) of the kernel versions
> you have tested.

 Commit Id ---> 5037be168f
>>>
>>> The reason I was asking for the commit ID is because I saw that 
>>> clone_endio()
>>> occurs in the oops which means that the dm driver is involved. An 
>>> important fix
>>> for the dm driver went upstream recently, namely d37753540568 ("dm: Use 
>>> kzalloc
>>> for all structs with embedded biosets/mempools"). Can you double check 
>>> whether
>>> that commit it present in your tree? If it is not present, please 
>>> update to the
>>> latest master and retest. If it is present, please report how to 
>>> reproduce
>>> this oops to Kent Overstreet, Jens Axboe, linux-block and Mike Snitzer.
>>>
>>> Thanks,
>>>
>>> Bart.
>>
>>
>> Yes, the fix is present in the tree, which I have tested.
>>
>> Steps to reproduce:
>>
>> Step1: Clone and Install avocado git clone 
>> https://github.com/avocado-framework/avocado.git
>> Step2: Clone 
>> https://github.com/avocado-framework-tests/avocado-misc-tests.git
>> Test case is 
>> https://github.com/avocado-framework-tests/avocado-misc-tests/blob/master/memory/memhotplug.py
>> Step3: Command to run the test is avocado run 
>> avocado-misc-tests/memory/memhotplug.py
> 
> Can you try with the below? Not a fully formed fix since I'd prefer
> if the dm bioset copy stuff was changed instead, but worth a shot.

This is closer to an actual fix, please try that instead.


diff --git a/block/bio.c b/block/bio.c
index 595663e0281a..0616d86b15c6 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1967,6 +1967,21 @@ int bioset_init(struct bio_set *bs,
 }
 EXPORT_SYMBOL(bioset_init);
 
+int bioset_init_from_src(struct bio_set *new, struct bio_set *src)
+{
+   unsigned int pool_size = src->bio_pool.min_nr;
+   int flags;
+
+   flags = 0;
+   if (src->bvec_pool.min_nr)
+   flags |= BIOSET_NEED_BVECS;
+   if (src->rescue_workqueue)
+   flags |= BIOSET_NEED_RESCUER;
+
+   return bioset_init(new, pool_size, src->front_pad, flags);
+}
+EXPORT_SYMBOL(bioset_init_from_src);
+
 #ifdef CONFIG_BLK_CGROUP
 
 /**
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 98dff36b89a3..20a8d63754bf 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1953,9 +1953,10 @@ static void free_dev(struct mapped_device *md)
kvfree(md);
 }
 
-static void __bind_mempools(struct mapped_device *md, struct dm_table *t)
+static int __bind_mempools(struct mapped_device *md, struct dm_table *t)
 {
struct dm_md_mempools *p = dm_table_get_md_mempools(t);
+   int ret = 0;
 
if (dm_table_bio_based(t)) {
/*
@@ -1982,13 +1983,16 @@ static void __bind_mempools(struct mapped_device *md, 
struct dm_table *t)
   bioset_initialized(>bs) ||
   bioset_initialized(>io_bs));
 
-   md->bs = p->bs;
-   memset(>bs, 0, sizeof(p->bs));
-   md->io_bs = p->io_bs;
-   memset(>io_bs, 0, sizeof(p->io_bs));
+   ret = bioset_init_from_src(>bs, >bs);
+   if (ret)
+   goto out;
+   ret = bioset_init_from_src(>io_bs, >io_bs);
+   if (ret)
+   bioset_exit(>bs);
 out:
/* mempool bind completed, no longer need any mempools in the table */
dm_table_free_md_mempools(t);
+   return ret;
 }
 
 /*
@@ -2033,6 +2037,7 @@ static struct dm_table *__bind(struct mapped_device *md, 
struct dm_table *t,
struct request_queue *q = md->queue;
bool request_based = dm_table_request_based(t);
sector_t size;
+   int ret;
 
lockdep_assert_held(>suspend_lock);
 
@@ -2068,7 +2073,11 @@ static struct dm_table *__bind(struct mapped_device *md, 
struct dm_table *t,
md->immutable_target = dm_table_get_immutable_target(t);
}
 
-   __bind_mempools(md, t);
+   ret = __bind_mempools(md, t);
+   if (ret) {
+   old_map = ERR_PTR(ret);
+   goto out;
+   }
 
old_map = rcu_dereference_protected(md->map, 
lockdep_is_held(>suspend_lock));
rcu_assign_pointer(md->map, (void *)t);
@@ -2078,6 +2087,7 @@ static struct dm_table *__bind(struct mapped_device *md, 
struct dm_table *t,
if (old_map)
dm_sync_table(md);
 
+out:
return old_map;
 }
 
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 810a8bee8f85..307682ac2f31 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -417,6 +417,7 @@ enum {

[RFC PATCH 2/5] powerpc: Flush checkpointed gpr state for 32-bit processes in ptrace

2018-06-07 Thread Pedro Franco de Carvalho

Currently ptrace doesn't flush the register state when the
checkpointed GPRs of a 32-bit thread are accessed. This can cause core
dumps to have stale data in the checkpointed GPR note.
---
 arch/powerpc/kernel/ptrace.c | 20 
 1 file changed, 20 insertions(+)

diff --git a/arch/powerpc/kernel/ptrace.c b/arch/powerpc/kernel/ptrace.c
index 6618570c6d56..be8ca03a0bd5 100644
--- a/arch/powerpc/kernel/ptrace.c
+++ b/arch/powerpc/kernel/ptrace.c
@@ -2124,6 +2124,16 @@ static int tm_cgpr32_get(struct task_struct *target,
 unsigned int pos, unsigned int count,
 void *kbuf, void __user *ubuf)
 {
+   if (!cpu_has_feature(CPU_FTR_TM))
+   return -ENODEV;
+
+   if (!MSR_TM_ACTIVE(target->thread.regs->msr))
+   return -ENODATA;
+
+   flush_tmregs_to_thread(target);
+   flush_fp_to_thread(target);
+   flush_altivec_to_thread(target);
+
return gpr32_get_common(target, regset, pos, count, kbuf, ubuf,
>thread.ckpt_regs.gpr[0]);
 }
@@ -2133,6 +2143,16 @@ static int tm_cgpr32_set(struct task_struct *target,
 unsigned int pos, unsigned int count,
 const void *kbuf, const void __user *ubuf)
 {
+   if (!cpu_has_feature(CPU_FTR_TM))
+   return -ENODEV;
+
+   if (!MSR_TM_ACTIVE(target->thread.regs->msr))
+   return -ENODATA;
+
+   flush_tmregs_to_thread(target);
+   flush_fp_to_thread(target);
+   flush_altivec_to_thread(target);
+
return gpr32_set_common(target, regset, pos, count, kbuf, ubuf,
>thread.ckpt_regs.gpr[0]);
 }
-- 
2.13.6

[RFC PATCH 4/5] powerpc: Add VSX regset to compat_regsets

2018-06-07 Thread Pedro Franco de Carvalho

This patch copies the the missing VSX regset to the compat_regsets
array.

Not having this regset can cause issues in fs/binfmt_elf.c in the
fill_thread_core_info function, which iterates over all the regsets
defined in compat_regsets to fill note info for a core dump of a
32-bit thread. However, the number of regset notes allocated for
writing is the number of regsets with core_note_type != 0. If the
regset array has an entry with core_note_type == 0, which is the case
for the missing VSX element, this can cause later regsets to be
written outside the bounds of the allocated notes.

The compat_regset is also missing entries for REGSET_PMR and
REGSET_PKEY, but because these are at the end of the powerpc_regset
enum, the designated initializers for the compat_regset array don't
cause implicit elements to be created, like they did for REGSET_VSX.
---
 arch/powerpc/kernel/ptrace.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/arch/powerpc/kernel/ptrace.c b/arch/powerpc/kernel/ptrace.c
index 69123feaef9e..2da0668a96dc 100644
--- a/arch/powerpc/kernel/ptrace.c
+++ b/arch/powerpc/kernel/ptrace.c
@@ -2237,6 +2237,13 @@ static const struct user_regset compat_regsets[] = {
.active = vr_active, .get = vr_get, .set = vr_set
},
 #endif
+#ifdef CONFIG_VSX
+   [REGSET_VSX] = {
+   .core_note_type = NT_PPC_VSX, .n = 32,
+   .size = sizeof(double), .align = sizeof(double),
+   .active = vsr_active, .get = vsr_get, .set = vsr_set
+   },
+#endif
 #ifdef CONFIG_SPE
[REGSET_SPE] = {
.core_note_type = NT_PPC_SPE, .n = 35,
-- 
2.13.6

[RFC PATCH 5/5] powerpc: Add PMU regset to compat_regsets

2018-06-07 Thread Pedro Franco de Carvalho

This patch allows setting and getting PMU registers from 32-bit
threads.
---
 arch/powerpc/kernel/ptrace.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/arch/powerpc/kernel/ptrace.c b/arch/powerpc/kernel/ptrace.c
index 2da0668a96dc..3a9c4ae65429 100644
--- a/arch/powerpc/kernel/ptrace.c
+++ b/arch/powerpc/kernel/ptrace.c
@@ -2317,6 +2317,11 @@ static const struct user_regset compat_regsets[] = {
.size = sizeof(u64), .align = sizeof(u64),
.active = ebb_active, .get = ebb_get, .set = ebb_set
},
+   [REGSET_PMR] = {
+   .core_note_type = NT_PPC_PMU, .n = ELF_NPMU,
+   .size = sizeof(u64), .align = sizeof(u64),
+   .active = pmu_active, .get = pmu_get, .set = pmu_set
+   },
 #endif
 };
 
-- 
2.13.6

[RFC PATCH 3/5] powerpc: Fix pmu get/set functions

2018-06-07 Thread Pedro Franco de Carvalho

The PMU regset exposed through ptrace has 5 64-bit words, which are
all copied in and out. However, mmcr0 in the thread_struct is an
unsigned, which causes pmu_set to clobber the next variable in the
thread_struct (used_ebb), and pmu_get to return the same variable in
one half of the mmcr0 slot.
---
 arch/powerpc/kernel/ptrace.c | 31 +++
 1 file changed, 27 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/kernel/ptrace.c b/arch/powerpc/kernel/ptrace.c
index be8ca03a0bd5..69123feaef9e 100644
--- a/arch/powerpc/kernel/ptrace.c
+++ b/arch/powerpc/kernel/ptrace.c
@@ -1733,6 +1733,9 @@ static int pmu_get(struct task_struct *target,
  unsigned int pos, unsigned int count,
  void *kbuf, void __user *ubuf)
 {
+   int ret = 0;
+   unsigned long mmcr0 = target->thread.mmcr0;
+
/* Build tests */
BUILD_BUG_ON(TSO(siar) + sizeof(unsigned long) != TSO(sdar));
BUILD_BUG_ON(TSO(sdar) + sizeof(unsigned long) != TSO(sier));
@@ -1742,9 +1745,16 @@ static int pmu_get(struct task_struct *target,
if (!cpu_has_feature(CPU_FTR_ARCH_207S))
return -ENODEV;
 
-   return user_regset_copyout(, , , ,
-   >thread.siar, 0,
+   ret = user_regset_copyout(, , , ,
+   >thread.siar, 0,
+   4 * sizeof(unsigned long));
+
+   if (!ret)
+   ret = user_regset_copyout(, , , ,
+   , 4 * sizeof(unsigned long),
5 * sizeof(unsigned long));
+
+   return ret;
 }
 
 static int pmu_set(struct task_struct *target,
@@ -1754,6 +1764,12 @@ static int pmu_set(struct task_struct *target,
 {
int ret = 0;
 
+#ifdef __BIG_ENDIAN
+   int mmcr0_offset = sizeof(unsigned);
+#else
+   int mmcr0_offset = 0;
+#endif
+
/* Build tests */
BUILD_BUG_ON(TSO(siar) + sizeof(unsigned long) != TSO(sdar));
BUILD_BUG_ON(TSO(sdar) + sizeof(unsigned long) != TSO(sier));
@@ -1783,9 +1799,16 @@ static int pmu_set(struct task_struct *target,
4 * sizeof(unsigned long));
 
if (!ret)
+   ret = user_regset_copyin_ignore(, , ,
+   , 4 * sizeof(unsigned long),
+   4 * sizeof(unsigned long) + mmcr0_offset);
+
+   if (!ret)
ret = user_regset_copyin(, , , ,
-   >thread.mmcr0, 4 * sizeof(unsigned long),
-   5 * sizeof(unsigned long));
+   >thread.mmcr0,
+   4 * sizeof(unsigned long) + mmcr0_offset,
+   4 * sizeof(unsigned long) + mmcr0_offset
+   + sizeof (unsigned));
return ret;
 }
 #endif
-- 
2.13.6

[RFC PATCH 1/5] powerpc: Fix inverted active predicate for setting the EBB regset

2018-06-07 Thread Pedro Franco de Carvalho

Currently, the ebb_set function for writing to the EBB regset returns
ENODATA when ebb is active in the thread, and copies in the data when
it is inactive. This patch inverts the condition so that it matches
ebb_get and ebb_active.
---
 arch/powerpc/kernel/ptrace.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/ptrace.c b/arch/powerpc/kernel/ptrace.c
index d23cf632edf0..6618570c6d56 100644
--- a/arch/powerpc/kernel/ptrace.c
+++ b/arch/powerpc/kernel/ptrace.c
@@ -1701,7 +1701,7 @@ static int ebb_set(struct task_struct *target,
if (!cpu_has_feature(CPU_FTR_ARCH_207S))
return -ENODEV;
 
-   if (target->thread.used_ebb)
+   if (!target->thread.used_ebb)
return -ENODATA;
 
ret = user_regset_copyin(, , , ,
-- 
2.13.6

[RFC PATCH 0/5] powerpc: Misc. ptrace regset fixes

2018-06-07 Thread Pedro Franco de Carvalho

This series attempts to fix a few issues with ptrace regsets.

Patch 1 simply inverts the active predicate for ebb_set. I don't know
if there was a reason for having opposite predicates in
ebb_get/ebb_set, but I assumed this was a typo.

Patch 2 adds the usual HTM prologue for regsets to the tm_cgpr32
get/set functions, so that the cgprs are flushed. I don't really
understand the need for flushing the fp and altivec states, but I
copied that over since it was done in the regular tm_cgpr get/set
functions.

Patch 3 changes the pmu get/set functions so that they don't read or
write outside the bounds of thread_struct.mmcr0. The endianess of the
kernel is used to determine where the mmcr0 word should be placed (or
read from) in its corresponding 64-bit slot in the regset. I am not
sure if this is the correct way to go, or if the endianess of the
thread being traced should determine this position (can the kernel run
threads with a different endianess?). I used the kernel endianess
because that is what seems to happen for other registers smaller than
their regset fields (for instance, it seems that checkpointed CR is
saved by the kernel as a doubleword, so the the position of the word
depends on the kernel's endianess). The rest of the function assumes
that unsigned longs are doublewords, so the patch assumes that an
unsigned is a word. This patch (and the original pmu_get/set
functions) might not work if the kernel is compiled in 32 bits.

Patch 4 adds the VSX regset to compat_regsets, which could cause out
of bounds writes in fs/binfmt_elf.c.

Patch 5 adds the PMU regset to compat_regsets.

I also noticed that the regset for CGPRs for 32-bit threads has 48 * 8
bytes (same as the one for 64-bit threads), but the data only occupies
the first 48 * 4 bytes (like for the 32-bit GPR regset). I am not sure
if this was intended, or if it can be changed now that other programs
might already assume the 48 * 8 size. If the kernel is compiled in
32-bits, the size will change (because it depends on sizeof (long)),
but I don't know if HTM and the corresponding regsets are supported in
the first place for a 32-bit kernel.

I haven't added the PKEY regset to compat_regsets. Does that make
sense for 32-bit threads?

Pedro Franco de Carvalho (5):
  powerpc: Fix inverted active predicate for setting the EBB regset
  powerpc: Flush checkpointed gpr state for 32-bit processes in ptrace
  powerpc: Fix pmu get/set functions
  powerpc: Add VSX regset to compat_regsets
  powerpc: Add PMU regset to compat_regsets

 arch/powerpc/kernel/ptrace.c | 65 
 1 file changed, 60 insertions(+), 5 deletions(-)

-- 
2.13.6

Re: Fwd: [powerpc/Baremetal]Kernel OOPS while executing memory hotplug on Power8 baremetal

2018-06-07 Thread Jens Axboe

On 6/7/18 4:37 AM, vrbagal1 wrote:
> On 2018-06-07 13:12, Bart Van Assche wrote:
>> On Thu, 2018-06-07 at 12:56 +0530, Venkat Rao B wrote:
>>> On Thursday 07 June 2018 12:46 PM, Bart Van Assche wrote:
 On Thu, 2018-06-07 at 12:38 +0530, vrbagal1 wrote:
> Observing Kernel oops and machine reboots while executing memory hotplug
> test case, on Power8 Baremetal machine.
>
> I see this is introduced some where between rc6 and 4.17.

 Please provide the exact versions (git commit IDs) of the kernel versions
 you have tested.
>>>
>>> Commit Id ---> 5037be168f
>>
>> The reason I was asking for the commit ID is because I saw that 
>> clone_endio()
>> occurs in the oops which means that the dm driver is involved. An 
>> important fix
>> for the dm driver went upstream recently, namely d37753540568 ("dm: Use 
>> kzalloc
>> for all structs with embedded biosets/mempools"). Can you double check 
>> whether
>> that commit it present in your tree? If it is not present, please 
>> update to the
>> latest master and retest. If it is present, please report how to 
>> reproduce
>> this oops to Kent Overstreet, Jens Axboe, linux-block and Mike Snitzer.
>>
>> Thanks,
>>
>> Bart.
> 
> 
> Yes, the fix is present in the tree, which I have tested.
> 
> Steps to reproduce:
> 
> Step1: Clone and Install avocado git clone 
> https://github.com/avocado-framework/avocado.git
> Step2: Clone 
> https://github.com/avocado-framework-tests/avocado-misc-tests.git
> Test case is 
> https://github.com/avocado-framework-tests/avocado-misc-tests/blob/master/memory/memhotplug.py
> Step3: Command to run the test is avocado run 
> avocado-misc-tests/memory/memhotplug.py

Can you try with the below? Not a fully formed fix since I'd prefer
if the dm bioset copy stuff was changed instead, but worth a shot.


diff --git a/block/bio.c b/block/bio.c
index 595663e0281a..45bdee67d28b 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1967,6 +1967,27 @@ int bioset_init(struct bio_set *bs,
 }
 EXPORT_SYMBOL(bioset_init);
 
+void bioset_move(struct bio_set *dst, struct bio_set *src)
+{
+   dst->bio_slab = src->bio_slab;
+   dst->front_pad = src->front_pad;
+   mempool_move(>bio_pool, >bio_pool);
+   mempool_move(>bvec_pool, >bvec_pool);
+#if defined(CONFIG_BLK_DEV_INTEGRITY)
+   mempool_move(>bio_integrity_pool, >bio_integrity_pool);
+   mempool_move(>bvec_integrity_pool, >bvec_integrity_pool);
+#endif
+   BUG_ON(!bio_list_empty(>rescue_list));
+   BUG_ON(work_pending(>rescue_work));
+   spin_lock_init(>rescue_lock);
+   bio_list_init(>rescue_list);
+   INIT_WORK(>rescue_work, bio_alloc_rescue);
+   dst->rescue_workqueue = src->rescue_workqueue;
+
+   memset(src, 0, sizeof(*src));
+}
+EXPORT_SYMBOL(bioset_move);
+
 #ifdef CONFIG_BLK_CGROUP
 
 /**
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 98dff36b89a3..87f636815baf 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1982,10 +1982,8 @@ static void __bind_mempools(struct mapped_device *md, 
struct dm_table *t)
   bioset_initialized(>bs) ||
   bioset_initialized(>io_bs));
 
-   md->bs = p->bs;
-   memset(>bs, 0, sizeof(p->bs));
-   md->io_bs = p->io_bs;
-   memset(>io_bs, 0, sizeof(p->io_bs));
+   bioset_move(>bs, >bs);
+   bioset_move(>io_bs, >io_bs);
 out:
/* mempool bind completed, no longer need any mempools in the table */
dm_table_free_md_mempools(t);
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 810a8bee8f85..7581231dd0a3 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -417,6 +417,7 @@ enum {
 extern int bioset_init(struct bio_set *, unsigned int, unsigned int, int 
flags);
 extern void bioset_exit(struct bio_set *);
 extern int biovec_init_pool(mempool_t *pool, int pool_entries);
+extern void bioset_move(struct bio_set *dst, struct bio_set *src);
 
 extern struct bio *bio_alloc_bioset(gfp_t, unsigned int, struct bio_set *);
 extern void bio_put(struct bio *);
diff --git a/include/linux/mempool.h b/include/linux/mempool.h
index 0c964ac107c2..20818919180c 100644
--- a/include/linux/mempool.h
+++ b/include/linux/mempool.h
@@ -47,6 +47,7 @@ extern int mempool_resize(mempool_t *pool, int new_min_nr);
 extern void mempool_destroy(mempool_t *pool);
 extern void *mempool_alloc(mempool_t *pool, gfp_t gfp_mask) __malloc;
 extern void mempool_free(void *element, mempool_t *pool);
+extern void mempool_move(mempool_t *dst, mempool_t *src);
 
 /*
  * A mempool_alloc_t and mempool_free_t that get the memory from
diff --git a/mm/mempool.c b/mm/mempool.c
index b54f2c20e5e0..dd402653367b 100644
--- a/mm/mempool.c
+++ b/mm/mempool.c
@@ -181,6 +181,8 @@ int mempool_init_node(mempool_t *pool, int min_nr, 
mempool_alloc_t *alloc_fn,
  mempool_free_t *free_fn, void *pool_data,
  gfp_t gfp_mask, int node_id)
 {
+   memset(pool, 0, sizeof(*pool));
+
spin_lock_init(>lock);

Re: [RFC PATCH -tip v5 18/27] powerpc/kprobes: Don't call the ->break_handler() in arm kprobes code

2018-06-07 Thread Masami Hiramatsu

On Thu, 07 Jun 2018 17:07:00 +0530
"Naveen N. Rao"  wrote:

> Masami Hiramatsu wrote:
> > Don't call the ->break_handler() from the arm kprobes code,
>   ^^^ powerpc
> 
> > because it was only used by jprobes which got removed.
> > 
> > This also makes skip_singlestep() a static function since
> > only ftrace-kprobe.c is using this function.
> > 
> > Signed-off-by: Masami Hiramatsu 
> > Cc: Benjamin Herrenschmidt 
> > Cc: Paul Mackerras 
> > Cc: Michael Ellerman 
> > Cc: "Naveen N. Rao" 
> > Cc: linuxppc-dev@lists.ozlabs.org
> > ---
> >  arch/powerpc/include/asm/kprobes.h   |   10 --
> >  arch/powerpc/kernel/kprobes-ftrace.c |   16 +++-
> >  arch/powerpc/kernel/kprobes.c|   31 +++
> >  3 files changed, 14 insertions(+), 43 deletions(-)
> 
> With 2 small comments...

2 ? or 1 ?

> Acked-by: Naveen N. Rao 
> 
> - Naveen
> 
> > 
> > diff --git a/arch/powerpc/include/asm/kprobes.h 
> > b/arch/powerpc/include/asm/kprobes.h
> > index 674036db558b..785c464b6588 100644
> > --- a/arch/powerpc/include/asm/kprobes.h
> > +++ b/arch/powerpc/include/asm/kprobes.h
> > @@ -102,16 +102,6 @@ extern int kprobe_exceptions_notify(struct 
> > notifier_block *self,
> >  extern int kprobe_fault_handler(struct pt_regs *regs, int trapnr);
> >  extern int kprobe_handler(struct pt_regs *regs);
> >  extern int kprobe_post_handler(struct pt_regs *regs);
> > -#ifdef CONFIG_KPROBES_ON_FTRACE
> > -extern int skip_singlestep(struct kprobe *p, struct pt_regs *regs,
> > -  struct kprobe_ctlblk *kcb);
> > -#else
> > -static inline int skip_singlestep(struct kprobe *p, struct pt_regs *regs,
> > - struct kprobe_ctlblk *kcb)
> > -{
> > -   return 0;
> > -}
> > -#endif
> >  #else
> >  static inline int kprobe_handler(struct pt_regs *regs) { return 0; }
> >  static inline int kprobe_post_handler(struct pt_regs *regs) { return 0; }
> > diff --git a/arch/powerpc/kernel/kprobes-ftrace.c 
> > b/arch/powerpc/kernel/kprobes-ftrace.c
> > index 1b316331c2d9..3869b0e5d5c7 100644
> > --- a/arch/powerpc/kernel/kprobes-ftrace.c
> > +++ b/arch/powerpc/kernel/kprobes-ftrace.c
> > @@ -26,8 +26,8 @@
> >  #include 
> > 
> >  static nokprobe_inline
> > -int __skip_singlestep(struct kprobe *p, struct pt_regs *regs,
> > - struct kprobe_ctlblk *kcb, unsigned long orig_nip)
> > +int skip_singlestep(struct kprobe *p, struct pt_regs *regs,
> > +   struct kprobe_ctlblk *kcb, unsigned long orig_nip)
> >  {
> > /*
> >  * Emulate singlestep (and also recover regs->nip)
> > @@ -44,16 +44,6 @@ int __skip_singlestep(struct kprobe *p, struct pt_regs 
> > *regs,
> > return 1;
> >  }
> > 
> > -int skip_singlestep(struct kprobe *p, struct pt_regs *regs,
> > -   struct kprobe_ctlblk *kcb)
> > -{
> > -   if (kprobe_ftrace(p))
> > -   return __skip_singlestep(p, regs, kcb, 0);
> > -   else
> > -   return 0;
> > -}
> > -NOKPROBE_SYMBOL(skip_singlestep);
> > -
> >  /* Ftrace callback handler for kprobes */
> >  void kprobe_ftrace_handler(unsigned long nip, unsigned long parent_nip,
> >struct ftrace_ops *ops, struct pt_regs *regs)
> > @@ -82,7 +72,7 @@ void kprobe_ftrace_handler(unsigned long nip, unsigned 
> > long parent_nip,
> > __this_cpu_write(current_kprobe, p);
> > kcb->kprobe_status = KPROBE_HIT_ACTIVE;
> > if (!p->pre_handler || !p->pre_handler(p, regs))
> > -   __skip_singlestep(p, regs, kcb, orig_nip);
> > +   skip_singlestep(p, regs, kcb, orig_nip);
> 
> We can probably get rid of skip_singlestep() completely along with 
> orig_nip since instructions are always 4 bytes on powerpc. So, the 
> changes we do to nip should help to recover the value automatically.

Good point! Yes, skip_singlestep() is no more exported, so we just consolidate
it into kprobe_ftrace_handler() for simplifying operation.

Thank you!
> 
> - Naveen
> 
> 


-- 
Masami Hiramatsu

Re: [RFC PATCH -tip v5 07/27] powerpc/kprobes: Remove jprobe powerpc implementation

2018-06-07 Thread Masami Hiramatsu

On Thu, 07 Jun 2018 17:01:23 +0530
"Naveen N. Rao"  wrote:

> Masami Hiramatsu wrote:
> > Remove arch dependent setjump/longjump functions
> > and unused fields in kprobe_ctlblk for jprobes
> > from arch/powerpc. This also reverts commits
> > related __is_active_jprobe() function.
> > 
> > Signed-off-by: Masami Hiramatsu 
> > 
> > Cc: Benjamin Herrenschmidt 
> > Cc: Paul Mackerras 
> > Cc: Michael Ellerman 
> > Cc: "Naveen N. Rao" 
> > Cc: linuxppc-dev@lists.ozlabs.org
> > ---
> >  arch/powerpc/include/asm/kprobes.h |2 -
> >  arch/powerpc/kernel/kprobes-ftrace.c   |   15 ---
> >  arch/powerpc/kernel/kprobes.c  |   54 
> > 
> >  arch/powerpc/kernel/trace/ftrace_64_mprofile.S |   39 ++---
> >  4 files changed, 5 insertions(+), 105 deletions(-)
> 
> LGTM.
> 
> Acked-by: Naveen N. Rao 

Thanks Naveen!

> 
> - Naveen
> 
> 


-- 
Masami Hiramatsu

Re: Fwd: [powerpc/Baremetal]Kernel OOPS while executing memory hotplug on Power8 baremetal

2018-06-07 Thread Michael Ellerman

vrbagal1  writes:
> On 2018-06-07 13:12, Bart Van Assche wrote:
>> On Thu, 2018-06-07 at 12:56 +0530, Venkat Rao B wrote:
>>> On Thursday 07 June 2018 12:46 PM, Bart Van Assche wrote:
>>> > On Thu, 2018-06-07 at 12:38 +0530, vrbagal1 wrote:
>>> > > Observing Kernel oops and machine reboots while executing memory hotplug
>>> > > test case, on Power8 Baremetal machine.
>>> > >
>>> > > I see this is introduced some where between rc6 and 4.17.
>>> >
>>> > Please provide the exact versions (git commit IDs) of the kernel versions
>>> > you have tested.
>>> 
>>> Commit Id ---> 5037be168f
>> 
>> The reason I was asking for the commit ID is because I saw that 
>> clone_endio()
>> occurs in the oops which means that the dm driver is involved. An 
>> important fix
>> for the dm driver went upstream recently, namely d37753540568 ("dm: Use 
>> kzalloc
>> for all structs with embedded biosets/mempools"). Can you double check 
>> whether
>> that commit it present in your tree? If it is not present, please 
>> update to the
>> latest master and retest. If it is present, please report how to 
>> reproduce
>> this oops to Kent Overstreet, Jens Axboe, linux-block and Mike Snitzer.
>> 
>> Thanks,
>> 
>> Bart.
>
>
> Yes, the fix is present in the tree, which I have tested.
>
> Steps to reproduce:
>
> Step1: Clone and Install avocado git clone 
> https://github.com/avocado-framework/avocado.git
> Step2: Clone 
> https://github.com/avocado-framework-tests/avocado-misc-tests.git
> Test case is 
> https://github.com/avocado-framework-tests/avocado-misc-tests/blob/master/memory/memhotplug.py
> Step3: Command to run the test is avocado run 
> avocado-misc-tests/memory/memhotplug.py

That gave me:

  $ avocado run avocado-misc-tests/memory/memhotplug.py
  avocado: command not found

Was I meant to install it?

I tried this which worked (I think):

  $ ./scripts/avocado run avocado-misc-tests/memory/memhotplug.py
  Failed to load plugin from module "avocado_runner_vm": ImportError('No module 
named libvirt',)
  JOB ID : 28deb5a455fb876a7e177deb2b46eab640f313c8
  JOB LOG: 
/home/michael/avocado/job-results/job-2018-06-07T22.27-28deb5a/job.log
   (1/4) avocado-misc-tests/memory/memhotplug.py:memstress.test_hotplug_loop: 
PASS (10.62 s)
   (2/4) avocado-misc-tests/memory/memhotplug.py:memstress.test_hotplug_toggle: 
PASS (245.15 s)
   (3/4) 
avocado-misc-tests/memory/memhotplug.py:memstress.test_dlpar_mem_hotplug: PASS 
(0.37 s)
   (4/4) 
avocado-misc-tests/memory/memhotplug.py:memstress.test_hotplug_per_numa_node: 
PASS (41.09 s)
  RESULTS: PASS 4 | ERROR 0 | FAIL 0 | SKIP 0 | WARN 0 | INTERRUPT 0 | 
CANCEL 0
  JOB TIME   : 323.45 s
  JOB HTML   : 
/home/michael/avocado/job-results/job-2018-06-07T22.27-28deb5a/results.html


So what's different about your system?

What does 'lsblk -O' say on your system?

cheers

[GIT PULL] Please pull powerpc/linux.git powerpc-4.18-1 tag

2018-06-07 Thread Michael Ellerman

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

Hi Linus,

Please pull powerpc updates for 4.18:

The following changes since commit 6da6c0db5316275015e8cc2959f12a17584aeb64:

  Linux v4.17-rc3 (2018-04-29 14:17:42 -0700)

are available in the git repository at:

  https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git 
tags/powerpc-4.18-1

for you to fetch changes up to ff5bc793e47b537bf3e904fada585e102c54dd8b:

  powerpc/64s/radix: Fix missing ptesync in flush_cache_vmap (2018-06-06 
18:50:53 +1000)

- --
powerpc updates for 4.18

Notable changes:

 - Support for split PMD page table lock on 64-bit Book3S (Power8/9).

 - Add support for HAVE_RELIABLE_STACKTRACE, so we properly support live
   patching again.

 - Add support for patching barrier_nospec in copy_from_user() and syscall 
entry.

 - A couple of fixes for our data breakpoints on Book3S.

 - A series from Nick optimising TLB/mm handling with the Radix MMU.

 - Numerous small cleanups to squash sparse/gcc warnings from Mathieu Malaterre.

 - Several series optimising various parts of the 32-bit code from Christophe 
Leroy.

 - Removal of support for two old machines, "SBC834xE" and "C2K" 
("GEFanuc,C2K"),
   which is why the diffstat has so many deletions.

And many other small improvements & fixes.

There's a few out-of-area changes. Some minor ftrace changes OK'ed by Steve, and
a fix to our powernv cpuidle driver. Then there's a series touching mm, x86 and
fs/proc/task_mmu.c, which cleans up some details around pkey support. It was
ack'ed/reviewed by Ingo & Dave and has been in next for several weeks.

Thanks to:
  Akshay Adiga, Alastair D'Silva, Alexey Kardashevskiy, Al Viro, Andrew
  Donnellan, Aneesh Kumar K.V, Anju T Sudhakar, Arnd Bergmann, Balbir Singh,
  Cédric Le Goater, Christophe Leroy, Christophe Lombard, Colin Ian King, Dave
  Hansen, Fabio Estevam, Finn Thain, Frederic Barrat, Gautham R. Shenoy, Haren
  Myneni, Hari Bathini, Ingo Molnar, Jonathan Neuschäfer, Josh Poimboeuf,
  Kamalesh Babulal, Madhavan Srinivasan, Mahesh Salgaonkar, Mark Greer, Mathieu
  Malaterre, Matthew Wilcox, Michael Neuling, Michal Suchanek, Naveen N. Rao,
  Nicholas Piggin, Nicolai Stange, Olof Johansson, Paul Gortmaker, Paul
  Mackerras, Peter Rosin, Pridhiviraj Paidipeddi, Ram Pai, Rashmica Gupta, Ravi
  Bangoria, Russell Currey, Sam Bobroff, Samuel Mendoza-Jonas, Segher
  Boessenkool, Shilpasri G Bhat, Simon Guo, Souptick Joarder, Stewart Smith,
  Thiago Jung Bauermann, Torsten Duwe, Vaibhav Jain, Wei Yongjun, Wolfram Sang,
  Yisheng Xie, YueHaibing.

- --
Akshay Adiga (1):
  powerpc/powernv/cpuidle: Init all present cpus for deep states

Al Viro (6):
  powerpc/syscalls: Switch trivial cases to SYSCALL_DEFINE
  powerpc/syscalls: signal_{32, 64} - switch to SYSCALL_DEFINE
  powerpc/syscalls: switch rtas(2) to SYSCALL_DEFINE
  powerpc/syscalls: kill ppc32_select()
  powerpc/syscalls: timer_create can be handle by perfectly normal 
COMPAT_SYS_SPU
  powerpc/ptrace: Use copy_{from, to}_user() rather than open-coding

Alastair D'Silva (7):
  powerpc: Add TIDR CPU feature for POWER9
  powerpc: Use TIDR CPU feature to control TIDR allocation
  powerpc: use task_pid_nr() for TID allocation
  ocxl: Rename pnv_ocxl_spa_remove_pe to clarify it's action
  ocxl: Expose the thread_id needed for wait on POWER9
  ocxl: Add an IOCTL so userspace knows what OCXL features are available
  ocxl: Document new OCXL IOCTLs

Alexey Kardashevskiy (2):
  powerpc/ioda: Use ibm, supported-tce-sizes for IOMMU page size mask
  powerpc/powernv/ioda2: Remove redundant free of TCE pages

Aneesh Kumar K.V (20):
  powerpc/kvm: Switch kvm pmd allocator to custom allocator
  powerpc/mm/book3s64: Move book3s64 code to pgtable-book3s64
  powerpc/mm: Use pmd_lockptr instead of opencoding it
  powerpc/mm: Rename pte fragment functions
  powerpc/mm/book3e/64: Remove unsupported 64Kpage size from 64bit booke
  powerpc/mm/nohash: Remove pte fragment dependency from nohash
  powerpc/mm/book3s64/4k: Switch 4k pagesize config to use pagetable 
fragment
  powerpc/book3s64/mm: Simplify the rcu callback for page table free
  powerpc/mm: Implement helpers for pagetable fragment support at PMD level
  powerpc/mm: Use page fragments for allocation page table at PMD level
  powerpc/book3s64: Enable split pmd ptlock.
  powerpc/livepatch: Fix build error with kprobes disabled.
  powerpc/mm: Fix kernel crash on page table free
  powerpc/mm/hugetlb: Update huge_ptep_set_access_flags to call 
__ptep_set_access_flags directly
  powerpc/mm/radix: Move function from radix.h to pgtable-radix.c
  powerpc/mm: Change function prototype
  powerpc/mm/radix: Change pte relax sequence to handle nest MMU hang
  powerpc/mm/hash: Add missing isync prior to

Re: [RFC PATCH -tip v5 24/27] bpf: error-inject: kprobes: Clear current_kprobe and enable preempt in kprobe

2018-06-07 Thread Naveen N. Rao


Masami Hiramatsu wrote:

Clear current_kprobe and enable preemption in kprobe
even if pre_handler returns !0.

This simplifies function override using kprobes.

Jprobe used to require to keep the preemption disabled and
keep current_kprobe until it returned to original function
entry. For this reason kprobe_int3_handler() and similar
arch dependent kprobe handers checks pre_handler result
and exit without enabling preemption if the result is !0.

After removing the jprobe, Kprobes does not need to
keep preempt disabled even if user handler returns !0
anymore.

But since the function override handler in error-inject
and bpf is also returns !0 if it overrides a function,
to balancing the preempt count, it enables preemption
and reset current kprobe by itself.

That is a bad design that is very buggy. This fixes
such unbalanced preempt-count and current_kprobes setting
in kprobes, bpf and error-inject.

Note: for powerpc and x86, this removes all preempt_disable
from kprobe_ftrace_handler because ftrace callbacks are
called under preempt disabled.

Signed-off-by: Masami Hiramatsu 
Cc: Vineet Gupta 
Cc: Russell King 
Cc: Catalin Marinas 
Cc: Will Deacon 
Cc: Tony Luck 
Cc: Fenghua Yu 
Cc: Ralf Baechle 
Cc: James Hogan 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Michael Ellerman 
Cc: Martin Schwidefsky 
Cc: Heiko Carstens 
Cc: Yoshinori Sato 
Cc: Rich Felker 
Cc: "David S. Miller" 
Cc: "Naveen N. Rao" 
Cc: Josef Bacik 
Cc: Alexei Starovoitov 
Cc: x...@kernel.org
Cc: linux-snps-...@lists.infradead.org
Cc: linux-ker...@vger.kernel.org
Cc: linux-arm-ker...@lists.infradead.org
Cc: linux-i...@vger.kernel.org
Cc: linux-m...@linux-mips.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-s...@vger.kernel.org
Cc: linux...@vger.kernel.org
Cc: sparcli...@vger.kernel.org
---
 Changes in v5:
  - Fix kprobe_ftrace_handler in arch/powerpc too.
---
 arch/arc/kernel/kprobes.c|5 +++--
 arch/arm/probes/kprobes/core.c   |   10 +-
 arch/arm64/kernel/probes/kprobes.c   |   10 +-
 arch/ia64/kernel/kprobes.c   |   13 -
 arch/mips/kernel/kprobes.c   |4 ++--
 arch/powerpc/kernel/kprobes-ftrace.c |   15 ++-
 arch/powerpc/kernel/kprobes.c|7 +--


For the powerpc bits:
Acked-by: Naveen N. Rao 

Thanks,
Naveen

Re: [RFC PATCH -tip v5 18/27] powerpc/kprobes: Don't call the ->break_handler() in arm kprobes code

2018-06-07 Thread Naveen N. Rao


Masami Hiramatsu wrote:

Don't call the ->break_handler() from the arm kprobes code,

^^^ powerpc


because it was only used by jprobes which got removed.

This also makes skip_singlestep() a static function since
only ftrace-kprobe.c is using this function.

Signed-off-by: Masami Hiramatsu 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Michael Ellerman 
Cc: "Naveen N. Rao" 
Cc: linuxppc-dev@lists.ozlabs.org
---
 arch/powerpc/include/asm/kprobes.h   |   10 --
 arch/powerpc/kernel/kprobes-ftrace.c |   16 +++-
 arch/powerpc/kernel/kprobes.c|   31 +++
 3 files changed, 14 insertions(+), 43 deletions(-)


With 2 small comments...
Acked-by: Naveen N. Rao 

- Naveen



diff --git a/arch/powerpc/include/asm/kprobes.h 
b/arch/powerpc/include/asm/kprobes.h
index 674036db558b..785c464b6588 100644
--- a/arch/powerpc/include/asm/kprobes.h
+++ b/arch/powerpc/include/asm/kprobes.h
@@ -102,16 +102,6 @@ extern int kprobe_exceptions_notify(struct notifier_block 
*self,
 extern int kprobe_fault_handler(struct pt_regs *regs, int trapnr);
 extern int kprobe_handler(struct pt_regs *regs);
 extern int kprobe_post_handler(struct pt_regs *regs);
-#ifdef CONFIG_KPROBES_ON_FTRACE
-extern int skip_singlestep(struct kprobe *p, struct pt_regs *regs,
-  struct kprobe_ctlblk *kcb);
-#else
-static inline int skip_singlestep(struct kprobe *p, struct pt_regs *regs,
- struct kprobe_ctlblk *kcb)
-{
-   return 0;
-}
-#endif
 #else
 static inline int kprobe_handler(struct pt_regs *regs) { return 0; }
 static inline int kprobe_post_handler(struct pt_regs *regs) { return 0; }
diff --git a/arch/powerpc/kernel/kprobes-ftrace.c 
b/arch/powerpc/kernel/kprobes-ftrace.c
index 1b316331c2d9..3869b0e5d5c7 100644
--- a/arch/powerpc/kernel/kprobes-ftrace.c
+++ b/arch/powerpc/kernel/kprobes-ftrace.c
@@ -26,8 +26,8 @@
 #include 

 static nokprobe_inline
-int __skip_singlestep(struct kprobe *p, struct pt_regs *regs,
- struct kprobe_ctlblk *kcb, unsigned long orig_nip)
+int skip_singlestep(struct kprobe *p, struct pt_regs *regs,
+   struct kprobe_ctlblk *kcb, unsigned long orig_nip)
 {
/*
 * Emulate singlestep (and also recover regs->nip)
@@ -44,16 +44,6 @@ int __skip_singlestep(struct kprobe *p, struct pt_regs *regs,
return 1;
 }

-int skip_singlestep(struct kprobe *p, struct pt_regs *regs,
-   struct kprobe_ctlblk *kcb)
-{
-   if (kprobe_ftrace(p))
-   return __skip_singlestep(p, regs, kcb, 0);
-   else
-   return 0;
-}
-NOKPROBE_SYMBOL(skip_singlestep);
-
 /* Ftrace callback handler for kprobes */
 void kprobe_ftrace_handler(unsigned long nip, unsigned long parent_nip,
   struct ftrace_ops *ops, struct pt_regs *regs)
@@ -82,7 +72,7 @@ void kprobe_ftrace_handler(unsigned long nip, unsigned long 
parent_nip,
__this_cpu_write(current_kprobe, p);
kcb->kprobe_status = KPROBE_HIT_ACTIVE;
if (!p->pre_handler || !p->pre_handler(p, regs))
-   __skip_singlestep(p, regs, kcb, orig_nip);
+   skip_singlestep(p, regs, kcb, orig_nip);


We can probably get rid of skip_singlestep() completely along with 
orig_nip since instructions are always 4 bytes on powerpc. So, the 
changes we do to nip should help to recover the value automatically.


- Naveen

Re: [RFC PATCH -tip v5 07/27] powerpc/kprobes: Remove jprobe powerpc implementation

2018-06-07 Thread Naveen N. Rao


Masami Hiramatsu wrote:

Remove arch dependent setjump/longjump functions
and unused fields in kprobe_ctlblk for jprobes
from arch/powerpc. This also reverts commits
related __is_active_jprobe() function.

Signed-off-by: Masami Hiramatsu 

Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Michael Ellerman 
Cc: "Naveen N. Rao" 
Cc: linuxppc-dev@lists.ozlabs.org
---
 arch/powerpc/include/asm/kprobes.h |2 -
 arch/powerpc/kernel/kprobes-ftrace.c   |   15 ---
 arch/powerpc/kernel/kprobes.c  |   54 
 arch/powerpc/kernel/trace/ftrace_64_mprofile.S |   39 ++---
 4 files changed, 5 insertions(+), 105 deletions(-)


LGTM.

Acked-by: Naveen N. Rao 

- Naveen

Re: [v2 PATCH 0/5] powerpc/pseries: Machien check handler improvements.

2018-06-07 Thread Nicholas Piggin

On Thu, 07 Jun 2018 15:36:25 +0530
Mahesh J Salgaonkar  wrote:

> This patch series includes some improvement to Machine check handler
> for pseries. Patch 1 fixes an issue where machine check handler crashes
> kernel while accessing vmalloc-ed buffer while in nmi context.
> Patch 3 dumps the SLB contents on SLB MCE errors to improve the debugability.
> Patch 4 display's the MCE error details on console.
> 
> Change in V2:
> - patch 4: Display additional info (NIP and task info) in MCE error details.
> - patch 5: Fix endain bug while restoring of r3 in MCE handler.
> 
> ---
> 
> Mahesh Salgaonkar (5):
>   powerpc/pseries: convert rtas_log_buf to linear allocation.
>   powerpc/pseries: Define MCE error event section.
>   powerpc/pseries: Dump and flush SLB contents on SLB MCE errors.
>   powerpc/pseries: Display machine check error details.
>   powerpc/pseries: Fix endainness while restoring of r3 in MCE handler.

These look good, should patch 5 be moved to patch 2 and the first 2
patches marked for stable?

Do you also plan to dump SLB contents for bare metal MCEs?

Thanks,
Nick

Re: Fwd: [powerpc/Baremetal]Kernel OOPS while executing memory hotplug on Power8 baremetal

2018-06-07 Thread vrbagal1

On 2018-06-07 13:12, Bart Van Assche wrote:

On Thu, 2018-06-07 at 12:56 +0530, Venkat Rao B wrote:

On Thursday 07 June 2018 12:46 PM, Bart Van Assche wrote:
> On Thu, 2018-06-07 at 12:38 +0530, vrbagal1 wrote:
> > Observing Kernel oops and machine reboots while executing memory hotplug
> > test case, on Power8 Baremetal machine.
> >
> > I see this is introduced some where between rc6 and 4.17.
>
> Please provide the exact versions (git commit IDs) of the kernel versions
> you have tested.

Commit Id ---> 5037be168f

The reason I was asking for the commit ID is because I saw that 
clone_endio()
occurs in the oops which means that the dm driver is involved. An 
important fix
for the dm driver went upstream recently, namely d37753540568 ("dm: Use 
kzalloc
for all structs with embedded biosets/mempools"). Can you double check 
whether
that commit it present in your tree? If it is not present, please 
update to the
latest master and retest. If it is present, please report how to 
reproduce

this oops to Kent Overstreet, Jens Axboe, linux-block and Mike Snitzer.

Thanks,

Bart.

Yes, the fix is present in the tree, which I have tested.

Steps to reproduce:

Step1: Clone and Install avocado git clone 
https://github.com/avocado-framework/avocado.git
Step2: Clone 
https://github.com/avocado-framework-tests/avocado-misc-tests.git
   Test case is 
https://github.com/avocado-framework-tests/avocado-misc-tests/blob/master/memory/memhotplug.py
Step3: Command to run the test is avocado run 
avocado-misc-tests/memory/memhotplug.py

Regards,
Venkat.

[PATCH 3/3] powerpc: Allow CPU selection of e300core variants

2018-06-07 Thread Christophe Leroy

GCC supports -mcpu=e300c2 and -mcpu=e300c3

This patch gives the opportunity to tune kernel to one of
those two types.

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/platforms/Kconfig.cputype | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/arch/powerpc/platforms/Kconfig.cputype 
b/arch/powerpc/platforms/Kconfig.cputype
index ed7c6edec87e..d174acb41389 100644
--- a/arch/powerpc/platforms/Kconfig.cputype
+++ b/arch/powerpc/platforms/Kconfig.cputype
@@ -145,6 +145,14 @@ config 860_CPU
bool "8xx family"
depends on PPC_8xx
 
+config E300C2_CPU
+   bool "e300c2 (832x)"
+   depends on PPC_BOOK3S_32
+
+config E300C3_CPU
+   bool "e300c3 (831x)"
+   depends on PPC_BOOK3S_32
+
 endchoice
 
 config SPECIAL_CPU_BOOL
@@ -161,6 +169,8 @@ config SPECIAL_CPU
default "power8" if POWER8_CPU
default "power9" if POWER9_CPU
default "860" if 860_CPU
+   default "e300c2" if E300C2_CPU
+   default "e300c3" if E300C3_CPU
 
 config PPC_BOOK3S
def_bool y
-- 
2.13.3

[PATCH 2/3] powerpc: Allow CPU selection also on PPC32

2018-06-07 Thread Christophe Leroy

This patch extends to PPC32 the capability to select the exact
CPU type.

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/platforms/Kconfig.cputype | 15 +++
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/platforms/Kconfig.cputype 
b/arch/powerpc/platforms/Kconfig.cputype
index 71ef559cc474..ed7c6edec87e 100644
--- a/arch/powerpc/platforms/Kconfig.cputype
+++ b/arch/powerpc/platforms/Kconfig.cputype
@@ -86,7 +86,6 @@ endchoice
 
 choice
prompt "CPU selection"
-   depends on PPC64
default GENERIC_CPU
help
  This will create a kernel which is optimised for a particular CPU.
@@ -96,13 +95,17 @@ choice
 
 config GENERIC_CPU
bool "Generic (POWER4 and above)"
-   depends on !CPU_LITTLE_ENDIAN
+   depends on PPC64 && !CPU_LITTLE_ENDIAN
 
 config GENERIC_CPU
bool "Generic (POWER8 and above)"
-   depends on CPU_LITTLE_ENDIAN
+   depends on PPC64 && CPU_LITTLE_ENDIAN
select ARCH_HAS_FAST_MULTIPLIER
 
+config GENERIC_CPU
+   bool "Generic 32 bits powerpc"
+   depends on PPC32 && !PPC_8xx
+
 config CELL_CPU
bool "Cell Broadband Engine"
depends on PPC_BOOK3S_64 && !CPU_LITTLE_ENDIAN
@@ -138,6 +141,10 @@ config E6500_CPU
bool "Freescale e6500"
depends on E500
 
+config 860_CPU
+   bool "8xx family"
+   depends on PPC_8xx
+
 endchoice
 
 config SPECIAL_CPU_BOOL
@@ -153,7 +160,7 @@ config SPECIAL_CPU
default "power7" if POWER7_CPU
default "power8" if POWER8_CPU
default "power9" if POWER9_CPU
-   default "860" if PPC_8xx
+   default "860" if 860_CPU
 
 config PPC_BOOK3S
def_bool y
-- 
2.13.3

[PATCH 1/3] powerpc: make CPU selection logic generic in Makefile

2018-06-07 Thread Christophe Leroy

At the time being, when adding a new CPU for selection, both
Kconfig.cputype and Makefile have to be modified.

This patch moves into Kconfig.cputype the name of the CPU to me
passed to the -mcpu= argument.

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/Makefile  |  8 +---
 arch/powerpc/platforms/Kconfig.cputype | 15 +++
 2 files changed, 16 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/Makefile b/arch/powerpc/Makefile
index 9704ab360d39..9a5642552abc 100644
--- a/arch/powerpc/Makefile
+++ b/arch/powerpc/Makefile
@@ -175,13 +175,7 @@ ifdef CONFIG_MPROFILE_KERNEL
 endif
 endif
 
-CFLAGS-$(CONFIG_CELL_CPU) += $(call cc-option,-mcpu=cell)
-CFLAGS-$(CONFIG_POWER5_CPU) += $(call cc-option,-mcpu=power5)
-CFLAGS-$(CONFIG_POWER6_CPU) += $(call cc-option,-mcpu=power6)
-CFLAGS-$(CONFIG_POWER7_CPU) += $(call cc-option,-mcpu=power7)
-CFLAGS-$(CONFIG_POWER8_CPU) += $(call cc-option,-mcpu=power8)
-CFLAGS-$(CONFIG_POWER9_CPU) += $(call cc-option,-mcpu=power9)
-CFLAGS-$(CONFIG_PPC_8xx) += $(call cc-option,-mcpu=860)
+CFLAGS-$(CONFIG_SPECIAL_CPU_BOOL) += $(call 
cc-option,-mcpu=$(CONFIG_SPECIAL_CPU))
 
 # Altivec option not allowed with e500mc64 in GCC.
 ifeq ($(CONFIG_ALTIVEC),y)
diff --git a/arch/powerpc/platforms/Kconfig.cputype 
b/arch/powerpc/platforms/Kconfig.cputype
index cc892dcfa114..71ef559cc474 100644
--- a/arch/powerpc/platforms/Kconfig.cputype
+++ b/arch/powerpc/platforms/Kconfig.cputype
@@ -140,6 +140,21 @@ config E6500_CPU
 
 endchoice
 
+config SPECIAL_CPU_BOOL
+   bool
+   default !GENERIC_CPU
+
+config SPECIAL_CPU
+   string
+   depends on SPECIAL_CPU_BOOL
+   default "cell" if CELL_CPU
+   default "power5" if POWER5_CPU
+   default "power6" if POWER6_CPU
+   default "power7" if POWER7_CPU
+   default "power8" if POWER8_CPU
+   default "power9" if POWER9_CPU
+   default "860" if PPC_8xx
+
 config PPC_BOOK3S
def_bool y
depends on PPC_BOOK3S_32 || PPC_BOOK3S_64
-- 
2.13.3

[v2 PATCH 5/5] powerpc/pseries: Fix endainness while restoring of

2018-06-07 Thread Mahesh J Salgaonkar

r3 in MCE handler.

From: Mahesh Salgaonkar 

During Machine Check interrupt on pseries platform, register r3 points
RTAS extended event log passed by hypervisor. Since hypervisor uses r3
to pass pointer to rtas log, it stores the original r3 value at the
start of the memory (first 8 bytes) pointed by r3. Since hypervisor
stores this info and rtas log is in BE format, linux should make
sure to restore r3 value in correct endian format.

Without this patch when MCE handler, after recovery, returns to code that
that caused the MCE may end up with Data SLB access interrupt for invalid
address followed by kernel panic or hang.

[   62.878965] Severe Machine check interrupt [Recovered]
[   62.878968]   NIP [dca301b8]: init_module+0x1b8/0x338 [bork_kernel]
[   62.878969]   Initiator: CPU
[   62.878970]   Error type: SLB [Multihit]
[   62.878971] Effective address: dca7
cpu 0xa: Vector: 380 (Data SLB Access) at [c000fc7775b0]
pc: c09694c0: vsnprintf+0x80/0x480
lr: c09698e0: vscnprintf+0x20/0x60
sp: c000fc777830
   msr: 82009033
   dar: a803a30c00d0
  current = 0xcbc9ef00
  paca= 0xc0001eca5c00   softe: 3irq_happened: 0x01
pid   = 8860, comm = insmod
[c000fc7778b0] c09698e0 vscnprintf+0x20/0x60
[c000fc7778e0] c016b6c4 vprintk_emit+0xb4/0x4b0
[c000fc777960] c016d40c vprintk_func+0x5c/0xd0
[c000fc777980] c016cbb4 printk+0x38/0x4c
[c000fc7779a0] dca301c0 init_module+0x1c0/0x338 [bork_kernel]
[c000fc777a40] c000d9c4 do_one_initcall+0x54/0x230
[c000fc777b00] c01b3b74 do_init_module+0x8c/0x248
[c000fc777b90] c01b2478 load_module+0x12b8/0x15b0
[c000fc777d30] c01b29e8 sys_finit_module+0xa8/0x110
[c000fc777e30] c000b204 system_call+0x58/0x6c
--- Exception: c00 (System Call) at 7fff8bda0644
SP (7fffdfbfe980) is in userspace

This patch fixes this issue.

Signed-off-by: Mahesh Salgaonkar 
---
 arch/powerpc/platforms/pseries/ras.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/platforms/pseries/ras.c 
b/arch/powerpc/platforms/pseries/ras.c
index afdf05444bc2..cd9446980092 100644
--- a/arch/powerpc/platforms/pseries/ras.c
+++ b/arch/powerpc/platforms/pseries/ras.c
@@ -360,7 +360,7 @@ static struct rtas_error_log *fwnmi_get_errinfo(struct 
pt_regs *regs)
}
 
savep = __va(regs->gpr[3]);
-   regs->gpr[3] = savep[0];/* restore original r3 */
+   regs->gpr[3] = be64_to_cpu(savep[0]);   /* restore original r3 */
 
/* If it isn't an extended log we can use the per cpu 64bit buffer */
h = (struct rtas_error_log *)[1];

[v2 PATCH 4/5] powerpc/pseries: Display machine check error details.

2018-06-07 Thread Mahesh J Salgaonkar

From: Mahesh Salgaonkar 

Extract the MCE eror details from RTAS extended log and display it to
console.

With this patch you should now see mce logs like below:

[  142.371818] Severe Machine check interrupt [Recovered]
[  142.371822]   NIP [dca301b8]: init_module+0x1b8/0x338 [bork_kernel]
[  142.371822]   Initiator: CPU
[  142.371823]   Error type: SLB [Multihit]
[  142.371824] Effective address: dca7

Signed-off-by: Mahesh Salgaonkar 
---
 arch/powerpc/include/asm/rtas.h  |5 +
 arch/powerpc/platforms/pseries/ras.c |  128 +-
 2 files changed, 131 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/rtas.h b/arch/powerpc/include/asm/rtas.h
index 3f2fba7ef23b..8100a95c133a 100644
--- a/arch/powerpc/include/asm/rtas.h
+++ b/arch/powerpc/include/asm/rtas.h
@@ -190,6 +190,11 @@ static inline uint8_t rtas_error_extended(const struct 
rtas_error_log *elog)
return (elog->byte1 & 0x04) >> 2;
 }
 
+static inline uint8_t rtas_error_initiator(const struct rtas_error_log *elog)
+{
+   return (elog->byte2 & 0xf0) >> 4;
+}
+
 #define rtas_error_type(x) ((x)->byte3)
 
 static inline
diff --git a/arch/powerpc/platforms/pseries/ras.c 
b/arch/powerpc/platforms/pseries/ras.c
index 7470a216cd6b..afdf05444bc2 100644
--- a/arch/powerpc/platforms/pseries/ras.c
+++ b/arch/powerpc/platforms/pseries/ras.c
@@ -422,7 +422,130 @@ int pSeries_system_reset_exception(struct pt_regs *regs)
return 0; /* need to perform reset */
 }
 
-static int mce_handle_error(struct rtas_error_log *errp)
+#define VAL_TO_STRING(ar, val) ((val < ARRAY_SIZE(ar)) ? ar[val] : "Unknown")
+
+static void pseries_print_mce_info(struct pt_regs *regs,
+   struct rtas_error_log *errp, int disposition)
+{
+   const char *level, *sevstr;
+   struct pseries_errorlog *pseries_log;
+   struct pseries_mc_errorlog *mce_log;
+   uint8_t error_type, err_sub_type;
+   uint8_t initiator = rtas_error_initiator(errp);
+   uint64_t addr;
+
+   static const char * const initiators[] = {
+   "Unknown",
+   "CPU",
+   "PCI",
+   "ISA",
+   "Memory",
+   "Power Mgmt",
+   };
+   static const char * const mc_err_types[] = {
+   "UE",
+   "SLB",
+   "ERAT",
+   "TLB",
+   "D-Cache",
+   "Unknown",
+   "I-Cache",
+   };
+   static const char * const mc_ue_types[] = {
+   "Indeterminate",
+   "Instruction fetch",
+   "Page table walk ifetch",
+   "Load/Store",
+   "Page table walk Load/Store",
+   };
+
+   /* SLB sub errors valid values are 0x0, 0x1, 0x2 */
+   static const char * const mc_slb_types[] = {
+   "Parity",
+   "Multihit",
+   "Indeterminate",
+   };
+
+   /* TLB and ERAT sub errors valid values are 0x1, 0x2, 0x3 */
+   static const char * const mc_soft_types[] = {
+   "Unknown",
+   "Parity",
+   "Multihit",
+   "Indeterminate",
+   };
+
+   pseries_log = get_pseries_errorlog(errp, PSERIES_ELOG_SECT_ID_MCE);
+   if (pseries_log == NULL)
+   return;
+
+   mce_log = (struct pseries_mc_errorlog *)pseries_log->data;
+
+   error_type = rtas_mc_error_type(mce_log);
+   err_sub_type = rtas_mc_error_sub_type(mce_log);
+
+   switch (rtas_error_severity(errp)) {
+   case RTAS_SEVERITY_NO_ERROR:
+   level = KERN_INFO;
+   sevstr = "Harmless";
+   break;
+   case RTAS_SEVERITY_WARNING:
+   level = KERN_WARNING;
+   sevstr = "";
+   break;
+   case RTAS_SEVERITY_ERROR:
+   case RTAS_SEVERITY_ERROR_SYNC:
+   level = KERN_ERR;
+   sevstr = "Severe";
+   break;
+   case RTAS_SEVERITY_FATAL:
+   default:
+   level = KERN_ERR;
+   sevstr = "Fatal";
+   break;
+   }
+
+   printk("%s%s Machine check interrupt [%s]\n", level, sevstr,
+   disposition == RTAS_DISP_FULLY_RECOVERED ?
+   "Recovered" : "Not recovered");
+   if (user_mode(regs)) {
+   printk("%s  NIP: [%016lx] PID: %d Comm: %s\n", level,
+   regs->nip, current->pid, current->comm);
+   } else {
+   printk("%s  NIP [%016lx]: %pS\n", level, regs->nip,
+   (void *)regs->nip);
+   }
+   printk("%s  Initiator: %s\n", level,
+   VAL_TO_STRING(initiators, initiator));
+
+   switch (error_type) {
+   case PSERIES_MC_ERROR_TYPE_UE:
+   printk("%s  Error type: %s [%s]\n", level,
+   VAL_TO_STRING(mc_err_types, error_type),
+

[v2 PATCH 3/5] powerpc/pseries: Dump and flush SLB contents on SLB MCE errors.

2018-06-07 Thread Mahesh J Salgaonkar

From: Mahesh Salgaonkar 

If we get a machine check exceptions due to SLB errors then dump the
current SLB contents which will be very much helpful in debugging the
root cause of SLB errors. On pseries, as of today system crashes on SLB
errors. These are soft errors and can be fixed by flushing the SLBs so
the kernel can continue to function instead of system crash. This patch
fixes that also.

With this patch the console will log SLB contents like below on SLB MCE
errors:

[  822.711728] slb contents:
[  822.711730] 00 c800 400ea1b217000500
[  822.711731]   1T  ESID=   c0  VSID=  ea1b217 LLP:100
[  822.711732] 01 d800 400d43642f000510
[  822.711733]   1T  ESID=   d0  VSID=  d43642f LLP:110
[  822.711734] 09 f800 400a86c85f000500
[  822.711736]   1T  ESID=   f0  VSID=  a86c85f LLP:100
[  822.711737] 10 7f000800 400d1f26e3000d90
[  822.711738]   1T  ESID=   7f  VSID=  d1f26e3 LLP:110
[  822.711739] 11 1800 000e3615f520fd90
[  822.711740]  256M ESID=1  VSID=   e3615f520f LLP:110
[  822.711740] 12 d800 400d43642f000510
[  822.711741]   1T  ESID=   d0  VSID=  d43642f LLP:110
[  822.711742] 13 d800 400d43642f000510
[  822.711743]   1T  ESID=   d0  VSID=  d43642f LLP:110


Suggested-by: Aneesh Kumar K.V 
Suggested-by: Michael Ellerman 
Signed-off-by: Mahesh Salgaonkar 
---
 arch/powerpc/include/asm/book3s/64/mmu-hash.h |1 +
 arch/powerpc/mm/slb.c |   35 +
 arch/powerpc/platforms/pseries/ras.c  |   29 -
 3 files changed, 64 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/book3s/64/mmu-hash.h 
b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
index 50ed64fba4ae..c0da68927235 100644
--- a/arch/powerpc/include/asm/book3s/64/mmu-hash.h
+++ b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
@@ -487,6 +487,7 @@ extern void hpte_init_native(void);
 
 extern void slb_initialize(void);
 extern void slb_flush_and_rebolt(void);
+extern void slb_dump_contents(void);
 
 extern void slb_vmalloc_update(void);
 extern void slb_set_size(u16 size);
diff --git a/arch/powerpc/mm/slb.c b/arch/powerpc/mm/slb.c
index 66577cc66dc9..799aa117cec3 100644
--- a/arch/powerpc/mm/slb.c
+++ b/arch/powerpc/mm/slb.c
@@ -145,6 +145,41 @@ void slb_flush_and_rebolt(void)
get_paca()->slb_cache_ptr = 0;
 }
 
+void slb_dump_contents(void)
+{
+   int i;
+   unsigned long e, v;
+   unsigned long llp;
+
+   pr_err("slb contents:\n");
+   for (i = 0; i < mmu_slb_size; i++) {
+   asm volatile("slbmfee  %0,%1" : "=r" (e) : "r" (i));
+   asm volatile("slbmfev  %0,%1" : "=r" (v) : "r" (i));
+
+   if (!e && !v)
+   continue;
+
+   pr_err("%02d %016lx %016lx", i, e, v);
+
+   if (!(e & SLB_ESID_V)) {
+   pr_err("\n");
+   continue;
+   }
+   llp = v & SLB_VSID_LLP;
+   if (v & SLB_VSID_B_1T) {
+   pr_err("  1T  ESID=%9lx  VSID=%13lx LLP:%3lx\n",
+   GET_ESID_1T(e),
+   (v & ~SLB_VSID_B) >> SLB_VSID_SHIFT_1T,
+   llp);
+   } else {
+   pr_err(" 256M ESID=%9lx  VSID=%13lx LLP:%3lx\n",
+   GET_ESID(e),
+   (v & ~SLB_VSID_B) >> SLB_VSID_SHIFT,
+   llp);
+   }
+   }
+}
+
 void slb_vmalloc_update(void)
 {
unsigned long vflags;
diff --git a/arch/powerpc/platforms/pseries/ras.c 
b/arch/powerpc/platforms/pseries/ras.c
index 5e1ef9150182..7470a216cd6b 100644
--- a/arch/powerpc/platforms/pseries/ras.c
+++ b/arch/powerpc/platforms/pseries/ras.c
@@ -422,6 +422,31 @@ int pSeries_system_reset_exception(struct pt_regs *regs)
return 0; /* need to perform reset */
 }
 
+static int mce_handle_error(struct rtas_error_log *errp)
+{
+   struct pseries_errorlog *pseries_log;
+   struct pseries_mc_errorlog *mce_log;
+   int disposition = rtas_error_disposition(errp);
+   uint8_t error_type;
+
+   pseries_log = get_pseries_errorlog(errp, PSERIES_ELOG_SECT_ID_MCE);
+   if (pseries_log == NULL)
+   goto out;
+
+   mce_log = (struct pseries_mc_errorlog *)pseries_log->data;
+   error_type = rtas_mc_error_type(mce_log);
+
+   if ((disposition == RTAS_DISP_NOT_RECOVERED) &&
+   (error_type == PSERIES_MC_ERROR_TYPE_SLB)) {
+   slb_dump_contents();
+   slb_flush_and_rebolt();
+   disposition = RTAS_DISP_FULLY_RECOVERED;
+   }
+
+out:
+   return disposition;
+}
+
 /*
  * See if we can recover from a machine check exception.
  * This is only called on power4 (or above) and only via
@@ -434,7 +459,9 @@ int

[v2 PATCH 2/5] powerpc/pseries: Define MCE error event section.

2018-06-07 Thread Mahesh J Salgaonkar

From: Mahesh Salgaonkar 

On pseries, the machine check error details are part of RTAS extended
event log passed under Machine check exception section. This patch adds
the definition of rtas MCE event section and related helper
functions.

Signed-off-by: Mahesh Salgaonkar 
---
 arch/powerpc/include/asm/rtas.h |  104 +++
 1 file changed, 104 insertions(+)

diff --git a/arch/powerpc/include/asm/rtas.h b/arch/powerpc/include/asm/rtas.h
index ec9dd79398ee..3f2fba7ef23b 100644
--- a/arch/powerpc/include/asm/rtas.h
+++ b/arch/powerpc/include/asm/rtas.h
@@ -275,6 +275,7 @@ inline uint32_t rtas_ext_event_company_id(struct 
rtas_ext_event_log_v6 *ext_log)
 #define PSERIES_ELOG_SECT_ID_CALL_HOME (('C' << 8) | 'H')
 #define PSERIES_ELOG_SECT_ID_USER_DEF  (('U' << 8) | 'D')
 #define PSERIES_ELOG_SECT_ID_HOTPLUG   (('H' << 8) | 'P')
+#define PSERIES_ELOG_SECT_ID_MCE   (('M' << 8) | 'C')
 
 /* Vendor specific Platform Event Log Format, Version 6, section header */
 struct pseries_errorlog {
@@ -326,6 +327,109 @@ struct pseries_hp_errorlog {
 #define PSERIES_HP_ELOG_ID_DRC_COUNT   3
 #define PSERIES_HP_ELOG_ID_DRC_IC  4
 
+/* RTAS pseries MCE errorlog section */
+#pragma pack(push, 1)
+struct pseries_mc_errorlog {
+   __be32  fru_id;
+   __be32  proc_id;
+   uint8_t error_type;
+   union {
+   struct {
+   uint8_t ue_err_type;
+   /* 
+* X1: Permanent or Transient UE.
+*  X   1: Effective address provided.
+*   X  1: Logical address provided.
+*XX2: Reserved.
+*  XXX 3: Type of UE error.
+*/
+   uint8_t reserved_1[6];
+   __be64  effective_address;
+   __be64  logical_address;
+   } ue_error;
+   struct {
+   uint8_t soft_err_type;
+   /* 
+* X1: Effective address provided.
+*  X   5: Reserved.
+*   XX 2: Type of SLB/ERAT/TLB error.
+*/
+   uint8_t reserved_1[6];
+   __be64  effective_address;
+   uint8_t reserved_2[8];
+   } soft_error;
+   } u;
+};
+#pragma pack(pop)
+
+/* RTAS pseries MCE error types */
+#define PSERIES_MC_ERROR_TYPE_UE   0x00
+#define PSERIES_MC_ERROR_TYPE_SLB  0x01
+#define PSERIES_MC_ERROR_TYPE_ERAT 0x02
+#define PSERIES_MC_ERROR_TYPE_TLB  0x04
+#define PSERIES_MC_ERROR_TYPE_D_CACHE  0x05
+#define PSERIES_MC_ERROR_TYPE_I_CACHE  0x07
+
+/* RTAS pseries MCE error sub types */
+#define PSERIES_MC_ERROR_UE_INDETERMINATE  0
+#define PSERIES_MC_ERROR_UE_IFETCH 1
+#define PSERIES_MC_ERROR_UE_PAGE_TABLE_WALK_IFETCH 2
+#define PSERIES_MC_ERROR_UE_LOAD_STORE 3
+#define PSERIES_MC_ERROR_UE_PAGE_TABLE_WALK_LOAD_STORE 4
+
+#define PSERIES_MC_ERROR_SLB_PARITY0
+#define PSERIES_MC_ERROR_SLB_MULTIHIT  1
+#define PSERIES_MC_ERROR_SLB_INDETERMINATE 2
+
+#define PSERIES_MC_ERROR_ERAT_PARITY   1
+#define PSERIES_MC_ERROR_ERAT_MULTIHIT 2
+#define PSERIES_MC_ERROR_ERAT_INDETERMINATE3
+
+#define PSERIES_MC_ERROR_TLB_PARITY1
+#define PSERIES_MC_ERROR_TLB_MULTIHIT  2
+#define PSERIES_MC_ERROR_TLB_INDETERMINATE 3
+
+static inline uint8_t rtas_mc_error_type(const struct pseries_mc_errorlog 
*mlog)
+{
+   return mlog->error_type;
+}
+
+static inline uint8_t rtas_mc_error_sub_type(
+   const struct pseries_mc_errorlog *mlog)
+{
+   switch (mlog->error_type) {
+   casePSERIES_MC_ERROR_TYPE_UE:
+   return (mlog->u.ue_error.ue_err_type & 0x07);
+   casePSERIES_MC_ERROR_TYPE_SLB:
+   casePSERIES_MC_ERROR_TYPE_ERAT:
+   casePSERIES_MC_ERROR_TYPE_TLB:
+   return (mlog->u.soft_error.soft_err_type & 0x03);
+   default:
+   return 0;
+   }
+}
+
+static inline uint64_t rtas_mc_get_effective_addr(
+   const struct pseries_mc_errorlog *mlog)
+{
+   uint64_t addr = 0;
+
+   switch (mlog->error_type) {
+   casePSERIES_MC_ERROR_TYPE_UE:
+   if (mlog->u.ue_error.ue_err_type & 0x40)
+   addr = mlog->u.ue_error.effective_address;
+   break;
+   casePSERIES_MC_ERROR_TYPE_SLB:
+   casePSERIES_MC_ERROR_TYPE_ERAT:
+   casePSERIES_MC_ERROR_TYPE_TLB:
+   if (mlog->u.soft_error.soft_err_type & 0x80)
+   addr =

[v2 PATCH 1/5] powerpc/pseries: convert rtas_log_buf to linear allocation.

2018-06-07 Thread Mahesh J Salgaonkar

From: Mahesh Salgaonkar 

rtas_log_buf is a buffer to hold RTAS event data that are communicated
to kernel by hypervisor. This buffer is then used to pass RTAS event
data to user through proc fs. This buffer is allocated from vmalloc
(non-linear mapping) area.

On Machine check interrupt, register r3 points to RTAS extended event
log passed by hypervisor that contains the MCE event. The pseries
machine check handler then logs this error into rtas_log_buf. The
rtas_log_buf is a vmalloc-ed (non-linear) buffer we end up taking up a
page fault (vector 0x300) while accessing it. Since machine check
interrupt handler runs in NMI context we can not afford to take any
page fault. Page faults are not honored in NMI context and causes
kernel panic. This patch fixes this issue by allocating rtas_log_buf
using kmalloc.

Suggested-by: Aneesh Kumar K.V 
Signed-off-by: Mahesh Salgaonkar 
---
 arch/powerpc/kernel/rtasd.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/rtasd.c b/arch/powerpc/kernel/rtasd.c
index f915db93cd42..3957d4ae2ba2 100644
--- a/arch/powerpc/kernel/rtasd.c
+++ b/arch/powerpc/kernel/rtasd.c
@@ -559,7 +559,7 @@ static int __init rtas_event_scan_init(void)
rtas_error_log_max = rtas_get_error_log_max();
rtas_error_log_buffer_max = rtas_error_log_max + sizeof(int);
 
-   rtas_log_buf = vmalloc(rtas_error_log_buffer_max*LOG_NUMBER);
+   rtas_log_buf = kmalloc(rtas_error_log_buffer_max*LOG_NUMBER, 
GFP_KERNEL);
if (!rtas_log_buf) {
printk(KERN_ERR "rtasd: no memory\n");
return -ENOMEM;

[v2 PATCH 0/5] powerpc/pseries: Machien check handler improvements.

2018-06-07 Thread Mahesh J Salgaonkar

This patch series includes some improvement to Machine check handler
for pseries. Patch 1 fixes an issue where machine check handler crashes
kernel while accessing vmalloc-ed buffer while in nmi context.
Patch 3 dumps the SLB contents on SLB MCE errors to improve the debugability.
Patch 4 display's the MCE error details on console.

Change in V2:
- patch 4: Display additional info (NIP and task info) in MCE error details.
- patch 5: Fix endain bug while restoring of r3 in MCE handler.

---

Mahesh Salgaonkar (5):
  powerpc/pseries: convert rtas_log_buf to linear allocation.
  powerpc/pseries: Define MCE error event section.
  powerpc/pseries: Dump and flush SLB contents on SLB MCE errors.
  powerpc/pseries: Display machine check error details.
  powerpc/pseries: Fix endainness while restoring of r3 in MCE handler.


 arch/powerpc/include/asm/book3s/64/mmu-hash.h |1 
 arch/powerpc/include/asm/rtas.h   |  109 ++
 arch/powerpc/kernel/rtasd.c   |2 
 arch/powerpc/mm/slb.c |   35 ++
 arch/powerpc/platforms/pseries/ras.c  |  155 +
 5 files changed, 299 insertions(+), 3 deletions(-)

--
Signature

[PATCH v3] powerpc: Add support for function error injection

2018-06-07 Thread Naveen N. Rao

We implement regs_set_return_value() and override_function_with_return()
for this purpose.

On powerpc, a return from a function (blr) just branches to the location
contained in the link register. So, we can just update pt_regs rather
than redirecting execution to a dummy function that returns.

Signed-off-by: Naveen N. Rao 
---
The only change is to add a comment in override_function_with_return() 
to clarify that we don't need to worry about 32-bit userspace while 
emulating 'blr'.

- Naveen

 arch/powerpc/Kconfig   |  1 +
 arch/powerpc/include/asm/error-injection.h | 13 +
 arch/powerpc/include/asm/ptrace.h  |  5 +
 arch/powerpc/lib/Makefile  |  2 ++
 arch/powerpc/lib/error-inject.c| 16 
 5 files changed, 37 insertions(+)
 create mode 100644 arch/powerpc/include/asm/error-injection.h
 create mode 100644 arch/powerpc/lib/error-inject.c

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 076fe3094856..00dad3c759a0 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -187,6 +187,7 @@ config PPC
select HAVE_EBPF_JITif PPC64
select HAVE_EFFICIENT_UNALIGNED_ACCESS  if !(CPU_LITTLE_ENDIAN && 
POWER7_CPU)
select HAVE_FTRACE_MCOUNT_RECORD
+   select HAVE_FUNCTION_ERROR_INJECTION
select HAVE_FUNCTION_GRAPH_TRACER
select HAVE_FUNCTION_TRACER
select HAVE_GCC_PLUGINS
diff --git a/arch/powerpc/include/asm/error-injection.h 
b/arch/powerpc/include/asm/error-injection.h
new file mode 100644
index ..740c3075bdf4
--- /dev/null
+++ b/arch/powerpc/include/asm/error-injection.h
@@ -0,0 +1,13 @@
+// SPDX-License-Identifier: GPL-2.0+
+
+#ifndef _ASM_ERROR_INJECTION_H
+#define _ASM_ERROR_INJECTION_H
+
+#include 
+#include 
+#include 
+#include 
+
+void override_function_with_return(struct pt_regs *regs);
+
+#endif /* _ASM_ERROR_INJECTION_H */
diff --git a/arch/powerpc/include/asm/ptrace.h 
b/arch/powerpc/include/asm/ptrace.h
index e4923686e43a..c0705296c2f0 100644
--- a/arch/powerpc/include/asm/ptrace.h
+++ b/arch/powerpc/include/asm/ptrace.h
@@ -101,6 +101,11 @@ static inline long regs_return_value(struct pt_regs *regs)
return -regs->gpr[3];
 }
 
+static inline void regs_set_return_value(struct pt_regs *regs, unsigned long 
rc)
+{
+   regs->gpr[3] = rc;
+}
+
 #ifdef __powerpc64__
 #define user_mode(regs) regs)->msr) >> MSR_PR_LG) & 0x1)
 #else
diff --git a/arch/powerpc/lib/Makefile b/arch/powerpc/lib/Makefile
index d0ca13ad8231..dd43c5a53396 100644
--- a/arch/powerpc/lib/Makefile
+++ b/arch/powerpc/lib/Makefile
@@ -14,6 +14,8 @@ obj-y += string.o alloc.o code-patching.o feature-fixups.o
 
 obj-$(CONFIG_PPC32)+= div64.o copy_32.o crtsavres.o
 
+obj-$(CONFIG_FUNCTION_ERROR_INJECTION) += error-inject.o
+
 # See corresponding test in arch/powerpc/Makefile
 # 64-bit linker creates .sfpr on demand for final link (vmlinux),
 # so it is only needed for modules, and only for older linkers which
diff --git a/arch/powerpc/lib/error-inject.c b/arch/powerpc/lib/error-inject.c
new file mode 100644
index ..407b992fb02f
--- /dev/null
+++ b/arch/powerpc/lib/error-inject.c
@@ -0,0 +1,16 @@
+// SPDX-License-Identifier: GPL-2.0+
+
+#include 
+#include 
+#include 
+
+void override_function_with_return(struct pt_regs *regs)
+{
+   /*
+* Emulate 'blr'. 'regs' represents the state on entry of a predefined
+* function in the kernel/module, captured on a kprobe. We don't need
+* to worry about 32-bit userspace on a 64-bit kernel.
+*/
+   regs->nip = regs->link;
+}
+NOKPROBE_SYMBOL(override_function_with_return);
-- 
2.17.0

[v3, 10/10] dpaa_eth: add the get_ts_info interface for ethtool

2018-06-07 Thread Yangbo Lu

Added the get_ts_info interface for ethtool to check
the timestamping capability.

Signed-off-by: Yangbo Lu 
---
Changes for v2:
- Removed ifdef for hw timestamp.
Changes for v3:
- None.
---
 drivers/net/ethernet/freescale/dpaa/dpaa_ethtool.c |   39 
 1 files changed, 39 insertions(+), 0 deletions(-)

diff --git a/drivers/net/ethernet/freescale/dpaa/dpaa_ethtool.c 
b/drivers/net/ethernet/freescale/dpaa/dpaa_ethtool.c
index 2f933b6..3184c8f 100644
--- a/drivers/net/ethernet/freescale/dpaa/dpaa_ethtool.c
+++ b/drivers/net/ethernet/freescale/dpaa/dpaa_ethtool.c
@@ -32,6 +32,9 @@
 #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
 
 #include 
+#include 
+#include 
+#include 
 
 #include "dpaa_eth.h"
 #include "mac.h"
@@ -515,6 +518,41 @@ static int dpaa_set_rxnfc(struct net_device *dev, struct 
ethtool_rxnfc *cmd)
return ret;
 }
 
+static int dpaa_get_ts_info(struct net_device *net_dev,
+   struct ethtool_ts_info *info)
+{
+   struct device *dev = net_dev->dev.parent;
+   struct device_node *mac_node = dev->of_node;
+   struct device_node *fman_node = NULL, *ptp_node = NULL;
+   struct platform_device *ptp_dev = NULL;
+   struct qoriq_ptp *ptp = NULL;
+
+   info->phc_index = -1;
+
+   fman_node = of_get_parent(mac_node);
+   if (fman_node)
+   ptp_node = of_parse_phandle(fman_node, "ptimer-handle", 0);
+
+   if (ptp_node)
+   ptp_dev = of_find_device_by_node(ptp_node);
+
+   if (ptp_dev)
+   ptp = platform_get_drvdata(ptp_dev);
+
+   if (ptp)
+   info->phc_index = ptp->phc_index;
+
+   info->so_timestamping = SOF_TIMESTAMPING_TX_HARDWARE |
+   SOF_TIMESTAMPING_RX_HARDWARE |
+   SOF_TIMESTAMPING_RAW_HARDWARE;
+   info->tx_types = (1 << HWTSTAMP_TX_OFF) |
+(1 << HWTSTAMP_TX_ON);
+   info->rx_filters = (1 << HWTSTAMP_FILTER_NONE) |
+  (1 << HWTSTAMP_FILTER_ALL);
+
+   return 0;
+}
+
 const struct ethtool_ops dpaa_ethtool_ops = {
.get_drvinfo = dpaa_get_drvinfo,
.get_msglevel = dpaa_get_msglevel,
@@ -530,4 +568,5 @@ static int dpaa_set_rxnfc(struct net_device *dev, struct 
ethtool_rxnfc *cmd)
.set_link_ksettings = dpaa_set_link_ksettings,
.get_rxnfc = dpaa_get_rxnfc,
.set_rxnfc = dpaa_set_rxnfc,
+   .get_ts_info = dpaa_get_ts_info,
 };
-- 
1.7.1

[v3, 09/10] dpaa_eth: add support for hardware timestamping

2018-06-07 Thread Yangbo Lu

This patch is to add hardware timestamping support
for dpaa_eth. On Rx, timestamping is enabled for
all frames. On Tx, we only instruct the hardware
to timestamp the frames marked accordingly by the
stack.

Signed-off-by: Yangbo Lu 
---
Changes for v2:
- Removed ifdef for timestamp code.
- Minor fixes for code style.
Changes for v3:
- Moved tstamp endianness conversion to fman API.
- Fixed fm.cmd endianness.
---
 drivers/net/ethernet/freescale/dpaa/dpaa_eth.c |   88 ++--
 drivers/net/ethernet/freescale/dpaa/dpaa_eth.h |3 +
 2 files changed, 86 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/freescale/dpaa/dpaa_eth.c 
b/drivers/net/ethernet/freescale/dpaa/dpaa_eth.c
index fd43f98..6a1c58a 100644
--- a/drivers/net/ethernet/freescale/dpaa/dpaa_eth.c
+++ b/drivers/net/ethernet/freescale/dpaa/dpaa_eth.c
@@ -1168,7 +1168,7 @@ static int dpaa_eth_init_tx_port(struct fman_port *port, 
struct dpaa_fq *errq,
buf_prefix_content.priv_data_size = buf_layout->priv_data_size;
buf_prefix_content.pass_prs_result = true;
buf_prefix_content.pass_hash_result = true;
-   buf_prefix_content.pass_time_stamp = false;
+   buf_prefix_content.pass_time_stamp = true;
buf_prefix_content.data_align = DPAA_FD_DATA_ALIGNMENT;
 
params.specific_params.non_rx_params.err_fqid = errq->fqid;
@@ -1210,7 +1210,7 @@ static int dpaa_eth_init_rx_port(struct fman_port *port, 
struct dpaa_bp **bps,
buf_prefix_content.priv_data_size = buf_layout->priv_data_size;
buf_prefix_content.pass_prs_result = true;
buf_prefix_content.pass_hash_result = true;
-   buf_prefix_content.pass_time_stamp = false;
+   buf_prefix_content.pass_time_stamp = true;
buf_prefix_content.data_align = DPAA_FD_DATA_ALIGNMENT;
 
rx_p = _params.rx_params;
@@ -1607,14 +1607,28 @@ static int dpaa_eth_refill_bpools(struct dpaa_priv 
*priv)
 {
const enum dma_data_direction dma_dir = DMA_TO_DEVICE;
struct device *dev = priv->net_dev->dev.parent;
+   struct skb_shared_hwtstamps shhwtstamps;
dma_addr_t addr = qm_fd_addr(fd);
const struct qm_sg_entry *sgt;
struct sk_buff **skbh, *skb;
int nr_frags, i;
+   u64 ns;
 
skbh = (struct sk_buff **)phys_to_virt(addr);
skb = *skbh;
 
+   if (priv->tx_tstamp && skb_shinfo(skb)->tx_flags & SKBTX_HW_TSTAMP) {
+   memset(, 0, sizeof(shhwtstamps));
+
+   if (!fman_port_get_tstamp(priv->mac_dev->port[TX], (void *)skbh,
+ )) {
+   shhwtstamps.hwtstamp = ns_to_ktime(ns);
+   skb_tstamp_tx(skb, );
+   } else {
+   dev_warn(dev, "fman_port_get_tstamp failed!\n");
+   }
+   }
+
if (unlikely(qm_fd_get_format(fd) == qm_fd_sg)) {
nr_frags = skb_shinfo(skb)->nr_frags;
dma_unmap_single(dev, addr, qm_fd_get_offset(fd) +
@@ -2086,6 +2100,11 @@ static int dpaa_start_xmit(struct sk_buff *skb, struct 
net_device *net_dev)
if (unlikely(err < 0))
goto skb_to_fd_failed;
 
+   if (priv->tx_tstamp && skb_shinfo(skb)->tx_flags & SKBTX_HW_TSTAMP) {
+   fd.cmd |= cpu_to_be32(FM_FD_CMD_UPD);
+   skb_shinfo(skb)->tx_flags |= SKBTX_IN_PROGRESS;
+   }
+
if (likely(dpaa_xmit(priv, percpu_stats, queue_mapping, ) == 0))
return NETDEV_TX_OK;
 
@@ -2227,6 +2246,7 @@ static enum qman_cb_dqrr_result rx_default_dqrr(struct 
qman_portal *portal,
struct qman_fq *fq,
const struct qm_dqrr_entry *dq)
 {
+   struct skb_shared_hwtstamps *shhwtstamps;
struct rtnl_link_stats64 *percpu_stats;
struct dpaa_percpu_priv *percpu_priv;
const struct qm_fd *fd = >fd;
@@ -2240,6 +2260,7 @@ static enum qman_cb_dqrr_result rx_default_dqrr(struct 
qman_portal *portal,
struct sk_buff *skb;
int *count_ptr;
void *vaddr;
+   u64 ns;
 
fd_status = be32_to_cpu(fd->status);
fd_format = qm_fd_get_format(fd);
@@ -2304,6 +2325,16 @@ static enum qman_cb_dqrr_result rx_default_dqrr(struct 
qman_portal *portal,
if (!skb)
return qman_cb_dqrr_consume;
 
+   if (priv->rx_tstamp) {
+   shhwtstamps = skb_hwtstamps(skb);
+   memset(shhwtstamps, 0, sizeof(*shhwtstamps));
+
+   if (!fman_port_get_tstamp(priv->mac_dev->port[RX], vaddr, ))
+   shhwtstamps->hwtstamp = ns_to_ktime(ns);
+   else
+   dev_warn(net_dev->dev.parent, "fman_port_get_tstamp 
failed!\n");
+   }
+
skb->protocol = eth_type_trans(skb, net_dev);
 
if (net_dev->features & NETIF_F_RXHASH && priv->keygen_in_use &&
@@ -2523,11 +2554,58 @@ static int

[v3, 08/10] fsl/fman: define frame description command UPD

2018-06-07 Thread Yangbo Lu

Defined frame description command FM_FD_CMD_UPD for
prepended data updating.

Signed-off-by: Yangbo Lu 
---
Changes for v2:
- None.
Changes for v3:
- None.
---
 drivers/net/ethernet/freescale/fman/fman.h |1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/drivers/net/ethernet/freescale/fman/fman.h 
b/drivers/net/ethernet/freescale/fman/fman.h
index bfa02e0..935c317 100644
--- a/drivers/net/ethernet/freescale/fman/fman.h
+++ b/drivers/net/ethernet/freescale/fman/fman.h
@@ -41,6 +41,7 @@
 /* Frame queue Context Override */
 #define FM_FD_CMD_FCO   0x8000
 #define FM_FD_CMD_RPD   0x4000  /* Read Prepended Data */
+#define FM_FD_CMD_UPD  0x2000  /* Update Prepended Data */
 #define FM_FD_CMD_DTC   0x1000  /* Do L4 Checksum */
 
 /* TX-Port: Unsupported Format */
-- 
1.7.1

[v3, 07/10] fsl/fman_port: support getting timestamp

2018-06-07 Thread Yangbo Lu

This patch is to add fman_port_get_tstamp() interface
to get timestamp.

Signed-off-by: Yangbo Lu 
---
Changes for v2:
- None.
Changes for v3:
- Moved endianness conversion from dpaa to fman API.
---
 drivers/net/ethernet/freescale/fman/fman_port.c |   12 
 drivers/net/ethernet/freescale/fman/fman_port.h |2 ++
 2 files changed, 14 insertions(+), 0 deletions(-)

diff --git a/drivers/net/ethernet/freescale/fman/fman_port.c 
b/drivers/net/ethernet/freescale/fman/fman_port.c
index ce6e24c..dce860d 100644
--- a/drivers/net/ethernet/freescale/fman/fman_port.c
+++ b/drivers/net/ethernet/freescale/fman/fman_port.c
@@ -1731,6 +1731,18 @@ int fman_port_get_hash_result_offset(struct fman_port 
*port, u32 *offset)
 }
 EXPORT_SYMBOL(fman_port_get_hash_result_offset);
 
+int fman_port_get_tstamp(struct fman_port *port, const void *data, u64 *tstamp)
+{
+   if (port->buffer_offsets.time_stamp_offset == ILLEGAL_BASE)
+   return -EINVAL;
+
+   *tstamp = be64_to_cpu(*(u64 *)(data +
+   port->buffer_offsets.time_stamp_offset));
+
+   return 0;
+}
+EXPORT_SYMBOL(fman_port_get_tstamp);
+
 static int fman_port_probe(struct platform_device *of_dev)
 {
struct fman_port *port;
diff --git a/drivers/net/ethernet/freescale/fman/fman_port.h 
b/drivers/net/ethernet/freescale/fman/fman_port.h
index e86ca6a..9dbb69f 100644
--- a/drivers/net/ethernet/freescale/fman/fman_port.h
+++ b/drivers/net/ethernet/freescale/fman/fman_port.h
@@ -153,6 +153,8 @@ int fman_port_cfg_buf_prefix_content(struct fman_port *port,
 
 int fman_port_get_hash_result_offset(struct fman_port *port, u32 *offset);
 
+int fman_port_get_tstamp(struct fman_port *port, const void *data, u64 
*tstamp);
+
 struct fman_port *fman_port_bind(struct device *dev);
 
 #endif /* __FMAN_PORT_H */
-- 
1.7.1

[v3, 06/10] fsl/fman: add set_tstamp interface

2018-06-07 Thread Yangbo Lu

This patch is to add set_tstamp interface for memac,
dtsec, and 10GEC controllers to configure HW timestamping.

Signed-off-by: Yangbo Lu 
---
Changes for v2:
- None.
Changes for v3:
- None.
---
 drivers/net/ethernet/freescale/fman/fman_dtsec.c |   27 ++
 drivers/net/ethernet/freescale/fman/fman_dtsec.h |1 +
 drivers/net/ethernet/freescale/fman/fman_memac.c |5 
 drivers/net/ethernet/freescale/fman/fman_memac.h |1 +
 drivers/net/ethernet/freescale/fman/fman_tgec.c  |   21 +
 drivers/net/ethernet/freescale/fman/fman_tgec.h  |1 +
 drivers/net/ethernet/freescale/fman/mac.c|3 ++
 drivers/net/ethernet/freescale/fman/mac.h|1 +
 8 files changed, 60 insertions(+), 0 deletions(-)

diff --git a/drivers/net/ethernet/freescale/fman/fman_dtsec.c 
b/drivers/net/ethernet/freescale/fman/fman_dtsec.c
index 57b1e2b..1ca543a 100644
--- a/drivers/net/ethernet/freescale/fman/fman_dtsec.c
+++ b/drivers/net/ethernet/freescale/fman/fman_dtsec.c
@@ -123,11 +123,13 @@
 #define DTSEC_ECNTRL_R100M 0x0008
 #define DTSEC_ECNTRL_QSGMIIM   0x0001
 
+#define TCTRL_TTSE 0x0040
 #define TCTRL_GTS  0x0020
 
 #define RCTRL_PAL_MASK 0x001f
 #define RCTRL_PAL_SHIFT16
 #define RCTRL_GHTX 0x0400
+#define RCTRL_RTSE 0x0040
 #define RCTRL_GRS  0x0020
 #define RCTRL_MPROM0x0008
 #define RCTRL_RSF  0x0004
@@ -1136,6 +1138,31 @@ int dtsec_set_allmulti(struct fman_mac *dtsec, bool 
enable)
return 0;
 }
 
+int dtsec_set_tstamp(struct fman_mac *dtsec, bool enable)
+{
+   struct dtsec_regs __iomem *regs = dtsec->regs;
+   u32 rctrl, tctrl;
+
+   if (!is_init_done(dtsec->dtsec_drv_param))
+   return -EINVAL;
+
+   rctrl = ioread32be(>rctrl);
+   tctrl = ioread32be(>tctrl);
+
+   if (enable) {
+   rctrl |= RCTRL_RTSE;
+   tctrl |= TCTRL_TTSE;
+   } else {
+   rctrl &= ~RCTRL_RTSE;
+   tctrl &= ~TCTRL_TTSE;
+   }
+
+   iowrite32be(rctrl, >rctrl);
+   iowrite32be(tctrl, >tctrl);
+
+   return 0;
+}
+
 int dtsec_del_hash_mac_address(struct fman_mac *dtsec, enet_addr_t *eth_addr)
 {
struct dtsec_regs __iomem *regs = dtsec->regs;
diff --git a/drivers/net/ethernet/freescale/fman/fman_dtsec.h 
b/drivers/net/ethernet/freescale/fman/fman_dtsec.h
index 1a689ad..5149d96 100644
--- a/drivers/net/ethernet/freescale/fman/fman_dtsec.h
+++ b/drivers/net/ethernet/freescale/fman/fman_dtsec.h
@@ -56,5 +56,6 @@ int dtsec_set_exception(struct fman_mac *dtsec,
 int dtsec_del_hash_mac_address(struct fman_mac *dtsec, enet_addr_t *eth_addr);
 int dtsec_get_version(struct fman_mac *dtsec, u32 *mac_version);
 int dtsec_set_allmulti(struct fman_mac *dtsec, bool enable);
+int dtsec_set_tstamp(struct fman_mac *dtsec, bool enable);
 
 #endif /* __DTSEC_H */
diff --git a/drivers/net/ethernet/freescale/fman/fman_memac.c 
b/drivers/net/ethernet/freescale/fman/fman_memac.c
index 446a97b..bc6eb30 100644
--- a/drivers/net/ethernet/freescale/fman/fman_memac.c
+++ b/drivers/net/ethernet/freescale/fman/fman_memac.c
@@ -964,6 +964,11 @@ int memac_set_allmulti(struct fman_mac *memac, bool enable)
return 0;
 }
 
+int memac_set_tstamp(struct fman_mac *memac, bool enable)
+{
+   return 0; /* Always enabled. */
+}
+
 int memac_del_hash_mac_address(struct fman_mac *memac, enet_addr_t *eth_addr)
 {
struct memac_regs __iomem *regs = memac->regs;
diff --git a/drivers/net/ethernet/freescale/fman/fman_memac.h 
b/drivers/net/ethernet/freescale/fman/fman_memac.h
index b5a5033..b2c671e 100644
--- a/drivers/net/ethernet/freescale/fman/fman_memac.h
+++ b/drivers/net/ethernet/freescale/fman/fman_memac.h
@@ -58,5 +58,6 @@ int memac_set_exception(struct fman_mac *memac,
 int memac_add_hash_mac_address(struct fman_mac *memac, enet_addr_t *eth_addr);
 int memac_del_hash_mac_address(struct fman_mac *memac, enet_addr_t *eth_addr);
 int memac_set_allmulti(struct fman_mac *memac, bool enable);
+int memac_set_tstamp(struct fman_mac *memac, bool enable);
 
 #endif /* __MEMAC_H */
diff --git a/drivers/net/ethernet/freescale/fman/fman_tgec.c 
b/drivers/net/ethernet/freescale/fman/fman_tgec.c
index 284735d..4070593 100644
--- a/drivers/net/ethernet/freescale/fman/fman_tgec.c
+++ b/drivers/net/ethernet/freescale/fman/fman_tgec.c
@@ -44,6 +44,7 @@
 #define TGEC_TX_IPG_LENGTH_MASK0x03ff
 
 /* Command and Configuration Register (COMMAND_CONFIG) */
+#define CMD_CFG_EN_TIMESTAMP   0x0010
 #define CMD_CFG_NO_LEN_CHK 0x0002
 #define CMD_CFG_PAUSE_IGNORE   0x0100
 #define CMF_CFG_CRC_FWD0x0040
@@ -588,6 +589,26 @@ int tgec_set_allmulti(struct fman_mac *tgec, bool enable)
return 0;
 }

[v3, 05/10] arm64: dts: fsl: move ptp timer out of fman

2018-06-07 Thread Yangbo Lu

This patch is to move ptp timer node out of fman.
Because ptp timer will be probed by ptp_qoriq driver,
it should be an independent device in case of conflict
memory mapping.

Signed-off-by: Yangbo Lu 
---
Changes for v2:
- Fixed address-cells for ptp-timer.
Changes for v3:
- None.
---
 arch/arm64/boot/dts/freescale/qoriq-fman3-0.dtsi |   14 --
 1 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/arch/arm64/boot/dts/freescale/qoriq-fman3-0.dtsi 
b/arch/arm64/boot/dts/freescale/qoriq-fman3-0.dtsi
index 4dd0676..a56a408 100644
--- a/arch/arm64/boot/dts/freescale/qoriq-fman3-0.dtsi
+++ b/arch/arm64/boot/dts/freescale/qoriq-fman3-0.dtsi
@@ -11,13 +11,14 @@ fman0: fman@1a0 {
#size-cells = <1>;
cell-index = <0>;
compatible = "fsl,fman";
-   ranges = <0x0 0x0 0x1a0 0x10>;
-   reg = <0x0 0x1a0 0x0 0x10>;
+   ranges = <0x0 0x0 0x1a0 0xfe000>;
+   reg = <0x0 0x1a0 0x0 0xfe000>;
interrupts = ,
 ;
clocks = < 3 0>;
clock-names = "fmanclk";
fsl,qman-channel-range = <0x800 0x10>;
+   ptimer-handle = <_timer0>;
 
muram@0 {
compatible = "fsl,fman-muram";
@@ -73,9 +74,10 @@ fman0: fman@1a0 {
compatible = "fsl,fman-memac-mdio", "fsl,fman-xmdio";
reg = <0xfd000 0x1000>;
};
+};
 
-   ptp_timer0: ptp-timer@fe000 {
-   compatible = "fsl,fman-ptp-timer";
-   reg = <0xfe000 0x1000>;
-   };
+ptp_timer0: ptp-timer@1afe000 {
+   compatible = "fsl,fman-ptp-timer";
+   reg = <0x0 0x1afe000 0x0 0x1000>;
+   interrupts = ;
 };
-- 
1.7.1

[v3, 04/10] powerpc/mpc85xx: move ptp timer out of fman in dts

2018-06-07 Thread Yangbo Lu

This patch is to move ptp timer node out of fman.
Because ptp timer will be probed by ptp_qoriq driver,
it should be an independent device in case of conflict
memory mapping.

Signed-off-by: Yangbo Lu 
---
Changes for v2:
- None.
Changes for v3:
- None.
---
 arch/powerpc/boot/dts/fsl/qoriq-fman-0.dtsi   |   14 --
 arch/powerpc/boot/dts/fsl/qoriq-fman-1.dtsi   |   14 --
 arch/powerpc/boot/dts/fsl/qoriq-fman3-0.dtsi  |   14 --
 arch/powerpc/boot/dts/fsl/qoriq-fman3-1.dtsi  |   14 --
 arch/powerpc/boot/dts/fsl/qoriq-fman3l-0.dtsi |   14 --
 5 files changed, 40 insertions(+), 30 deletions(-)

diff --git a/arch/powerpc/boot/dts/fsl/qoriq-fman-0.dtsi 
b/arch/powerpc/boot/dts/fsl/qoriq-fman-0.dtsi
index abd01d4..6b124f7 100644
--- a/arch/powerpc/boot/dts/fsl/qoriq-fman-0.dtsi
+++ b/arch/powerpc/boot/dts/fsl/qoriq-fman-0.dtsi
@@ -37,12 +37,13 @@ fman0: fman@40 {
#size-cells = <1>;
cell-index = <0>;
compatible = "fsl,fman";
-   ranges = <0 0x40 0x10>;
-   reg = <0x40 0x10>;
+   ranges = <0 0x40 0xfe000>;
+   reg = <0x40 0xfe000>;
interrupts = <96 2 0 0>, <16 2 1 1>;
clocks = < 3 0>;
clock-names = "fmanclk";
fsl,qman-channel-range = <0x40 0xc>;
+   ptimer-handle = <_timer0>;
 
muram@0 {
compatible = "fsl,fman-muram";
@@ -93,9 +94,10 @@ fman0: fman@40 {
reg = <0x87000 0x1000>;
status = "disabled";
};
+};
 
-   ptp_timer0: ptp-timer@fe000 {
-   compatible = "fsl,fman-ptp-timer";
-   reg = <0xfe000 0x1000>;
-   };
+ptp_timer0: ptp-timer@4fe000 {
+   compatible = "fsl,fman-ptp-timer";
+   reg = <0x4fe000 0x1000>;
+   interrupts = <96 2 0 0>;
 };
diff --git a/arch/powerpc/boot/dts/fsl/qoriq-fman-1.dtsi 
b/arch/powerpc/boot/dts/fsl/qoriq-fman-1.dtsi
index debea75..b80aaf5 100644
--- a/arch/powerpc/boot/dts/fsl/qoriq-fman-1.dtsi
+++ b/arch/powerpc/boot/dts/fsl/qoriq-fman-1.dtsi
@@ -37,12 +37,13 @@ fman1: fman@50 {
#size-cells = <1>;
cell-index = <1>;
compatible = "fsl,fman";
-   ranges = <0 0x50 0x10>;
-   reg = <0x50 0x10>;
+   ranges = <0 0x50 0xfe000>;
+   reg = <0x50 0xfe000>;
interrupts = <97 2 0 0>, <16 2 1 0>;
clocks = < 3 1>;
clock-names = "fmanclk";
fsl,qman-channel-range = <0x60 0xc>;
+   ptimer-handle = <_timer1>;
 
muram@0 {
compatible = "fsl,fman-muram";
@@ -93,9 +94,10 @@ fman1: fman@50 {
reg = <0x87000 0x1000>;
status = "disabled";
};
+};
 
-   ptp_timer1: ptp-timer@fe000 {
-   compatible = "fsl,fman-ptp-timer";
-   reg = <0xfe000 0x1000>;
-   };
+ptp_timer1: ptp-timer@5fe000 {
+   compatible = "fsl,fman-ptp-timer";
+   reg = <0x5fe000 0x1000>;
+   interrupts = <97 2 0 0>;
 };
diff --git a/arch/powerpc/boot/dts/fsl/qoriq-fman3-0.dtsi 
b/arch/powerpc/boot/dts/fsl/qoriq-fman3-0.dtsi
index 3a20e0d..d3720fd 100644
--- a/arch/powerpc/boot/dts/fsl/qoriq-fman3-0.dtsi
+++ b/arch/powerpc/boot/dts/fsl/qoriq-fman3-0.dtsi
@@ -37,12 +37,13 @@ fman0: fman@40 {
#size-cells = <1>;
cell-index = <0>;
compatible = "fsl,fman";
-   ranges = <0 0x40 0x10>;
-   reg = <0x40 0x10>;
+   ranges = <0 0x40 0xfe000>;
+   reg = <0x40 0xfe000>;
interrupts = <96 2 0 0>, <16 2 1 1>;
clocks = < 3 0>;
clock-names = "fmanclk";
fsl,qman-channel-range = <0x800 0x10>;
+   ptimer-handle = <_timer0>;
 
muram@0 {
compatible = "fsl,fman-muram";
@@ -98,9 +99,10 @@ fman0: fman@40 {
compatible = "fsl,fman-memac-mdio", "fsl,fman-xmdio";
reg = <0xfd000 0x1000>;
};
+};
 
-   ptp_timer0: ptp-timer@fe000 {
-   compatible = "fsl,fman-ptp-timer";
-   reg = <0xfe000 0x1000>;
-   };
+ptp_timer0: ptp-timer@4fe000 {
+   compatible = "fsl,fman-ptp-timer";
+   reg = <0x4fe000 0x1000>;
+   interrupts = <96 2 0 0>;
 };
diff --git a/arch/powerpc/boot/dts/fsl/qoriq-fman3-1.dtsi 
b/arch/powerpc/boot/dts/fsl/qoriq-fman3-1.dtsi
index 82750ac..ae34c20 100644
--- a/arch/powerpc/boot/dts/fsl/qoriq-fman3-1.dtsi
+++ b/arch/powerpc/boot/dts/fsl/qoriq-fman3-1.dtsi
@@ -37,12 +37,13 @@ fman1: fman@50 {
#size-cells = <1>;
cell-index = <1>;
compatible = "fsl,fman";
-   ranges = <0 0x50 0x10>;
-   reg = <0x50 0x10>;
+   ranges = <0 0x50 0xfe000>;
+   reg = <0x50 0xfe000>;
interrupts = <97 2 0 0>, <16 2 1 0>;
clocks = < 3 1>;
clock-names = "fmanclk";
fsl,qman-channel-range = <0x820 0x10>;
+   ptimer-handle = <_timer1>;
 
muram@0 {
compatible =

[v3, 03/10] dt-binding: ptp_qoriq: add DPAA FMan support

2018-06-07 Thread Yangbo Lu

This patch is to add bindings description for DPAA
FMan 1588 timer, and also remove its description in
fsl-fman dt-bindings document.

Signed-off-by: Yangbo Lu 
---
Changes for v2:
- None.
Changes for v3:
- None.
---
 Documentation/devicetree/bindings/net/fsl-fman.txt |   25 +---
 .../devicetree/bindings/ptp/ptp-qoriq.txt  |   15 +--
 2 files changed, 13 insertions(+), 27 deletions(-)

diff --git a/Documentation/devicetree/bindings/net/fsl-fman.txt 
b/Documentation/devicetree/bindings/net/fsl-fman.txt
index df873d1..74603dd 100644
--- a/Documentation/devicetree/bindings/net/fsl-fman.txt
+++ b/Documentation/devicetree/bindings/net/fsl-fman.txt
@@ -356,30 +356,7 @@ ethernet@e {
 
 FMan IEEE 1588 Node
 
-DESCRIPTION
-
-The FMan interface to support IEEE 1588
-
-
-PROPERTIES
-
-- compatible
-   Usage: required
-   Value type: 
-   Definition: A standard property.
-   Must include "fsl,fman-ptp-timer".
-
-- reg
-   Usage: required
-   Value type: 
-   Definition: A standard property.
-
-EXAMPLE
-
-ptp-timer@fe000 {
-   compatible = "fsl,fman-ptp-timer";
-   reg = <0xfe000 0x1000>;
-};
+Refer to Documentation/devicetree/bindings/ptp/ptp-qoriq.txt
 
 =
 FMan MDIO Node
diff --git a/Documentation/devicetree/bindings/ptp/ptp-qoriq.txt 
b/Documentation/devicetree/bindings/ptp/ptp-qoriq.txt
index 0f569d8..c5d0e79 100644
--- a/Documentation/devicetree/bindings/ptp/ptp-qoriq.txt
+++ b/Documentation/devicetree/bindings/ptp/ptp-qoriq.txt
@@ -2,7 +2,8 @@
 
 General Properties:
 
-  - compatible   Should be "fsl,etsec-ptp"
+  - compatible   Should be "fsl,etsec-ptp" for eTSEC
+ Should be "fsl,fman-ptp-timer" for DPAA FMan
   - reg  Offset and length of the register set for the device
   - interrupts   There should be at least two interrupts. Some devices
  have as many as four PTP related interrupts.
@@ -43,14 +44,22 @@ Clock Properties:
   value, which will be directly written in those bits, that is why,
   according to reference manual, the next clock sources can be used:
 
+  For eTSEC,
   <0> - external high precision timer reference clock (TSEC_TMR_CLK
 input is used for this purpose);
   <1> - eTSEC system clock;
   <2> - eTSEC1 transmit clock;
   <3> - RTC clock input.
 
-  When this attribute is not used, eTSEC system clock will serve as
-  IEEE 1588 timer reference clock.
+  For DPAA FMan,
+  <0> - external high precision timer reference clock (TMR_1588_CLK)
+  <1> - MAC system clock (1/2 FMan clock)
+  <2> - reserved
+  <3> - RTC clock oscillator
+
+  When this attribute is not used, the IEEE 1588 timer reference clock
+  will use the eTSEC system clock (for Gianfar) or the MAC system
+  clock (for DPAA).
 
 Example:
 
-- 
1.7.1

[v3, 02/10] ptp: support DPAA FMan 1588 timer in ptp_qoriq

2018-06-07 Thread Yangbo Lu

This patch is to support DPAA (Data Path Acceleration Architecture)
1588 timer by adding "fsl,fman-ptp-timer" compatible, sharing
interrupt with FMan, adding FSL_DPAA_ETH dependency, and fixing
up register offset.

Signed-off-by: Yangbo Lu 
---
Changes for v2:
- None.
Changes for v3:
- None.
---
 drivers/ptp/Kconfig   |2 +-
 drivers/ptp/ptp_qoriq.c   |  104 ++---
 include/linux/fsl/ptp_qoriq.h |   38 ---
 3 files changed, 98 insertions(+), 46 deletions(-)

diff --git a/drivers/ptp/Kconfig b/drivers/ptp/Kconfig
index 474c988..d137c48 100644
--- a/drivers/ptp/Kconfig
+++ b/drivers/ptp/Kconfig
@@ -43,7 +43,7 @@ config PTP_1588_CLOCK_DTE
 
 config PTP_1588_CLOCK_QORIQ
tristate "Freescale QorIQ 1588 timer as PTP clock"
-   depends on GIANFAR
+   depends on GIANFAR || FSL_DPAA_ETH
depends on PTP_1588_CLOCK
default y
help
diff --git a/drivers/ptp/ptp_qoriq.c b/drivers/ptp/ptp_qoriq.c
index 1468a16..c4e3545 100644
--- a/drivers/ptp/ptp_qoriq.c
+++ b/drivers/ptp/ptp_qoriq.c
@@ -39,11 +39,12 @@
 /* Caller must hold qoriq_ptp->lock. */
 static u64 tmr_cnt_read(struct qoriq_ptp *qoriq_ptp)
 {
+   struct qoriq_ptp_registers *regs = _ptp->regs;
u64 ns;
u32 lo, hi;
 
-   lo = qoriq_read(_ptp->regs->tmr_cnt_l);
-   hi = qoriq_read(_ptp->regs->tmr_cnt_h);
+   lo = qoriq_read(>ctrl_regs->tmr_cnt_l);
+   hi = qoriq_read(>ctrl_regs->tmr_cnt_h);
ns = ((u64) hi) << 32;
ns |= lo;
return ns;
@@ -52,16 +53,18 @@ static u64 tmr_cnt_read(struct qoriq_ptp *qoriq_ptp)
 /* Caller must hold qoriq_ptp->lock. */
 static void tmr_cnt_write(struct qoriq_ptp *qoriq_ptp, u64 ns)
 {
+   struct qoriq_ptp_registers *regs = _ptp->regs;
u32 hi = ns >> 32;
u32 lo = ns & 0x;
 
-   qoriq_write(_ptp->regs->tmr_cnt_l, lo);
-   qoriq_write(_ptp->regs->tmr_cnt_h, hi);
+   qoriq_write(>ctrl_regs->tmr_cnt_l, lo);
+   qoriq_write(>ctrl_regs->tmr_cnt_h, hi);
 }
 
 /* Caller must hold qoriq_ptp->lock. */
 static void set_alarm(struct qoriq_ptp *qoriq_ptp)
 {
+   struct qoriq_ptp_registers *regs = _ptp->regs;
u64 ns;
u32 lo, hi;
 
@@ -70,16 +73,18 @@ static void set_alarm(struct qoriq_ptp *qoriq_ptp)
ns -= qoriq_ptp->tclk_period;
hi = ns >> 32;
lo = ns & 0x;
-   qoriq_write(_ptp->regs->tmr_alarm1_l, lo);
-   qoriq_write(_ptp->regs->tmr_alarm1_h, hi);
+   qoriq_write(>alarm_regs->tmr_alarm1_l, lo);
+   qoriq_write(>alarm_regs->tmr_alarm1_h, hi);
 }
 
 /* Caller must hold qoriq_ptp->lock. */
 static void set_fipers(struct qoriq_ptp *qoriq_ptp)
 {
+   struct qoriq_ptp_registers *regs = _ptp->regs;
+
set_alarm(qoriq_ptp);
-   qoriq_write(_ptp->regs->tmr_fiper1, qoriq_ptp->tmr_fiper1);
-   qoriq_write(_ptp->regs->tmr_fiper2, qoriq_ptp->tmr_fiper2);
+   qoriq_write(>fiper_regs->tmr_fiper1, qoriq_ptp->tmr_fiper1);
+   qoriq_write(>fiper_regs->tmr_fiper2, qoriq_ptp->tmr_fiper2);
 }
 
 /*
@@ -89,16 +94,17 @@ static void set_fipers(struct qoriq_ptp *qoriq_ptp)
 static irqreturn_t isr(int irq, void *priv)
 {
struct qoriq_ptp *qoriq_ptp = priv;
+   struct qoriq_ptp_registers *regs = _ptp->regs;
struct ptp_clock_event event;
u64 ns;
u32 ack = 0, lo, hi, mask, val;
 
-   val = qoriq_read(_ptp->regs->tmr_tevent);
+   val = qoriq_read(>ctrl_regs->tmr_tevent);
 
if (val & ETS1) {
ack |= ETS1;
-   hi = qoriq_read(_ptp->regs->tmr_etts1_h);
-   lo = qoriq_read(_ptp->regs->tmr_etts1_l);
+   hi = qoriq_read(>etts_regs->tmr_etts1_h);
+   lo = qoriq_read(>etts_regs->tmr_etts1_l);
event.type = PTP_CLOCK_EXTTS;
event.index = 0;
event.timestamp = ((u64) hi) << 32;
@@ -108,8 +114,8 @@ static irqreturn_t isr(int irq, void *priv)
 
if (val & ETS2) {
ack |= ETS2;
-   hi = qoriq_read(_ptp->regs->tmr_etts2_h);
-   lo = qoriq_read(_ptp->regs->tmr_etts2_l);
+   hi = qoriq_read(>etts_regs->tmr_etts2_h);
+   lo = qoriq_read(>etts_regs->tmr_etts2_l);
event.type = PTP_CLOCK_EXTTS;
event.index = 1;
event.timestamp = ((u64) hi) << 32;
@@ -130,16 +136,16 @@ static irqreturn_t isr(int irq, void *priv)
hi = ns >> 32;
lo = ns & 0x;
spin_lock(_ptp->lock);
-   qoriq_write(_ptp->regs->tmr_alarm2_l, lo);
-   qoriq_write(_ptp->regs->tmr_alarm2_h, hi);
+   qoriq_write(>alarm_regs->tmr_alarm2_l, lo);
+   qoriq_write(>alarm_regs->tmr_alarm2_h, hi);
spin_unlock(_ptp->lock);
qoriq_ptp->alarm_value = ns;

[v3, 01/10] fsl/fman: share the event interrupt

2018-06-07 Thread Yangbo Lu

This patch is to share fman event interrupt because
the 1588 timer driver will also use this interrupt.

Signed-off-by: Yangbo Lu 
---
Changes for v2:
- None.
Changes for v3:
- None.
---
 drivers/net/ethernet/freescale/fman/fman.c |3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/drivers/net/ethernet/freescale/fman/fman.c 
b/drivers/net/ethernet/freescale/fman/fman.c
index 9530405..c415ac6 100644
--- a/drivers/net/ethernet/freescale/fman/fman.c
+++ b/drivers/net/ethernet/freescale/fman/fman.c
@@ -2801,7 +2801,8 @@ static irqreturn_t fman_irq(int irq, void *handle)
of_node_put(muram_node);
of_node_put(fm_node);
 
-   err = devm_request_irq(_dev->dev, irq, fman_irq, 0, "fman", fman);
+   err = devm_request_irq(_dev->dev, irq, fman_irq, IRQF_SHARED,
+  "fman", fman);
if (err < 0) {
dev_err(_dev->dev, "%s: irq %d allocation failed (error = 
%d)\n",
__func__, irq, err);
-- 
1.7.1

[v3, 00/10] Support DPAA PTP clock and timestamping

2018-06-07 Thread Yangbo Lu

This patchset is to support DPAA FMAN PTP clock and HW timestamping.
It had been verified on both ARM platform and PPC platform.
- The patch #1 to patch #5 are to support DPAA FMAN 1588 timer in
  ptp_qoriq driver.
- The patch #6 to patch #10 are to add HW timestamping support in
  DPAA ethernet driver.

Yangbo Lu (10):
  fsl/fman: share the event interrupt
  ptp: support DPAA FMan 1588 timer in ptp_qoriq
  dt-binding: ptp_qoriq: add DPAA FMan support
  powerpc/mpc85xx: move ptp timer out of fman in dts
  arm64: dts: fsl: move ptp timer out of fman
  fsl/fman: add set_tstamp interface
  fsl/fman_port: support getting timestamp
  fsl/fman: define frame description command UPD
  dpaa_eth: add support for hardware timestamping
  dpaa_eth: add the get_ts_info interface for ethtool

 Documentation/devicetree/bindings/net/fsl-fman.txt |   25 +-
 .../devicetree/bindings/ptp/ptp-qoriq.txt  |   15 +++-
 arch/arm64/boot/dts/freescale/qoriq-fman3-0.dtsi   |   14 ++-
 arch/powerpc/boot/dts/fsl/qoriq-fman-0.dtsi|   14 ++-
 arch/powerpc/boot/dts/fsl/qoriq-fman-1.dtsi|   14 ++-
 arch/powerpc/boot/dts/fsl/qoriq-fman3-0.dtsi   |   14 ++-
 arch/powerpc/boot/dts/fsl/qoriq-fman3-1.dtsi   |   14 ++-
 arch/powerpc/boot/dts/fsl/qoriq-fman3l-0.dtsi  |   14 ++-
 drivers/net/ethernet/freescale/dpaa/dpaa_eth.c |   88 -
 drivers/net/ethernet/freescale/dpaa/dpaa_eth.h |3 +
 drivers/net/ethernet/freescale/dpaa/dpaa_ethtool.c |   39 
 drivers/net/ethernet/freescale/fman/fman.c |3 +-
 drivers/net/ethernet/freescale/fman/fman.h |1 +
 drivers/net/ethernet/freescale/fman/fman_dtsec.c   |   27 +
 drivers/net/ethernet/freescale/fman/fman_dtsec.h   |1 +
 drivers/net/ethernet/freescale/fman/fman_memac.c   |5 +
 drivers/net/ethernet/freescale/fman/fman_memac.h   |1 +
 drivers/net/ethernet/freescale/fman/fman_port.c|   12 +++
 drivers/net/ethernet/freescale/fman/fman_port.h|2 +
 drivers/net/ethernet/freescale/fman/fman_tgec.c|   21 
 drivers/net/ethernet/freescale/fman/fman_tgec.h|1 +
 drivers/net/ethernet/freescale/fman/mac.c  |3 +
 drivers/net/ethernet/freescale/fman/mac.h  |1 +
 drivers/ptp/Kconfig|2 +-
 drivers/ptp/ptp_qoriq.c|  104 ---
 include/linux/fsl/ptp_qoriq.h  |   38 ++--
 26 files changed, 361 insertions(+), 115 deletions(-)

RE: [v2, 09/10] dpaa_eth: add support for hardware timestamping

2018-06-07 Thread Y.b. Lu

Hi Madalin,

> -Original Message-
> From: Madalin-cristian Bucur
> Sent: Thursday, June 7, 2018 4:24 PM
> To: Y.b. Lu ; net...@vger.kernel.org; Richard Cochran
> ; Rob Herring ; Shawn
> Guo ; David S . Miller 
> Cc: devicet...@vger.kernel.org; linuxppc-dev@lists.ozlabs.org;
> linux-arm-ker...@lists.infradead.org; linux-ker...@vger.kernel.org; Y.b. Lu
> 
> Subject: RE: [v2, 09/10] dpaa_eth: add support for hardware timestamping
> 
> > -Original Message-
> > From: Yangbo Lu [mailto:yangbo...@nxp.com]
> > Sent: Thursday, June 7, 2018 6:23 AM
> > Subject: [v2, 09/10] dpaa_eth: add support for hardware timestamping
> >
> > This patch is to add hardware timestamping support for dpaa_eth. On
> > Rx, timestamping is enabled for all frames. On Tx, we only instruct
> > the hardware to timestamp the frames marked accordingly by the stack.
> >
> > Signed-off-by: Yangbo Lu 
> > ---
> > Changes for v2:
> > - Removed ifdef for timestamp code.
> > - Minor fixes for code style.
> > ---
> >  drivers/net/ethernet/freescale/dpaa/dpaa_eth.c |  101
> > ++-
> >  drivers/net/ethernet/freescale/dpaa/dpaa_eth.h |3 +
> >  2 files changed, 99 insertions(+), 5 deletions(-)
> >
> > diff --git a/drivers/net/ethernet/freescale/dpaa/dpaa_eth.c
> > b/drivers/net/ethernet/freescale/dpaa/dpaa_eth.c
> > index fd43f98..bd589ac 100644
> > --- a/drivers/net/ethernet/freescale/dpaa/dpaa_eth.c
> > +++ b/drivers/net/ethernet/freescale/dpaa/dpaa_eth.c
> > @@ -1168,7 +1168,7 @@ static int dpaa_eth_init_tx_port(struct
> > fman_port *port, struct dpaa_fq *errq,
> > buf_prefix_content.priv_data_size = buf_layout->priv_data_size;
> > buf_prefix_content.pass_prs_result = true;
> > buf_prefix_content.pass_hash_result = true;
> > -   buf_prefix_content.pass_time_stamp = false;
> > +   buf_prefix_content.pass_time_stamp = true;
> > buf_prefix_content.data_align = DPAA_FD_DATA_ALIGNMENT;
> >
> > params.specific_params.non_rx_params.err_fqid = errq->fqid; @@
> > -1210,7 +1210,7 @@ static int dpaa_eth_init_rx_port(struct fman_port
> > *port, struct dpaa_bp **bps,
> > buf_prefix_content.priv_data_size = buf_layout->priv_data_size;
> > buf_prefix_content.pass_prs_result = true;
> > buf_prefix_content.pass_hash_result = true;
> > -   buf_prefix_content.pass_time_stamp = false;
> > +   buf_prefix_content.pass_time_stamp = true;
> > buf_prefix_content.data_align = DPAA_FD_DATA_ALIGNMENT;
> >
> > rx_p = _params.rx_params;
> > @@ -1592,6 +1592,16 @@ static int dpaa_eth_refill_bpools(struct
> > dpaa_priv
> > *priv)
> > return 0;
> >  }
> >
> > +static int dpaa_get_tstamp_ns(struct net_device *net_dev, u64 *ns,
> > + struct fman_port *port, const void *data) {
> > +   if (!fman_port_get_tstamp_field(port, data, ns)) {
> > +   be64_to_cpus(ns);
> 
> Please move this endianness conversion in the fman API.

[Y.b. Lu] Ok. Will move to fman API in next version.

> 
> > +   return 0;
> > +   }
> > +   return -EINVAL;
> > +}
> > +
> >  /* Cleanup function for outgoing frame descriptors that were built on
> > Tx path,
> >   * either contiguous frames or scatter/gather ones.
> >   * Skb freeing is not handled here.
> > @@ -1607,14 +1617,29 @@ static int dpaa_eth_refill_bpools(struct
> > dpaa_priv
> > *priv)
> >  {
> > const enum dma_data_direction dma_dir = DMA_TO_DEVICE;
> > struct device *dev = priv->net_dev->dev.parent;
> > +   struct skb_shared_hwtstamps shhwtstamps;
> > dma_addr_t addr = qm_fd_addr(fd);
> > const struct qm_sg_entry *sgt;
> > struct sk_buff **skbh, *skb;
> > int nr_frags, i;
> > +   u64 ns;
> >
> > skbh = (struct sk_buff **)phys_to_virt(addr);
> > skb = *skbh;
> >
> > +   if (priv->tx_tstamp && skb_shinfo(skb)->tx_flags &
> > SKBTX_HW_TSTAMP) {
> > +   memset(, 0, sizeof(shhwtstamps));
> > +
> > +   if (!dpaa_get_tstamp_ns(priv->net_dev, ,
> > +   priv->mac_dev->port[TX],
> > +   (void *)skbh)) {
> > +   shhwtstamps.hwtstamp = ns_to_ktime(ns);
> > +   skb_tstamp_tx(skb, );
> > +   } else {
> > +   dev_warn(dev, "dpaa_get_tstamp_ns failed!\n");
> > +   }
> > +   }
> > +
> > if (unlikely(qm_fd_get_format(fd) == qm_fd_sg)) {
> > nr_frags = skb_shinfo(skb)->nr_frags;
> > dma_unmap_single(dev, addr, qm_fd_get_offset(fd) + @@ -2086,6
> > +2111,11 @@ static int dpaa_start_xmit(struct sk_buff *skb, struct
> > net_device *net_dev)
> > if (unlikely(err < 0))
> > goto skb_to_fd_failed;
> >
> > +   if (priv->tx_tstamp && skb_shinfo(skb)->tx_flags &
> > SKBTX_HW_TSTAMP) {
> > +   fd.cmd |= FM_FD_CMD_UPD;
> 
> The fd.cmd field is big endian, please use this:
> 
> + fd.cmd |= cpu_to_be32(FM_FD_CMD_UPD);
> 

[Y.b. Lu] Thanks a lot for pointing out this issue. This fixes TX timestamp 
issue on ARM platform.

[RFC PATCH kernel 1/5] vfio/spapr_tce: Simplify page contained test

2018-06-07 Thread Alexey Kardashevskiy

The test function takes a page struct pointer which is not used by
either of two callers in any other way, make it simple and just pass
a physical address there.

This should cause no behavioral change now but later we may start
supporting host addresses for memory devices which are not backed
with page structs.

Signed-off-by: Alexey Kardashevskiy 
---
 drivers/vfio/vfio_iommu_spapr_tce.c | 11 ---
 1 file changed, 4 insertions(+), 7 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c 
b/drivers/vfio/vfio_iommu_spapr_tce.c
index 759a5bd..2c4a048 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -249,8 +249,9 @@ static void tce_iommu_userspace_view_free(struct 
iommu_table *tbl,
decrement_locked_vm(mm, cb >> PAGE_SHIFT);
 }
 
-static bool tce_page_is_contained(struct page *page, unsigned page_shift)
+static bool tce_page_is_contained(unsigned long hpa, unsigned page_shift)
 {
+   struct page *page = pfn_to_page(hpa >> PAGE_SHIFT);
/*
 * Check that the TCE table granularity is not bigger than the size of
 * a page we just found. Otherwise the hardware can get access to
@@ -549,7 +550,6 @@ static long tce_iommu_build(struct tce_container *container,
enum dma_data_direction direction)
 {
long i, ret = 0;
-   struct page *page;
unsigned long hpa;
enum dma_data_direction dirtmp;
 
@@ -560,8 +560,7 @@ static long tce_iommu_build(struct tce_container *container,
if (ret)
break;
 
-   page = pfn_to_page(hpa >> PAGE_SHIFT);
-   if (!tce_page_is_contained(page, tbl->it_page_shift)) {
+   if (!tce_page_is_contained(hpa, tbl->it_page_shift)) {
ret = -EPERM;
break;
}
@@ -595,7 +594,6 @@ static long tce_iommu_build_v2(struct tce_container 
*container,
enum dma_data_direction direction)
 {
long i, ret = 0;
-   struct page *page;
unsigned long hpa;
enum dma_data_direction dirtmp;
 
@@ -615,8 +613,7 @@ static long tce_iommu_build_v2(struct tce_container 
*container,
if (ret)
break;
 
-   page = pfn_to_page(hpa >> PAGE_SHIFT);
-   if (!tce_page_is_contained(page, tbl->it_page_shift)) {
+   if (!tce_page_is_contained(hpa, tbl->it_page_shift)) {
ret = -EPERM;
break;
}
-- 
2.11.0

[RFC PATCH kernel 0/5] powerpc/P9/vfio: Pass through NVIDIA Tesla V100

2018-06-07 Thread Alexey Kardashevskiy

Here is an rfc of some patches adding psaa-through support
for NVIDIA V100 GPU found in some POWER9 boxes.

The example P9 system has 6 GPUs, each accompanied with 2 bridges
representing the hardware links (aka NVLink2):

 4  0004:04:00.0 3D: NVIDIA Corporation GV100GL [Tesla V100 SXM2] (rev a1)
 5  0004:05:00.0 3D: NVIDIA Corporation GV100GL [Tesla V100 SXM2] (rev a1)
 6  0004:06:00.0 3D: NVIDIA Corporation GV100GL [Tesla V100 SXM2] (rev a1)
 4  0006:00:00.0 Bridge: IBM Device 04ea (rev 01)
 4  0006:00:00.1 Bridge: IBM Device 04ea (rev 01)
 5  0006:00:01.0 Bridge: IBM Device 04ea (rev 01)
 5  0006:00:01.1 Bridge: IBM Device 04ea (rev 01)
 6  0006:00:02.0 Bridge: IBM Device 04ea (rev 01)
 6  0006:00:02.1 Bridge: IBM Device 04ea (rev 01)
10  0007:00:00.0 Bridge: IBM Device 04ea (rev 01)
10  0007:00:00.1 Bridge: IBM Device 04ea (rev 01)
11  0007:00:01.0 Bridge: IBM Device 04ea (rev 01)
11  0007:00:01.1 Bridge: IBM Device 04ea (rev 01)
12  0007:00:02.0 Bridge: IBM Device 04ea (rev 01)
12  0007:00:02.1 Bridge: IBM Device 04ea (rev 01)
10  0035:03:00.0 3D: NVIDIA Corporation GV100GL [Tesla V100 SXM2] (rev a1)
11  0035:04:00.0 3D: NVIDIA Corporation GV100GL [Tesla V100 SXM2] (rev a1)
12  0035:05:00.0 3D: NVIDIA Corporation GV100GL [Tesla V100 SXM2] (rev a1)

^^ the number is an IOMMU group ID.

Each bridge represents an additional hardware interface called "NVLink2",
it is not a PCI link but separate but. The design inherits from original
NVLink from POWER8.

The new feature of V100 is 16GB of cache coherent memory on GPU board.
This memory is presented to the host via the device tree and remains offline
until the NVIDIA driver loads, trains NVLink2 (via the config space of these
bridges above) and the nvidia-persistenced daemon then onlines it.
The memory remains online as long as nvidia-persistenced is running, when
it stops, it offlines the memory.

The amount of GPUs suggest passing them through to a guest. However,
in order to do so we cannot use the NVIDIA driver so we have a host with
a 128GB window (bigger or equal to actual GPU RAM size) in a system memory
with no page structs backing this window and we cannot touch this memory
before the NVIDIA driver configures it in a host or a guest as
HMI (hardware management interrupt?) occurs.

On the example system the GPU RAM windows are located at:
0x0400  
0x0420  
0x0440  
0x2400  
0x2420  
0x2440  

So the complications are:

1. cannot touch the GPU memory till it is trained, i.e. cannot add ptes
to VFIO-to-userspace or guest-to-host-physical translations till
the driver trains it (i.e. nvidia-persistenced has started), otherwise
prefetching happens and HMI occurs; I am trying to get this changed
somehow;

2. since it appears as normal cache coherent memory, it will be used
for DMA which means it has to be pinned and mapped in the host. Having
no page structs makes it different from the usual case - we only need
translate user addresses to host physical and map GPU RAM memory but
pinning is not required.

This series maps GPU RAM via the GPU vfio-pci device so QEMU can then
register this memory as a KVM memory slot and present memory nodes to
the guest. Unless NVIDIA provides an userspace driver, this is no use
for things like DPDK.


There is another problem which the series does not address but worth
mentioning - it is not strictly necessary to map GPU RAM to the guest
exactly where it is in the host (I tested this to some extent), we still
might want to represent the memory at the same offset as on the host
which increases the size of a TCE table needed to cover such a huge
window: (((0x2440 + 0x20) >> 16)*8)>>20 = 4556MB
I am addressing this in a separate patchset by allocating indirect TCE
levels on demand and using 16MB IOMMU pages in the guest as we can now
back emulated pages with the smaller hardware ones.


This is an RFC. Please comment. Thanks.



Alexey Kardashevskiy (5):
  vfio/spapr_tce: Simplify page contained test
  powerpc/iommu_context: Change referencing in API
  powerpc/iommu: Do not pin memory of a memory device
  vfio_pci: Allow mapping extra regions
  vfio_pci: Add NVIDIA GV100GL [Tesla V100 SXM2] [10de:1db1] subdriver

 drivers/vfio/pci/Makefile  |   1 +
 arch/powerpc/include/asm/mmu_context.h |   5 +-
 drivers/vfio/pci/vfio_pci_private.h|  11 ++
 include/uapi/linux/vfio.h  |   3 +
 arch/powerpc/kernel/iommu.c|   8 +-
 arch/powerpc/mm/mmu_context_iommu.c|  70 +---
 drivers/vfio/pci/vfio_pci.c|  19 +++-
 drivers/vfio/pci/vfio_pci_nvlink2.c| 190 +
 drivers/vfio/vfio_iommu_spapr_tce.c|  42 +---
 drivers/vfio/pci/Kconfig   |   4 +
 10 files changed, 319 insertions(+), 34 deletions(-)
 create mode 100644 drivers/vfio/pci/vfio_pci_nvlink2.c

-- 
2.11.0

1 2 >

1 - 100 of 111 matches

Mail list logo