Re: [PATCH v5 09/11] hugetlb: Introduce generic version of huge_ptep_set_wrprotect

2018-08-02 Thread Alex Ghiti

Ok, I tried every defconfig available:

- for the nohash/32, I found that I could use mpc885_ads_defconfig and I 
activated HUGETLBFS.
I removed the definition of huge_ptep_set_wrprotect from 
nohash/32/pgtable.h, add an #error in
include/asm-generic/hugetlb.h right before the generic definition of 
huge_ptep_set_wrprotect,

and fell onto it at compile-time:
=> I'm pretty confident then that removing the definition of 
huge_ptep_set_wrprotect does not

break anythingin this case.

- regardind book3s/32, I did not find any defconfig with 
CONFIG_PPC_BOOK3S_32, CONFIG_PPC32

allowing to enable huge page support (ie CONFIG_SYS_SUPPORTS_HUGETLBFS)
=> Do you have a defconfig that would allow me to try the same as above ?

Thanks,

Alex


On 07/31/2018 11:17 AM, Alexandre Ghiti wrote:


On 07/31/2018 12:06 PM, Michael Ellerman wrote:

Alexandre Ghiti  writes:


arm, ia64, mips, sh, x86 architectures use the same version
of huge_ptep_set_wrprotect, so move this generic implementation into
asm-generic/hugetlb.h.
Note: powerpc uses twice for book3s/32 and nohash/32 the same 
version as

the above architectures, but the modification was not straightforward
and hence has not been done.

Do you remember what the problem was there?

It looks like you should just be able to drop them like the others. I
assume there's some header spaghetti that causes problems though?


Yes, the header spaghetti frightened me a bit. Maybe I should have 
tried harder: I can try to remove them and find the right defconfigs 
to compile both to begin with. And to guarantee the functionality is 
preserved, can I use the testsuite of libhugetlbfs with qemu ?


Alex



cheers



Signed-off-by: Alexandre Ghiti 
Reviewed-by: Mike Kravetz 
---
  arch/arm/include/asm/hugetlb-3level.h    | 6 --
  arch/arm64/include/asm/hugetlb.h | 1 +
  arch/ia64/include/asm/hugetlb.h  | 6 --
  arch/mips/include/asm/hugetlb.h  | 6 --
  arch/parisc/include/asm/hugetlb.h    | 1 +
  arch/powerpc/include/asm/book3s/32/pgtable.h | 2 ++
  arch/powerpc/include/asm/book3s/64/pgtable.h | 1 +
  arch/powerpc/include/asm/nohash/32/pgtable.h | 2 ++
  arch/powerpc/include/asm/nohash/64/pgtable.h | 1 +
  arch/sh/include/asm/hugetlb.h    | 6 --
  arch/sparc/include/asm/hugetlb.h | 1 +
  arch/x86/include/asm/hugetlb.h   | 6 --
  include/asm-generic/hugetlb.h    | 8 
  13 files changed, 17 insertions(+), 30 deletions(-)

diff --git a/arch/arm/include/asm/hugetlb-3level.h 
b/arch/arm/include/asm/hugetlb-3level.h

index b897541520ef..8247cd6a2ac6 100644
--- a/arch/arm/include/asm/hugetlb-3level.h
+++ b/arch/arm/include/asm/hugetlb-3level.h
@@ -37,12 +37,6 @@ static inline pte_t huge_ptep_get(pte_t *ptep)
  return retval;
  }
  -static inline void huge_ptep_set_wrprotect(struct mm_struct *mm,
-   unsigned long addr, pte_t *ptep)
-{
-    ptep_set_wrprotect(mm, addr, ptep);
-}
-
  static inline int huge_ptep_set_access_flags(struct vm_area_struct 
*vma,

   unsigned long addr, pte_t *ptep,
   pte_t pte, int dirty)
diff --git a/arch/arm64/include/asm/hugetlb.h 
b/arch/arm64/include/asm/hugetlb.h

index 3e7f6e69b28d..f4f69ae5466e 100644
--- a/arch/arm64/include/asm/hugetlb.h
+++ b/arch/arm64/include/asm/hugetlb.h
@@ -48,6 +48,7 @@ extern int huge_ptep_set_access_flags(struct 
vm_area_struct *vma,

  #define __HAVE_ARCH_HUGE_PTEP_GET_AND_CLEAR
  extern pte_t huge_ptep_get_and_clear(struct mm_struct *mm,
   unsigned long addr, pte_t *ptep);
+#define __HAVE_ARCH_HUGE_PTEP_SET_WRPROTECT
  extern void huge_ptep_set_wrprotect(struct mm_struct *mm,
  unsigned long addr, pte_t *ptep);
  #define __HAVE_ARCH_HUGE_PTEP_CLEAR_FLUSH
diff --git a/arch/ia64/include/asm/hugetlb.h 
b/arch/ia64/include/asm/hugetlb.h

index cbe296271030..49d1f7949f3a 100644
--- a/arch/ia64/include/asm/hugetlb.h
+++ b/arch/ia64/include/asm/hugetlb.h
@@ -27,12 +27,6 @@ static inline void huge_ptep_clear_flush(struct 
vm_area_struct *vma,

  {
  }
  -static inline void huge_ptep_set_wrprotect(struct mm_struct *mm,
-   unsigned long addr, pte_t *ptep)
-{
-    ptep_set_wrprotect(mm, addr, ptep);
-}
-
  static inline int huge_ptep_set_access_flags(struct vm_area_struct 
*vma,

   unsigned long addr, pte_t *ptep,
   pte_t pte, int dirty)
diff --git a/arch/mips/include/asm/hugetlb.h 
b/arch/mips/include/asm/hugetlb.h

index 6ff2531cfb1d..3dcf5debf8c4 100644
--- a/arch/mips/include/asm/hugetlb.h
+++ b/arch/mips/include/asm/hugetlb.h
@@ -63,12 +63,6 @@ static inline int huge_pte_none(pte_t pte)
  return !val || (val == (unsigned long)invalid_pte_table);
  }
  -static inline void huge_ptep_set_wrprotect(struct mm_struct *mm,
-   unsigned long addr, pte_t *ptep)
-{
-    ptep_set_wrprotect(mm, addr, ptep);
-}
-
  

Re: [PATCH] powerpc/powernv: Fix concurrency issue with npu->mmio_atsd_usage

2018-08-02 Thread Alistair Popple
> There may be a long-term way to fix this at a larger scale, but for now
> resolve the immediate problem by gating our call to
> test_and_set_bit_lock() with one to test_bit(), which is obviously
> implemented without using a store.

I am less sure of this now but am continuing to investigate. However this patch
looks good.

Acked-by: Alistair Popple 

> Signed-off-by: Reza Arbab 
> ---
>  arch/powerpc/platforms/powernv/npu-dma.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/powerpc/platforms/powernv/npu-dma.c 
> b/arch/powerpc/platforms/powernv/npu-dma.c
> index 8cdf91f..c773465 100644
> --- a/arch/powerpc/platforms/powernv/npu-dma.c
> +++ b/arch/powerpc/platforms/powernv/npu-dma.c
> @@ -437,8 +437,9 @@ static int get_mmio_atsd_reg(struct npu *npu)
>   int i;
>  
>   for (i = 0; i < npu->mmio_atsd_count; i++) {
> - if (!test_and_set_bit_lock(i, >mmio_atsd_usage))
> - return i;
> + if (!test_bit(i, >mmio_atsd_usage))
> + if (!test_and_set_bit_lock(i, >mmio_atsd_usage))
> + return i;
>   }
>  
>   return -ENOSPC;
> 




[PATCH 2/2] powerpc/64s: reimplement book3s idle code in C

2018-08-02 Thread Nicholas Piggin
Reimplement Book3S idle code in C, moving POWER7/8/9 implementation
speific HV idle code to the powernv platform code. Generic book3s
assembly stubs are kept in common code and used only to save and
restore the stack frame and non-volatile GPRs just before going to
idle and just after waking up, which can return into C code.

Moving the HMI, SPR, OPAL, locking, etc. to C is the only real
way this stuff will cope with non-trivial new CPU implementation
details, firmware changes, etc., without becoming unmaintainable.

This is not a strict translation to C code, there are some
differences.

- Idle wakeup no longer uses the ->cpu_restore call to reinit SPRs,
  but saves and restores them all explicitly.

- The optimisation where EC=ESL=0 idle modes did not have to save
  GPRs or change MSR is restored, because it's now simple to do.
  State loss modes that did not actually lose GPRs can use this
  optimization too.

- KVM secondary entry code is now more of a call/return style
  rather than spaghetti. nap_state_lost is not required beccause
  KVM always returns via NVGPR restorig path.

This seems pretty solid, so needs more review and testing now. The
KVM changes are pretty significant and complicated. POWER7 needs to
be tested too.

Open question:
- Why does P9 restore some of the PMU SPRs (but not others), and
  P8 only zeroes them?

Since RFC v1:
- Now tested and working with POWER9 hash and radix.
- KVM support added. This took a bit of work to untangle and might
  still have some issues, but POWER9 seems to work including hash on
  radix with dependent threads mode.
- This snowballed a bit because of KVM and other details making it
  not feasible to leave POWER7/8 code alone. That's only half done
  at the moment.
- So far this trades about 800 lines of asm for 500 of C. With POWER7/8
  support done it might be another hundred or so lines of C.

Since RFC v2:
- Fixed deep state SLB reloading
- Now tested and working with POWER8.
- Accounted for most feedback.

Since RFC v3:
- Rebased to powerpc merge + idle state bugfix
- Split SLB flush/restore code out and shared with MCE code (pseries
  MCE patches can also use).
- More testing on POWER8 including KVM with secondaries.
- Performance testing looks good. EC=ESL=0 is about 5% faster, other
  stop states look a little faster too.
- Adjusted SPR saving to handler POWER7, haven't tested it.
---
 arch/powerpc/include/asm/cpuidle.h   |  19 +-
 arch/powerpc/include/asm/paca.h  |  40 +-
 arch/powerpc/include/asm/processor.h |   9 +-
 arch/powerpc/include/asm/reg.h   |   7 +-
 arch/powerpc/kernel/asm-offsets.c|  18 -
 arch/powerpc/kernel/dt_cpu_ftrs.c|  21 +-
 arch/powerpc/kernel/exceptions-64s.S |  17 +-
 arch/powerpc/kernel/idle_book3s.S| 998 ++-
 arch/powerpc/kernel/setup-common.c   |   4 +-
 arch/powerpc/kvm/book3s_hv_rmhandlers.S  |  90 +-
 arch/powerpc/platforms/powernv/idle.c| 816 ++
 arch/powerpc/platforms/powernv/subcore.c |   2 +-
 arch/powerpc/xmon/xmon.c |  25 +-
 13 files changed, 859 insertions(+), 1207 deletions(-)

diff --git a/arch/powerpc/include/asm/cpuidle.h 
b/arch/powerpc/include/asm/cpuidle.h
index 43e5f31fe64d..9844b3ded187 100644
--- a/arch/powerpc/include/asm/cpuidle.h
+++ b/arch/powerpc/include/asm/cpuidle.h
@@ -27,10 +27,11 @@
  * the THREAD_WINKLE_BITS are set, which indicate which threads have not
  * yet woken from the winkle state.
  */
-#define PNV_CORE_IDLE_LOCK_BIT 0x1000
+#define NR_PNV_CORE_IDLE_LOCK_BIT  28
+#define PNV_CORE_IDLE_LOCK_BIT (1ULL << 
NR_PNV_CORE_IDLE_LOCK_BIT)
 
+#define PNV_CORE_IDLE_WINKLE_COUNT_SHIFT   16
 #define PNV_CORE_IDLE_WINKLE_COUNT 0x0001
-#define PNV_CORE_IDLE_WINKLE_COUNT_ALL_BIT 0x0008
 #define PNV_CORE_IDLE_WINKLE_COUNT_BITS0x000F
 #define PNV_CORE_IDLE_THREAD_WINKLE_BITS_SHIFT 8
 #define PNV_CORE_IDLE_THREAD_WINKLE_BITS   0xFF00
@@ -68,16 +69,6 @@
 #define ERR_DEEP_STATE_ESL_MISMATCH-2
 
 #ifndef __ASSEMBLY__
-/* Additional SPRs that need to be saved/restored during stop */
-struct stop_sprs {
-   u64 pid;
-   u64 ldbar;
-   u64 fscr;
-   u64 hfscr;
-   u64 mmcr1;
-   u64 mmcr2;
-   u64 mmcra;
-};
 
 #define PNV_IDLE_NAME_LEN16
 struct pnv_idle_states_t {
@@ -92,10 +83,6 @@ struct pnv_idle_states_t {
 
 extern struct pnv_idle_states_t *pnv_idle_states;
 extern int nr_pnv_idle_states;
-extern u32 pnv_fastsleep_workaround_at_entry[];
-extern u32 pnv_fastsleep_workaround_at_exit[];
-
-extern u64 pnv_first_deep_stop_state;
 
 unsigned long pnv_cpu_offline(unsigned int cpu);
 int validate_psscr_val_mask(u64 *psscr_val, u64 *psscr_mask, u32 flags);
diff --git a/arch/powerpc/include/asm/paca.h b/arch/powerpc/include/asm/paca.h
index 4e9cede5a7e7..d2cee5ebaaa1 100644
--- a/arch/powerpc/include/asm/paca.h
+++ b/arch/powerpc/include/asm/paca.h
@@ -168,7 

[PATCH 1/2] powerpc/64s: move machine check SLB flushing to mm/slb.c

2018-08-02 Thread Nicholas Piggin
The machine check code that flushes and restores bolted segments in
real mode belongs in mm/slb.c. This will be used by pseries machine
check and idle code.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/include/asm/book3s/64/mmu-hash.h |  3 ++
 arch/powerpc/kernel/mce_power.c   | 21 ++
 arch/powerpc/mm/slb.c | 38 +++
 3 files changed, 44 insertions(+), 18 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/mmu-hash.h 
b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
index 2f74bdc805e0..d4e398185b3a 100644
--- a/arch/powerpc/include/asm/book3s/64/mmu-hash.h
+++ b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
@@ -497,6 +497,9 @@ extern void hpte_init_native(void);
 
 extern void slb_initialize(void);
 extern void slb_flush_and_rebolt(void);
+extern void slb_flush_all_realmode(void);
+extern void __slb_restore_bolted_realmode(void);
+extern void slb_restore_bolted_realmode(void);
 
 extern void slb_vmalloc_update(void);
 extern void slb_set_size(u16 size);
diff --git a/arch/powerpc/kernel/mce_power.c b/arch/powerpc/kernel/mce_power.c
index d6756af6ec78..50f7b9817246 100644
--- a/arch/powerpc/kernel/mce_power.c
+++ b/arch/powerpc/kernel/mce_power.c
@@ -62,11 +62,8 @@ static unsigned long addr_to_pfn(struct pt_regs *regs, 
unsigned long addr)
 #ifdef CONFIG_PPC_BOOK3S_64
 static void flush_and_reload_slb(void)
 {
-   struct slb_shadow *slb;
-   unsigned long i, n;
-
/* Invalidate all SLBs */
-   asm volatile("slbmte %0,%0; slbia" : : "r" (0));
+   slb_flush_all_realmode();
 
 #ifdef CONFIG_KVM_BOOK3S_HANDLER
/*
@@ -76,22 +73,10 @@ static void flush_and_reload_slb(void)
if (get_paca()->kvm_hstate.in_guest)
return;
 #endif
-
-   /* For host kernel, reload the SLBs from shadow SLB buffer. */
-   slb = get_slb_shadow();
-   if (!slb)
+   if (early_radix_enabled())
return;
 
-   n = min_t(u32, be32_to_cpu(slb->persistent), SLB_MIN_SIZE);
-
-   /* Load up the SLB entries from shadow SLB */
-   for (i = 0; i < n; i++) {
-   unsigned long rb = be64_to_cpu(slb->save_area[i].esid);
-   unsigned long rs = be64_to_cpu(slb->save_area[i].vsid);
-
-   rb = (rb & ~0xFFFul) | i;
-   asm volatile("slbmte %0,%1" : : "r" (rs), "r" (rb));
-   }
+   slb_restore_bolted_realmode();
 }
 #endif
 
diff --git a/arch/powerpc/mm/slb.c b/arch/powerpc/mm/slb.c
index cb796724a6fc..136db8652577 100644
--- a/arch/powerpc/mm/slb.c
+++ b/arch/powerpc/mm/slb.c
@@ -90,6 +90,44 @@ static inline void create_shadowed_slbe(unsigned long ea, 
int ssize,
 : "memory" );
 }
 
+/*
+ * Insert bolted entries into SLB (which may not be empty).
+ */
+void __slb_restore_bolted_realmode(void)
+{
+   struct slb_shadow *p = get_slb_shadow();
+   enum slb_index index;
+
+/* No isync needed because realmode. */
+   for (index = 0; index < SLB_NUM_BOLTED; index++) {
+   asm volatile("slbmte  %0,%1" :
+: "r" (be64_to_cpu(p->save_area[index].vsid)),
+  "r" (be64_to_cpu(p->save_area[index].esid)));
+   }
+}
+
+/*
+ * Insert bolted entries into an empty SLB.
+ * This is not the same as rebolt because the bolted segments
+ * (e.g., kstack) are not changed (rebolted).
+ */
+void slb_restore_bolted_realmode(void)
+{
+   __slb_restore_bolted_realmode();
+   get_paca()->slb_cache_ptr = 0;
+}
+
+/*
+ * This flushes all SLB entries including 0, so it must be realmode.
+ */
+void slb_flush_all_realmode(void)
+{
+   /*
+* This flushes all SLB entries including 0, so it must be realmode.
+*/
+   asm volatile("slbmte %0,%0; slbia" : : "r" (0));
+}
+
 static void __slb_flush_and_rebolt(void)
 {
/* If you change this make sure you change SLB_NUM_BOLTED
-- 
2.17.0



[PATCH] powerpc/powernv: Fix concurrency issue with npu->mmio_atsd_usage

2018-08-02 Thread Reza Arbab
We've encountered a performance issue when multiple processors stress
{get,put}_mmio_atsd_reg(). These functions contend for mmio_atsd_usage,
an unsigned long used as a bitmask.

The accesses to mmio_atsd_usage are done using test_and_set_bit_lock()
and clear_bit_unlock(). As implemented, both of these will require a
(successful) stwcx to that same cache line.

What we end up with is thread A, attempting to unlock, being slowed by
other threads repeatedly attempting to lock. A's stwcx instructions fail
and retry because the memory reservation is lost every time a different
thread beats it to the punch.

There may be a long-term way to fix this at a larger scale, but for now
resolve the immediate problem by gating our call to
test_and_set_bit_lock() with one to test_bit(), which is obviously
implemented without using a store.

Signed-off-by: Reza Arbab 
---
 arch/powerpc/platforms/powernv/npu-dma.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/npu-dma.c 
b/arch/powerpc/platforms/powernv/npu-dma.c
index 8cdf91f..c773465 100644
--- a/arch/powerpc/platforms/powernv/npu-dma.c
+++ b/arch/powerpc/platforms/powernv/npu-dma.c
@@ -437,8 +437,9 @@ static int get_mmio_atsd_reg(struct npu *npu)
int i;
 
for (i = 0; i < npu->mmio_atsd_count; i++) {
-   if (!test_and_set_bit_lock(i, >mmio_atsd_usage))
-   return i;
+   if (!test_bit(i, >mmio_atsd_usage))
+   if (!test_and_set_bit_lock(i, >mmio_atsd_usage))
+   return i;
}
 
return -ENOSPC;
-- 
1.8.3.1



Re: [PATCH v4 5/6] powerpc: Add show_user_instructions()

2018-08-02 Thread Joe Perches
On Thu, 2018-08-02 at 21:42 -0300, Murilo Opsfelder Araujo wrote:
> > > diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
[]
> > > @@ -1299,6 +1299,46 @@ static void show_instructions(struct pt_regs *regs)
> > >   pr_cont("\n");
> > >   }
> > > +void show_user_instructions(struct pt_regs *regs)
> > > +{
> > > + int i;
> > > + const char *prefix = KERN_INFO "%s[%d]: code: ";
> > > + unsigned long pc = regs->nip - (instructions_to_print * 3 / 4 *
> > > + sizeof(int));
> > > +
> > > + printk(prefix, current->comm, current->pid);
> > 
> > Why not use pr_info() and remove KERN_INFO from *prefix ?
> 
> Because it doesn't compile:
> 
>   arch/powerpc/kernel/process.c:1317:10: error: expected ‘)’ before ‘prefix’
> pr_info(prefix, current->comm, current->pid);
> ^
>   ./include/linux/printk.h:288:21: note: in definition of macro ‘pr_fmt’
>#define pr_fmt(fmt) fmt
>  ^

What being suggested is using:

pr_info("%s[%d]: code: ", current->comm, current->pid);



Re: [RFC 0/4] Virtio uses DMA API for all devices

2018-08-02 Thread Jason Wang




On 2018年08月03日 04:55, Michael S. Tsirkin wrote:

On Fri, Jul 20, 2018 at 09:29:37AM +0530, Anshuman Khandual wrote:

This patch series is the follow up on the discussions we had before about
the RFC titled [RFC,V2] virtio: Add platform specific DMA API translation
for virito devices (https://patchwork.kernel.org/patch/10417371/). There
were suggestions about doing away with two different paths of transactions
with the host/QEMU, first being the direct GPA and the other being the DMA
API based translations.

First patch attempts to create a direct GPA mapping based DMA operations
structure called 'virtio_direct_dma_ops' with exact same implementation
of the direct GPA path which virtio core currently has but just wrapped in
a DMA API format. Virtio core must use 'virtio_direct_dma_ops' instead of
the arch default in absence of VIRTIO_F_IOMMU_PLATFORM flag to preserve the
existing semantics. The second patch does exactly that inside the function
virtio_finalize_features(). The third patch removes the default direct GPA
path from virtio core forcing it to use DMA API callbacks for all devices.
Now with that change, every device must have a DMA operations structure
associated with it. The fourth patch adds an additional hook which gives
the platform an opportunity to do yet another override if required. This
platform hook can be used on POWER Ultravisor based protected guests to
load up SWIOTLB DMA callbacks to do the required (as discussed previously
in the above mentioned thread how host is allowed to access only parts of
the guest GPA range) bounce buffering into the shared memory for all I/O
scatter gather buffers to be consumed on the host side.

Please go through these patches and review whether this approach broadly
makes sense. I will appreciate suggestions, inputs, comments regarding
the patches or the approach in general. Thank you.

Jason did some work on profiling this. Unfortunately he reports
about 4% extra overhead from this switch on x86 with no vIOMMU.


The test is rather simple, just run pktgen (pktgen_sample01_simple.sh) 
in guest and measure PPS on tap on host.


Thanks



I expect he's writing up the data in more detail, but
just wanted to let you know this would be one more
thing to debug before we can just switch to DMA APIs.



Anshuman Khandual (4):
   virtio: Define virtio_direct_dma_ops structure
   virtio: Override device's DMA OPS with virtio_direct_dma_ops selectively
   virtio: Force virtio core to use DMA API callbacks for all virtio devices
   virtio: Add platform specific DMA API translation for virito devices

  arch/powerpc/include/asm/dma-mapping.h |  6 +++
  arch/powerpc/platforms/pseries/iommu.c |  6 +++
  drivers/virtio/virtio.c| 72 ++
  drivers/virtio/virtio_pci_common.h |  3 ++
  drivers/virtio/virtio_ring.c   | 65 +-
  5 files changed, 89 insertions(+), 63 deletions(-)

--
2.9.3




Re: [PATCH v4 5/6] powerpc: Add show_user_instructions()

2018-08-02 Thread Murilo Opsfelder Araujo
Hi, Christophe.

On Thu, Aug 02, 2018 at 07:26:20AM +0200, Christophe LEROY wrote:
>
>
> Le 01/08/2018 à 23:33, Murilo Opsfelder Araujo a écrit :
> > show_user_instructions() is a slightly modified version of
> > show_instructions() that allows userspace instruction dump.
> >
> > This will be useful within show_signal_msg() to dump userspace
> > instructions of the faulty location.
> >
> > Here is a sample of what show_user_instructions() outputs:
> >
> >pandafault[10850]: code: 4bfffeec 4bfffee8 3c401002 38427f00 fbe1fff8 
> > f821ffc1 7c3f0b78 3d22fffe
> >pandafault[10850]: code: 392988d0 f93f0020 e93f0020 39400048 <9949> 
> > 3920 7d234b78 383f0040
> >
> > The current->comm and current->pid printed can serve as a glue that
> > links the instructions dump to its originator, allowing messages to be
> > interleaved in the logs.
> >
> > Signed-off-by: Murilo Opsfelder Araujo 
> > ---
> >   arch/powerpc/include/asm/stacktrace.h | 13 +
> >   arch/powerpc/kernel/process.c | 40 +++
> >   2 files changed, 53 insertions(+)
> >   create mode 100644 arch/powerpc/include/asm/stacktrace.h
> >
> > diff --git a/arch/powerpc/include/asm/stacktrace.h 
> > b/arch/powerpc/include/asm/stacktrace.h
> > new file mode 100644
> > index ..6149b53b3bc8
> > --- /dev/null
> > +++ b/arch/powerpc/include/asm/stacktrace.h
> > @@ -0,0 +1,13 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +/*
> > + * Stack trace functions.
> > + *
> > + * Copyright 2018, Murilo Opsfelder Araujo, IBM Corporation.
> > + */
> > +
> > +#ifndef _ASM_POWERPC_STACKTRACE_H
> > +#define _ASM_POWERPC_STACKTRACE_H
> > +
> > +void show_user_instructions(struct pt_regs *regs);
> > +
> > +#endif /* _ASM_POWERPC_STACKTRACE_H */
> > diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
> > index e9533b4d2f08..364645ac732c 100644
> > --- a/arch/powerpc/kernel/process.c
> > +++ b/arch/powerpc/kernel/process.c
> > @@ -1299,6 +1299,46 @@ static void show_instructions(struct pt_regs *regs)
> > pr_cont("\n");
> >   }
> > +void show_user_instructions(struct pt_regs *regs)
> > +{
> > +   int i;
> > +   const char *prefix = KERN_INFO "%s[%d]: code: ";
> > +   unsigned long pc = regs->nip - (instructions_to_print * 3 / 4 *
> > +   sizeof(int));
> > +
> > +   printk(prefix, current->comm, current->pid);
>
> Why not use pr_info() and remove KERN_INFO from *prefix ?

Because it doesn't compile:

  arch/powerpc/kernel/process.c:1317:10: error: expected ‘)’ before ‘prefix’
pr_info(prefix, current->comm, current->pid);
^
  ./include/linux/printk.h:288:21: note: in definition of macro ‘pr_fmt’
   #define pr_fmt(fmt) fmt
 ^

`pr_info(prefix, ...)` expands to `printk("\001" "6" prefix, ...)`,
which is an invalid string concatenation.

`pr_info("%s", ...)` expands to `printk("\001" "6" "%s", ...)`, which is
valid.

> > +
> > +   for (i = 0; i < instructions_to_print; i++) {
> > +   int instr;
> > +
> > +   if (!(i % 8) && (i > 0)) {
> > +   pr_cont("\n");
> > +   printk(prefix, current->comm, current->pid);
> > +   }
> > +
> > +#if !defined(CONFIG_BOOKE)
> > +   /* If executing with the IMMU off, adjust pc rather
> > +* than print .
> > +*/
> > +   if (!(regs->msr & MSR_IR))
> > +   pc = (unsigned long)phys_to_virt(pc);
>
> Shouldn't this be done outside of the loop, only once ?

I don't think so.

pc gets incremented at the bottom of the loop:

  pc += sizeof(int);

Adjusting pc is necessary at each iteration.  Leaving this block inside
the loop seems correct.

Cheers
Murilo



Re: [resend] [PATCH 0/3] powerpc/pseries: use H_BLOCK_REMOVE

2018-08-02 Thread Michael Ellerman
Nicholas Piggin  writes:
> On Fri, 27 Jul 2018 15:51:30 +0200
> Laurent Dufour  wrote:
>> [Resending so everyone is getting the cover letter]
>> On very large system we could see soft lockup fired when a process is exiting
>> 
>> watchdog: BUG: soft lockup - CPU#851 stuck for 21s! [forkoff:215523]
>> Modules linked in: pseries_rng rng_core xfs raid10 vmx_crypto btrfs 
>> libcrc32c xor zstd_decompress zstd_compress xxhash lzo_compress raid6_pq 
>> crc32c_vpmsum lpfc crc_t10dif crct10dif_generic crct10dif_common 
>> dm_multipath scsi_dh_rdac scsi_dh_alua autofs4
>> CPU: 851 PID: 215523 Comm: forkoff Not tainted 4.17.0 #1
>> NIP:  c00b995c LR: c00b8f64 CTR: aa18
>> REGS: c6b0645b7610 TRAP: 0901   Not tainted  (4.17.0)
>> MSR:  80010280b033   CR: 22042082 
>>  XER: 
>> CFAR: 006cf8f0 SOFTE: 0 
>> GPR00: 0010 c6b0645b7890 c0f99200  
>> GPR04: 8e01a5a4de58 400249cf1bfd5480 8e01a5a4de50 400249cf1bfd5480 
>> GPR08: 8e01a5a4de48 400249cf1bfd5480 8e01a5a4de40 400249cf1bfd5480 
>> GPR12:  c0001e690800 
>> NIP [c00b995c] plpar_hcall9+0x44/0x7c
>> LR [c00b8f64] pSeries_lpar_flush_hash_range+0x324/0x3d0
>> Call Trace:
>> [c6b0645b7890] [8e01a5a4dd20] 0x8e01a5a4dd20 (unreliable)
>> [c6b0645b7a00] [c006d5b0] flush_hash_range+0x60/0x110
>> [c6b0645b7a50] [c0072a2c] __flush_tlb_pending+0x4c/0xd0
>> [c6b0645b7a80] [c02eaf44] unmap_page_range+0x984/0xbd0
>> [c6b0645b7bc0] [c02eb594] unmap_vmas+0x84/0x100
>> [c6b0645b7c10] [c02f8afc] exit_mmap+0xac/0x1f0
>> [c6b0645b7cd0] [c00f2638] mmput+0x98/0x1b0
>> [c6b0645b7d00] [c00fc9d0] do_exit+0x330/0xc00
>> [c6b0645b7dc0] [c00fd384] do_group_exit+0x64/0x100
>> [c6b0645b7e00] [c00fd44c] sys_exit_group+0x2c/0x30
>> [c6b0645b7e30] [c000b960] system_call+0x58/0x6c
>> Instruction dump:
>> 6000 f8810028 7ca42b78 7cc53378 7ce63b78 7d074378 7d284b78 7d495378 
>> e9410060 e9610068 e9810070 4422 <7d806378> e9810028 f88c f8ac0008
>> 
>> This happens when removing the PTE by calling the hypervisor using the
>> H_BULK_REMOVE call. This call is processing up to 4 PTEs but is doing a
>> tlbie for each PTE it is processing. This could lead to long time spent in
>> the hypervisor (sometimes up to 4s) and soft lockup being raised because
>> the scheduler is not called in zap_pte_range().
>> 
>> Since the Power7's time, the hypervisor is providing a new hcall
>> H_BLOCK_REMOVE allowing processing up to 8 PTEs with one call to
>> tlbie. By limiting the amount of tlbie generated, this reduces the time
>> spent invalidating the PTEs.
>
> Oh that's a nice feature. I must have an ancient PAPR because I don't
> have it.

The only public document is LoPAPR, which is a stripped down version of
the 2012 PAPR.

cheers


RE: [PATCH] QE GPIO: Add qe_gpio_set_multiple

2018-08-02 Thread Leo Li


> -Original Message-
> From: Joakim Tjernlund [mailto:joakim.tjernl...@infinera.com]
> Sent: Thursday, August 2, 2018 5:37 PM
> To: Leo Li ; York Sun 
> Cc: linuxppc-dev@lists.ozlabs.org; m...@ellerman.id.au; Qiang Zhao
> 
> Subject: Re: [PATCH] QE GPIO: Add qe_gpio_set_multiple
> 
> Leo, did this go anywhere ?

It has been included in my soc pull request for 4.19.  
http://lkml.iu.edu/hypermail/linux/kernel/1807.3/02295.html

Leo

> 
> Jocke
> 
> On Tue, 2018-07-03 at 16:35 +, Leo Li wrote:
> >
> > > -Original Message-
> > > From: York Sun
> > > Sent: Tuesday, July 3, 2018 10:27 AM
> > > To: jo...@infinera.com ; Leo Li
> > > 
> > > Cc: linuxppc-dev@lists.ozlabs.org; m...@ellerman.id.au; Qiang Zhao
> > > 
> > > Subject: Re: [PATCH] QE GPIO: Add qe_gpio_set_multiple
> > >
> > > +Leo
> > >
> > > On 07/03/2018 03:30 AM, Joakim Tjernlund wrote:
> > > > On Tue, 2018-06-26 at 23:41 +1000, Michael Ellerman wrote:
> > > > >
> > > > > Joakim Tjernlund  writes:
> > > > > > On Thu, 2018-06-21 at 02:38 +, Qiang Zhao wrote:
> > > > > > > On 06/19/2018 09:22 AM, Joakim Tjernlund wrote:
> > > > > > > -Original Message-
> > > > > > > From: Linuxppc-dev [mailto:linuxppc-dev-
> > >
> > > bounces+qiang.zhao=nxp@lists.ozlabs.org] On Behalf Of Joakim
> > > Tjernlund
> > > > > > > Sent: 2018年6月20日 0:22
> > > > > > > To: York Sun ; linuxppc-dev  > >
> > > d...@lists.ozlabs.org>
> > > > > > > Subject: [PATCH] QE GPIO: Add qe_gpio_set_multiple
> > > > > > >
> > > > > > > This cousin to gpio-mpc8xxx was lacking a multiple pins
> > > > > > > method, add
> > >
> > > one.
> > > > > > >
> > > > > > > Signed-off-by: Joakim Tjernlund
> > > > > > > 
> > > > > > > ---
> > > > > > >  drivers/soc/fsl/qe/gpio.c | 28 
> > > > > > >  1 file changed, 28 insertions(+)
> > > > >
> > > > > ...
> > > > > > >  static int qe_gpio_dir_in(struct gpio_chip *gc, unsigned int 
> > > > > > > gpio)  {
> > > > > > > struct of_mm_gpio_chip *mm_gc =
> > > > > > > to_of_mm_gpio_chip(gc); @@ -
> > >
> > > 298,6 +325,7 @@ static int __init qe_add_gpiochips(void)
> > > > > > > gc->direction_output = qe_gpio_dir_out;
> > > > > > > gc->get = qe_gpio_get;
> > > > > > > gc->set = qe_gpio_set;
> > > > > > > +   gc->set_multiple = qe_gpio_set_multiple;
> > > > > > >
> > > > > > > ret = of_mm_gpiochip_add_data(np, mm_gc, qe_gc);
> > > > > > > if (ret)
> > > > > > >
> > > > > > > Reviewed-by: Qiang Zhao 
> > > > > > >
> > > > > >
> > > > > > Who picks up this patch ? Is it queued somewhere already?
> > > > >
> > > > > Not me.
> > > >
> > > > York? You seem to be the only one left.
> > > >
> > >
> > > I am not a Linux maintainer. Even I want to, I can't merge this patch.
> > >
> > > Leo, who can merge this patch and request a pull?
> >
> > Since it falls under the driver/soc/fsl/ folder.  I can take it.
> >
> > Regards,
> > Leo


Re: [PATCH] QE GPIO: Add qe_gpio_set_multiple

2018-08-02 Thread Joakim Tjernlund
Leo, did this go anywhere ?

Jocke

On Tue, 2018-07-03 at 16:35 +, Leo Li wrote:
> 
> > -Original Message-
> > From: York Sun
> > Sent: Tuesday, July 3, 2018 10:27 AM
> > To: jo...@infinera.com ; Leo Li
> > 
> > Cc: linuxppc-dev@lists.ozlabs.org; m...@ellerman.id.au; Qiang Zhao
> > 
> > Subject: Re: [PATCH] QE GPIO: Add qe_gpio_set_multiple
> > 
> > +Leo
> > 
> > On 07/03/2018 03:30 AM, Joakim Tjernlund wrote:
> > > On Tue, 2018-06-26 at 23:41 +1000, Michael Ellerman wrote:
> > > > 
> > > > Joakim Tjernlund  writes:
> > > > > On Thu, 2018-06-21 at 02:38 +, Qiang Zhao wrote:
> > > > > > On 06/19/2018 09:22 AM, Joakim Tjernlund wrote:
> > > > > > -Original Message-
> > > > > > From: Linuxppc-dev [mailto:linuxppc-dev-
> > 
> > bounces+qiang.zhao=nxp@lists.ozlabs.org] On Behalf Of Joakim
> > Tjernlund
> > > > > > Sent: 2018年6月20日 0:22
> > > > > > To: York Sun ; linuxppc-dev  > 
> > d...@lists.ozlabs.org>
> > > > > > Subject: [PATCH] QE GPIO: Add qe_gpio_set_multiple
> > > > > > 
> > > > > > This cousin to gpio-mpc8xxx was lacking a multiple pins method, add
> > 
> > one.
> > > > > > 
> > > > > > Signed-off-by: Joakim Tjernlund 
> > > > > > ---
> > > > > >  drivers/soc/fsl/qe/gpio.c | 28 
> > > > > >  1 file changed, 28 insertions(+)
> > > > 
> > > > ...
> > > > > >  static int qe_gpio_dir_in(struct gpio_chip *gc, unsigned int gpio) 
> > > > > >  {
> > > > > > struct of_mm_gpio_chip *mm_gc = to_of_mm_gpio_chip(gc); @@ -
> > 
> > 298,6 +325,7 @@ static int __init qe_add_gpiochips(void)
> > > > > > gc->direction_output = qe_gpio_dir_out;
> > > > > > gc->get = qe_gpio_get;
> > > > > > gc->set = qe_gpio_set;
> > > > > > +   gc->set_multiple = qe_gpio_set_multiple;
> > > > > > 
> > > > > > ret = of_mm_gpiochip_add_data(np, mm_gc, qe_gc);
> > > > > > if (ret)
> > > > > > 
> > > > > > Reviewed-by: Qiang Zhao 
> > > > > > 
> > > > > 
> > > > > Who picks up this patch ? Is it queued somewhere already?
> > > > 
> > > > Not me.
> > > 
> > > York? You seem to be the only one left.
> > > 
> > 
> > I am not a Linux maintainer. Even I want to, I can't merge this patch.
> > 
> > Leo, who can merge this patch and request a pull?
> 
> Since it falls under the driver/soc/fsl/ folder.  I can take it.
> 
> Regards,
> Leo


Re: [RFC 0/4] Virtio uses DMA API for all devices

2018-08-02 Thread Michael S. Tsirkin
On Thu, Aug 02, 2018 at 04:13:09PM -0500, Benjamin Herrenschmidt wrote:
> On Thu, 2018-08-02 at 23:52 +0300, Michael S. Tsirkin wrote:
> > > Yes, this is the purpose of Anshuman original patch (I haven't looked
> > > at the details of the patch in a while but that's what I told him to
> > > implement ;-) :
> > > 
> > >   - Make virtio always use DMA ops to simplify the code path (with a set
> > > of "transparent" ops for legacy)
> > > 
> > >   and
> > > 
> > >   -  Provide an arch hook allowing us to "override" those "transparent"
> > > DMA ops with some custom ones that do the appropriate swiotlb gunk.
> > > 
> > > Cheers,
> > > Ben.
> > > 
> > 
> > Right but as I tried to say doing that brings us to a bunch of issues
> > with using DMA APIs in virtio. Put simply DMA APIs weren't designed for
> > guest to hypervisor communication.
> 
> I'm not sure I see the problem, see below
> 
> > When we do (as is the case with PLATFORM_IOMMU right now) this adds a
> > bunch of overhead which we need to get rid of if we are to switch to
> > PLATFORM_IOMMU by default.  We need to fix that.
> 
> So let's differenciate the two problems of having an IOMMU (real or
> emulated) which indeeds adds overhead etc... and using the DMA API.

Well actually it's the other way around. An iommu in theory doesn't need
to bring overhead if you set it in bypass mode.  Which does imply the
iommu supports bypass mode. Is that universally the case?  DMA API does
see Christoph's list of things it does some of which add overhead.

> At the moment, virtio does this all over the place:
> 
>   if (use_dma_api)
>   dma_map/alloc_something(...)
>   else
>   use_pa
> 
> The idea of the patch set is to do two, somewhat orthogonal, changes
> that together achieve what we want. Let me know where you think there
> is "a bunch of issues" because I'm missing it:
> 
>  1- Replace the above if/else constructs with just calling the DMA API,
> and have virtio, at initialization, hookup its own dma_ops that just
> "return pa" (roughly) when the IOMMU stuff isn't used.
> 
> This adds an indirect function call to the path that previously didn't
> have one (the else case above). Is that a significant/measurable
> overhead ?

Seems to be :( Jason reports about 4%. I wonder whether we can support
map_sg and friends being NULL, then use that when mapping is an
identity. A conditional branch there is likely very cheap.

Would this cover all platforms with kvm (which is where we care
most about performance)?

> This change stands alone, and imho "cleans" up virtio by avoiding all
> that if/else "2 path" and unless it adds a measurable overhead, should
> probably be done.
> 
>  2- Make virtio use the DMA API with our custom platform-provided
> swiotlb callbacks when needed, that is when not using IOMMU *and*
> running on a secure VM in our case.
> 
> This benefits from -1- by making us just plumb in a different set of
> DMA ops we would have cooked up specifically for virtio in our arch
> code (or in virtio itself but build arch-conditionally in a separate
> file). But it doesn't strictly need it -1-:
> 
> Now, -2- doesn't strictly needs -1-. We could have just done another
> xen-like hack that forces the DMA API "ON" for virtio when running in a
> secure VM.
> 
> The problem if we do that however is that we also then need the arch
> PCI code to make sure it hooks up the virtio PCI devices with the
> special "magic" DMA ops that avoid the iommu but still do swiotlb, ie,
> not the same as other PCI devices. So it will have to play games such
> as checking vendor/device IDs for virtio, checking the IOMMU flag,
> etc... from the arch code which really bloody sucks when assigning PCI
> DMA ops.
> 
> However, if we do it the way we plan here, on top of -1-, with a hook
> called from virtio into the arch to "override" the virtio DMA ops, then
> we avoid the problem completely: The arch hook would only be called by
> virtio if the IOMMU flag is *not* set. IE only when using that special
> "hypervisor" iommu bypass. If the IOMMU flag is set, virtio uses normal
> PCI dma ops as usual.
> 
> That way, we have a very clear semantic: This hook is purely about
> replacing those "null" DMA ops that just return PA introduced in -1-
> with some arch provided specially cooked up DMA ops for non-IOMMU
> virtio that know about the arch special requirements. For us bounce
> buffering.
> 
> Is there something I'm missing ?
> 
> Cheers,
> Ben.

Right so I was trying to write it up in a systematic way, but just to
give you one example, if there is a system where DMA API handles
coherency issues, or flushing of some buffers, then our PLATFORM_IOMMU
flag causes that to happen.

And we kinda worked around this without the IOMMU by basically saying
"ok we do not really need DMA API so let's just bypass it" and
it was kind of ok except now everyone is switching to vIOMMU
just in case. So now people do want some parts of what DMA API does,
such as the bounce 

Re: [RFC 0/4] Virtio uses DMA API for all devices

2018-08-02 Thread Benjamin Herrenschmidt
On Thu, 2018-08-02 at 23:52 +0300, Michael S. Tsirkin wrote:
> > Yes, this is the purpose of Anshuman original patch (I haven't looked
> > at the details of the patch in a while but that's what I told him to
> > implement ;-) :
> > 
> >   - Make virtio always use DMA ops to simplify the code path (with a set
> > of "transparent" ops for legacy)
> > 
> >   and
> > 
> >   -  Provide an arch hook allowing us to "override" those "transparent"
> > DMA ops with some custom ones that do the appropriate swiotlb gunk.
> > 
> > Cheers,
> > Ben.
> > 
> 
> Right but as I tried to say doing that brings us to a bunch of issues
> with using DMA APIs in virtio. Put simply DMA APIs weren't designed for
> guest to hypervisor communication.

I'm not sure I see the problem, see below

> When we do (as is the case with PLATFORM_IOMMU right now) this adds a
> bunch of overhead which we need to get rid of if we are to switch to
> PLATFORM_IOMMU by default.  We need to fix that.

So let's differenciate the two problems of having an IOMMU (real or
emulated) which indeeds adds overhead etc... and using the DMA API.

At the moment, virtio does this all over the place:

if (use_dma_api)
dma_map/alloc_something(...)
else
use_pa

The idea of the patch set is to do two, somewhat orthogonal, changes
that together achieve what we want. Let me know where you think there
is "a bunch of issues" because I'm missing it:

 1- Replace the above if/else constructs with just calling the DMA API,
and have virtio, at initialization, hookup its own dma_ops that just
"return pa" (roughly) when the IOMMU stuff isn't used.

This adds an indirect function call to the path that previously didn't
have one (the else case above). Is that a significant/measurable
overhead ?

This change stands alone, and imho "cleans" up virtio by avoiding all
that if/else "2 path" and unless it adds a measurable overhead, should
probably be done.

 2- Make virtio use the DMA API with our custom platform-provided
swiotlb callbacks when needed, that is when not using IOMMU *and*
running on a secure VM in our case.

This benefits from -1- by making us just plumb in a different set of
DMA ops we would have cooked up specifically for virtio in our arch
code (or in virtio itself but build arch-conditionally in a separate
file). But it doesn't strictly need it -1-:

Now, -2- doesn't strictly needs -1-. We could have just done another
xen-like hack that forces the DMA API "ON" for virtio when running in a
secure VM.

The problem if we do that however is that we also then need the arch
PCI code to make sure it hooks up the virtio PCI devices with the
special "magic" DMA ops that avoid the iommu but still do swiotlb, ie,
not the same as other PCI devices. So it will have to play games such
as checking vendor/device IDs for virtio, checking the IOMMU flag,
etc... from the arch code which really bloody sucks when assigning PCI
DMA ops.

However, if we do it the way we plan here, on top of -1-, with a hook
called from virtio into the arch to "override" the virtio DMA ops, then
we avoid the problem completely: The arch hook would only be called by
virtio if the IOMMU flag is *not* set. IE only when using that special
"hypervisor" iommu bypass. If the IOMMU flag is set, virtio uses normal
PCI dma ops as usual.

That way, we have a very clear semantic: This hook is purely about
replacing those "null" DMA ops that just return PA introduced in -1-
with some arch provided specially cooked up DMA ops for non-IOMMU
virtio that know about the arch special requirements. For us bounce
buffering.

Is there something I'm missing ?

Cheers,
Ben.






Re: [RFC 0/4] Virtio uses DMA API for all devices

2018-08-02 Thread Michael S. Tsirkin
On Fri, Jul 20, 2018 at 09:29:37AM +0530, Anshuman Khandual wrote:
> This patch series is the follow up on the discussions we had before about
> the RFC titled [RFC,V2] virtio: Add platform specific DMA API translation
> for virito devices (https://patchwork.kernel.org/patch/10417371/). There
> were suggestions about doing away with two different paths of transactions
> with the host/QEMU, first being the direct GPA and the other being the DMA
> API based translations.
> 
> First patch attempts to create a direct GPA mapping based DMA operations
> structure called 'virtio_direct_dma_ops' with exact same implementation
> of the direct GPA path which virtio core currently has but just wrapped in
> a DMA API format. Virtio core must use 'virtio_direct_dma_ops' instead of
> the arch default in absence of VIRTIO_F_IOMMU_PLATFORM flag to preserve the
> existing semantics. The second patch does exactly that inside the function
> virtio_finalize_features(). The third patch removes the default direct GPA
> path from virtio core forcing it to use DMA API callbacks for all devices.
> Now with that change, every device must have a DMA operations structure
> associated with it. The fourth patch adds an additional hook which gives
> the platform an opportunity to do yet another override if required. This
> platform hook can be used on POWER Ultravisor based protected guests to
> load up SWIOTLB DMA callbacks to do the required (as discussed previously
> in the above mentioned thread how host is allowed to access only parts of
> the guest GPA range) bounce buffering into the shared memory for all I/O
> scatter gather buffers to be consumed on the host side.
> 
> Please go through these patches and review whether this approach broadly
> makes sense. I will appreciate suggestions, inputs, comments regarding
> the patches or the approach in general. Thank you.

Jason did some work on profiling this. Unfortunately he reports
about 4% extra overhead from this switch on x86 with no vIOMMU.

I expect he's writing up the data in more detail, but
just wanted to let you know this would be one more
thing to debug before we can just switch to DMA APIs.


> Anshuman Khandual (4):
>   virtio: Define virtio_direct_dma_ops structure
>   virtio: Override device's DMA OPS with virtio_direct_dma_ops selectively
>   virtio: Force virtio core to use DMA API callbacks for all virtio devices
>   virtio: Add platform specific DMA API translation for virito devices
> 
>  arch/powerpc/include/asm/dma-mapping.h |  6 +++
>  arch/powerpc/platforms/pseries/iommu.c |  6 +++
>  drivers/virtio/virtio.c| 72 
> ++
>  drivers/virtio/virtio_pci_common.h |  3 ++
>  drivers/virtio/virtio_ring.c   | 65 +-
>  5 files changed, 89 insertions(+), 63 deletions(-)
> 
> -- 
> 2.9.3


Re: [RFC 0/4] Virtio uses DMA API for all devices

2018-08-02 Thread Michael S. Tsirkin
On Thu, Aug 02, 2018 at 10:33:05AM -0500, Benjamin Herrenschmidt wrote:
> On Thu, 2018-08-02 at 00:56 +0300, Michael S. Tsirkin wrote:
> > > but it's not, VMs are
> > > created in "legacy" mode all the times and we don't know at VM creation
> > > time whether it will become a secure VM or not. The way our secure VMs
> > > work is that they start as a normal VM, load a secure "payload" and
> > > call the Ultravisor to "become" secure.
> > > 
> > > So we're in a bit of a bind here. We need that one-liner optional arch
> > > hook to make virtio use swiotlb in that "IOMMU bypass" case.
> > > 
> > > Ben.
> > 
> > And just to make sure I understand, on your platform DMA APIs do include
> > some of the cache flushing tricks and this is why you don't want to
> > declare iommu support in the hypervisor?
> 
> I'm not sure I parse what you mean.
> 
> We don't need cache flushing tricks.

You don't but do real devices on same platform need them?

> The problem we have with our
> "secure" VMs is that:
> 
>  - At VM creation time we have no idea it's going to become a secure
> VM, qemu doesn't know anything about it, and thus qemu (or other
> management tools, libvirt etc...) are going to create "legacy" (ie
> iommu bypass) virtio devices.
> 
>  - Once the VM goes secure (early during boot but too late for qemu),
> it will need to make virtio do bounce-buffering via swiotlb because
> qemu cannot physically access most VM pages (blocked by HW security
> features), we need to bounce buffer using some unsecure pages that are
> accessible to qemu.
> 
> That said, I wouldn't object for us to more generally switch long run
> to changing qemu so that virtio on powerpc starts using the IOMMU as a
> default provided we fix our guest firmware to understand it (it
> currently doesn't), and provided we verify that the performance impact
> on things like vhost is negligible.
> 
> Cheers,
> Ben.
> 


Re: [RFC 0/4] Virtio uses DMA API for all devices

2018-08-02 Thread Michael S. Tsirkin
On Thu, Aug 02, 2018 at 12:53:41PM -0500, Benjamin Herrenschmidt wrote:
> On Thu, 2018-08-02 at 20:19 +0300, Michael S. Tsirkin wrote:
> > 
> > I see. So yes, given that device does not know or care, using
> > virtio features is an awkward fit.
> > 
> > So let's say as a quick fix for you maybe we could generalize the
> > xen_domain hack, instead of just checking xen_domain check some static
> > branch.  Then teach xen and others to enable that.>
> 
> > OK but problem then becomes this: if you do this and virtio device appears
> > behind a vIOMMU and it does not advertize the IOMMU flag, the
> > code will try to use the vIOMMU mappings and fail.
> >
> > It does look like even with trick above, you need a special version of
> > DMA ops that does just swiotlb but not any of the other things DMA API
> > might do.
> > 
> > Thoughts?
> 
> Yes, this is the purpose of Anshuman original patch (I haven't looked
> at the details of the patch in a while but that's what I told him to
> implement ;-) :
> 
>  - Make virtio always use DMA ops to simplify the code path (with a set
> of "transparent" ops for legacy)
> 
>  and
> 
>  -  Provide an arch hook allowing us to "override" those "transparent"
> DMA ops with some custom ones that do the appropriate swiotlb gunk.
> 
> Cheers,
> Ben.
> 

Right but as I tried to say doing that brings us to a bunch of issues
with using DMA APIs in virtio. Put simply DMA APIs weren't designed for
guest to hypervisor communication.

When we do (as is the case with PLATFORM_IOMMU right now) this adds a
bunch of overhead which we need to get rid of if we are to switch to
PLATFORM_IOMMU by default.  We need to fix that.

-- 
MST


vio.c:__vio_register_driver && LPAR Migration issue

2018-08-02 Thread Michael Bringmann
Hello:
I have been observing an anomaly during LPAR migrations between
a couple of P8 systems.

This is the problem.  After migrating an LPAR, the PPC mobility code
receives RTAS requests to delete nodes with platform-/hardware-specific
attributes when restarting the kernel after a migration.  My example is
for migration between a P8 Alpine and a P8 Brazos.  Among the nodes
that I see being deleted are 'ibm,random-v1', 'ibm,compression-v1',
'ibm,platform-facilities', and 'ibm,sym-encryption-v1'.  Of these
nodes, the following are created during initial boot by calls to
vio_register_driver:

drivers/char/hw_random/pseries-rng.c
ibm,random-v1

drivers/crypto/nx/nx-842-pseries.c
ibm,compression-v1

drivers/crypto/nx/nx.c
ibm,sym-encryption-v1

After the migration, these nodes are deleted, but nothing recreates
them.  If I boot the LPAR on the target system, the nodes are added
again.

My question is how do we recreate these nodes after migration?

Thanks.

-- 
Michael W. Bringmann
Linux Technology Center
IBM Corporation
Tie-Line  363-5196
External: (512) 286-5196
Cell:   (512) 466-0650
m...@linux.vnet.ibm.com



Re: [RFC 0/4] Virtio uses DMA API for all devices

2018-08-02 Thread Benjamin Herrenschmidt
On Thu, 2018-08-02 at 20:19 +0300, Michael S. Tsirkin wrote:
> 
> I see. So yes, given that device does not know or care, using
> virtio features is an awkward fit.
> 
> So let's say as a quick fix for you maybe we could generalize the
> xen_domain hack, instead of just checking xen_domain check some static
> branch.  Then teach xen and others to enable that.>

> OK but problem then becomes this: if you do this and virtio device appears
> behind a vIOMMU and it does not advertize the IOMMU flag, the
> code will try to use the vIOMMU mappings and fail.
>
> It does look like even with trick above, you need a special version of
> DMA ops that does just swiotlb but not any of the other things DMA API
> might do.
> 
> Thoughts?

Yes, this is the purpose of Anshuman original patch (I haven't looked
at the details of the patch in a while but that's what I told him to
implement ;-) :

 - Make virtio always use DMA ops to simplify the code path (with a set
of "transparent" ops for legacy)

 and

 -  Provide an arch hook allowing us to "override" those "transparent"
DMA ops with some custom ones that do the appropriate swiotlb gunk.

Cheers,
Ben.




Re: [RFC 0/4] Virtio uses DMA API for all devices

2018-08-02 Thread Michael S. Tsirkin
On Thu, Aug 02, 2018 at 11:01:26AM -0500, Benjamin Herrenschmidt wrote:
> On Thu, 2018-08-02 at 18:41 +0300, Michael S. Tsirkin wrote:
> > 
> > > I don't completely agree:
> > > 
> > > 1 - VIRTIO_F_IOMMU_PLATFORM is a property of the "other side", ie qemu
> > > for example. It indicates that the peer bypasses the normal platform
> > > iommu. The platform code in the guest has no real way to know that this
> > > is the case, this is a specific "feature" of the qemu implementation.
> > > 
> > > 2 - VIRTIO_F_PLATFORM_DMA (or whatever you want to call it), is a
> > > property of the guest platform itself (not qemu), there's no way the
> > > "peer" can advertize it via the virtio negociated flags. At least for
> > > us. I don't know for sure whether that would be workable for the ARM
> > > case. In our case, qemu has no idea at VM creation time that the VM
> > > will turn itself into a secure VM and thus will require bounce
> > > buffering for IOs (including virtio).
> > > 
> > > So unless we have another hook for the arch code to set
> > > VIRTIO_F_PLATFORM_DMA on selected (or all) virtio devices from the
> > > guest itself, I don't see that as a way to deal with it.
> > > 
> > > >  The other issue is VIRTIO_F_IO_BARRIER
> > > > which is very vaguely defined, and which needs a better definition.
> > > > And last but not least we'll need some text explaining the challenges
> > > > of hardware devices - I think VIRTIO_F_PLATFORM_DMA + 
> > > > VIRTIO_F_IO_BARRIER
> > > > is what would basically cover them, but a good description including
> > > > an explanation of why these matter.
> > > 
> > > Ben.
> > > 
> > 
> > So is it true that from qemu point of view there is nothing special
> > going on?  You pass in a PA, host writes there.
> 
> Yes, qemu doesn't see a different. It's the guest that will bounce the
> pages via a pool of "insecure" pages that qemu can access. Normal pages
> in a secure VM come from PAs that qemu cannot physically access.
> 
> Cheers,
> Ben.
> 

I see. So yes, given that device does not know or care, using
virtio features is an awkward fit.

So let's say as a quick fix for you maybe we could generalize the
xen_domain hack, instead of just checking xen_domain check some static
branch.  Then teach xen and others to enable that.
OK but problem then becomes this: if you do this and virtio device appears
behind a vIOMMU and it does not advertize the IOMMU flag, the
code will try to use the vIOMMU mappings and fail.

It does look like even with trick above, you need a special version of
DMA ops that does just swiotlb but not any of the other things DMA API
might do.

Thoughts?

-- 
MST


Re: [RFC 0/4] Virtio uses DMA API for all devices

2018-08-02 Thread Benjamin Herrenschmidt
On Thu, 2018-08-02 at 18:41 +0300, Michael S. Tsirkin wrote:
> 
> > I don't completely agree:
> > 
> > 1 - VIRTIO_F_IOMMU_PLATFORM is a property of the "other side", ie qemu
> > for example. It indicates that the peer bypasses the normal platform
> > iommu. The platform code in the guest has no real way to know that this
> > is the case, this is a specific "feature" of the qemu implementation.
> > 
> > 2 - VIRTIO_F_PLATFORM_DMA (or whatever you want to call it), is a
> > property of the guest platform itself (not qemu), there's no way the
> > "peer" can advertize it via the virtio negociated flags. At least for
> > us. I don't know for sure whether that would be workable for the ARM
> > case. In our case, qemu has no idea at VM creation time that the VM
> > will turn itself into a secure VM and thus will require bounce
> > buffering for IOs (including virtio).
> > 
> > So unless we have another hook for the arch code to set
> > VIRTIO_F_PLATFORM_DMA on selected (or all) virtio devices from the
> > guest itself, I don't see that as a way to deal with it.
> > 
> > >  The other issue is VIRTIO_F_IO_BARRIER
> > > which is very vaguely defined, and which needs a better definition.
> > > And last but not least we'll need some text explaining the challenges
> > > of hardware devices - I think VIRTIO_F_PLATFORM_DMA + VIRTIO_F_IO_BARRIER
> > > is what would basically cover them, but a good description including
> > > an explanation of why these matter.
> > 
> > Ben.
> > 
> 
> So is it true that from qemu point of view there is nothing special
> going on?  You pass in a PA, host writes there.

Yes, qemu doesn't see a different. It's the guest that will bounce the
pages via a pool of "insecure" pages that qemu can access. Normal pages
in a secure VM come from PAs that qemu cannot physically access.

Cheers,
Ben.




Re: [RFC 0/4] Virtio uses DMA API for all devices

2018-08-02 Thread Michael S. Tsirkin
On Thu, Aug 02, 2018 at 10:24:57AM -0500, Benjamin Herrenschmidt wrote:
> On Wed, 2018-08-01 at 01:36 -0700, Christoph Hellwig wrote:
> > We just need to figure out how to deal with devices that deviate
> > from the default.  One things is that VIRTIO_F_IOMMU_PLATFORM really
> > should become VIRTIO_F_PLATFORM_DMA to cover the cases of non-iommu
> > dma tweaks (offsets, cache flushing), which seems well in spirit of
> > the original design. 
> 
> I don't completely agree:
> 
> 1 - VIRTIO_F_IOMMU_PLATFORM is a property of the "other side", ie qemu
> for example. It indicates that the peer bypasses the normal platform
> iommu. The platform code in the guest has no real way to know that this
> is the case, this is a specific "feature" of the qemu implementation.
> 
> 2 - VIRTIO_F_PLATFORM_DMA (or whatever you want to call it), is a
> property of the guest platform itself (not qemu), there's no way the
> "peer" can advertize it via the virtio negociated flags. At least for
> us. I don't know for sure whether that would be workable for the ARM
> case. In our case, qemu has no idea at VM creation time that the VM
> will turn itself into a secure VM and thus will require bounce
> buffering for IOs (including virtio).
> 
> So unless we have another hook for the arch code to set
> VIRTIO_F_PLATFORM_DMA on selected (or all) virtio devices from the
> guest itself, I don't see that as a way to deal with it.
> 
> >  The other issue is VIRTIO_F_IO_BARRIER
> > which is very vaguely defined, and which needs a better definition.
> > And last but not least we'll need some text explaining the challenges
> > of hardware devices - I think VIRTIO_F_PLATFORM_DMA + VIRTIO_F_IO_BARRIER
> > is what would basically cover them, but a good description including
> > an explanation of why these matter.
> 
> Ben.
> 

So is it true that from qemu point of view there is nothing special
going on?  You pass in a PA, host writes there.


-- 
MST


[PATCH] powernv/cpuidle: Fix idle states all being marked invalid

2018-08-02 Thread Nicholas Piggin
Commit 9c7b185ab2 ("powernv/cpuidle: Parse dt idle properties into
global structure") parses dt idle states into structs, but never
marks them valid. This results in all idle states being lost.

Cc: Akshay Adiga 
Cc: Gautham R. Shenoy 
Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/platforms/powernv/idle.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/platforms/powernv/idle.c 
b/arch/powerpc/platforms/powernv/idle.c
index 3116bab10aa3..ecb002c5db83 100644
--- a/arch/powerpc/platforms/powernv/idle.c
+++ b/arch/powerpc/platforms/powernv/idle.c
@@ -651,11 +651,12 @@ static int __init pnv_power9_idle_init(void)
  >psscr_mask,
  state->flags);
if (err) {
-   state->valid = false;
report_invalid_psscr_val(state->psscr_val, err);
continue;
}
 
+   state->valid = true;
+
if (max_residency_ns < state->residency_ns) {
max_residency_ns = state->residency_ns;
pnv_deepest_stop_psscr_val = state->psscr_val;
-- 
2.17.0



Re: [RFC 0/4] Virtio uses DMA API for all devices

2018-08-02 Thread Benjamin Herrenschmidt
On Thu, 2018-08-02 at 00:56 +0300, Michael S. Tsirkin wrote:
> > but it's not, VMs are
> > created in "legacy" mode all the times and we don't know at VM creation
> > time whether it will become a secure VM or not. The way our secure VMs
> > work is that they start as a normal VM, load a secure "payload" and
> > call the Ultravisor to "become" secure.
> > 
> > So we're in a bit of a bind here. We need that one-liner optional arch
> > hook to make virtio use swiotlb in that "IOMMU bypass" case.
> > 
> > Ben.
> 
> And just to make sure I understand, on your platform DMA APIs do include
> some of the cache flushing tricks and this is why you don't want to
> declare iommu support in the hypervisor?

I'm not sure I parse what you mean.

We don't need cache flushing tricks. The problem we have with our
"secure" VMs is that:

 - At VM creation time we have no idea it's going to become a secure
VM, qemu doesn't know anything about it, and thus qemu (or other
management tools, libvirt etc...) are going to create "legacy" (ie
iommu bypass) virtio devices.

 - Once the VM goes secure (early during boot but too late for qemu),
it will need to make virtio do bounce-buffering via swiotlb because
qemu cannot physically access most VM pages (blocked by HW security
features), we need to bounce buffer using some unsecure pages that are
accessible to qemu.

That said, I wouldn't object for us to more generally switch long run
to changing qemu so that virtio on powerpc starts using the IOMMU as a
default provided we fix our guest firmware to understand it (it
currently doesn't), and provided we verify that the performance impact
on things like vhost is negligible.

Cheers,
Ben.




Re: [RFC 0/4] Virtio uses DMA API for all devices

2018-08-02 Thread Benjamin Herrenschmidt
On Wed, 2018-08-01 at 01:36 -0700, Christoph Hellwig wrote:
> We just need to figure out how to deal with devices that deviate
> from the default.  One things is that VIRTIO_F_IOMMU_PLATFORM really
> should become VIRTIO_F_PLATFORM_DMA to cover the cases of non-iommu
> dma tweaks (offsets, cache flushing), which seems well in spirit of
> the original design. 

I don't completely agree:

1 - VIRTIO_F_IOMMU_PLATFORM is a property of the "other side", ie qemu
for example. It indicates that the peer bypasses the normal platform
iommu. The platform code in the guest has no real way to know that this
is the case, this is a specific "feature" of the qemu implementation.

2 - VIRTIO_F_PLATFORM_DMA (or whatever you want to call it), is a
property of the guest platform itself (not qemu), there's no way the
"peer" can advertize it via the virtio negociated flags. At least for
us. I don't know for sure whether that would be workable for the ARM
case. In our case, qemu has no idea at VM creation time that the VM
will turn itself into a secure VM and thus will require bounce
buffering for IOs (including virtio).

So unless we have another hook for the arch code to set
VIRTIO_F_PLATFORM_DMA on selected (or all) virtio devices from the
guest itself, I don't see that as a way to deal with it.

>  The other issue is VIRTIO_F_IO_BARRIER
> which is very vaguely defined, and which needs a better definition.
> And last but not least we'll need some text explaining the challenges
> of hardware devices - I think VIRTIO_F_PLATFORM_DMA + VIRTIO_F_IO_BARRIER
> is what would basically cover them, but a good description including
> an explanation of why these matter.

Ben.




Re: [RESEND PATCH 6/6] arm64: enable RapidIO menu in Kconfig

2018-08-02 Thread Christoph Hellwig
On Thu, Aug 02, 2018 at 09:45:44AM -0400, Alexei Colin wrote:
> If I move RapidIO option to drivers/Kconfig, then it won't appear under
> the Bus Options/System Support menu, along with other choices for the
> system bus (PCI, PCMCIA, ISA, TC, ...).

It's not like anyone cares.  And FYI, moving all those to
drivers/Kconfig is osmething I will submit for the next merge window.

> Alex explains above that RapidIO
> may be selected as a system bus on some architectures, and users expect
> it to be in the menu in which it has been for some time now.  What
> problem is the current organization causing?

A "system bus" seems like a rather outdated concept to start with.


Re: [RESEND PATCH 6/6] arm64: enable RapidIO menu in Kconfig

2018-08-02 Thread Alex Bounine
I will experiment with this idea.
Please keep me in CC as well.

Sent from my iPad

> On Aug 2, 2018, at 10:00 AM, Alexei Colin  wrote:
> 
>> On Thu, Aug 02, 2018 at 06:54:21AM -0700, Christoph Hellwig wrote:
>>> On Thu, Aug 02, 2018 at 09:45:44AM -0400, Alexei Colin wrote:
>>> If I move RapidIO option to drivers/Kconfig, then it won't appear under
>>> the Bus Options/System Support menu, along with other choices for the
>>> system bus (PCI, PCMCIA, ISA, TC, ...).
>> 
>> It's not like anyone cares.  And FYI, moving all those to
>> drivers/Kconfig is osmething I will submit for the next merge window.
> 
> Ok, I would appreciate a CC on that patch. After it lands, if
> there is time, I'll resubmit a rebased version of this patch.


Re: [RFC PATCH 3/3] cpuidle/powernv: Conditionally save-restore sprs using opal

2018-08-02 Thread Nicholas Piggin
On Thu,  2 Aug 2018 10:21:32 +0530
Akshay Adiga  wrote:

> From: Abhishek Goel 
> 
> If a state has "opal-supported" compat flag in device-tree, an opal call
> needs to be made during the entry and exit of the stop state. This patch
> passes a hint to the power9_idle_stop and power9_offline_stop.
> 
> This patch moves the saving and restoring of sprs for P9 cpuidle
> from kernel to opal. This patch still uses existing code to detect
> first thread in core.
> In an attempt to make the powernv idle code backward compatible,
> and to some extent forward compatible, add support for pre-stop entry
> and post-stop exit actions in OPAL. If a kernel knows about this
> opal call, then just a firmware supporting newer hardware is required,
> instead of waiting for kernel updates.

Still think we should make these do-everything calls. Including
executing nap/stop instructions, restoring timebase, possibly even
saving and restoring SLB (although a return code could be used to
tell the kernel to do that maybe if performance advantage is enough).

I haven't had a lot of time to go through it, I'm working on moving
~all of idle_book3s.S to C code, I'd like to do that before this
OPAL idle driver if possible.

A minor thing I just noticed, you don't have to allocate the opal
spr save space in Linux, just do it all in OPAL.

Thanks,
Nick


Re: [RESEND PATCH 6/6] arm64: enable RapidIO menu in Kconfig

2018-08-02 Thread Alexei Colin
On Thu, Aug 02, 2018 at 06:54:21AM -0700, Christoph Hellwig wrote:
> On Thu, Aug 02, 2018 at 09:45:44AM -0400, Alexei Colin wrote:
> > If I move RapidIO option to drivers/Kconfig, then it won't appear under
> > the Bus Options/System Support menu, along with other choices for the
> > system bus (PCI, PCMCIA, ISA, TC, ...).
> 
> It's not like anyone cares.  And FYI, moving all those to
> drivers/Kconfig is osmething I will submit for the next merge window.

Ok, I would appreciate a CC on that patch. After it lands, if
there is time, I'll resubmit a rebased version of this patch.


Re: [RESEND PATCH 6/6] arm64: enable RapidIO menu in Kconfig

2018-08-02 Thread Alexei Colin
On Thu, Aug 02, 2018 at 10:57:00AM +0200, Geert Uytterhoeven wrote:
> On Wed, Aug 1, 2018 at 3:16 PM Alex Bounine  wrote:
> > On 2018-08-01 05:54 AM, Christoph Hellwig wrote:
> > > On Tue, Jul 31, 2018 at 10:29:54AM -0400, Alexei Colin wrote:
> > >> Platforms with a PCI bus will be offered the RapidIO menu since they may
> > >> be want support for a RapidIO PCI device. Platforms without a PCI bus
> > >> that might include a RapidIO IP block will need to "select HAS_RAPIDIO"
> > >> in the platform-/machine-specific "config ARCH_*" Kconfig entry.
> > >>
> > >> Tested that kernel builds for arm64 with RapidIO subsystem and
> > >> switch drivers enabled, also that the modules load successfully
> > >> on a custom Aarch64 Qemu model.
> > >
> > > As said before, please include it from drivers/Kconfig so that _all_
> > > architectures supporting PCI (or other Rapidio attachements) get it
> > > and not some arbitrary selection of architectures.
> 
> +1
> 
> > As it was replied earlier this is not a random selection of
> > architectures but only ones that implement support for RapidIO as system
> > bus. If other architectures choose to adopt RapidIO we will include them
> > as well.
> >
> > On some platforms RapidIO can be the only system bus available replacing
> > PCI/PCIe or RapidIO can coexist with PCIe.
> >
> > As it is done now, RapidIO is configured in "Bus Options" (x86/PPC) or
> > "Bus Support" (ARMs) sub-menu and from system configuration option it
> > should be kept this way.
> >
> > Current location of RAPIDIO configuration option is familiar to users of
> > PowerPC and x86 platforms, and is similarly available in some ARM
> > manufacturers kernel code trees.
> >
> > drivers/Kconfig will be used for configuring drivers for peripheral
> > RapidIO devices if/when such device drivers will be published.
> 
> Everything in drivers/rapidio/Kconfig depends on RAPIDIO (probably it should
> use a big if RAPIDIO/endif instead), so it can just be included from
> drivers/Kconfig now.
> 
> The sooner you do that, the less treewide changes are needed (currently
> limited to mips, powerpc, and x86; your patch adds arm64).

If I move RapidIO option to drivers/Kconfig, then it won't appear under
the Bus Options/System Support menu, along with other choices for the
system bus (PCI, PCMCIA, ISA, TC, ...). Alex explains above that RapidIO
may be selected as a system bus on some architectures, and users expect
it to be in the menu in which it has been for some time now.  What
problem is the current organization causing?

This patch does not intend to propose any changes to the current
organization of the menus, if such changes are needed, another patch can
be made with that purpose. What problem is this patch introducing?


[PATCH 2/2] powerpc/cpu: post the event cpux add/remove instead of online/offline during hotplug

2018-08-02 Thread Pingfan Liu
Technically speaking, echo 1/0 > cpuX/online is only a subset of cpu
hotplug/unplug, i.e. add/remove. The latter one includes the physical
adding/removing of a cpu device. Some user space tools such as kexec-tools
resort to the event add/remove to automatically rebuild dtb.
If the dtb is not rebuilt correctly, we may hang on 2nd kernel due to
lack the info of boot-cpu-hwid in dtb.

The steps to trigger the bug: (suppose 8 threads/core)
drmgr -c cpu -r -q 1
systemctl restart kdump.service
drmgr -c cpu -a -q 1
taskset -c 11 sh -c "echo c > /proc/sysrq-trigger"

Then, failure info:
[  205.299528] SysRq : Trigger a crash
[  205.299551] Unable to handle kernel paging request for data at address 
0x
[  205.299558] Faulting instruction address: 0xc06001a0
[  205.299564] Oops: Kernel access of bad area, sig: 11 [#1]
[  205.299569] SMP NR_CPUS=2048 NUMA pSeries
[  205.299575] Modules linked in: macsec sctp_diag sctp tcp_diag udp_diag 
inet_diag unix_diag af_packet_diag netlink_diag ip6t_rpfilter ipt_REJECT 
nf_reject_ipv4 ip6t_REJECT nf_reject_ipv6
xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc 
ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle 
ip6table_security ip6table_raw iptable_nat
nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack 
iptable_mangle iptable_security iptable_raw ebtable_filter ebtables 
ip6table_filter ip6_tables iptable_filter xfs libcrc32c sg
pseries_rng binfmt_misc ip_tables ext4 mbcache jbd2 sd_mod crc_t10dif 
crct10dif_generic crct10dif_common ibmvscsi scsi_transport_srp ibmveth scsi_tgt 
dm_mirror dm_region_hash dm_log dm_mod
[  205.299658] CPU: 11 PID: 2521 Comm: bash Not tainted 
3.10.0-799.el7.ppc64le #1
[  205.299664] task: c0017bcd15e0 ti: c0014f41 task.ti: 
c0014f41
[  205.299670] NIP: c06001a0 LR: c0600ddc CTR: 
c0600180
[  205.299676] REGS: c0014f413a70 TRAP: 0300   Not tainted  
(3.10.0-799.el7.ppc64le)
[  205.299681] MSR: 80009033   CR: 28222822  
XER: 0001
[  205.299696] CFAR: c0009368 DAR:  DSISR: 4200 
SOFTE: 1
GPR00: c0600dbc c0014f413cf0 c1263200 0063
GPR04: c19ca818 c19db5f8 00c2 c140aa30
GPR08: 0007 0001  c140fc60
GPR12: c0600180 c7b36300 10139e58 4000
GPR16: 1013b5d0  101306fc 10139de4
GPR20: 10139de8 10093150  
GPR24: 1013b5e0 100fa0e8 0007 c11af1c8
GPR28: 0063 c11af588 c1179ba8 0002
[  205.299770] NIP [c06001a0] sysrq_handle_crash+0x20/0x30
[  205.299776] LR [c0600ddc] write_sysrq_trigger+0x10c/0x230
[  205.299781] Call Trace:
[  205.299786] [c0014f413cf0] [c0600dbc] 
write_sysrq_trigger+0xec/0x230 (unreliable)
[  205.299794] [c0014f413d90] [c03eb2c4] 
proc_reg_write+0x84/0x120
[  205.299801] [c0014f413dd0] [c0330a80] SyS_write+0x150/0x400
[  205.299808] [c0014f413e30] [c000a184] system_call+0x38/0xb4
[  205.299813] Instruction dump:
[  205.299816] 409effb8 7fc3f378 4bfff381 4bac 3c4c00c6 38423080 
3d42fff1 394a6930
[  205.299827] 3921 912a 7c0004ac 3940 <992a> 4e800020 
6000 6042
[  205.299838] ---[ end trace f590a5dbd3f63aab ]---
[  205.301812]
[  205.301829] Sending IPI to other CPUs
[  205.302846] IPI complete
I'm in purgatory
  -- > hang up here

This patch uses the interface register_/unregister_cpu to fix the problem.

Test the compatibility with ppc64_cpu on a powerKVM guest, with the
following topo:
  Thread(s) per core:  8
  Core(s) per socket:  2
  Socket(s):   2
  NUMA node(s):1
and the following instructions:
  ppc64_cpu --smt=off
  drmgr -c cpu -r 1
  drmgr -c cpu -a 1
  ppc64_cpu --smt=on
or
  drmgr -c cpu -r 1
  ppc64_cpu --smt=off
  drmgr -c cpu -a 1
  ppc64_cpu --smt=on

Cc: Benjamin Herrenschmidt 
Cc: Michael Ellerman 
Cc: "Rafael J. Wysocki" 
Cc: Tyrel Datwyler 
Cc: linux...@vger.kernel.org
Signed-off-by: Pingfan Liu 
---
v2 -> v3
  create sysfs file "dev_attr_physical_id" before cpu online callbacks,
  hence it can work with ppc64_cpu
---
 arch/powerpc/include/asm/setup.h |  4 +++
 arch/powerpc/include/asm/smp.h   |  1 +
 arch/powerpc/kernel/sysfs.c  | 38 +++-
 arch/powerpc/platforms/pseries/hotplug-cpu.c | 16 +++-
 4 files changed, 52 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/include/asm/setup.h b/arch/powerpc/include/asm/setup.h
index 8721fd0..e0597f2 100644
--- a/arch/powerpc/include/asm/setup.h
+++ 

[PATCH 1/2] powerpc/cpuidle: dynamically register/unregister cpuidle_device during hotplug

2018-08-02 Thread Pingfan Liu
cpuidle_device is touched during the cpu hotplug. At present, ppc64 just
online/offline cpu during hotplug/unplug. But if using the
register_cpu/unregister_cpu() API to implement the hotplug, the dir
/sys/../cpuX is created/destroyed during hotplug, hence we also need to
create the file cpuX/cpuidle dynamically.

Cc: Benjamin Herrenschmidt 
Cc: Michael Ellerman 
Cc: "Rafael J. Wysocki" 
Cc: Tyrel Datwyler 
Cc: linux...@vger.kernel.org
Signed-off-by: Pingfan Liu 
---
 drivers/cpuidle/cpuidle-powernv.c | 2 ++
 drivers/cpuidle/cpuidle-pseries.c | 2 ++
 2 files changed, 4 insertions(+)

diff --git a/drivers/cpuidle/cpuidle-powernv.c 
b/drivers/cpuidle/cpuidle-powernv.c
index d29e4f0..94d0def 100644
--- a/drivers/cpuidle/cpuidle-powernv.c
+++ b/drivers/cpuidle/cpuidle-powernv.c
@@ -166,6 +166,7 @@ static int powernv_cpuidle_cpu_online(unsigned int cpu)
struct cpuidle_device *dev = per_cpu(cpuidle_devices, cpu);
 
if (dev && cpuidle_get_driver()) {
+   cpuidle_register_device(dev);
cpuidle_pause_and_lock();
cpuidle_enable_device(dev);
cpuidle_resume_and_unlock();
@@ -181,6 +182,7 @@ static int powernv_cpuidle_cpu_dead(unsigned int cpu)
cpuidle_pause_and_lock();
cpuidle_disable_device(dev);
cpuidle_resume_and_unlock();
+   cpuidle_unregister_device(dev);
}
return 0;
 }
diff --git a/drivers/cpuidle/cpuidle-pseries.c 
b/drivers/cpuidle/cpuidle-pseries.c
index 9e56bc4..a53be8a 100644
--- a/drivers/cpuidle/cpuidle-pseries.c
+++ b/drivers/cpuidle/cpuidle-pseries.c
@@ -193,6 +193,7 @@ static int pseries_cpuidle_cpu_online(unsigned int cpu)
struct cpuidle_device *dev = per_cpu(cpuidle_devices, cpu);
 
if (dev && cpuidle_get_driver()) {
+   cpuidle_register_device(dev);
cpuidle_pause_and_lock();
cpuidle_enable_device(dev);
cpuidle_resume_and_unlock();
@@ -208,6 +209,7 @@ static int pseries_cpuidle_cpu_dead(unsigned int cpu)
cpuidle_pause_and_lock();
cpuidle_disable_device(dev);
cpuidle_resume_and_unlock();
+   cpuidle_unregister_device(dev);
}
return 0;
 }
-- 
2.7.4



[PATCH] ppc64_cpu: utilize cpu/present info to cope with dynamic sysfs

2018-08-02 Thread Pingfan Liu
At present, ppc64_cpu takes the assumption of statically contiguous cpu
ids, i.e from 0 to threads_in_system. This does not face problem, since
the kernel code ensures the continuity. But due to kexec-tools needs the
CPU_ADD/_REMOVE udev event message, instead of CPU_ONLINE/_OFFLINE, the
kernel will resort to register_cpu/unregister_cpu API to acheive this.
Now, unplugging a core will make a hole in cpu_present_mask, which breaks
the continuity. To address this, this patch utilizes the cpu/present to
build a bitmap, and iterate over bitmap to cope with discontinuity.
By this way, ppc64_cpu can work with old/new kernel.

Notes about the kexec-tools issue: (tested with Fedora28)
Some user space tools such as kexec-tools resorts to the event add/remove
to automatically rebuild dtb. If the dtb is not rebuilt correctly, we
may hang on 2nd kernel due to lack the info of boot-cpu-hwid in dtb.

The steps to trigger the bug: (suppose 8 threads/core)
drmgr -c cpu -r -q 1
systemctl restart kdump.service
drmgr -c cpu -a -q 1
taskset -c 11 sh -c "echo c > /proc/sysrq-trigger"

Then, failure info:
[  205.299528] SysRq : Trigger a crash
[  205.299551] Unable to handle kernel paging request for data at address 
0x
[  205.299558] Faulting instruction address: 0xc06001a0
[  205.299564] Oops: Kernel access of bad area, sig: 11 [#1]
[  205.299569] SMP NR_CPUS=2048 NUMA pSeries
[-- cut --]
[  205.301829] Sending IPI to other CPUs
[  205.302846] IPI complete
I'm in purgatory
  -- > hang up here

Cc: Tyrel Datwyler 
Cc: Benjamin Herrenschmidt 
Cc: Michael Ellerman 
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Pingfan Liu 
---
 src/ppc64_cpu.c | 205 
 1 file changed, 176 insertions(+), 29 deletions(-)

diff --git a/src/ppc64_cpu.c b/src/ppc64_cpu.c
index 34654b4..cd5997d 100644
--- a/src/ppc64_cpu.c
+++ b/src/ppc64_cpu.c
@@ -23,6 +23,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -49,7 +50,8 @@
 
 #define PPC64_CPU_VERSION  "1.2"
 
-#define SYSFS_CPUDIR   "/sys/devices/system/cpu/cpu%d"
+#define SYSFS_CPUDIR "/sys/devices/system/cpu"
+#define SYSFS_PERCPUDIR"/sys/devices/system/cpu/cpu%d"
 #define SYSFS_SUBCORES "/sys/devices/system/cpu/subcores_per_core"
 #define DSCR_DEFAULT_PATH "/sys/devices/system/cpu/dscr_default"
 #define INTSERV_PATH   "/proc/device-tree/cpus/%s/ibm,ppc-interrupt-server#s"
@@ -75,17 +77,161 @@ struct cpu_freq {
 
 static int threads_per_cpu = 0;
 static int cpus_in_system = 0;
-static int threads_in_system = 0;
 
 static int do_info(void);
 
+/* 64 bits system */
+#define BITS_PER_LONG  64
+#define BIT_MASK(nr)   (1UL << ((nr) % BITS_PER_LONG))
+#define BIT_WORD(nr)   ((nr) / BITS_PER_LONG)
+
+static unsigned long *cpu_present_mask;
+static unsigned int max_cpu_id = (unsigned int)-1;
+
+/* @n: the position prior to the place to search */
+static unsigned int cpumask_next(int nr, unsigned long *addr)
+{
+   unsigned int bit_num, i, j;
+   unsigned long *p;
+
+   p = addr + BIT_WORD(nr);
+   for (i = nr+1; i < max_cpu_id; ) {
+   for (j = i % BITS_PER_LONG; j < BITS_PER_LONG; j++) {
+   if ((*p >> j) & 0x1) {
+   bit_num = BIT_WORD(i)*BITS_PER_LONG + j;
+   return bit_num;
+   }
+   }
+   p++;
+   i = ((i >> 6) + 1) << 6;
+   }
+   return -1;
+}
+
+#define for_each_cpu(cpu, mask)\
+   for ((cpu) = -1;\
+   (cpu) = cpumask_next((cpu), (mask)),\
+   (cpu) < max_cpu_id;)
+
+static inline int test_bit(int nr, const unsigned long *addr)
+{
+   return 1UL & (addr[BIT_WORD(nr)] >> (nr & (BITS_PER_LONG-1)));
+}
+
+static inline void set_bit(int nr, const unsigned long *addr)
+{
+   unsigned long mask = BIT_MASK(nr);
+   unsigned long *p = ((unsigned long *)addr) + BIT_WORD(nr);
+
+   *p  |= mask;
+}
+
+static void set_bitmap(int start, int end, const unsigned long *addr)
+{
+   int i;
+
+   for ( i = start; i <= end; i++)
+   set_bit(i, addr);
+}
+
+/* @n: the place prior to search */
+static unsigned int cpumask_next_hthread(int nr, const unsigned long *mask)
+{
+   int i, start;
+
+   start = (nr/threads_per_cpu +1)*threads_per_cpu;
+   for (i = start; i < max_cpu_id; i += threads_per_cpu) {
+   if (test_bit(i, mask))
+   return i;
+   }
+   return -1;
+}
+
+/* @bitmap: allocated internally
+ * max_idx: the max cpu logical id
+ * return the num of bits in bitmap
+ */
+static int parse_cpu_mask(char *buf, int bz, unsigned long **bitmap,
+   unsigned int *max_idx)
+{
+   int a, b, i, bm_sz;
+   bool range = false;
+   char *s, *p;
+#define TMP_BUF_SIZE 32
+   char 

Re: powerpc/64s/radix: Fix missing global invalidations when removing copro

2018-08-02 Thread Michael Ellerman
On Tue, 2018-07-31 at 13:24:52 UTC, Frederic Barrat wrote:
> With the optimizations for TLB invalidation from commit 0cef77c7798a
> ("powerpc/64s/radix: flush remote CPUs out of single-threaded
> mm_cpumask"), the scope of a TLBI (global vs. local) can now be
> influenced by the value of the 'copros' counter of the memory context.
> 
> When calling mm_context_remove_copro(), the 'copros' counter is
> decremented first before flushing. It may have the unintended side
> effect of sending local TLBIs when we explicitly need global
> invalidations in this case. Thus breaking any nMMU user in a bad and
> unpredictable way.
> 
> Fix it by flushing first, before updating the 'copros' counter, so
> that invalidations will be global.
> 
> Fixes: 0cef77c7798a ("powerpc/64s/radix: flush remote CPUs out of 
> single-threaded mm_cpumask")
> Signed-off-by: Frederic Barrat 
> Reviewed-by: Nicholas Piggin 
> Tested-by: Vaibhav Jain 

Applied to powerpc fixes, thanks.

https://git.kernel.org/powerpc/c/cca19f0b684f4ed6aabf6ad07ae3e1

cheers


Re: [PATCH] powerpc/mm: fix always true/false warning in slice.c

2018-08-02 Thread Christophe Leroy




On 06/29/2018 02:55 AM, Michael Ellerman wrote:

Christophe Leroy  writes:


This patch fixes the following warnings (obtained with make W=1).

arch/powerpc/mm/slice.c: In function 'slice_range_to_mask':
arch/powerpc/mm/slice.c:73:12: error: comparison is always true due to limited 
range of data type [-Werror=type-limits]
   if (start < SLICE_LOW_TOP) {


Presumably only on 32-bit ?


Sure




diff --git a/arch/powerpc/mm/slice.c b/arch/powerpc/mm/slice.c
index 9530c6db406a..17c57760e06c 100644
--- a/arch/powerpc/mm/slice.c
+++ b/arch/powerpc/mm/slice.c
@@ -79,7 +86,7 @@ static void slice_range_to_mask(unsigned long start, unsigned 
long len,
- (1u << GET_LOW_SLICE_INDEX(start));
}
  
-	if ((start + len) > SLICE_LOW_TOP) {

+   if (!slice_addr_is_low(end)) {
unsigned long start_index = GET_HIGH_SLICE_INDEX(start);
unsigned long align_end = ALIGN(end, (1UL << SLICE_HIGH_SHIFT));
unsigned long count = GET_HIGH_SLICE_INDEX(align_end) - 
start_index;


This worries me.


I'll change it in v2 to:
if (SLICE_NUM_HIGH && !slice_addr_is_low(end)) {




By casting before the comparison in the helper you squash the compiler
warning, but the code is still broken if (start + len) overflows.

Presumably that "never happens", but it just seems fishy.

The other similar check in that file does:

   if (SLICE_NUM_HIGH && ((start + len) > SLICE_LOW_TOP)) {

Where SLICE_NUM_HIGH == 0 on 32-bit.


Could we fix the less than comparisons with SLICE_LOW_TOP with something
similar, eg:

if (!SLICE_NUM_HIGH || start < SLICE_LOW_TOP) {

ie. limit them to the 64-bit code?


That's not enough to make GCC happy:

arch/powerpc/mm/slice.c: In function ‘slice_range_to_mask’:
arch/powerpc/mm/slice.c:74:31: error: comparison is always true due to 
limited range of data type [-Werror=type-limits]

  if (!SLICE_NUM_HIGH || start < SLICE_LOW_TOP) {

Christophe




cheers



[PATCH v2] powerpc/mm: fix always true/false warning in slice.c

2018-08-02 Thread Christophe Leroy
This patch fixes the following warnings (obtained with make W=1).

arch/powerpc/mm/slice.c: In function 'slice_range_to_mask':
arch/powerpc/mm/slice.c:73:12: error: comparison is always true due to limited 
range of data type [-Werror=type-limits]
  if (start < SLICE_LOW_TOP) {
^
arch/powerpc/mm/slice.c:81:20: error: comparison is always false due to limited 
range of data type [-Werror=type-limits]
  if ((start + len) > SLICE_LOW_TOP) {
^
arch/powerpc/mm/slice.c: In function 'slice_mask_for_free':
arch/powerpc/mm/slice.c:136:17: error: comparison is always true due to limited 
range of data type [-Werror=type-limits]
  if (high_limit <= SLICE_LOW_TOP)
 ^
arch/powerpc/mm/slice.c: In function 'slice_check_range_fits':
arch/powerpc/mm/slice.c:185:12: error: comparison is always true due to limited 
range of data type [-Werror=type-limits]
  if (start < SLICE_LOW_TOP) {
^
arch/powerpc/mm/slice.c:195:39: error: comparison is always false due to 
limited range of data type [-Werror=type-limits]
  if (SLICE_NUM_HIGH && ((start + len) > SLICE_LOW_TOP)) {
   ^
arch/powerpc/mm/slice.c: In function 'slice_scan_available':
arch/powerpc/mm/slice.c:306:11: error: comparison is always true due to limited 
range of data type [-Werror=type-limits]
  if (addr < SLICE_LOW_TOP) {
   ^
arch/powerpc/mm/slice.c: In function 'get_slice_psize':
arch/powerpc/mm/slice.c:709:11: error: comparison is always true due to limited 
range of data type [-Werror=type-limits]
  if (addr < SLICE_LOW_TOP) {
   ^

Signed-off-by: Christophe Leroy 
---
 v2: fixed issue reported by Michael

 arch/powerpc/mm/slice.c | 21 ++---
 1 file changed, 14 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/mm/slice.c b/arch/powerpc/mm/slice.c
index 9530c6db406a..c774d43e038c 100644
--- a/arch/powerpc/mm/slice.c
+++ b/arch/powerpc/mm/slice.c
@@ -62,6 +62,13 @@ static void slice_print_mask(const char *label, const struct 
slice_mask *mask) {
 
 #endif
 
+static inline bool slice_addr_is_low(unsigned long addr)
+{
+   u64 tmp = (u64)addr;
+
+   return tmp < SLICE_LOW_TOP;
+}
+
 static void slice_range_to_mask(unsigned long start, unsigned long len,
struct slice_mask *ret)
 {
@@ -71,7 +78,7 @@ static void slice_range_to_mask(unsigned long start, unsigned 
long len,
if (SLICE_NUM_HIGH)
bitmap_zero(ret->high_slices, SLICE_NUM_HIGH);
 
-   if (start < SLICE_LOW_TOP) {
+   if (slice_addr_is_low(start)) {
unsigned long mend = min(end,
 (unsigned long)(SLICE_LOW_TOP - 1));
 
@@ -79,7 +86,7 @@ static void slice_range_to_mask(unsigned long start, unsigned 
long len,
- (1u << GET_LOW_SLICE_INDEX(start));
}
 
-   if ((start + len) > SLICE_LOW_TOP) {
+   if (SLICE_NUM_HIGH && !slice_addr_is_low(end)) {
unsigned long start_index = GET_HIGH_SLICE_INDEX(start);
unsigned long align_end = ALIGN(end, (1UL << SLICE_HIGH_SHIFT));
unsigned long count = GET_HIGH_SLICE_INDEX(align_end) - 
start_index;
@@ -134,7 +141,7 @@ static void slice_mask_for_free(struct mm_struct *mm, 
struct slice_mask *ret,
if (!slice_low_has_vma(mm, i))
ret->low_slices |= 1u << i;
 
-   if (high_limit <= SLICE_LOW_TOP)
+   if (slice_addr_is_low(high_limit - 1))
return;
 
for (i = 0; i < GET_HIGH_SLICE_INDEX(high_limit); i++)
@@ -183,7 +190,7 @@ static bool slice_check_range_fits(struct mm_struct *mm,
unsigned long end = start + len - 1;
u64 low_slices = 0;
 
-   if (start < SLICE_LOW_TOP) {
+   if (slice_addr_is_low(start)) {
unsigned long mend = min(end,
 (unsigned long)(SLICE_LOW_TOP - 1));
 
@@ -193,7 +200,7 @@ static bool slice_check_range_fits(struct mm_struct *mm,
if ((low_slices & available->low_slices) != low_slices)
return false;
 
-   if (SLICE_NUM_HIGH && ((start + len) > SLICE_LOW_TOP)) {
+   if (SLICE_NUM_HIGH && !slice_addr_is_low(end)) {
unsigned long start_index = GET_HIGH_SLICE_INDEX(start);
unsigned long align_end = ALIGN(end, (1UL << SLICE_HIGH_SHIFT));
unsigned long count = GET_HIGH_SLICE_INDEX(align_end) - 
start_index;
@@ -304,7 +311,7 @@ static bool slice_scan_available(unsigned long addr,
 int end, unsigned long *boundary_addr)
 {
unsigned long slice;
-   if (addr < SLICE_LOW_TOP) {
+   if (slice_addr_is_low(addr)) {
slice = GET_LOW_SLICE_INDEX(addr);
*boundary_addr = (slice + end) << SLICE_LOW_SHIFT;
return !!(available->low_slices & (1u << slice));
@@ -707,7 +714,7 @@ unsigned int get_slice_psize(struct mm_struct 

Re: [RESEND PATCH 6/6] arm64: enable RapidIO menu in Kconfig

2018-08-02 Thread Geert Uytterhoeven
Hi Alex,

On Wed, Aug 1, 2018 at 3:16 PM Alex Bounine  wrote:
> On 2018-08-01 05:54 AM, Christoph Hellwig wrote:
> > On Tue, Jul 31, 2018 at 10:29:54AM -0400, Alexei Colin wrote:
> >> Platforms with a PCI bus will be offered the RapidIO menu since they may
> >> be want support for a RapidIO PCI device. Platforms without a PCI bus
> >> that might include a RapidIO IP block will need to "select HAS_RAPIDIO"
> >> in the platform-/machine-specific "config ARCH_*" Kconfig entry.
> >>
> >> Tested that kernel builds for arm64 with RapidIO subsystem and
> >> switch drivers enabled, also that the modules load successfully
> >> on a custom Aarch64 Qemu model.
> >
> > As said before, please include it from drivers/Kconfig so that _all_
> > architectures supporting PCI (or other Rapidio attachements) get it
> > and not some arbitrary selection of architectures.

+1

> As it was replied earlier this is not a random selection of
> architectures but only ones that implement support for RapidIO as system
> bus. If other architectures choose to adopt RapidIO we will include them
> as well.
>
> On some platforms RapidIO can be the only system bus available replacing
> PCI/PCIe or RapidIO can coexist with PCIe.
>
> As it is done now, RapidIO is configured in "Bus Options" (x86/PPC) or
> "Bus Support" (ARMs) sub-menu and from system configuration option it
> should be kept this way.
>
> Current location of RAPIDIO configuration option is familiar to users of
> PowerPC and x86 platforms, and is similarly available in some ARM
> manufacturers kernel code trees.
>
> drivers/Kconfig will be used for configuring drivers for peripheral
> RapidIO devices if/when such device drivers will be published.

Everything in drivers/rapidio/Kconfig depends on RAPIDIO (probably it should
use a big if RAPIDIO/endif instead), so it can just be included from
drivers/Kconfig now.

The sooner you do that, the less treewide changes are needed (currently
limited to mips, powerpc, and x86; your patch adds arm64).

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds


[PATCH v9 4/4] selftests/powerpc: update strlen() test to test the new assembly function for PPC32

2018-08-02 Thread Christophe Leroy
This patch adds a test for testing the new assembly strlen() for PPC32

Signed-off-by: Christophe Leroy 
---
 v9: fixed relevant checkpatch warnings
 v8: removed defines in ppc_asm.h that were added in v6 (not used anymore since 
v7) ; added missing link to strlen_32.S
 v7: reduced the scope to PPC32
 v6: added additional necessary defines in ppc_asm.h
 v5: no change
 v4: new

 tools/testing/selftests/powerpc/stringloops/Makefile| 5 -
 tools/testing/selftests/powerpc/stringloops/asm/cache.h | 3 +++
 tools/testing/selftests/powerpc/stringloops/strlen_32.S | 1 +
 3 files changed, 8 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/powerpc/stringloops/asm/cache.h
 create mode 12 tools/testing/selftests/powerpc/stringloops/strlen_32.S

diff --git a/tools/testing/selftests/powerpc/stringloops/Makefile 
b/tools/testing/selftests/powerpc/stringloops/Makefile
index 779b644461c4..9e510de2c07d 100644
--- a/tools/testing/selftests/powerpc/stringloops/Makefile
+++ b/tools/testing/selftests/powerpc/stringloops/Makefile
@@ -13,9 +13,12 @@ $(OUTPUT)/memcmp_32: CFLAGS += -m32
 $(OUTPUT)/strlen: strlen.c string.o
 $(OUTPUT)/string.o: string.c
 
+$(OUTPUT)/strlen_32: strlen.c
+$(OUTPUT)/strlen_32: CFLAGS += -m32
+
 ASFLAGS = $(CFLAGS)
 
-TEST_GEN_PROGS := memcmp_32 memcmp_64 strlen
+TEST_GEN_PROGS := memcmp_32 memcmp_64 strlen strlen_32
 
 include ../../lib.mk
 
diff --git a/tools/testing/selftests/powerpc/stringloops/asm/cache.h 
b/tools/testing/selftests/powerpc/stringloops/asm/cache.h
new file mode 100644
index ..38685554b6c1
--- /dev/null
+++ b/tools/testing/selftests/powerpc/stringloops/asm/cache.h
@@ -0,0 +1,3 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#defineIFETCH_ALIGN_BYTES 4
diff --git a/tools/testing/selftests/powerpc/stringloops/strlen_32.S 
b/tools/testing/selftests/powerpc/stringloops/strlen_32.S
new file mode 12
index ..72b13731b24c
--- /dev/null
+++ b/tools/testing/selftests/powerpc/stringloops/strlen_32.S
@@ -0,0 +1 @@
+../../../../../arch/powerpc/lib/strlen_32.S
\ No newline at end of file
-- 
2.13.3



[PATCH v9 3/4] powerpc/lib: implement strlen() in assembly for PPC32

2018-08-02 Thread Christophe Leroy
The generic implementation of strlen() reads strings byte per byte.

This patch implements strlen() in assembly based on a read of entire
words, in the same spirit as what some other arches and glibc do.

On a 8xx the time spent in strlen is reduced by 3/4 for long strings.

strlen() selftest on an 8xx provides the following values:

Before the patch (ie with the generic strlen() in lib/string.c):

len 256 : time = 1.195055
len 016 : time = 0.083745
len 008 : time = 0.046828
len 004 : time = 0.028390

After the patch:

len 256 : time = 0.272185 ==> 78% improvment
len 016 : time = 0.040632 ==> 51% improvment
len 008 : time = 0.033060 ==> 29% improvment
len 004 : time = 0.029149 ==> 2% degradation

On a 832x:

Before the patch:

len 256 : time = 0.236125
len 016 : time = 0.018136
len 008 : time = 0.011000
len 004 : time = 0.007229

After the patch:

len 256 : time = 0.094950 ==> 60% improvment
len 016 : time = 0.013357 ==> 26% improvment
len 008 : time = 0.010586 ==> 4% improvment
len 004 : time = 0.008784

Signed-off-by: Christophe Leroy 
---
Changes in v9 and v8:
 - No change

Changes in v7:
 - Reduced the scope to PPC32
 - Modified the missalignment handling to be branchless and loopless

Changes in v6:
 - Reworked for having branchless conclusion

Changes in v5:
 - Fixed for PPC64 LITTLE ENDIAN

Changes in v4:
 - Added alignment of the loop
 - doing the andc only if still not 0 as it happends only for bytes above 0x7f 
which is pretty rare in a string

Changes in v3:
 - Made it common to PPC32 and PPC64

Changes in v2:
 - Moved handling of unaligned strings outside of the main path as it is very 
unlikely.
 - Removed the verification of the fourth byte in case none of the three first 
ones are NUL.

 arch/powerpc/include/asm/string.h |  2 +
 arch/powerpc/lib/Makefile |  2 +-
 arch/powerpc/lib/strlen_32.S  | 78 +++
 3 files changed, 81 insertions(+), 1 deletion(-)
 create mode 100644 arch/powerpc/lib/strlen_32.S

diff --git a/arch/powerpc/include/asm/string.h 
b/arch/powerpc/include/asm/string.h
index 9b8cedf618f4..1647de15a31e 100644
--- a/arch/powerpc/include/asm/string.h
+++ b/arch/powerpc/include/asm/string.h
@@ -50,6 +50,8 @@ static inline void *memset64(uint64_t *p, uint64_t v, 
__kernel_size_t n)
return __memset64(p, v, n * 8);
 }
 #else
+#define __HAVE_ARCH_STRLEN
+
 extern void *memset16(uint16_t *, uint16_t, __kernel_size_t);
 #endif
 #endif /* __KERNEL__ */
diff --git a/arch/powerpc/lib/Makefile b/arch/powerpc/lib/Makefile
index d0ca13ad8231..670286808928 100644
--- a/arch/powerpc/lib/Makefile
+++ b/arch/powerpc/lib/Makefile
@@ -12,7 +12,7 @@ CFLAGS_REMOVE_feature-fixups.o = $(CC_FLAGS_FTRACE)
 
 obj-y += string.o alloc.o code-patching.o feature-fixups.o
 
-obj-$(CONFIG_PPC32)+= div64.o copy_32.o crtsavres.o
+obj-$(CONFIG_PPC32)+= div64.o copy_32.o crtsavres.o strlen_32.o
 
 # See corresponding test in arch/powerpc/Makefile
 # 64-bit linker creates .sfpr on demand for final link (vmlinux),
diff --git a/arch/powerpc/lib/strlen_32.S b/arch/powerpc/lib/strlen_32.S
new file mode 100644
index ..0a8d3f64d493
--- /dev/null
+++ b/arch/powerpc/lib/strlen_32.S
@@ -0,0 +1,78 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * strlen() for PPC32
+ *
+ * Copyright (C) 2018 Christophe Leroy CS Systemes d'Information.
+ *
+ * Inspired from glibc implementation
+ */
+#include 
+#include 
+#include 
+
+   .text
+
+/*
+ * Algorithm:
+ *
+ * 1) Given a word 'x', we can test to see if it contains any 0 bytes
+ *by subtracting 0x01010101, and seeing if any of the high bits of each
+ *byte changed from 0 to 1. This works because the least significant
+ *0 byte must have had no incoming carry (otherwise it's not the least
+ *significant), so it is 0x00 - 0x01 == 0xff. For all other
+ *byte values, either they have the high bit set initially, or when
+ *1 is subtracted you get a value in the range 0x00-0x7f, none of which
+ *have their high bit set. The expression here is
+ *(x - 0x01010101) & ~x & 0x80808080), which gives 0x when
+ *there were no 0x00 bytes in the word.  You get 0x80 in bytes that
+ *match, but possibly false 0x80 matches in the next more significant
+ *byte to a true match due to carries.  For little-endian this is
+ *of no consequence since the least significant match is the one
+ *we're interested in, but big-endian needs method 2 to find which
+ *byte matches.
+ * 2) Given a word 'x', we can test to see _which_ byte was zero by
+ *calculating ~(((x & ~0x80808080) - 0x80808080 - 1) | x | ~0x80808080).
+ *This produces 0x80 in each byte that was zero, and 0x00 in all
+ *the other bytes. The '| ~0x80808080' clears the low 7 bits in each
+ *byte, and the '| x' part ensures that bytes with the high bit set
+ *produce 0x00. The addition will carry into the high bit of each byte
+ *iff that byte had one of its low 7 bits set. We can then 

[PATCH v9 2/4] selftests/powerpc: Add test for strlen()

2018-08-02 Thread Christophe Leroy
This patch adds a test for strlen()

string.c contains a copy of strlen() from lib/string.c

The test first tests the correctness of strlen() by comparing
the result with libc strlen(). It tests all cases of alignment.

It them tests the duration of an aligned strlen() on a 4 bytes string,
on a 16 bytes string and on a 256 bytes string.

Signed-off-by: Christophe Leroy 
---
 v9: fixed relevant checkpatch warnings
 v8: no change
 v7: no change
 v6: refactorised the benchmark test
 v5: no change
 v4: new

 .../testing/selftests/powerpc/stringloops/Makefile |   5 +-
 .../testing/selftests/powerpc/stringloops/string.c |  37 ++
 .../testing/selftests/powerpc/stringloops/strlen.c | 131 +
 3 files changed, 172 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/powerpc/stringloops/string.c
 create mode 100644 tools/testing/selftests/powerpc/stringloops/strlen.c

diff --git a/tools/testing/selftests/powerpc/stringloops/Makefile 
b/tools/testing/selftests/powerpc/stringloops/Makefile
index 3862256c2b7d..779b644461c4 100644
--- a/tools/testing/selftests/powerpc/stringloops/Makefile
+++ b/tools/testing/selftests/powerpc/stringloops/Makefile
@@ -10,9 +10,12 @@ $(OUTPUT)/memcmp_64: CFLAGS += -m64 -maltivec
 $(OUTPUT)/memcmp_32: memcmp.c
 $(OUTPUT)/memcmp_32: CFLAGS += -m32
 
+$(OUTPUT)/strlen: strlen.c string.o
+$(OUTPUT)/string.o: string.c
+
 ASFLAGS = $(CFLAGS)
 
-TEST_GEN_PROGS := memcmp_32 memcmp_64
+TEST_GEN_PROGS := memcmp_32 memcmp_64 strlen
 
 include ../../lib.mk
 
diff --git a/tools/testing/selftests/powerpc/stringloops/string.c 
b/tools/testing/selftests/powerpc/stringloops/string.c
new file mode 100644
index ..9dac5ef4f52c
--- /dev/null
+++ b/tools/testing/selftests/powerpc/stringloops/string.c
@@ -0,0 +1,37 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ *  linux/lib/string.c
+ *
+ *  Copyright (C) 1991, 1992  Linus Torvalds
+ */
+
+/*
+ * stupid library routines.. The optimized versions should generally be found
+ * as inline code in 
+ *
+ * These are buggy as well..
+ *
+ * * Fri Jun 25 1999, Ingo Oeser 
+ * -  Added strsep() which will replace strtok() soon (because strsep() is
+ *reentrant and should be faster). Use only strsep() in new code, please.
+ *
+ * * Sat Feb 09 2002, Jason Thomas ,
+ *Matthew Hawkins 
+ * -  Kissed strtok() goodbye
+ */
+
+#include 
+
+/**
+ * strlen - Find the length of a string
+ * @s: The string to be sized
+ */
+size_t test_strlen(const char *s)
+{
+   const char *sc;
+
+   for (sc = s; *sc != '\0'; ++sc)
+   ; /* nothing */
+
+   return sc - s;
+}
diff --git a/tools/testing/selftests/powerpc/stringloops/strlen.c 
b/tools/testing/selftests/powerpc/stringloops/strlen.c
new file mode 100644
index ..7ea6122ecacb
--- /dev/null
+++ b/tools/testing/selftests/powerpc/stringloops/strlen.c
@@ -0,0 +1,131 @@
+// SPDX-License-Identifier: GPL-2.0
+#include 
+#include 
+#include 
+#include 
+#include "utils.h"
+
+#define SIZE 256
+#define ITERATIONS 1000
+#define ITERATIONS_BENCH 10
+
+int test_strlen(const void *s);
+
+/* test all offsets and lengths */
+static void test_one(char *s)
+{
+   unsigned long offset;
+
+   for (offset = 0; offset < SIZE; offset++) {
+   int x, y;
+   unsigned long i;
+
+   y = strlen(s + offset);
+   x = test_strlen(s + offset);
+
+   if (x == y)
+   continue;
+
+   printf("strlen() returned %d, should have returned %d (%p 
offset %ld)\n",
+  x, y, s, offset);
+
+   for (i = offset; i < SIZE; i++)
+   printf("%02x ", s[i]);
+   printf("\n");
+   }
+}
+
+static void bench_test(char *s)
+{
+   struct timespec ts_start, ts_end;
+   int i;
+
+   clock_gettime(CLOCK_MONOTONIC, _start);
+
+   for (i = 0; i < ITERATIONS_BENCH; i++)
+   test_strlen(s);
+
+   clock_gettime(CLOCK_MONOTONIC, _end);
+
+   printf("len %3.3d : time = %.6f\n", test_strlen(s),
+  ts_end.tv_sec - ts_start.tv_sec +
+  (ts_end.tv_nsec - ts_start.tv_nsec) / 1e9);
+}
+
+static int testcase(void)
+{
+   char *s;
+   unsigned long i;
+
+   s = memalign(128, SIZE);
+   if (!s) {
+   perror("memalign");
+   exit(1);
+   }
+
+   srandom(1);
+
+   memset(s, 0, SIZE);
+   for (i = 0; i < SIZE; i++) {
+   char c;
+
+   do {
+   c = random() & 0x7f;
+   } while (!c);
+   s[i] = c;
+   test_one(s);
+   }
+
+   for (i = 0; i < ITERATIONS; i++) {
+   unsigned long j;
+
+   for (j = 0; j < SIZE; j++) {
+   char c;
+
+   do {
+   c = random() & 0x7f;
+   } while (!c);
+   s[j] = c;
+   

[PATCH v9 1/4] selftests/powerpc: add test for 32 bits memcmp

2018-08-02 Thread Christophe Leroy
This patch renames memcmp test to memcmp_64 and adds
a memcmp_32 test for testing the 32 bits version of memcmp()

Signed-off-by: Christophe Leroy 
---
 v9: no change
 v8: rebased on latest powerpc/merge
 v7: no change
 v6: no change
 v5: no change
 v4: new

 tools/testing/selftests/powerpc/stringloops/Makefile| 14 +++---
 tools/testing/selftests/powerpc/stringloops/memcmp_32.S |  1 +
 2 files changed, 12 insertions(+), 3 deletions(-)
 create mode 12 tools/testing/selftests/powerpc/stringloops/memcmp_32.S

diff --git a/tools/testing/selftests/powerpc/stringloops/Makefile 
b/tools/testing/selftests/powerpc/stringloops/Makefile
index c60c6172dd3c..3862256c2b7d 100644
--- a/tools/testing/selftests/powerpc/stringloops/Makefile
+++ b/tools/testing/selftests/powerpc/stringloops/Makefile
@@ -1,10 +1,18 @@
 # SPDX-License-Identifier: GPL-2.0
 # The loops are all 64-bit code
-CFLAGS += -m64 -maltivec
 CFLAGS += -I$(CURDIR)
 
-TEST_GEN_PROGS := memcmp
-EXTRA_SOURCES := memcmp_64.S ../harness.c
+EXTRA_SOURCES := ../harness.c
+
+$(OUTPUT)/memcmp_64: memcmp.c
+$(OUTPUT)/memcmp_64: CFLAGS += -m64 -maltivec
+
+$(OUTPUT)/memcmp_32: memcmp.c
+$(OUTPUT)/memcmp_32: CFLAGS += -m32
+
+ASFLAGS = $(CFLAGS)
+
+TEST_GEN_PROGS := memcmp_32 memcmp_64
 
 include ../../lib.mk
 
diff --git a/tools/testing/selftests/powerpc/stringloops/memcmp_32.S 
b/tools/testing/selftests/powerpc/stringloops/memcmp_32.S
new file mode 12
index ..056f2b3af789
--- /dev/null
+++ b/tools/testing/selftests/powerpc/stringloops/memcmp_32.S
@@ -0,0 +1 @@
+../../../../../arch/powerpc/lib/memcmp_32.S
\ No newline at end of file
-- 
2.13.3



Re: [GIT PULL 00/21] perf/core improvements and fixes

2018-08-02 Thread Ingo Molnar


* Arnaldo Carvalho de Melo  wrote:

> Hi Ingo,
> 
>   Please consider pulling, contains a recently merged
> tip/perf/urgent,
> 
> - Arnaldo
> 
> Test results at the end of this message, as usual.
> 
> The following changes since commit c2586cfbb905939b79b49a9121fb0a59a5668fd6:
> 
>   Merge remote-tracking branch 'tip/perf/urgent' into perf/core (2018-07-31 
> 09:55:45 -0300)
> 
> are available in the Git repository at:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux.git 
> tags/perf-core-for-mingo-4.19-20180801
> 
> for you to fetch changes up to b912885ab75c7c8aa841c615108afd755d0b97f8:
> 
>   perf trace: Do not require --no-syscalls to suppress strace like output 
> (2018-08-01 16:20:28 -0300)
> 
> 
> perf/core improvements and fixes:
> 
> perf trace: (Arnaldo Carvalho de Melo)
> 
> - Do not require --no-syscalls to suppress strace like output, i.e.
> 
>  # perf trace -e sched:*switch
> 
>   will show just sched:sched_switch events, not strace-like formatted
>   syscall events, use --syscalls to get the previous behaviour.
> 
>   If instead:
> 
>  # perf trace
> 
>   is used, i.e. no events specified, then --syscalls is implied and
>   system wide strace like formatting will be applied to all syscalls.
> 
>   The behaviour when just a syscall subset is used with '-e' is unchanged:
> 
>  # perf trace -e *sleep,sched:*switch
> 
>   will work as before: just the 'nanosleep' syscall will be strace-like
>   formatted plus the sched:sched_switch tracepoint event, system wide.
> 
> - Allow string table generators to use a default header dir, allowing
>   use of them without parameters to see the table it generates on
>   stdout, e.g.:
> 
> $ tools/perf/trace/beauty/kvm_ioctl.sh
> static const char *kvm_ioctl_cmds[] = {
> [0x00] = "GET_API_VERSION",
> [0x01] = "CREATE_VM",
> [0x02] = "GET_MSR_INDEX_LIST",
> [0x03] = "CHECK_EXTENSION",
> 
> [0xe0] = "CREATE_DEVICE",
> [0xe1] = "SET_DEVICE_ATTR",
> [0xe2] = "GET_DEVICE_ATTR",
> [0xe3] = "HAS_DEVICE_ATTR",
> };
> $
> 
>   See 'ls tools/perf/trace/beauty/*.sh' to see the available string
>   table generators.
> 
> - Add a generator for IPPROTO_ socket's protocol constants.
> 
> perf record: (Kan Liang)
> 
> - Fix error out while applying initial delay and using LBR, due to
>   the use of a PERF_TYPE_SOFTWARE/PERF_COUNT_SW_DUMMY event to track
>   PERF_RECORD_MMAP events while waiting for the initial delay. Such
>   events fail when configured asking PERF_SAMPLE_BRANCH_STACK in
>   perf_event_attr.sample_type.
> 
> perf c2c: (Jiri Olsa)
> 
> - Fix report crash for empty browser, when processing a perf.data file
>   without events of interest, either because not asked for in
>   'perf record' or because the workload didn't triggered such events.
> 
> perf list: (Michael Petlan)
> 
> - Align metric group description format with PMU event description.
> 
> perf tests: (Sandipan Das)
> 
> - Fix indexing when invoking subtests, which caused BPF tests to
>   get results for the next test in the list, with the last one
>   reporting a failure.
> 
> eBPF:
> 
> - Fix installation directory for header files included from eBPF proggies,
>   avoiding clashing with relative paths used to build other software projects
>   such as glibc. (Thomas Richter)
> 
> - Show better message when failing to load an object. (Arnaldo Carvalho de 
> Melo)
> 
> General: (Christophe Leroy)
> 
> - Allow overriding MAX_NR_CPUS at compile time, to make the tooling
>   usable in systems with less memory, in time this has to be changed
>   to properly allocate based on _NPROCESSORS_ONLN.
> 
> Architecture specific:
> 
> - Update arm64's ThunderX2 implementation defined pmu core events (Ganapatrao 
> Kulkarni)
> 
> - Fix complex event name parsing in 'perf test' for PowerPC, where the 
> 'umask' event
>   modifier isn't present. (Sandipan Das)
> 
> CoreSight ARM hardware tracing: (Leo Yan)
> 
> - Fix start tracing packet handling.
> 
> - Support dummy address value for CS_ETM_TRACE_ON packet.
> 
> - Generate branch sample when receiving a CS_ETM_TRACE_ON packet.
> 
> - Generate branch sample for CS_ETM_TRACE_ON packet.
> 
> Signed-off-by: Arnaldo Carvalho de Melo 
> 
> 
> Arnaldo Carvalho de Melo (9):
>   perf trace beauty: Default header_dir to cwd to work without parms
>   tools include uapi: Grab a copy of linux/in.h
>   perf beauty: Add a generator for IPPROTO_ socket's protocol constants
>   perf trace beauty: Do not print NULL strarray entries
>   perf trace beauty: Add beautifiers for 'socket''s 'protocol' arg
>   perf trace: Beautify the AF_INET & AF_INET6 'socket' syscall 'protocol' 
> args
>   perf bpf: Show better message when failing to load an object
>   perf bpf: Include uapi/linux/bpf.h from the 'perf trace' script's 

[PATCH v5 3/3] powerpc/time: no steal_time when CONFIG_PPC_SPLPAR is not selected

2018-08-02 Thread Christophe Leroy
If CONFIG_PPC_SPLPAR is not selected, steal_time will always
be NUL, so accounting it is pointless

Signed-off-by: Christophe Leroy 
---
 v5: moved the change to the original localtion
 v4: removed the check in vtime_account_system(), the compiler removes the code 
regardless.
 v3: new

 arch/powerpc/kernel/time.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index 3b03731221c1..b0f5cc491f8b 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -412,8 +412,10 @@ void vtime_flush(struct task_struct *tsk)
if (acct->gtime)
account_guest_time(tsk, cputime_to_nsecs(acct->gtime));
 
-   if (acct->steal_time)
+   if (IS_ENABLED(CONFIG_PPC_SPLPAR) && acct->steal_time) {
account_steal_time(cputime_to_nsecs(acct->steal_time));
+   acct->steal_time = 0;
+   }
 
if (acct->idle_time)
account_idle_time(cputime_to_nsecs(acct->idle_time));
@@ -433,7 +435,6 @@ void vtime_flush(struct task_struct *tsk)
 
acct->utime = 0;
acct->gtime = 0;
-   acct->steal_time = 0;
acct->idle_time = 0;
acct->stime = 0;
acct->hardirq_time = 0;
-- 
2.13.3



[PATCH v5 2/3] powerpc/time: Only set CONFIG_ARCH_HAS_SCALED_CPUTIME on PPC64

2018-08-02 Thread Christophe Leroy
scaled cputime is only meaningfull when the processor has
SPURR and/or PURR, which means only on PPC64.

Removing it on PPC32 significantly reduces the size of
vtime_account_system() and vtime_account_idle() on an 8xx:

Before:
 l F .text  00a8 vtime_delta
0280 g F .text  010c vtime_account_system
038c g F .text  0048 vtime_account_idle

After:
(vtime_delta gets inlined inside the two functions)
01d8 g F .text  00a0 vtime_account_system
0278 g F .text  0038 vtime_account_idle

In terms of performance, we also get approximatly 7% improvement on
task switch. The following small benchmark app is run with perf stat:

void *thread(void *arg)
{
int i;

for (i = 0; i < atoi((char*)arg); i++)
pthread_yield();
}

int main(int argc, char **argv)
{
pthread_t th1, th2;

pthread_create(, NULL, thread, argv[1]);
pthread_create(, NULL, thread, argv[1]);
pthread_join(th1, NULL);
pthread_join(th2, NULL);

return 0;
}

Before the patch:

 Performance counter stats for 'chrt -f 98 ./sched 10' (50 runs):

   8228.476465  task-clock (msec) #0.954 CPUs utilized  
  ( +-  0.23% )
24  context-switches  #0.024 M/sec  
  ( +-  0.00% )

After the patch:

 Performance counter stats for 'chrt -f 98 ./sched 10' (50 runs):

   7649.070444  task-clock (msec) #0.955 CPUs utilized  
  ( +-  0.27% )
24  context-switches  #0.026 M/sec  
  ( +-  0.00% )

Signed-off-by: Christophe Leroy 
---
 v5: New (previous patch splitted in two parts)

 arch/powerpc/Kconfig  |  2 +-
 arch/powerpc/include/asm/accounting.h |  4 
 arch/powerpc/include/asm/cputime.h|  1 -
 arch/powerpc/kernel/time.c| 12 ++--
 arch/powerpc/xmon/xmon.c  |  4 
 5 files changed, 19 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 5eb4d969afbf..1f5ce708a650 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -137,7 +137,7 @@ config PPC
select ARCH_HAS_PMEM_APIif PPC64
select ARCH_HAS_PTE_SPECIAL
select ARCH_HAS_MEMBARRIER_CALLBACKS
-   select ARCH_HAS_SCALED_CPUTIME  if VIRT_CPU_ACCOUNTING_NATIVE
+   select ARCH_HAS_SCALED_CPUTIME  if VIRT_CPU_ACCOUNTING_NATIVE 
&& PPC64
select ARCH_HAS_SG_CHAIN
select ARCH_HAS_STRICT_KERNEL_RWX   if ((PPC_BOOK3S_64 || PPC32) && 
!RELOCATABLE && !HIBERNATION)
select ARCH_HAS_TICK_BROADCAST  if GENERIC_CLOCKEVENTS_BROADCAST
diff --git a/arch/powerpc/include/asm/accounting.h 
b/arch/powerpc/include/asm/accounting.h
index 3abcf98ed2e0..c607c5d835cc 100644
--- a/arch/powerpc/include/asm/accounting.h
+++ b/arch/powerpc/include/asm/accounting.h
@@ -15,8 +15,10 @@ struct cpu_accounting_data {
/* Accumulated cputime values to flush on ticks*/
unsigned long utime;
unsigned long stime;
+#ifdef CONFIG_ARCH_HAS_SCALED_CPUTIME
unsigned long utime_scaled;
unsigned long stime_scaled;
+#endif
unsigned long gtime;
unsigned long hardirq_time;
unsigned long softirq_time;
@@ -25,8 +27,10 @@ struct cpu_accounting_data {
/* Internal counters */
unsigned long starttime;/* TB value snapshot */
unsigned long starttime_user;   /* TB value on exit to usermode */
+#ifdef CONFIG_ARCH_HAS_SCALED_CPUTIME
unsigned long startspurr;   /* SPURR value snapshot */
unsigned long utime_sspurr; /* ->user_time when ->startspurr set */
+#endif
 };
 
 #endif
diff --git a/arch/powerpc/include/asm/cputime.h 
b/arch/powerpc/include/asm/cputime.h
index 133672744b2e..ae73dc8da2d4 100644
--- a/arch/powerpc/include/asm/cputime.h
+++ b/arch/powerpc/include/asm/cputime.h
@@ -61,7 +61,6 @@ static inline void arch_vtime_task_switch(struct task_struct 
*prev)
struct cpu_accounting_data *acct0 = get_accounting(prev);
 
acct->starttime = acct0->starttime;
-   acct->startspurr = acct0->startspurr;
 }
 #endif
 
diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index 6196bd7393a3..3b03731221c1 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -175,7 +175,7 @@ static void calc_cputime_factors(void)
  * Read the SPURR on systems that have it, otherwise the PURR,
  * or if that doesn't exist return the timebase value passed in.
  */
-static unsigned long read_spurr(unsigned long tb)
+static inline unsigned long read_spurr(unsigned long tb)
 {
if (cpu_has_feature(CPU_FTR_SPURR))
return mfspr(SPRN_SPURR);
@@ -284,7 +284,8 @@ static inline u64 calculate_stolen_time(u64 stop_tb)
 static unsigned long vtime_delta_scaled(struct cpu_accounting_data *acct,

[PATCH v5 1/3] powerpc/time: isolate scaled cputime accounting in dedicated functions.

2018-08-02 Thread Christophe Leroy
scaled cputime is only meaningfull when the processor has
SPURR and/or PURR, which means only on PPC64.

In preparation of the following patch that will remove
CONFIG_ARCH_HAS_SCALED_CPUTIME on PPC32, this patch moves
all scaled cputing accounting logic into dedicated functions.

This patch doesn't change any functionality. It's only code
reorganisation.

Signed-off-by: Christophe Leroy 
---
 v5:
  - v4 patch split in two patches. First part isolates scaled cputime accounting
in dedicated functions, second part activates it only on PPC64.
  - read_spurr() kept as a separate function.
 v4:
  - Using the correct symbol CONFIG_ARCH_HAS_SCALED_CPUTIME instead of 
ARCH_HAS_SCALED_CPUTIME
  - Grouped CONFIG_ARCH_HAS_SCALED_CPUTIME related code in dedicated functions 
to reduce the number of #ifdefs
  - Integrated read_spurr() directly into the related function.
 v3: Rebased following modifications in xmon.c
 v2: added ifdefs in xmon to fix compilation error

 arch/powerpc/kernel/time.c | 69 +-
 1 file changed, 43 insertions(+), 26 deletions(-)

diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index 70f145e02487..6196bd7393a3 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -281,26 +281,16 @@ static inline u64 calculate_stolen_time(u64 stop_tb)
  * Account time for a transition between system, hard irq
  * or soft irq state.
  */
-static unsigned long vtime_delta(struct task_struct *tsk,
-unsigned long *stime_scaled,
-unsigned long *steal_time)
+static unsigned long vtime_delta_scaled(struct cpu_accounting_data *acct,
+   unsigned long now, unsigned long stime)
 {
-   unsigned long now, nowscaled, deltascaled;
-   unsigned long stime;
+   unsigned long stime_scaled;
+   unsigned long nowscaled, deltascaled;
unsigned long utime, utime_scaled;
-   struct cpu_accounting_data *acct = get_accounting(tsk);
 
-   WARN_ON_ONCE(!irqs_disabled());
-
-   now = mftb();
nowscaled = read_spurr(now);
-   stime = now - acct->starttime;
-   acct->starttime = now;
deltascaled = nowscaled - acct->startspurr;
acct->startspurr = nowscaled;
-
-   *steal_time = calculate_stolen_time(now);
-
utime = acct->utime - acct->utime_sspurr;
acct->utime_sspurr = acct->utime;
 
@@ -314,18 +304,38 @@ static unsigned long vtime_delta(struct task_struct *tsk,
 * the user ticks get saved up in paca->user_time_scaled to be
 * used by account_process_tick.
 */
-   *stime_scaled = stime;
+   stime_scaled = stime;
utime_scaled = utime;
if (deltascaled != stime + utime) {
if (utime) {
-   *stime_scaled = deltascaled * stime / (stime + utime);
-   utime_scaled = deltascaled - *stime_scaled;
+   stime_scaled = deltascaled * stime / (stime + utime);
+   utime_scaled = deltascaled - stime_scaled;
} else {
-   *stime_scaled = deltascaled;
+   stime_scaled = deltascaled;
}
}
acct->utime_scaled += utime_scaled;
 
+   return stime_scaled;
+}
+
+static unsigned long vtime_delta(struct task_struct *tsk,
+unsigned long *stime_scaled,
+unsigned long *steal_time)
+{
+   unsigned long now, stime;
+   struct cpu_accounting_data *acct = get_accounting(tsk);
+
+   WARN_ON_ONCE(!irqs_disabled());
+
+   now = mftb();
+   stime = now - acct->starttime;
+   acct->starttime = now;
+
+   *stime_scaled = vtime_delta_scaled(acct, now, stime);
+
+   *steal_time = calculate_stolen_time(now);
+
return stime;
 }
 
@@ -364,6 +374,19 @@ void vtime_account_idle(struct task_struct *tsk)
acct->idle_time += stime + steal_time;
 }
 
+static void vtime_flush_scaled(struct task_struct *tsk,
+  struct cpu_accounting_data *acct)
+{
+   if (acct->utime_scaled)
+   tsk->utimescaled += cputime_to_nsecs(acct->utime_scaled);
+   if (acct->stime_scaled)
+   tsk->stimescaled += cputime_to_nsecs(acct->stime_scaled);
+
+   acct->utime_scaled = 0;
+   acct->utime_sspurr = 0;
+   acct->stime_scaled = 0;
+}
+
 /*
  * Account the whole cputime accumulated in the paca
  * Must be called with interrupts disabled.
@@ -378,9 +401,6 @@ void vtime_flush(struct task_struct *tsk)
if (acct->utime)
account_user_time(tsk, cputime_to_nsecs(acct->utime));
 
-   if (acct->utime_scaled)
-   tsk->utimescaled += cputime_to_nsecs(acct->utime_scaled);
-
if (acct->gtime)
account_guest_time(tsk, cputime_to_nsecs(acct->gtime));
 
@@ -393,8 +413,6 @@ void vtime_flush(struct 

Re: [PATCH v6 5/8] powerpc/pseries: flush SLB contents on SLB MCE errors.

2018-08-02 Thread Nicholas Piggin
On Thu, 2 Aug 2018 10:30:08 +0530
Mahesh Jagannath Salgaonkar  wrote:

> On 08/01/2018 11:28 AM, Nicholas Piggin wrote:
> > On Wed, 04 Jul 2018 23:28:21 +0530
> > Mahesh J Salgaonkar  wrote:
> >   
> >> From: Mahesh Salgaonkar 
> >>
> >> On pseries, as of today system crashes if we get a machine check
> >> exceptions due to SLB errors. These are soft errors and can be fixed by
> >> flushing the SLBs so the kernel can continue to function instead of
> >> system crash. We do this in real mode before turning on MMU. Otherwise
> >> we would run into nested machine checks. This patch now fetches the
> >> rtas error log in real mode and flushes the SLBs on SLB errors.
> >>
> >> Signed-off-by: Mahesh Salgaonkar 
> >> ---
> >>  arch/powerpc/include/asm/book3s/64/mmu-hash.h |1 
> >>  arch/powerpc/include/asm/machdep.h|1 
> >>  arch/powerpc/kernel/exceptions-64s.S  |   42 +
> >>  arch/powerpc/kernel/mce.c |   16 +++-
> >>  arch/powerpc/mm/slb.c |6 +++
> >>  arch/powerpc/platforms/pseries/pseries.h  |1 
> >>  arch/powerpc/platforms/pseries/ras.c  |   51 
> >> +
> >>  arch/powerpc/platforms/pseries/setup.c|1 
> >>  8 files changed, 116 insertions(+), 3 deletions(-)
> >>
> >> diff --git a/arch/powerpc/include/asm/book3s/64/mmu-hash.h 
> >> b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
> >> index 50ed64fba4ae..cc00a7088cf3 100644
> >> --- a/arch/powerpc/include/asm/book3s/64/mmu-hash.h
> >> +++ b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
> >> @@ -487,6 +487,7 @@ extern void hpte_init_native(void);
> >>  
> >>  extern void slb_initialize(void);
> >>  extern void slb_flush_and_rebolt(void);
> >> +extern void slb_flush_and_rebolt_realmode(void);
> >>  
> >>  extern void slb_vmalloc_update(void);
> >>  extern void slb_set_size(u16 size);
> >> diff --git a/arch/powerpc/include/asm/machdep.h 
> >> b/arch/powerpc/include/asm/machdep.h
> >> index ffe7c71e1132..fe447e0d4140 100644
> >> --- a/arch/powerpc/include/asm/machdep.h
> >> +++ b/arch/powerpc/include/asm/machdep.h
> >> @@ -108,6 +108,7 @@ struct machdep_calls {
> >>  
> >>/* Early exception handlers called in realmode */
> >>int (*hmi_exception_early)(struct pt_regs *regs);
> >> +  int (*machine_check_early)(struct pt_regs *regs);
> >>  
> >>/* Called during machine check exception to retrive fixup address. */
> >>bool(*mce_check_early_recovery)(struct pt_regs *regs);
> >> diff --git a/arch/powerpc/kernel/exceptions-64s.S 
> >> b/arch/powerpc/kernel/exceptions-64s.S
> >> index f283958129f2..0038596b7906 100644
> >> --- a/arch/powerpc/kernel/exceptions-64s.S
> >> +++ b/arch/powerpc/kernel/exceptions-64s.S
> >> @@ -332,6 +332,9 @@ TRAMP_REAL_BEGIN(machine_check_pSeries)
> >>  machine_check_fwnmi:
> >>SET_SCRATCH0(r13)   /* save r13 */
> >>EXCEPTION_PROLOG_0(PACA_EXMC)
> >> +BEGIN_FTR_SECTION
> >> +  b   machine_check_pSeries_early
> >> +END_FTR_SECTION_IFCLR(CPU_FTR_HVMODE)
> >>  machine_check_pSeries_0:
> >>EXCEPTION_PROLOG_1(PACA_EXMC, KVMTEST_PR, 0x200)
> >>/*
> >> @@ -343,6 +346,45 @@ machine_check_pSeries_0:
> >>  
> >>  TRAMP_KVM_SKIP(PACA_EXMC, 0x200)
> >>  
> >> +TRAMP_REAL_BEGIN(machine_check_pSeries_early)
> >> +BEGIN_FTR_SECTION
> >> +  EXCEPTION_PROLOG_1(PACA_EXMC, NOTEST, 0x200)
> >> +  mr  r10,r1  /* Save r1 */
> >> +  ld  r1,PACAMCEMERGSP(r13)   /* Use MC emergency stack */
> >> +  subir1,r1,INT_FRAME_SIZE/* alloc stack frame*/
> >> +  mfspr   r11,SPRN_SRR0   /* Save SRR0 */
> >> +  mfspr   r12,SPRN_SRR1   /* Save SRR1 */
> >> +  EXCEPTION_PROLOG_COMMON_1()
> >> +  EXCEPTION_PROLOG_COMMON_2(PACA_EXMC)
> >> +  EXCEPTION_PROLOG_COMMON_3(0x200)
> >> +  addir3,r1,STACK_FRAME_OVERHEAD
> >> +  BRANCH_LINK_TO_FAR(machine_check_early) /* Function call ABI */
> >> +
> >> +  /* Move original SRR0 and SRR1 into the respective regs */
> >> +  ld  r9,_MSR(r1)
> >> +  mtspr   SPRN_SRR1,r9
> >> +  ld  r3,_NIP(r1)
> >> +  mtspr   SPRN_SRR0,r3
> >> +  ld  r9,_CTR(r1)
> >> +  mtctr   r9
> >> +  ld  r9,_XER(r1)
> >> +  mtxer   r9
> >> +  ld  r9,_LINK(r1)
> >> +  mtlrr9
> >> +  REST_GPR(0, r1)
> >> +  REST_8GPRS(2, r1)
> >> +  REST_GPR(10, r1)
> >> +  ld  r11,_CCR(r1)
> >> +  mtcrr11
> >> +  REST_GPR(11, r1)
> >> +  REST_2GPRS(12, r1)
> >> +  /* restore original r1. */
> >> +  ld  r1,GPR1(r1)
> >> +  SET_SCRATCH0(r13)   /* save r13 */
> >> +  EXCEPTION_PROLOG_0(PACA_EXMC)
> >> +  b   machine_check_pSeries_0
> >> +END_FTR_SECTION_IFCLR(CPU_FTR_HVMODE)
> >> +
> >>  EXC_COMMON_BEGIN(machine_check_common)
> >>/*
> >> * Machine check is different because we use a different
> >> diff --git a/arch/powerpc/kernel/mce.c b/arch/powerpc/kernel/mce.c
> >> index efdd16a79075..221271c96a57 100644
> >> --- a/arch/powerpc/kernel/mce.c

Re: linux-next: manual merge of the powerpc tree with the m68k tree

2018-08-02 Thread Geert Uytterhoeven
Hi Stephen,

On Thu, Aug 2, 2018 at 1:42 AM Stephen Rothwell  wrote:
> [forgot the conflict resolution ...]
>
> On Thu, 2 Aug 2018 09:27:20 +1000 Stephen Rothwell  
> wrote:
> >
> > Today's linux-next merge of the powerpc tree got a conflict in:
> >
> >   arch/m68k/mac/misc.c
> >
> > between commit:
> >
> >   5b9bfb8ec467 ("m68k: mac: Use time64_t in RTC handling")
> >
> > from the m68k tree and commit:
> >
> >   ebd722275f9c ("macintosh/via-pmu: Replace via-pmu68k driver with via-pmu 
> > driver")
> >
> > from the powerpc tree.
> >
> > I fixed it up (see below) and can carry the fix as necessary. This
> > is now fixed as far as linux-next is concerned, but any non trivial
> > conflicts should be mentioned to your upstream maintainer when your tree
> > is submitted for merging.  You may also want to consider cooperating
> > with the maintainer of the conflicting tree to minimise any particularly
> > complex conflicts.

Ah, now I remember Finn said he was going to rebase his series once the time64_t
patch has entered my tree...

> --- a/arch/m68k/mac/misc.c
> +++ b/arch/m68k/mac/misc.c
> @@@ -90,11 -85,11 +90,11 @@@ static void cuda_write_pram(int offset
>   }
>   #endif /* CONFIG_ADB_CUDA */
>
> - #ifdef CONFIG_ADB_PMU68K
> + #ifdef CONFIG_ADB_PMU
>  -static long pmu_read_time(void)
>  +static time64_t pmu_read_time(void)
>   {
> struct adb_request req;
>  -  long time;
>  +  time64_t time;
>
> if (pmu_request(, NULL, 1, PMU_READ_RTC) < 0)
> return 0;

Thanks, looks good to me!

Gr{oetje,eeting}s,

Geert

-- 
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds