Re: [PATCH] powerpc: Avoid clang warnings around setjmp and longjmp

2019-09-02 Thread Nathan Chancellor
On Thu, Aug 29, 2019 at 09:59:48AM +, David Laight wrote:
> From: Nathan Chancellor
> > Sent: 28 August 2019 19:45
> ...
> > However, I think that -fno-builtin-* would be appropriate here because
> > we are providing our own setjmp implementation, meaning clang should not
> > be trying to do anything with the builtin implementation like building a
> > declaration for it.
> 
> Isn't implementing setjmp impossible unless you tell the compiler that
> you function is 'setjmp-like' ?

No idea, PowerPC is the only architecture that does such a thing.

https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git/tree/arch/powerpc/kernel/misc.S#n43

Goes back all the way to before git history (all the way to ppc64's
addition actually):

https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git/commit/?id=61542216fa90397a2e70c46583edf26bc81994df

https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git/commit/arch/ppc64/xmon/setjmp.c?id=5f12b0bff93831620218e8ed3970903ecb7861ce

I would just like this warning fixed given that PowerPC builds with
-Werror by default so it is causing a build failure in our CI.

Cheers,
Nathan


[PATCH v2 6/6] powerpc: Don't flush caches when adding memory

2019-09-02 Thread Alastair D'Silva
From: Alastair D'Silva 

This operation takes a significant amount of time when hotplugging
large amounts of memory (~50 seconds with 890GB of persistent memory).

This was orignally in commit fb5924fddf9e
("powerpc/mm: Flush cache on memory hot(un)plug") to support memtrace,
but the flush on add is not needed as it is flushed on remove.

Signed-off-by: Alastair D'Silva 
---
 arch/powerpc/mm/mem.c | 7 ---
 1 file changed, 7 deletions(-)

diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c
index 854aaea2c6ae..2a14b5b93e19 100644
--- a/arch/powerpc/mm/mem.c
+++ b/arch/powerpc/mm/mem.c
@@ -111,7 +111,6 @@ int __ref arch_add_memory(int nid, u64 start, u64 size,
 {
unsigned long start_pfn = start >> PAGE_SHIFT;
unsigned long nr_pages = size >> PAGE_SHIFT;
-   u64 i;
int rc;
 
resize_hpt_for_hotplug(memblock_phys_mem_size());
@@ -124,12 +123,6 @@ int __ref arch_add_memory(int nid, u64 start, u64 size,
return -EFAULT;
}
 
-   for (i = 0; i < size; i += FLUSH_CHUNK_SIZE) {
-   flush_dcache_range(start + i,
-  min(start + size, start + i + 
FLUSH_CHUNK_SIZE));
-   cond_resched();
-   }
-
return __add_pages(nid, start_pfn, nr_pages, restrictions);
 }
 
-- 
2.21.0



[PATCH v2 5/6] powerpc: Remove 'extern' from func prototypes in cache headers

2019-09-02 Thread Alastair D'Silva
From: Alastair D'Silva 

The 'extern' keyword does not value-add for function prototypes.

Signed-off-by: Alastair D'Silva 
---
 arch/powerpc/include/asm/cache.h  | 8 
 arch/powerpc/include/asm/cacheflush.h | 6 +++---
 2 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/include/asm/cache.h b/arch/powerpc/include/asm/cache.h
index 91c808c6738b..54fffdf5a6ec 100644
--- a/arch/powerpc/include/asm/cache.h
+++ b/arch/powerpc/include/asm/cache.h
@@ -102,10 +102,10 @@ static inline u32 l1_icache_bytes(void)
 #define __read_mostly __attribute__((__section__(".data..read_mostly")))
 
 #ifdef CONFIG_PPC_BOOK3S_32
-extern long _get_L2CR(void);
-extern long _get_L3CR(void);
-extern void _set_L2CR(unsigned long);
-extern void _set_L3CR(unsigned long);
+long _get_L2CR(void);
+long _get_L3CR(void);
+void _set_L2CR(unsigned long val);
+void _set_L3CR(unsigned long val);
 #else
 #define _get_L2CR()0L
 #define _get_L3CR()0L
diff --git a/arch/powerpc/include/asm/cacheflush.h 
b/arch/powerpc/include/asm/cacheflush.h
index 4a1c9f0200e1..fa10dc19206c 100644
--- a/arch/powerpc/include/asm/cacheflush.h
+++ b/arch/powerpc/include/asm/cacheflush.h
@@ -38,15 +38,15 @@ static inline void flush_cache_vmap(unsigned long start, 
unsigned long end) { }
 #endif
 
 #define ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE 1
-extern void flush_dcache_page(struct page *page);
+void flush_dcache_page(struct page *page);
 #define flush_dcache_mmap_lock(mapping)do { } while (0)
 #define flush_dcache_mmap_unlock(mapping)  do { } while (0)
 
 void flush_icache_range(unsigned long start, unsigned long stop);
-extern void flush_icache_user_range(struct vm_area_struct *vma,
+void flush_icache_user_range(struct vm_area_struct *vma,
struct page *page, unsigned long addr,
int len);
-extern void flush_dcache_icache_page(struct page *page);
+void flush_dcache_icache_page(struct page *page);
 void __flush_dcache_icache(void *page);
 
 /**
-- 
2.21.0



[PATCH v2 4/6] powerpc: Chunk calls to flush_dcache_range in arch_*_memory

2019-09-02 Thread Alastair D'Silva
From: Alastair D'Silva 

When presented with large amounts of memory being hotplugged
(in my test case, ~890GB), the call to flush_dcache_range takes
a while (~50 seconds), triggering RCU stalls.

This patch breaks up the call into 1GB chunks, calling
cond_resched() inbetween to allow the scheduler to run.

Signed-off-by: Alastair D'Silva 
---
 arch/powerpc/mm/mem.c | 18 --
 1 file changed, 16 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c
index cd540123874d..854aaea2c6ae 100644
--- a/arch/powerpc/mm/mem.c
+++ b/arch/powerpc/mm/mem.c
@@ -104,11 +104,14 @@ int __weak remove_section_mapping(unsigned long start, 
unsigned long end)
return -ENODEV;
 }
 
+#define FLUSH_CHUNK_SIZE SZ_1G
+
 int __ref arch_add_memory(int nid, u64 start, u64 size,
struct mhp_restrictions *restrictions)
 {
unsigned long start_pfn = start >> PAGE_SHIFT;
unsigned long nr_pages = size >> PAGE_SHIFT;
+   u64 i;
int rc;
 
resize_hpt_for_hotplug(memblock_phys_mem_size());
@@ -120,7 +123,12 @@ int __ref arch_add_memory(int nid, u64 start, u64 size,
start, start + size, rc);
return -EFAULT;
}
-   flush_dcache_range(start, start + size);
+
+   for (i = 0; i < size; i += FLUSH_CHUNK_SIZE) {
+   flush_dcache_range(start + i,
+  min(start + size, start + i + 
FLUSH_CHUNK_SIZE));
+   cond_resched();
+   }
 
return __add_pages(nid, start_pfn, nr_pages, restrictions);
 }
@@ -131,13 +139,19 @@ void __ref arch_remove_memory(int nid, u64 start, u64 
size,
unsigned long start_pfn = start >> PAGE_SHIFT;
unsigned long nr_pages = size >> PAGE_SHIFT;
struct page *page = pfn_to_page(start_pfn) + vmem_altmap_offset(altmap);
+   u64 i;
int ret;
 
__remove_pages(page_zone(page), start_pfn, nr_pages, altmap);
 
/* Remove htab bolted mappings for this section of memory */
start = (unsigned long)__va(start);
-   flush_dcache_range(start, start + size);
+   for (i = 0; i < size; i += FLUSH_CHUNK_SIZE) {
+   flush_dcache_range(start + i,
+  min(start + size, start + i + 
FLUSH_CHUNK_SIZE));
+   cond_resched();
+   }
+
ret = remove_section_mapping(start, start + size);
WARN_ON_ONCE(ret);
 
-- 
2.21.0



[PATCH v2 3/6] powerpc: Convert flush_icache_range & friends to C

2019-09-02 Thread Alastair D'Silva
From: Alastair D'Silva 

Similar to commit 22e9c88d486a
("powerpc/64: reuse PPC32 static inline flush_dcache_range()")
this patch converts the following ASM symbols to C:
flush_icache_range()
__flush_dcache_icache()
__flush_dcache_icache_phys()

This was done as we discovered a long-standing bug where the length of the
range was truncated due to using a 32 bit shift instead of a 64 bit one.

By converting these functions to C, it becomes easier to maintain.

flush_dcache_icache_phys() retains a critical assembler section as we must
ensure there are no memory accesses while the data MMU is disabled
(authored by Christophe Leroy). Since this has no external callers, it has
also been made static, allowing the compiler to inline it within
flush_dcache_icache_page().

Signed-off-by: Alastair D'Silva 
Signed-off-by: Christophe Leroy 
---
 arch/powerpc/include/asm/cache.h  |  26 ++---
 arch/powerpc/include/asm/cacheflush.h |  24 ++--
 arch/powerpc/kernel/misc_32.S | 117 
 arch/powerpc/kernel/misc_64.S | 102 -
 arch/powerpc/mm/mem.c | 152 +-
 5 files changed, 173 insertions(+), 248 deletions(-)

diff --git a/arch/powerpc/include/asm/cache.h b/arch/powerpc/include/asm/cache.h
index f852d5cd746c..91c808c6738b 100644
--- a/arch/powerpc/include/asm/cache.h
+++ b/arch/powerpc/include/asm/cache.h
@@ -98,20 +98,7 @@ static inline u32 l1_icache_bytes(void)
 #endif
 #endif /* ! __ASSEMBLY__ */
 
-#if defined(__ASSEMBLY__)
-/*
- * For a snooping icache, we still need a dummy icbi to purge all the
- * prefetched instructions from the ifetch buffers. We also need a sync
- * before the icbi to order the the actual stores to memory that might
- * have modified instructions with the icbi.
- */
-#define PURGE_PREFETCHED_INS   \
-   sync;   \
-   icbi0,r3;   \
-   sync;   \
-   isync
-
-#else
+#if !defined(__ASSEMBLY__)
 #define __read_mostly __attribute__((__section__(".data..read_mostly")))
 
 #ifdef CONFIG_PPC_BOOK3S_32
@@ -145,6 +132,17 @@ static inline void dcbst(void *addr)
 {
__asm__ __volatile__ ("dcbst %y0" : : "Z"(*(u8 *)addr) : "memory");
 }
+
+static inline void icbi(void *addr)
+{
+   __asm__ __volatile__ ("icbi 0, %0" : : "r"(addr) : "memory");
+}
+
+static inline void iccci(void *addr)
+{
+   __asm__ __volatile__ ("iccci 0, %0" : : "r"(addr) : "memory");
+}
+
 #endif /* !__ASSEMBLY__ */
 #endif /* __KERNEL__ */
 #endif /* _ASM_POWERPC_CACHE_H */
diff --git a/arch/powerpc/include/asm/cacheflush.h 
b/arch/powerpc/include/asm/cacheflush.h
index ed57843ef452..4a1c9f0200e1 100644
--- a/arch/powerpc/include/asm/cacheflush.h
+++ b/arch/powerpc/include/asm/cacheflush.h
@@ -42,24 +42,20 @@ extern void flush_dcache_page(struct page *page);
 #define flush_dcache_mmap_lock(mapping)do { } while (0)
 #define flush_dcache_mmap_unlock(mapping)  do { } while (0)
 
-extern void flush_icache_range(unsigned long, unsigned long);
+void flush_icache_range(unsigned long start, unsigned long stop);
 extern void flush_icache_user_range(struct vm_area_struct *vma,
struct page *page, unsigned long addr,
int len);
-extern void __flush_dcache_icache(void *page_va);
 extern void flush_dcache_icache_page(struct page *page);
-#if defined(CONFIG_PPC32) && !defined(CONFIG_BOOKE)
-extern void __flush_dcache_icache_phys(unsigned long physaddr);
-#else
-static inline void __flush_dcache_icache_phys(unsigned long physaddr)
-{
-   BUG();
-}
-#endif
-
-/*
- * Write any modified data cache blocks out to memory and invalidate them.
- * Does not invalidate the corresponding instruction cache blocks.
+void __flush_dcache_icache(void *page);
+
+/**
+ * flush_dcache_range(): Write any modified data cache blocks out to memory and
+ * invalidate them. Does not invalidate the corresponding instruction cache
+ * blocks.
+ *
+ * @start: the start address
+ * @stop: the stop address (exclusive)
  */
 static inline void flush_dcache_range(unsigned long start, unsigned long stop)
 {
diff --git a/arch/powerpc/kernel/misc_32.S b/arch/powerpc/kernel/misc_32.S
index fe4bd321730e..12b95e6799d4 100644
--- a/arch/powerpc/kernel/misc_32.S
+++ b/arch/powerpc/kernel/misc_32.S
@@ -318,123 +318,6 @@ END_FTR_SECTION_IFSET(CPU_FTR_UNIFIED_ID_CACHE)
 EXPORT_SYMBOL(flush_instruction_cache)
 #endif /* CONFIG_PPC_8xx */
 
-/*
- * Write any modified data cache blocks out to memory
- * and invalidate the corresponding instruction cache blocks.
- * This is a no-op on the 601.
- *
- * flush_icache_range(unsigned long start, unsigned long stop)
- */
-_GLOBAL(flush_icache_range)
-BEGIN_FTR_SECTION
-   PURGE_PREFETCHED_INS
-   blr /* for 601, do nothing */
-END_FTR_SECTION_IFSET(CPU_FTR_COHERENT_ICACHE)
-   rlwinm  r3,r3,0,0,31 - L1_CACHE_SHIFT
-   subfr4,r3,r4
-   

[PATCH v2 2/6] powerpc: define helpers to get L1 icache sizes

2019-09-02 Thread Alastair D'Silva
From: Alastair D'Silva 

This patch adds helpers to retrieve icache sizes, and renames the existing
helpers to make it clear that they are for dcache.

Signed-off-by: Alastair D'Silva 
---
 arch/powerpc/include/asm/cache.h  | 29 +++
 arch/powerpc/include/asm/cacheflush.h | 12 +--
 2 files changed, 31 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/include/asm/cache.h b/arch/powerpc/include/asm/cache.h
index b3388d95f451..f852d5cd746c 100644
--- a/arch/powerpc/include/asm/cache.h
+++ b/arch/powerpc/include/asm/cache.h
@@ -55,25 +55,46 @@ struct ppc64_caches {
 
 extern struct ppc64_caches ppc64_caches;
 
-static inline u32 l1_cache_shift(void)
+static inline u32 l1_dcache_shift(void)
 {
return ppc64_caches.l1d.log_block_size;
 }
 
-static inline u32 l1_cache_bytes(void)
+static inline u32 l1_dcache_bytes(void)
 {
return ppc64_caches.l1d.block_size;
 }
+
+static inline u32 l1_icache_shift(void)
+{
+   return ppc64_caches.l1i.log_block_size;
+}
+
+static inline u32 l1_icache_bytes(void)
+{
+   return ppc64_caches.l1i.block_size;
+}
 #else
-static inline u32 l1_cache_shift(void)
+static inline u32 l1_dcache_shift(void)
 {
return L1_CACHE_SHIFT;
 }
 
-static inline u32 l1_cache_bytes(void)
+static inline u32 l1_dcache_bytes(void)
 {
return L1_CACHE_BYTES;
 }
+
+static inline u32 l1_icache_shift(void)
+{
+   return L1_CACHE_SHIFT;
+}
+
+static inline u32 l1_icache_bytes(void)
+{
+   return L1_CACHE_BYTES;
+}
+
 #endif
 #endif /* ! __ASSEMBLY__ */
 
diff --git a/arch/powerpc/include/asm/cacheflush.h 
b/arch/powerpc/include/asm/cacheflush.h
index eef388f2659f..ed57843ef452 100644
--- a/arch/powerpc/include/asm/cacheflush.h
+++ b/arch/powerpc/include/asm/cacheflush.h
@@ -63,8 +63,8 @@ static inline void __flush_dcache_icache_phys(unsigned long 
physaddr)
  */
 static inline void flush_dcache_range(unsigned long start, unsigned long stop)
 {
-   unsigned long shift = l1_cache_shift();
-   unsigned long bytes = l1_cache_bytes();
+   unsigned long shift = l1_dcache_shift();
+   unsigned long bytes = l1_dcache_bytes();
void *addr = (void *)(start & ~(bytes - 1));
unsigned long size = stop - (unsigned long)addr + (bytes - 1);
unsigned long i;
@@ -89,8 +89,8 @@ static inline void flush_dcache_range(unsigned long start, 
unsigned long stop)
  */
 static inline void clean_dcache_range(unsigned long start, unsigned long stop)
 {
-   unsigned long shift = l1_cache_shift();
-   unsigned long bytes = l1_cache_bytes();
+   unsigned long shift = l1_dcache_shift();
+   unsigned long bytes = l1_dcache_bytes();
void *addr = (void *)(start & ~(bytes - 1));
unsigned long size = stop - (unsigned long)addr + (bytes - 1);
unsigned long i;
@@ -108,8 +108,8 @@ static inline void clean_dcache_range(unsigned long start, 
unsigned long stop)
 static inline void invalidate_dcache_range(unsigned long start,
   unsigned long stop)
 {
-   unsigned long shift = l1_cache_shift();
-   unsigned long bytes = l1_cache_bytes();
+   unsigned long shift = l1_dcache_shift();
+   unsigned long bytes = l1_dcache_bytes();
void *addr = (void *)(start & ~(bytes - 1));
unsigned long size = stop - (unsigned long)addr + (bytes - 1);
unsigned long i;
-- 
2.21.0



[PATCH v2 1/6] powerpc: Allow flush_icache_range to work across ranges >4GB

2019-09-02 Thread Alastair D'Silva
From: Alastair D'Silva 

When calling flush_icache_range with a size >4GB, we were masking
off the upper 32 bits, so we would incorrectly flush a range smaller
than intended.

This patch replaces the 32 bit shifts with 64 bit ones, so that
the full size is accounted for.

Signed-off-by: Alastair D'Silva 
Cc: sta...@vger.kernel.org
---
 arch/powerpc/kernel/misc_64.S | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kernel/misc_64.S b/arch/powerpc/kernel/misc_64.S
index b55a7b4cb543..9bc0aa9aeb65 100644
--- a/arch/powerpc/kernel/misc_64.S
+++ b/arch/powerpc/kernel/misc_64.S
@@ -82,7 +82,7 @@ END_FTR_SECTION_IFSET(CPU_FTR_COHERENT_ICACHE)
subfr8,r6,r4/* compute length */
add r8,r8,r5/* ensure we get enough */
lwz r9,DCACHEL1LOGBLOCKSIZE(r10)/* Get log-2 of cache block 
size */
-   srw.r8,r8,r9/* compute line count */
+   srd.r8,r8,r9/* compute line count */
beqlr   /* nothing to do? */
mtctr   r8
 1: dcbst   0,r6
@@ -98,7 +98,7 @@ END_FTR_SECTION_IFSET(CPU_FTR_COHERENT_ICACHE)
subfr8,r6,r4/* compute length */
add r8,r8,r5
lwz r9,ICACHEL1LOGBLOCKSIZE(r10)/* Get log-2 of Icache block 
size */
-   srw.r8,r8,r9/* compute line count */
+   srd.r8,r8,r9/* compute line count */
beqlr   /* nothing to do? */
mtctr   r8
 2: icbi0,r6
-- 
2.21.0



[PATCH v2 0/6] powerpc: convert cache asm to C

2019-09-02 Thread Alastair D'Silva
From: Alastair D'Silva 

This series addresses a few issues discovered in how we flush caches:
1. Flushes were truncated at 4GB, so larger flushes were incorrect.
2. Flushing the dcache in arch_add_memory was unnecessary

This series also converts much of the cache assembler to C, with the
aim of making it easier to maintain.

Alastair D'Silva (6):
  powerpc: Allow flush_icache_range to work across ranges >4GB
  powerpc: define helpers to get L1 icache sizes
  powerpc: Convert flush_icache_range & friends to C
  powerpc: Chunk calls to flush_dcache_range in arch_*_memory
  powerpc: Remove 'extern' from func prototypes in cache headers
  powerpc: Don't flush caches when adding memory

Changelog:
 V2:
 - Replace C implementation of flush_dcache_icache_phys() with
   inline assembler authored by Christophe Leroy
 - Add memory clobbers for iccci implementation
 - Give __flush_dcache_icache a real implementation, it can't
   just be a wrapper around flush_icache_range()
 - Remove PPC64_CACHES from misc_64.S
 - Replace code duplicating clean_dcache_range() in
   flush_icache_range() with a call to clean_dcache_range()
 - Replace #ifdef CONFIG_44x with IS_ENABLED(...) in
   flush_icache_cange()
 - Use 1GB chunks instead of 16GB in arch_*_memory


 arch/powerpc/include/asm/cache.h  |  63 ++
 arch/powerpc/include/asm/cacheflush.h |  37 +++---
 arch/powerpc/kernel/misc_32.S | 117 ---
 arch/powerpc/kernel/misc_64.S | 102 -
 arch/powerpc/mm/mem.c | 159 +-
 5 files changed, 213 insertions(+), 265 deletions(-)

-- 
2.21.0



Re: [PATCH v7 5/6] powerpc/64: Make COMPAT user-selectable disabled on littleendian by default.

2019-09-02 Thread Christophe Leroy




On 09/02/2019 11:53 PM, Michael Ellerman wrote:

Segher Boessenkool  writes:

On Mon, Sep 02, 2019 at 12:03:12PM +1000, Michael Ellerman wrote:

Michal Suchanek  writes:

On bigendian ppc64 it is common to have 32bit legacy binaries but much
less so on littleendian.


I think the toolchain people will tell you that there is no 32-bit
little endian ABI defined at all, if anything works it's by accident.

 ^
 v2


There of course is a lot of powerpcle-* support.  The ABI used for it
on linux is the SYSV ABI, just like on BE 32-bit.


I was talking about ELFv2, which is 64-bit only. But that was based on
me thinking we had a hard assumption in the kernel that ppc64le kernels
always expect ELFv2 userland. Looking at the code though I was wrong
about that, it looks like we will run little endian ELFv1 binaries,
though I don't think anyone is testing it.


There also is specific powerpcle-linux support in GCC, and in binutils,
too.  Also, config.guess/config.sub supports it.  Half a year ago this
all built fine (no, I don't test it often either).

I don't think glibc supports it though, so I wonder if anyone builds an
actual system with it?  Maybe busybox or the like?


So I think we should not make this selectable, unless someone puts their
hand up to say they want it and are willing to test it and keep it
working.


What about actual 32-bit LE systems?  Does anyone still use those?


Not that I've ever heard of.



We dropped support from 32-bit LE at least with a1f3ae3fe8a1 
("powerpc/32: Use stmw/lmw for registers save/restore in asm").


Discussion about it can be found at 
https://patchwork.ozlabs.org/patch/899465/


Christophe


[PATCH 2/3] powerpc/tm: Fix restoring FP/VMX facility incorrectly on interrupts

2019-09-02 Thread Michael Neuling
From: Gustavo Romero 

When in userspace and MSR FP=0 the hardware FP state is unrelated to
the current process. This is extended for transactions where if tbegin
is run with FP=0, the hardware checkpoint FP state will also be
unrelated to the current process. Due to this, we need to ensure this
hardware checkpoint is updated with the correct state before we enable
FP for this process.

Unfortunately we get this wrong when returning to a process from a
hardware interrupt. A process that starts a transaction with FP=0 can
take an interrupt. When the kernel returns back to that process, we
change to FP=1 but with hardware checkpoint FP state not updated. If
this transaction is then rolled back, the FP registers now contain the
wrong state.

The process looks like this:
   Userspace:  Kernel

   Start userspace
with MSR FP=0 TM=1
  < -
   ...
   tbegin
   bne
   Hardware interrupt
    >


ret_from_except
  restore_math()
/* sees FP=0 */
restore_fp()
  tm_active_with_fp()
/* sees FP=1 (Incorrect) */
  load_fp_state()
FP = 0 -> 1
  < -
   Return to userspace
 with MSR TM=1 FP=1
 with junk in the FP TM checkpoint
   TM rollback
   reads FP junk

When returning from the hardware exception, tm_active_with_fp() is
incorrectly making restore_fp() call load_fp_state() which is setting
FP=1.

The fix is to remove tm_active_with_fp().

tm_active_with_fp() is attempting to handle the case where FP state
has been changed inside a transaction. In this case the checkpointed
and transactional FP state is different and hence we must restore the
FP state (ie. we can't do lazy FP restore inside a transaction that's
used FP). It's safe to remove tm_active_with_fp() as this case is
handled by restore_tm_state(). restore_tm_state() detects if FP has
been using inside a transaction and will set load_fp and call
restore_math() to ensure the FP state (checkpoint and transaction) is
restored.

This is a data integrity problem for the current process as the FP
registers are corrupted. It's also a security problem as the FP
registers from one process may be leaked to another.

Similarly for VMX.

A simple testcase to replicate this will be posted to
tools/testing/selftests/powerpc/tm/tm-poison.c

This fixes CVE-2019-15031.

Fixes: a7771176b439 ("powerpc: Don't enable FP/Altivec if not checkpointed")
Cc: sta...@vger.kernel.org # 4.15+
Signed-off-by: Gustavo Romero 
Signed-off-by: Michael Neuling 
---
 arch/powerpc/kernel/process.c | 18 ++
 1 file changed, 2 insertions(+), 16 deletions(-)

diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
index 437b57068c..7a84c9f177 100644
--- a/arch/powerpc/kernel/process.c
+++ b/arch/powerpc/kernel/process.c
@@ -101,21 +101,8 @@ static void check_if_tm_restore_required(struct 
task_struct *tsk)
}
 }
 
-static bool tm_active_with_fp(struct task_struct *tsk)
-{
-   return MSR_TM_ACTIVE(tsk->thread.regs->msr) &&
-   (tsk->thread.ckpt_regs.msr & MSR_FP);
-}
-
-static bool tm_active_with_altivec(struct task_struct *tsk)
-{
-   return MSR_TM_ACTIVE(tsk->thread.regs->msr) &&
-   (tsk->thread.ckpt_regs.msr & MSR_VEC);
-}
 #else
 static inline void check_if_tm_restore_required(struct task_struct *tsk) { }
-static inline bool tm_active_with_fp(struct task_struct *tsk) { return false; }
-static inline bool tm_active_with_altivec(struct task_struct *tsk) { return 
false; }
 #endif /* CONFIG_PPC_TRANSACTIONAL_MEM */
 
 bool strict_msr_control;
@@ -252,7 +239,7 @@ EXPORT_SYMBOL(enable_kernel_fp);
 
 static int restore_fp(struct task_struct *tsk)
 {
-   if (tsk->thread.load_fp || tm_active_with_fp(tsk)) {
+   if (tsk->thread.load_fp) {
load_fp_state(>thread.fp_state);
current->thread.load_fp++;
return 1;
@@ -334,8 +321,7 @@ EXPORT_SYMBOL_GPL(flush_altivec_to_thread);
 
 static int restore_altivec(struct task_struct *tsk)
 {
-   if (cpu_has_feature(CPU_FTR_ALTIVEC) &&
-   (tsk->thread.load_vec || tm_active_with_altivec(tsk))) {
+   if (cpu_has_feature(CPU_FTR_ALTIVEC) && (tsk->thread.load_vec)) {
load_vr_state(>thread.vr_state);
tsk->thread.used_vr = 1;
tsk->thread.load_vec++;
-- 
2.21.0



[PATCH 1/3] powerpc/tm: Fix FP/VMX unavailable exceptions inside a transaction

2019-09-02 Thread Michael Neuling
From: Gustavo Romero 

When we take an FP unavailable exception in a transaction we have to
account for the hardware FP TM checkpointed registers being
incorrect. In this case for this process we know the current and
checkpointed FP registers must be the same (since FP wasn't used
inside the transaction) hence in the thread_struct we copy the current
FP registers to the checkpointed ones.

This copy is done in tm_reclaim_thread(). We use thread->ckpt_regs.msr
to determine if FP was on when in userspace. thread->ckpt_regs.msr
represents the state of the MSR when exiting userspace. This is setup
by check_if_tm_restore_required().

Unfortunatley there is an optimisation in giveup_all() which returns
early if tsk->thread.regs->msr (via local variable `usermsr`) has
FP=VEC=VSX=SPE=0. This optimisation means that
check_if_tm_restore_required() is not called and hence
thread->ckpt_regs.msr is not updated and will contain an old value.

This can happen if due to load_fp=255 we start a userspace process
with MSR FP=1 and then we are context switched out. In this case
thread->ckpt_regs.msr will contain FP=1. If that same process is then
context switched in and load_fp overflows, MSR will have FP=0. If that
process now enters a transaction and does an FP instruction, the FP
unavailable will not update thread->ckpt_regs.msr (the bug) and MSR
FP=1 will be retained in thread->ckpt_regs.msr.  tm_reclaim_thread()
will then not perform the required memcpy and the checkpointed FP regs
in the thread struct will contain the wrong values.

The code path for this happening is:

   Userspace:  Kernel
   Start userspace
with MSR FP/VEC/VSX/SPE=0 TM=1
  < -
   ...
   tbegin
   bne
   fp instruction
   FP unavailable
    >
fp_unavailable_tm()
  tm_reclaim_current()
tm_reclaim_thread()
  giveup_all()
return early since FP/VMX/VSX=0
/* ckpt MSR not updated 
(Incorrect) */
  tm_reclaim()
/* thread_struct ckpt FP regs 
contain junk (OK) */
  /* Sees ckpt MSR FP=1 (Incorrect) 
*/
  no memcpy() performed
/* thread_struct ckpt FP regs 
not fixed (Incorrect) */
  tm_recheckpoint()
 /* Put junk in hardware checkpoint 
FP regs */
 
  < -
   Return to userspace
 with MSR TM=1 FP=1
 with junk in the FP TM checkpoint
   TM rollback
   reads FP junk

This is a data integrity problem for the current process as the FP
registers are corrupted. It's also a security problem as the FP
registers from one process may be leaked to another.

This patch moves up check_if_tm_restore_required() in giveup_all() to
ensure thread->ckpt_regs.msr is updated correctly.

A simple testcase to replicate this will be posted to
tools/testing/selftests/powerpc/tm/tm-poison.c

Similarly for VMX.

This fixes CVE-2019-15030.

Fixes: f48e91e87e67 ("powerpc/tm: Fix FP and VMX register corruption")
Cc: sta...@vger.kernel.org # 4.12+
Signed-off-by: Gustavo Romero 
Signed-off-by: Michael Neuling 
---
 arch/powerpc/kernel/process.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
index 8fc4de0d22..437b57068c 100644
--- a/arch/powerpc/kernel/process.c
+++ b/arch/powerpc/kernel/process.c
@@ -497,13 +497,14 @@ void giveup_all(struct task_struct *tsk)
if (!tsk->thread.regs)
return;
 
+   check_if_tm_restore_required(tsk);
+
usermsr = tsk->thread.regs->msr;
 
if ((usermsr & msr_all_available) == 0)
return;
 
msr_check_and_set(msr_all_available);
-   check_if_tm_restore_required(tsk);
 
WARN_ON((usermsr & MSR_VSX) && !((usermsr & MSR_FP) && (usermsr & 
MSR_VEC)));
 
-- 
2.21.0



[PATCH 3/3] powerpc/tm: Add tm-poison test

2019-09-02 Thread Michael Neuling
From: Gustavo Romero 

Add TM selftest to check if FP or VEC register values from one process
can leak into another process when both run on the same CPU.

This tests for CVE-2019-15030 and CVE-2019-15031.

Signed-off-by: Gustavo Romero 
Signed-off-by: Michael Neuling 
---
 tools/testing/selftests/powerpc/tm/.gitignore |   1 +
 tools/testing/selftests/powerpc/tm/Makefile   |   2 +-
 .../testing/selftests/powerpc/tm/tm-poison.c  | 180 ++
 3 files changed, 182 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/powerpc/tm/tm-poison.c

diff --git a/tools/testing/selftests/powerpc/tm/.gitignore 
b/tools/testing/selftests/powerpc/tm/.gitignore
index 951fe855f7..98f2708d86 100644
--- a/tools/testing/selftests/powerpc/tm/.gitignore
+++ b/tools/testing/selftests/powerpc/tm/.gitignore
@@ -17,3 +17,4 @@ tm-vmx-unavail
 tm-unavailable
 tm-trap
 tm-sigreturn
+tm-poison
diff --git a/tools/testing/selftests/powerpc/tm/Makefile 
b/tools/testing/selftests/powerpc/tm/Makefile
index c0734ed0ef..b15a1a325b 100644
--- a/tools/testing/selftests/powerpc/tm/Makefile
+++ b/tools/testing/selftests/powerpc/tm/Makefile
@@ -5,7 +5,7 @@ SIGNAL_CONTEXT_CHK_TESTS := tm-signal-context-chk-gpr 
tm-signal-context-chk-fpu
 TEST_GEN_PROGS := tm-resched-dscr tm-syscall tm-signal-msr-resv 
tm-signal-stack \
tm-vmxcopy tm-fork tm-tar tm-tmspr tm-vmx-unavail tm-unavailable 
tm-trap \
$(SIGNAL_CONTEXT_CHK_TESTS) tm-sigreturn tm-signal-sigreturn-nt \
-   tm-signal-context-force-tm
+   tm-signal-context-force-tm tm-poison
 
 top_srcdir = ../../../../..
 include ../../lib.mk
diff --git a/tools/testing/selftests/powerpc/tm/tm-poison.c 
b/tools/testing/selftests/powerpc/tm/tm-poison.c
new file mode 100644
index 00..0113da7a8d
--- /dev/null
+++ b/tools/testing/selftests/powerpc/tm/tm-poison.c
@@ -0,0 +1,180 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright 2019, Gustavo Romero, Michael Neuling, IBM Corp.
+ *
+ * This test will spawn two processes. Both will be attached to the same
+ * CPU (CPU 0). The child will be in a loop writing to FP register f31 and
+ * VMX/VEC/Altivec register vr31 a known value, called poison, calling
+ * sched_yield syscall after to allow the parent to switch on the CPU.
+ * Parent will set f31 and vr31 to 1 and in a loop will check if f31 and
+ * vr31 remain 1 as expected until a given timeout (2m). If the issue is
+ * present child's poison will leak into parent's f31 or vr31 registers,
+ * otherwise, poison will never leak into parent's f31 and vr31 registers.
+ *
+ * This test for CVE-2019-15030 and CVE-2019-15031.
+ */
+
+#define _GNU_SOURCE
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "tm.h"
+
+int tm_poison_test(void)
+{
+   int pid;
+   cpu_set_t cpuset;
+   uint64_t poison = 0xdeadbeefc0dec0fe;
+   uint64_t unknown = 0;
+   bool fail_fp = false;
+   bool fail_vr = false;
+
+   SKIP_IF(!have_htm());
+
+   /* Attach both Child and Parent to CPU 0 */
+   CPU_ZERO();
+   CPU_SET(0, );
+   sched_setaffinity(0, sizeof(cpuset), );
+
+   pid = fork();
+   if (!pid) {
+   /**
+* child
+*/
+   while (1) {
+   sched_yield();
+   asm (
+   "mtvsrd 31, %[poison];" // f31 = poison
+   "mtvsrd 63, %[poison];" // vr31 = poison
+
+   : : [poison] "r" (poison) : );
+   }
+   }
+
+   /**
+* parent
+*/
+   asm (
+   /*
+* Set r3, r4, and f31 to known value 1 before entering
+* in transaction. They won't be written after that.
+*/
+   "   li  3, 0x1  ;"
+   "   li  4, 0x1  ;"
+   "   mtvsrd  31, 4   ;"
+
+   /*
+* The Time Base (TB) is a 64-bit counter register that is
+* independent of the CPU clock and which is incremented
+* at a frequency of 51200 Hz, so every 1.953125ns.
+* So it's necessary 120s/0.1953125s = 6144000
+* increments to get a 2 minutes timeout. Below we set that
+* value in r5 and then use r6 to track initial TB value,
+* updating TB values in r7 at every iteration and comparing it
+* to r6. When r7 (current) - r6 (initial) > 6144000 we bail
+* out since for sure we spent already 2 minutes in the loop.
+* SPR 268 is the TB register.
+*/
+   "   lis 5, 14   ;"
+   "   ori 5, 5, 19996 ;"
+   "   sldi5, 5, 16;" // r5 = 6144000
+
+   "   mfspr   6, 268  ;" // r6 (TB initial)
+  

RE: [PATCH v3 01/11] PCI: designware-ep: Add multiple PFs support for DWC

2019-09-02 Thread Xiaowei Bao


> -Original Message-
> From: Andrew Murray 
> Sent: 2019年9月3日 0:26
> To: Xiaowei Bao 
> Cc: robh...@kernel.org; mark.rutl...@arm.com; shawn...@kernel.org; Leo
> Li ; kis...@ti.com; lorenzo.pieral...@arm.com; M.h.
> Lian ; Mingkai Hu ; Roy
> Zang ; jingooh...@gmail.com;
> gustavo.pimen...@synopsys.com; linux-...@vger.kernel.org;
> devicet...@vger.kernel.org; linux-ker...@vger.kernel.org;
> linux-arm-ker...@lists.infradead.org; linuxppc-dev@lists.ozlabs.org;
> gre...@linuxfoundation.org; Z.q. Hou ;
> a...@arndb.de
> Subject: Re: [PATCH v3 01/11] PCI: designware-ep: Add multiple PFs support
> for DWC
> 
> On Mon, Sep 02, 2019 at 11:17:06AM +0800, Xiaowei Bao wrote:
> > Add multiple PFs support for DWC, different PF have different config
> > space we use pf-offset property which get from the DTS to access the
> > different pF
> 
> This needs to be updated as this no longer comes from the DT.

Yes, thanks

Thanks
Xiaowei

> 
> > config space.
> >
> > Signed-off-by: Xiaowei Bao 
> 
> 
> We're assuming:
> 
>  - The offset address (func_offset) between PF's in the memory map can be
>different between different DWC implementations. And also that it's
>possible for DWC implementations to address PFs without using an offset.
> 
>  - The current approach is preferable to adding DWC EP driver callbacks
>for writing to the EP config space (e.g. a variant of dw_pcie_writew_dbi
>that takes a func number).

Even if use the a variant of dw_pcie_writew_dbi, we also need a offset value 
form
different platform, due to the different platform may be have different 
implement
about this, so I am not sure how to implement the variant of dw_pcie_writew_dbi?
  
> 
> I'm keen to hear feedback from Jingoo/Gustavo on this.

OK, expect the feedback.

Thanks 
Xiaowei

> 
> Thanks,
> 
> Andrew Murray
> 
> > ---
> > v2:
> >  - Remove duplicate redundant code.
> >  - Reimplement the PF config space access way.
> > v3:
> >  - Integrate duplicate code for func_select.
> >  - Move PCIE_ATU_FUNC_NUM(pf) (pf << 20) to ((pf) << 20).
> >  - Add the comments for func_conf_select function.
> >
> >  drivers/pci/controller/dwc/pcie-designware-ep.c | 123
> 
> >  drivers/pci/controller/dwc/pcie-designware.c|  59 
> >  drivers/pci/controller/dwc/pcie-designware.h|  18 +++-
> >  3 files changed, 142 insertions(+), 58 deletions(-)
> >
> > diff --git a/drivers/pci/controller/dwc/pcie-designware-ep.c
> > b/drivers/pci/controller/dwc/pcie-designware-ep.c
> > index 65f4792..eb851c2 100644
> > --- a/drivers/pci/controller/dwc/pcie-designware-ep.c
> > +++ b/drivers/pci/controller/dwc/pcie-designware-ep.c
> > @@ -19,12 +19,26 @@ void dw_pcie_ep_linkup(struct dw_pcie_ep *ep)
> > pci_epc_linkup(epc);
> >  }
> >
> > -static void __dw_pcie_ep_reset_bar(struct dw_pcie *pci, enum pci_barno
> bar,
> > -  int flags)
> > +static unsigned int dw_pcie_ep_func_select(struct dw_pcie_ep *ep, u8
> > +func_no) {
> > +   unsigned int func_offset = 0;
> > +
> > +   if (ep->ops->func_conf_select)
> > +   func_offset = ep->ops->func_conf_select(ep, func_no);
> > +
> > +   return func_offset;
> > +}
> > +
> > +static void __dw_pcie_ep_reset_bar(struct dw_pcie *pci, u8 func_no,
> > +  enum pci_barno bar, int flags)
> >  {
> > u32 reg;
> > +   unsigned int func_offset = 0;
> > +   struct dw_pcie_ep *ep = >ep;
> > +
> > +   func_offset = dw_pcie_ep_func_select(ep, func_no);
> >
> > -   reg = PCI_BASE_ADDRESS_0 + (4 * bar);
> > +   reg = func_offset + PCI_BASE_ADDRESS_0 + (4 * bar);
> > dw_pcie_dbi_ro_wr_en(pci);
> > dw_pcie_writel_dbi2(pci, reg, 0x0);
> > dw_pcie_writel_dbi(pci, reg, 0x0);
> > @@ -37,7 +51,12 @@ static void __dw_pcie_ep_reset_bar(struct dw_pcie
> > *pci, enum pci_barno bar,
> >
> >  void dw_pcie_ep_reset_bar(struct dw_pcie *pci, enum pci_barno bar)  {
> > -   __dw_pcie_ep_reset_bar(pci, bar, 0);
> > +   u8 func_no, funcs;
> > +
> > +   funcs = pci->ep.epc->max_functions;
> > +
> > +   for (func_no = 0; func_no < funcs; func_no++)
> > +   __dw_pcie_ep_reset_bar(pci, func_no, bar, 0);
> >  }
> >
> >  static int dw_pcie_ep_write_header(struct pci_epc *epc, u8 func_no,
> > @@ -45,28 +64,31 @@ static int dw_pcie_ep_write_header(struct pci_epc
> > *epc, u8 func_no,  {
> > struct dw_pcie_ep *ep = epc_get_drvdata(epc);
> > struct dw_pcie *pci = to_dw_pcie_from_ep(ep);
> > +   unsigned int func_offset = 0;
> > +
> > +   func_offset = dw_pcie_ep_func_select(ep, func_no);
> >
> > dw_pcie_dbi_ro_wr_en(pci);
> > -   dw_pcie_writew_dbi(pci, PCI_VENDOR_ID, hdr->vendorid);
> > -   dw_pcie_writew_dbi(pci, PCI_DEVICE_ID, hdr->deviceid);
> > -   dw_pcie_writeb_dbi(pci, PCI_REVISION_ID, hdr->revid);
> > -   dw_pcie_writeb_dbi(pci, PCI_CLASS_PROG, hdr->progif_code);
> > -   dw_pcie_writew_dbi(pci, PCI_CLASS_DEVICE,
> > +   dw_pcie_writew_dbi(pci, func_offset + PCI_VENDOR_ID, hdr->vendorid);
> > +   

Re: [PATCH] sysfs: add BIN_ATTR_WO() macro

2019-09-02 Thread Michael Ellerman
Greg Kroah-Hartman  writes:
> This variant was missing from sysfs.h, I guess no one noticed it before.
>
> Turns out the powerpc secure variable code can use it, so add it to the
> tree for it, and potentially others to take advantage of, instead of
> open-coding it.
>
> Reported-by: Nayna Jain 
> Signed-off-by: Greg Kroah-Hartman 
> ---
>
> I'll queue this up to my tree for 5.4-rc1, but if you want to take this
> in your tree earlier, feel free to do so.

OK. This series is blocked on the firmware support going in, so at the
moment it might miss v5.4 anyway. So this going via your tree is no
problem.

If it does make it into v5.4 we can do a fixup patch to use the new
macro once everything's in Linus' tree.

cheers

> diff --git a/include/linux/sysfs.h b/include/linux/sysfs.h
> index 965236795750..5420817ed317 100644
> --- a/include/linux/sysfs.h
> +++ b/include/linux/sysfs.h
> @@ -196,6 +196,12 @@ struct bin_attribute {
>   .size   = _size,\
>  }
>  
> +#define __BIN_ATTR_WO(_name) {   
> \
> + .attr   = { .name = __stringify(_name), .mode = 0200 }, \
> + .store  = _name##_store,\
> + .size   = _size,\
> +}
> +
>  #define __BIN_ATTR_RW(_name, _size)  \
>   __BIN_ATTR(_name, 0644, _name##_read, _name##_write, _size)
>  
> @@ -208,6 +214,9 @@ struct bin_attribute bin_attr_##_name = __BIN_ATTR(_name, 
> _mode, _read,   \
>  #define BIN_ATTR_RO(_name, _size)\
>  struct bin_attribute bin_attr_##_name = __BIN_ATTR_RO(_name, _size)
>  
> +#define BIN_ATTR_WO(_name, _size)\
> +struct bin_attribute bin_attr_##_name = __BIN_ATTR_WO(_name, _size)
> +
>  #define BIN_ATTR_RW(_name, _size)\
>  struct bin_attribute bin_attr_##_name = __BIN_ATTR_RW(_name, _size)
>  
> -- 
> 2.23.0


Re: [PATCH 6/6] powerpc/64s/radix: introduce options to disable use of the tlbie instruction

2019-09-02 Thread Nicholas Piggin
Alistair Popple's on September 3, 2019 10:32 am:
> Nick,
> 
> On Tuesday, 3 September 2019 1:29:31 AM AEST Nicholas Piggin wrote:
>> Introduce two options to control the use of the tlbie instruction. A
>> boot time option which completely disables the kernel using the
>> instruction, this is currently incompatible with HASH MMU, KVM, and
>> coherent accelerators.
> 
> Some accelerators (eg. cxl, ocxl, npu) call mm_context_add_copro() to force 
> global TLB invalidations:
> 
> static inline void mm_context_add_copro(struct mm_struct *mm)
> {
> /*
>  * If any copro is in use, increment the active CPU count
>  * in order to force TLB invalidations to be global as to
>  * propagate to the Nest MMU.
>  */
> if (atomic_inc_return(>context.copros) == 1)
> inc_mm_active_cpus(mm);
> }
> 
> Admittedly I haven't dug into all the details of this patch but it sounds 
> like 
> it might break the above if TLBIE is disabled. Do you think we should add a 
> WARN_ON if mm_context_add_copro() is called with TLBIE disabled? Or perhaps 
> even force TLBIE to be re-enabled if it is called with it disabled?

The patch has two flags, "enabled" and "capable". If capable is false
then it prevents cxl, oxcl, and KVM from loading. I think NPU is gone
from the tree now. Hash MMU won't work either, but for now you can't
mark !capable with hash.

If enabled is false but capable is true, then it avoids tlbie for
flushing the CPU translations, but will also use it to flush nMMU
coprocessors (and KVM for partition scope, but hopefully that can
be made to work with tlbiel as well).

So this should be fine. Could put a BUG in there if !tlbie_capable,
because we can't continue -- idea is that tlbie could be broken or
not implemented or we want to test without ever using it.

Thanks,
Nick



RE: [PATCH v3 04/11] PCI: designware-ep: Modify MSI and MSIX CAP way of finding

2019-09-02 Thread Xiaowei Bao


> -Original Message-
> From: Andrew Murray 
> Sent: 2019年9月2日 23:07
> To: Xiaowei Bao 
> Cc: robh...@kernel.org; mark.rutl...@arm.com; shawn...@kernel.org; Leo
> Li ; kis...@ti.com; lorenzo.pieral...@arm.com; M.h.
> Lian ; Mingkai Hu ; Roy
> Zang ; jingooh...@gmail.com;
> gustavo.pimen...@synopsys.com; linux-...@vger.kernel.org;
> devicet...@vger.kernel.org; linux-ker...@vger.kernel.org;
> linux-arm-ker...@lists.infradead.org; linuxppc-dev@lists.ozlabs.org;
> a...@arndb.de; gre...@linuxfoundation.org; Z.q. Hou
> 
> Subject: Re: [PATCH v3 04/11] PCI: designware-ep: Modify MSI and MSIX CAP
> way of finding
> 
> On Mon, Sep 02, 2019 at 11:17:09AM +0800, Xiaowei Bao wrote:
> > Each PF of EP device should have it's own MSI or MSIX capabitily
> > struct, so create a dw_pcie_ep_func struct and remover the msi_cap
> 
> remover?

Sorry. ^_^


> 
> > and msix_cap to this struce, and manage the PFs with a list.
> 
> struce?
> 
> >
> > Signed-off-by: Xiaowei Bao 
> > ---
> > v1:
> >  - This is a new patch, to fix the issue of MSI and MSIX CAP way of
> >finding.
> > v2:
> >  - No change.
> > v3:
> >  - No change.
> 
> This makes it look like you introduced the patch in v1 and haven't changed it
> since.
> 
> I think it's more common to have a history like this:
> 
> ---
> v3:
>  - Introduced new patch, to fix the issue of MSI and MSIX CAP way of
>finding.

OK, thanks, I am not clear the rules, thanks a lot for your help.

Thanks 
Xiaowei

> 
> 
> >
> >  drivers/pci/controller/dwc/pcie-designware-ep.c | 135
> +---
> >  drivers/pci/controller/dwc/pcie-designware.h|  18 +++-
> >  2 files changed, 134 insertions(+), 19 deletions(-)
> >
> > diff --git a/drivers/pci/controller/dwc/pcie-designware-ep.c
> > b/drivers/pci/controller/dwc/pcie-designware-ep.c
> > index c3bc7bd..144eb12 100644
> > --- a/drivers/pci/controller/dwc/pcie-designware-ep.c
> > +++ b/drivers/pci/controller/dwc/pcie-designware-ep.c
> > @@ -19,6 +19,19 @@ void dw_pcie_ep_linkup(struct dw_pcie_ep *ep)
> > pci_epc_linkup(epc);
> >  }
> >
> > +struct dw_pcie_ep_func *
> > +dw_pcie_ep_get_func_from_ep(struct dw_pcie_ep *ep, u8 func_no) {
> > +   struct dw_pcie_ep_func *ep_func;
> > +
> > +   list_for_each_entry(ep_func, >func_list, list) {
> > +   if (ep_func->func_no == func_no)
> > +   return ep_func;
> > +   }
> > +
> > +   return NULL;
> > +}
> > +
> >  static unsigned int dw_pcie_ep_func_select(struct dw_pcie_ep *ep, u8
> > func_no)  {
> > unsigned int func_offset = 0;
> > @@ -59,6 +72,47 @@ void dw_pcie_ep_reset_bar(struct dw_pcie *pci,
> enum pci_barno bar)
> > __dw_pcie_ep_reset_bar(pci, func_no, bar, 0);  }
> >
> > +static u8 __dw_pcie_ep_find_next_cap(struct dw_pcie_ep *ep, u8
> func_no,
> > +   u8 cap_ptr, u8 cap)
> > +{
> > +   struct dw_pcie *pci = to_dw_pcie_from_ep(ep);
> > +   unsigned int func_offset = 0;
> > +   u8 cap_id, next_cap_ptr;
> > +   u16 reg;
> > +
> > +   if (!cap_ptr)
> > +   return 0;
> > +
> > +   func_offset = dw_pcie_ep_func_select(ep, func_no);
> > +
> > +   reg = dw_pcie_readw_dbi(pci, func_offset + cap_ptr);
> > +   cap_id = (reg & 0x00ff);
> > +
> > +   if (cap_id > PCI_CAP_ID_MAX)
> > +   return 0;
> > +
> > +   if (cap_id == cap)
> > +   return cap_ptr;
> > +
> > +   next_cap_ptr = (reg & 0xff00) >> 8;
> > +   return __dw_pcie_ep_find_next_cap(ep, func_no, next_cap_ptr, cap); }
> 
> Which tree have you based this patchset on? v5.3-rc3 and pci/dwc both
> already have this function (without the func_no). See beb4641a787d
> ("PCI: dwc: Add MSI-X callbacks handler").

There is a commit 7a6854f68 in the latest kernel.

Thanks 
Xiaowei

> 
> > +
> > +static u8 dw_pcie_ep_find_capability(struct dw_pcie_ep *ep, u8
> > +func_no, u8 cap) {
> > +   struct dw_pcie *pci = to_dw_pcie_from_ep(ep);
> > +   unsigned int func_offset = 0;
> > +   u8 next_cap_ptr;
> > +   u16 reg;
> > +
> > +   func_offset = dw_pcie_ep_func_select(ep, func_no);
> > +
> > +   reg = dw_pcie_readw_dbi(pci, func_offset + PCI_CAPABILITY_LIST);
> > +   next_cap_ptr = (reg & 0x00ff);
> > +
> > +   return __dw_pcie_ep_find_next_cap(ep, func_no, next_cap_ptr, cap); }
> > +
> >  static int dw_pcie_ep_write_header(struct pci_epc *epc, u8 func_no,
> >struct pci_epf_header *hdr)
> >  {
> > @@ -246,13 +300,18 @@ static int dw_pcie_ep_get_msi(struct pci_epc
> *epc, u8 func_no)
> > struct dw_pcie *pci = to_dw_pcie_from_ep(ep);
> > u32 val, reg;
> > unsigned int func_offset = 0;
> > +   struct dw_pcie_ep_func *ep_func;
> >
> > -   if (!ep->msi_cap)
> > +   ep_func = dw_pcie_ep_get_func_from_ep(ep, func_no);
> > +   if (!ep_func)
> > +   return -EINVAL;
> > +
> > +   if (!ep_func->msi_cap)
> > return -EINVAL;
> >
> > func_offset = dw_pcie_ep_func_select(ep, func_no);
> >
> > -   reg = ep->msi_cap + func_offset + PCI_MSI_FLAGS;
> > +   reg = ep_func->msi_cap + func_offset + PCI_MSI_FLAGS;
> 

RE: [PATCH v3 07/11] PCI: layerscape: Modify the way of getting capability with different PEX

2019-09-02 Thread Xiaowei Bao


> -Original Message-
> From: Andrew Murray 
> Sent: 2019年9月2日 21:38
> To: Xiaowei Bao 
> Cc: robh...@kernel.org; mark.rutl...@arm.com; shawn...@kernel.org; Leo
> Li ; kis...@ti.com; lorenzo.pieral...@arm.com; M.h.
> Lian ; Mingkai Hu ; Roy
> Zang ; jingooh...@gmail.com;
> gustavo.pimen...@synopsys.com; linux-...@vger.kernel.org;
> devicet...@vger.kernel.org; linux-ker...@vger.kernel.org;
> linux-arm-ker...@lists.infradead.org; linuxppc-dev@lists.ozlabs.org;
> a...@arndb.de; gre...@linuxfoundation.org; Z.q. Hou
> 
> Subject: Re: [PATCH v3 07/11] PCI: layerscape: Modify the way of getting
> capability with different PEX
> 
> On Mon, Sep 02, 2019 at 11:17:12AM +0800, Xiaowei Bao wrote:
> > The different PCIe controller in one board may be have different
> > capability of MSI or MSIX, so change the way of getting the MSI
> > capability, make it more flexible.
> >
> > Signed-off-by: Xiaowei Bao 
> 
> Please see the comments I just made to Kishon's feedback in the thread for
> this patch in series v2.

I have reply the comments in series v2, expect Kishon's feedback.

Thanks
Xiaowei

> 
> Thanks,
> 
> Andrew Murray
> 
> > ---
> > v2:
> >  - Remove the repeated assignment code.
> > v3:
> >  - Use ep_func msi_cap and msix_cap to decide the msi_capable and
> >msix_capable of pci_epc_features struct.
> >
> >  drivers/pci/controller/dwc/pci-layerscape-ep.c | 31
> > +++---
> >  1 file changed, 23 insertions(+), 8 deletions(-)
> >
> > diff --git a/drivers/pci/controller/dwc/pci-layerscape-ep.c
> > b/drivers/pci/controller/dwc/pci-layerscape-ep.c
> > index a9c552e..1e07287 100644
> > --- a/drivers/pci/controller/dwc/pci-layerscape-ep.c
> > +++ b/drivers/pci/controller/dwc/pci-layerscape-ep.c
> > @@ -22,6 +22,7 @@
> >
> >  struct ls_pcie_ep {
> > struct dw_pcie  *pci;
> > +   struct pci_epc_features *ls_epc;
> >  };
> >
> >  #define to_ls_pcie_ep(x)   dev_get_drvdata((x)->dev)
> > @@ -40,26 +41,31 @@ static const struct of_device_id
> ls_pcie_ep_of_match[] = {
> > { },
> >  };
> >
> > -static const struct pci_epc_features ls_pcie_epc_features = {
> > -   .linkup_notifier = false,
> > -   .msi_capable = true,
> > -   .msix_capable = false,
> > -   .bar_fixed_64bit = (1 << BAR_2) | (1 << BAR_4),
> > -};
> > -
> >  static const struct pci_epc_features*  ls_pcie_ep_get_features(struct
> > dw_pcie_ep *ep)  {
> > -   return _pcie_epc_features;
> > +   struct dw_pcie *pci = to_dw_pcie_from_ep(ep);
> > +   struct ls_pcie_ep *pcie = to_ls_pcie_ep(pci);
> > +
> > +   return pcie->ls_epc;
> >  }
> >
> >  static void ls_pcie_ep_init(struct dw_pcie_ep *ep)  {
> > struct dw_pcie *pci = to_dw_pcie_from_ep(ep);
> > +   struct ls_pcie_ep *pcie = to_ls_pcie_ep(pci);
> > +   struct dw_pcie_ep_func *ep_func;
> > enum pci_barno bar;
> >
> > +   ep_func = dw_pcie_ep_get_func_from_ep(ep, 0);
> > +   if (!ep_func)
> > +   return;
> > +
> > for (bar = BAR_0; bar <= BAR_5; bar++)
> > dw_pcie_ep_reset_bar(pci, bar);
> > +
> > +   pcie->ls_epc->msi_capable = ep_func->msi_cap ? true : false;
> > +   pcie->ls_epc->msix_capable = ep_func->msix_cap ? true : false;
> >  }
> >
> >  static int ls_pcie_ep_raise_irq(struct dw_pcie_ep *ep, u8 func_no, @@
> > -119,6 +125,7 @@ static int __init ls_pcie_ep_probe(struct platform_device
> *pdev)
> > struct device *dev = >dev;
> > struct dw_pcie *pci;
> > struct ls_pcie_ep *pcie;
> > +   struct pci_epc_features *ls_epc;
> > struct resource *dbi_base;
> > int ret;
> >
> > @@ -130,6 +137,10 @@ static int __init ls_pcie_ep_probe(struct
> platform_device *pdev)
> > if (!pci)
> > return -ENOMEM;
> >
> > +   ls_epc = devm_kzalloc(dev, sizeof(*ls_epc), GFP_KERNEL);
> > +   if (!ls_epc)
> > +   return -ENOMEM;
> > +
> > dbi_base = platform_get_resource_byname(pdev, IORESOURCE_MEM,
> "regs");
> > pci->dbi_base = devm_pci_remap_cfg_resource(dev, dbi_base);
> > if (IS_ERR(pci->dbi_base))
> > @@ -140,6 +151,10 @@ static int __init ls_pcie_ep_probe(struct
> platform_device *pdev)
> > pci->ops = _pcie_ep_ops;
> > pcie->pci = pci;
> >
> > +   ls_epc->bar_fixed_64bit = (1 << BAR_2) | (1 << BAR_4),
> > +
> > +   pcie->ls_epc = ls_epc;
> > +
> > platform_set_drvdata(pdev, pcie);
> >
> > ret = ls_add_pcie_ep(pcie, pdev);
> > --
> > 2.9.5
> >


RE: [PATCH v2 06/10] PCI: layerscape: Modify the way of getting capability with different PEX

2019-09-02 Thread Xiaowei Bao


> -Original Message-
> From: Andrew Murray 
> Sent: 2019年9月2日 21:37
> To: Xiaowei Bao 
> Cc: Kishon Vijay Abraham I ; bhelg...@google.com;
> robh...@kernel.org; mark.rutl...@arm.com; shawn...@kernel.org; Leo Li
> ; lorenzo.pieral...@arm.co
> ; a...@arndb.de; gre...@linuxfoundation.org;
> M.h. Lian ; Mingkai Hu ;
> Roy Zang ; jingooh...@gmail.com;
> gustavo.pimen...@synopsys.com; linux-...@vger.kernel.org;
> devicet...@vger.kernel.org; linux-ker...@vger.kernel.org;
> linux-arm-ker...@lists.infradead.org; linuxppc-dev@lists.ozlabs.org
> Subject: Re: [PATCH v2 06/10] PCI: layerscape: Modify the way of getting
> capability with different PEX
> 
> On Fri, Aug 23, 2019 at 04:13:30AM +, Xiaowei Bao wrote:
> >
> >
> > > -Original Message-
> > > From: Kishon Vijay Abraham I 
> > > Sent: 2019年8月23日 11:40
> > > To: Xiaowei Bao ; bhelg...@google.com;
> > > robh...@kernel.org; mark.rutl...@arm.com; shawn...@kernel.org; Leo
> > > robh+Li
> > > ; lorenzo.pieral...@arm.co
> > > ; a...@arndb.de;
> > > gre...@linuxfoundation.org; M.h. Lian ;
> > > Mingkai Hu ; Roy Zang ;
> > > jingooh...@gmail.com; gustavo.pimen...@synopsys.com;
> > > linux-...@vger.kernel.org; devicet...@vger.kernel.org;
> > > linux-ker...@vger.kernel.org; linux-arm-ker...@lists.infradead.org;
> > > linuxppc-dev@lists.ozlabs.org; andrew.mur...@arm.com
> > > Subject: Re: [PATCH v2 06/10] PCI: layerscape: Modify the way of
> > > getting capability with different PEX
> > >
> > > Hi,
> > >
> > > (Fixed Lorenzo's email address. All the patches in the series have
> > > wrong email
> > > id)
> > >
> > > On 23/08/19 8:09 AM, Xiaowei Bao wrote:
> > > >
> > > >
> > > >> -Original Message-
> > > >> From: Kishon Vijay Abraham I 
> > > >> Sent: 2019年8月22日 19:44
> > > >> To: Xiaowei Bao ; bhelg...@google.com;
> > > >> robh...@kernel.org; mark.rutl...@arm.com; shawn...@kernel.org;
> > > >> robh+Leo
> > > Li
> > > >> ; lorenzo.pieral...@arm.co; a...@arndb.de;
> > > >> gre...@linuxfoundation.org; M.h. Lian ;
> > > >> Mingkai Hu ; Roy Zang ;
> > > >> jingooh...@gmail.com; gustavo.pimen...@synopsys.com;
> > > >> linux-...@vger.kernel.org; devicet...@vger.kernel.org;
> > > >> linux-ker...@vger.kernel.org;
> > > >> linux-arm-ker...@lists.infradead.org;
> > > >> linuxppc-dev@lists.ozlabs.org; andrew.mur...@arm.com
> > > >> Subject: Re: [PATCH v2 06/10] PCI: layerscape: Modify the way of
> > > >> getting capability with different PEX
> > > >>
> > > >> Hi,
> > > >>
> > > >> On 22/08/19 4:52 PM, Xiaowei Bao wrote:
> > > >>> The different PCIe controller in one board may be have different
> > > >>> capability of MSI or MSIX, so change the way of getting the MSI
> > > >>> capability, make it more flexible.
> > > >>
> > > >> please use different pci_epc_features table for different boards.
> > > > Thanks, I think that it will be more flexible to dynamically get
> > > > MSI or MSIX capability, Thus, we will not need to define the
> > > > pci_epc_feature for
> > > different boards.
> > >
> > > Is the restriction because you cannot have different compatible for
> > > different boards?
> > Sorry, I am not very clear what your mean, I think even if I use the
> > same compatible with different boards, each boards will enter the
> > probe function, in there I will get the MSI or MSIX PCIe capability of
> > the current controller in this board. Why do I need to define the
> pci_epc_feature for different boards?
> 
> At present you determine how to set the [msi,msix]_capable flags of
> pci_epc_features based on reading the function capabilities at probe time.
> Instead of doing this, is it possible that you can determine the flags based 
> on
> the compatible type alone? For example, is the MSI/MSIX capability the same
> for all fsl,ls2088a-pcie-ep devices?
> 
> If it isn't *necessary* to probe for this information at probe time, then you
> could instead create a static pci_epc_features structure and assign it to
> something in your drvdata. This may provide some benefits.
> 
> The dw_pcie_ep_get_features function would then look like:
> 
> static const struct pci_epc_features*
> ls_pcie_ep_get_features(struct dw_pcie_ep *ep) {
>   struct dw_pcie *pci = to_dw_pcie_from_pp(ep);
>   struct ls_pcie_ep *pcie = dev_get_drvdata(pci->dev);
>   return pcie->epc_features;
> }
> 
> This also means you can revert "[v3,03/11] PCI: designware-ep: Move the".
> 
> Is this what you had in mind Kishon?

Yes, I consider this scheme, but there is a issue with my board, e.g. my board 
have
three PCIE controllers, but only two controllers support MSI, I can't said that 
the 
board support the MSI feature, so I only set the msi_capabitily by reading the 
MSI
capability struct the current PCIE controller, I am also very entangled in this 
issue.
so, do you have better advice? Thanks a lot.

Thanks 
Xiaowei

> 
> Thanks,
> 
> Andrew Murray
> 
> > >
> > > Thanks
> > > Kishon
> > >
> > > >>
> > > >> Thanks
> > > >> Kishon
> > > >>>
> > > >>> Signed-off-by: Xiaowei Bao 
> > > 

RE: [PATCH v3 10/11] arm64: dts: layerscape: Add PCIe EP node for ls1088a

2019-09-02 Thread Xiaowei Bao


> -Original Message-
> From: Andrew Murray 
> Sent: 2019年9月2日 21:06
> To: Xiaowei Bao 
> Cc: robh...@kernel.org; mark.rutl...@arm.com; shawn...@kernel.org; Leo
> Li ; kis...@ti.com; lorenzo.pieral...@arm.com; M.h.
> Lian ; Mingkai Hu ; Roy
> Zang ; jingooh...@gmail.com;
> gustavo.pimen...@synopsys.com; linux-...@vger.kernel.org;
> devicet...@vger.kernel.org; linux-ker...@vger.kernel.org;
> linux-arm-ker...@lists.infradead.org; linuxppc-dev@lists.ozlabs.org;
> a...@arndb.de; gre...@linuxfoundation.org; Z.q. Hou
> 
> Subject: Re: [PATCH v3 10/11] arm64: dts: layerscape: Add PCIe EP node for
> ls1088a
> 
> On Mon, Sep 02, 2019 at 11:17:15AM +0800, Xiaowei Bao wrote:
> > Add PCIe EP node for ls1088a to support EP mode.
> >
> > Signed-off-by: Xiaowei Bao 
> > ---
> > v2:
> >  - Remove the pf-offset proparty.
> > v3:
> >  - No change.
> >
> >  arch/arm64/boot/dts/freescale/fsl-ls1088a.dtsi | 31
> ++
> >  1 file changed, 31 insertions(+)
> >
> > diff --git a/arch/arm64/boot/dts/freescale/fsl-ls1088a.dtsi
> b/arch/arm64/boot/dts/freescale/fsl-ls1088a.dtsi
> > index c676d07..da246ab 100644
> > --- a/arch/arm64/boot/dts/freescale/fsl-ls1088a.dtsi
> > +++ b/arch/arm64/boot/dts/freescale/fsl-ls1088a.dtsi
> > @@ -483,6 +483,17 @@
> > status = "disabled";
> > };
> >
> > +   pcie_ep@340 {
> > +   compatible = "fsl,ls1088a-pcie-ep","fsl,ls-pcie-ep";
> 
> Here you specify a fallback "fsl,ls-pcie-ep" that is removed by this series.
> 
> Besides that, this looks OK.

As explained, the "fsl,ls-pcie-ep" is needed, due to the u-boot will fixup the 
status
property base on this compatible, I think we reserve this compatible is 
helpfully,
if delate this compatible, I have to modify the code of bootloader.

Thanks 
XIaowei

> 
> Thanks,
> 
> Andrew Murray
> 
> > +   reg = <0x00 0x0340 0x0 0x0010
> > +  0x20 0x 0x8 0x>;
> > +   reg-names = "regs", "addr_space";
> > +   num-ib-windows = <24>;
> > +   num-ob-windows = <128>;
> > +   max-functions = /bits/ 8 <2>;
> > +   status = "disabled";
> > +   };
> > +
> > pcie@350 {
> > compatible = "fsl,ls1088a-pcie";
> > reg = <0x00 0x0350 0x0 0x0010   /* controller
> registers */
> > @@ -508,6 +519,16 @@
> > status = "disabled";
> > };
> >
> > +   pcie_ep@350 {
> > +   compatible = "fsl,ls1088a-pcie-ep","fsl,ls-pcie-ep";
> > +   reg = <0x00 0x0350 0x0 0x0010
> > +  0x28 0x 0x8 0x>;
> > +   reg-names = "regs", "addr_space";
> > +   num-ib-windows = <6>;
> > +   num-ob-windows = <8>;
> > +   status = "disabled";
> > +   };
> > +
> > pcie@360 {
> > compatible = "fsl,ls1088a-pcie";
> > reg = <0x00 0x0360 0x0 0x0010   /* controller
> registers */
> > @@ -533,6 +554,16 @@
> > status = "disabled";
> > };
> >
> > +   pcie_ep@360 {
> > +   compatible = "fsl,ls1088a-pcie-ep","fsl,ls-pcie-ep";
> > +   reg = <0x00 0x0360 0x0 0x0010
> > +  0x30 0x 0x8 0x>;
> > +   reg-names = "regs", "addr_space";
> > +   num-ib-windows = <6>;
> > +   num-ob-windows = <8>;
> > +   status = "disabled";
> > +   };
> > +
> > smmu: iommu@500 {
> > compatible = "arm,mmu-500";
> > reg = <0 0x500 0 0x80>;
> > --
> > 2.9.5
> >


RE: [PATCH v3 11/11] misc: pci_endpoint_test: Add LS1088a in pci_device_id table

2019-09-02 Thread Xiaowei Bao


> -Original Message-
> From: Andrew Murray 
> Sent: 2019年9月2日 20:55
> To: Xiaowei Bao 
> Cc: robh...@kernel.org; mark.rutl...@arm.com; shawn...@kernel.org; Leo
> Li ; kis...@ti.com; lorenzo.pieral...@arm.com; M.h.
> Lian ; Mingkai Hu ; Roy
> Zang ; jingooh...@gmail.com;
> gustavo.pimen...@synopsys.com; linux-...@vger.kernel.org;
> devicet...@vger.kernel.org; linux-ker...@vger.kernel.org;
> linux-arm-ker...@lists.infradead.org; linuxppc-dev@lists.ozlabs.org;
> a...@arndb.de; gre...@linuxfoundation.org; Z.q. Hou
> 
> Subject: Re: [PATCH v3 11/11] misc: pci_endpoint_test: Add LS1088a in
> pci_device_id table
> 
> On Mon, Sep 02, 2019 at 11:17:16AM +0800, Xiaowei Bao wrote:
> > Add LS1088a in pci_device_id table so that pci-epf-test can be used
> > for testing PCIe EP in LS1088a.
> >
> > Signed-off-by: Xiaowei Bao 
> > ---
> > v2:
> >  - No change.
> > v3:
> >  - No change.
> >
> >  drivers/misc/pci_endpoint_test.c | 1 +
> >  1 file changed, 1 insertion(+)
> >
> > diff --git a/drivers/misc/pci_endpoint_test.c
> > b/drivers/misc/pci_endpoint_test.c
> > index 6e208a0..d531951 100644
> > --- a/drivers/misc/pci_endpoint_test.c
> > +++ b/drivers/misc/pci_endpoint_test.c
> > @@ -793,6 +793,7 @@ static const struct pci_device_id
> pci_endpoint_test_tbl[] = {
> > { PCI_DEVICE(PCI_VENDOR_ID_TI, PCI_DEVICE_ID_TI_DRA74x) },
> > { PCI_DEVICE(PCI_VENDOR_ID_TI, PCI_DEVICE_ID_TI_DRA72x) },
> > { PCI_DEVICE(PCI_VENDOR_ID_FREESCALE, 0x81c0) },
> > +   { PCI_DEVICE(PCI_VENDOR_ID_FREESCALE, 0x80c0) },
> 
> The Freescale PCI devices are the only devices in this table that don't have a
> define for their device ID. I think a define should be created for both of the
> device IDs above.

OK, but I only define in this file, I am not sure this can define in 
include/linux/pci_ids.h
file 

Thanks 
Xiaowei

> 
> Thanks,
> 
> Andrew Murray
> 
> > { PCI_DEVICE_DATA(SYNOPSYS, EDDA, NULL) },
> > { PCI_DEVICE(PCI_VENDOR_ID_TI, PCI_DEVICE_ID_TI_AM654),
> >   .driver_data = (kernel_ulong_t)_data
> > --
> > 2.9.5
> >


RE: [PATCH v3 09/11] PCI: layerscape: Add EP mode support for ls1088a and ls2088a

2019-09-02 Thread Xiaowei Bao


> -Original Message-
> From: Andrew Murray 
> Sent: 2019年9月2日 20:46
> To: Xiaowei Bao 
> Cc: robh...@kernel.org; mark.rutl...@arm.com; shawn...@kernel.org; Leo
> Li ; kis...@ti.com; lorenzo.pieral...@arm.com; M.h.
> Lian ; Mingkai Hu ; Roy
> Zang ; jingooh...@gmail.com;
> gustavo.pimen...@synopsys.com; linux-...@vger.kernel.org;
> devicet...@vger.kernel.org; linux-ker...@vger.kernel.org;
> linux-arm-ker...@lists.infradead.org; linuxppc-dev@lists.ozlabs.org;
> a...@arndb.de; gre...@linuxfoundation.org; Z.q. Hou
> 
> Subject: Re: [PATCH v3 09/11] PCI: layerscape: Add EP mode support for
> ls1088a and ls2088a
> 
> On Mon, Sep 02, 2019 at 11:17:14AM +0800, Xiaowei Bao wrote:
> > Add PCIe EP mode support for ls1088a and ls2088a, there are some
> > difference between LS1 and LS2 platform, so refactor the code of the
> > EP driver.
> >
> > Signed-off-by: Xiaowei Bao 
> > ---
> > v2:
> >  - This is a new patch for supporting the ls1088a and ls2088a platform.
> > v3:
> >  - Adjust the some struct assignment order in probe function.
> >
> >  drivers/pci/controller/dwc/pci-layerscape-ep.c | 72
> > +++---
> >  1 file changed, 53 insertions(+), 19 deletions(-)
> >
> > diff --git a/drivers/pci/controller/dwc/pci-layerscape-ep.c
> > b/drivers/pci/controller/dwc/pci-layerscape-ep.c
> > index 5f0cb99..723bbe5 100644
> > --- a/drivers/pci/controller/dwc/pci-layerscape-ep.c
> > +++ b/drivers/pci/controller/dwc/pci-layerscape-ep.c
> > @@ -20,27 +20,29 @@
> >
> >  #define PCIE_DBI2_OFFSET   0x1000  /* DBI2 base address*/
> >
> > -struct ls_pcie_ep {
> > -   struct dw_pcie  *pci;
> > -   struct pci_epc_features *ls_epc;
> > +#define to_ls_pcie_ep(x)   dev_get_drvdata((x)->dev)
> > +
> > +struct ls_pcie_ep_drvdata {
> > +   u32 func_offset;
> > +   const struct dw_pcie_ep_ops *ops;
> > +   const struct dw_pcie_ops*dw_pcie_ops;
> >  };
> >
> > -#define to_ls_pcie_ep(x)   dev_get_drvdata((x)->dev)
> > +struct ls_pcie_ep {
> > +   struct dw_pcie  *pci;
> > +   struct pci_epc_features *ls_epc;
> > +   const struct ls_pcie_ep_drvdata *drvdata; };
> >
> >  static int ls_pcie_establish_link(struct dw_pcie *pci)  {
> > return 0;
> >  }
> >
> > -static const struct dw_pcie_ops ls_pcie_ep_ops = {
> > +static const struct dw_pcie_ops dw_ls_pcie_ep_ops = {
> > .start_link = ls_pcie_establish_link,  };
> >
> > -static const struct of_device_id ls_pcie_ep_of_match[] = {
> > -   { .compatible = "fsl,ls-pcie-ep",},
> > -   { },
> > -};
> > -
> >  static const struct pci_epc_features*  ls_pcie_ep_get_features(struct
> > dw_pcie_ep *ep)  { @@ -87,10 +89,39 @@ static int
> > ls_pcie_ep_raise_irq(struct dw_pcie_ep *ep, u8 func_no,
> > }
> >  }
> >
> > -static const struct dw_pcie_ep_ops pcie_ep_ops = {
> > +static unsigned int ls_pcie_ep_func_conf_select(struct dw_pcie_ep *ep,
> > +   u8 func_no)
> > +{
> > +   struct dw_pcie *pci = to_dw_pcie_from_ep(ep);
> > +   struct ls_pcie_ep *pcie = to_ls_pcie_ep(pci);
> > +
> > +   WARN_ON(func_no && !pcie->drvdata->func_offset);
> > +   return pcie->drvdata->func_offset * func_no; }
> > +
> > +static const struct dw_pcie_ep_ops ls_pcie_ep_ops = {
> > .ep_init = ls_pcie_ep_init,
> > .raise_irq = ls_pcie_ep_raise_irq,
> > .get_features = ls_pcie_ep_get_features,
> > +   .func_conf_select = ls_pcie_ep_func_conf_select, };
> > +
> > +static const struct ls_pcie_ep_drvdata ls1_ep_drvdata = {
> > +   .ops = _pcie_ep_ops,
> > +   .dw_pcie_ops = _ls_pcie_ep_ops,
> > +};
> > +
> > +static const struct ls_pcie_ep_drvdata ls2_ep_drvdata = {
> > +   .func_offset = 0x2,
> > +   .ops = _pcie_ep_ops,
> > +   .dw_pcie_ops = _ls_pcie_ep_ops,
> > +};
> > +
> > +static const struct of_device_id ls_pcie_ep_of_match[] = {
> > +   { .compatible = "fsl,ls1046a-pcie-ep", .data = _ep_drvdata },
> > +   { .compatible = "fsl,ls1088a-pcie-ep", .data = _ep_drvdata },
> > +   { .compatible = "fsl,ls2088a-pcie-ep", .data = _ep_drvdata },
> > +   { },
> 
> This removes support for "fsl,ls-pcie-ep" - was that intentional? If you do 
> plan
> to drop it please make sure you explain why in the commit message. See also
> my comments in your dt-binding patch.

In fact, the u-boot will fixup the status property to 'status = enabled' in PCI 
node of 
the DTS base on "fsl,ls-pcie-ep" compatible, so "fsl,ls-pcie-ep" is used, I 
used this
compatible before, because the driver only support the LS1046a, but this time, 
I add
the LS1088a and LS2088a support, and these two boards have some difference form
LS1046a, so I changed the compatible. I am not sure whether need to add 
"fsl,ls-pcie-ep"
in there, could you give some advice, thanks a lot.

Thanks 
Xiaowei
 
> 
> Thanks,
> 
> Andrew Murray
> 
> >  };
> >
> >  static int __init ls_add_pcie_ep(struct ls_pcie_ep *pcie, @@ -103,7
> > +134,7 @@ static int __init ls_add_pcie_ep(struct ls_pcie_ep *pcie,
> > int ret;
> >
> > 

RE: [PATCH v3 05/11] dt-bindings: pci: layerscape-pci: add compatible strings for ls1088a and ls2088a

2019-09-02 Thread Xiaowei Bao


> -Original Message-
> From: Andrew Murray 
> Sent: 2019年9月2日 20:32
> To: Xiaowei Bao 
> Cc: robh...@kernel.org; mark.rutl...@arm.com; shawn...@kernel.org; Leo
> Li ; kis...@ti.com; lorenzo.pieral...@arm.com; M.h.
> Lian ; Mingkai Hu ; Roy
> Zang ; jingooh...@gmail.com;
> gustavo.pimen...@synopsys.com; linux-...@vger.kernel.org;
> devicet...@vger.kernel.org; linux-ker...@vger.kernel.org;
> linux-arm-ker...@lists.infradead.org; linuxppc-dev@lists.ozlabs.org;
> a...@arndb.de; gre...@linuxfoundation.org; Z.q. Hou
> 
> Subject: Re: [PATCH v3 05/11] dt-bindings: pci: layerscape-pci: add compatible
> strings for ls1088a and ls2088a
> 
> On Mon, Sep 02, 2019 at 11:17:10AM +0800, Xiaowei Bao wrote:
> > Add compatible strings for ls1088a and ls2088a.
> >
> > Signed-off-by: Xiaowei Bao 
> > ---
> > v2:
> >  - No change.
> > v3:
> >  - Use one valid combination of compatible strings.
> >
> >  Documentation/devicetree/bindings/pci/layerscape-pci.txt | 4 +++-
> >  1 file changed, 3 insertions(+), 1 deletion(-)
> >
> > diff --git a/Documentation/devicetree/bindings/pci/layerscape-pci.txt
> > b/Documentation/devicetree/bindings/pci/layerscape-pci.txt
> > index e20ceaa..762ae41 100644
> > --- a/Documentation/devicetree/bindings/pci/layerscape-pci.txt
> > +++ b/Documentation/devicetree/bindings/pci/layerscape-pci.txt
> > @@ -22,7 +22,9 @@ Required properties:
> >  "fsl,ls1043a-pcie"
> >  "fsl,ls1012a-pcie"
> >EP mode:
> > -   "fsl,ls1046a-pcie-ep", "fsl,ls-pcie-ep"
> > +   "fsl,ls1046a-pcie-ep" "fsl,ls-pcie-ep"
> > +   "fsl,ls1088a-pcie-ep" "fsl,ls-pcie-ep"
> > +   "fsl,ls2088a-pcie-ep" "fsl,ls-pcie-ep"
> 
> This isn't consistent with "[PATCH v3 09/11] PCI: layerscape: Add EP mode..."
> as that patch drops the fallback "fsl,ls-pcie-ep". Either the fallback must be
> preserved in the driver, or you need to drop it here.
> 
> What if there are existing users that depend on the fallback?
> 
> (I'm also not sure if that comma should have been dropped).

Hi Andrew,

Thanks for your comments, I lose the comma.

Thanks 
Xiaowei

> 
> Thanks,
> 
> Andrew Murray
> 
> >  - reg: base addresses and lengths of the PCIe controller register blocks.
> >  - interrupts: A list of interrupt outputs of the controller. Must contain 
> > an
> >entry for each entry in the interrupt-names property.
> > --
> > 2.9.5
> >


Re: [PATCH 6/6] powerpc/64s/radix: introduce options to disable use of the tlbie instruction

2019-09-02 Thread Alistair Popple
Nick,

On Tuesday, 3 September 2019 1:29:31 AM AEST Nicholas Piggin wrote:
> Introduce two options to control the use of the tlbie instruction. A
> boot time option which completely disables the kernel using the
> instruction, this is currently incompatible with HASH MMU, KVM, and
> coherent accelerators.

Some accelerators (eg. cxl, ocxl, npu) call mm_context_add_copro() to force 
global TLB invalidations:

static inline void mm_context_add_copro(struct mm_struct *mm)
{
/*
 * If any copro is in use, increment the active CPU count
 * in order to force TLB invalidations to be global as to
 * propagate to the Nest MMU.
 */
if (atomic_inc_return(>context.copros) == 1)
inc_mm_active_cpus(mm);
}

Admittedly I haven't dug into all the details of this patch but it sounds like 
it might break the above if TLBIE is disabled. Do you think we should add a 
WARN_ON if mm_context_add_copro() is called with TLBIE disabled? Or perhaps 
even force TLBIE to be re-enabled if it is called with it disabled?

- Alistair

> And a debugfs option can be switched at runtime and avoids using tlbie
> for invalidating CPU TLBs for normal process and kernel address
> mappings. Coherent accelerators are still managed with tlbie, as will
> KVM partition scope translations.
> 
> Cross-CPU TLB flushing is implemented with IPIs and tlbiel. This is a
> basic implementation which does not attempt to make any optimisation
> beyond the tlbie implementation.
> 
> This is useful for performance testing among other things. For example
> in certain situations on large systems, using IPIs may be faster than
> tlbie as they can be directed rather than broadcast. Later we may also
> take advantage of the IPIs to do more interesting things such as trim
> the mm cpumask more aggressively.
> 
> Signed-off-by: Nicholas Piggin 
> ---
>  .../admin-guide/kernel-parameters.txt |   4 +
>  arch/powerpc/include/asm/book3s/64/tlbflush.h |   9 +
>  arch/powerpc/kvm/book3s_hv.c  |   6 +
>  arch/powerpc/mm/book3s64/pgtable.c|  47 +
>  arch/powerpc/mm/book3s64/radix_tlb.c  | 190 --
>  drivers/misc/cxl/main.c   |   4 +
>  drivers/misc/ocxl/main.c  |   4 +
>  7 files changed, 246 insertions(+), 18 deletions(-)
> 
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/
admin-guide/kernel-parameters.txt
> index d3cbb3ae62b6..65ae16549aa3 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -860,6 +860,10 @@
>   disable_radix   [PPC]
>   Disable RADIX MMU mode on POWER9
>  
> + disable_tlbie   [PPC]
> + Disable TLBIE instruction. Currently does not work
> + with KVM, with HASH MMU, or with coherent accelerators.
> +
>   disable_cpu_apicid= [X86,APIC,SMP]
>   Format: 
>   The number of initial APIC ID for the
> diff --git a/arch/powerpc/include/asm/book3s/64/tlbflush.h b/arch/powerpc/
include/asm/book3s/64/tlbflush.h
> index ebf572ea621e..7aa8195b6cff 100644
> --- a/arch/powerpc/include/asm/book3s/64/tlbflush.h
> +++ b/arch/powerpc/include/asm/book3s/64/tlbflush.h
> @@ -162,4 +162,13 @@ static inline void flush_tlb_pgtable(struct mmu_gather 
*tlb, unsigned long addre
>  
>   radix__flush_tlb_pwc(tlb, address);
>  }
> +
> +extern bool tlbie_capable;
> +extern bool tlbie_enabled;
> +
> +static inline bool cputlb_use_tlbie(void)
> +{
> + return tlbie_enabled;
> +}
> +
>  #endif /*  _ASM_POWERPC_BOOK3S_64_TLBFLUSH_H */
> diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
> index cde3f5a4b3e4..3cdaa2a09a19 100644
> --- a/arch/powerpc/kvm/book3s_hv.c
> +++ b/arch/powerpc/kvm/book3s_hv.c
> @@ -5462,6 +5462,12 @@ static int kvmppc_radix_possible(void)
>  static int kvmppc_book3s_init_hv(void)
>  {
>   int r;
> +
> + if (!tlbie_capable) {
> + pr_err("KVM-HV: Host does not support TLBIE\n");
> + return -ENODEV;
> + }
> +
>   /*
>* FIXME!! Do we need to check on all cpus ?
>*/
> diff --git a/arch/powerpc/mm/book3s64/pgtable.c b/arch/powerpc/mm/book3s64/
pgtable.c
> index 351eb78eed55..75483b40fcb1 100644
> --- a/arch/powerpc/mm/book3s64/pgtable.c
> +++ b/arch/powerpc/mm/book3s64/pgtable.c
> @@ -8,6 +8,7 @@
>  #include 
>  #include 
>  
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -469,3 +470,49 @@ int pmd_move_must_withdraw(struct spinlock 
*new_pmd_ptl,
>  
>   return true;
>  }
> +
> +/*
> + * Does the CPU support tlbie?
> + */
> +bool tlbie_capable __read_mostly = true;
> +EXPORT_SYMBOL(tlbie_capable);
> +
> +/*
> + * Should tlbie be used for management of CPU TLBs, for kernel and process
> + * address spaces? tlbie may still be used for nMMU accelerators, and for 
KVM
> + * guest address spaces.
> + */
> +bool 

Re: [PATCH v7 5/6] powerpc/64: Make COMPAT user-selectable disabled on littleendian by default.

2019-09-02 Thread Michael Ellerman
Michal Suchánek  writes:
> On Mon, 02 Sep 2019 12:03:12 +1000
> Michael Ellerman  wrote:
>
>> Michal Suchanek  writes:
>> > On bigendian ppc64 it is common to have 32bit legacy binaries but much
>> > less so on littleendian.  
>> 
>> I think the toolchain people will tell you that there is no 32-bit
>> little endian ABI defined at all, if anything works it's by accident.
>
> I have seen a piece of software that workarounds code issues on 64bit
> by always compiling 32bit code. So it does work in some way.

What software is that?

> Also it has been pointed out that you can still switch to BE even with
> the 'fast-switch' removed.

Yes we have a proper syscall for endian switching, sys_switch_endian(),
which is definitely supported.

But that *only* switches the endian-ness of the process, it does nothing
to the syscall layer. So any process that switches to the other endian
must endian flip syscall arguments (that aren't in registers), or flip
back to the native endian before calling syscalls.

>> So I think we should not make this selectable, unless someone puts their
>> hand up to say they want it and are willing to test it and keep it
>> working.
>
> I don't really care either way.

Sure. We'll see if anyone else speaks up.

cheers


Re: [PATCH v7 5/6] powerpc/64: Make COMPAT user-selectable disabled on littleendian by default.

2019-09-02 Thread Michael Ellerman
Segher Boessenkool  writes:
> On Mon, Sep 02, 2019 at 12:03:12PM +1000, Michael Ellerman wrote:
>> Michal Suchanek  writes:
>> > On bigendian ppc64 it is common to have 32bit legacy binaries but much
>> > less so on littleendian.
>> 
>> I think the toolchain people will tell you that there is no 32-bit
>> little endian ABI defined at all, if anything works it's by accident.
^
v2

> There of course is a lot of powerpcle-* support.  The ABI used for it
> on linux is the SYSV ABI, just like on BE 32-bit.

I was talking about ELFv2, which is 64-bit only. But that was based on
me thinking we had a hard assumption in the kernel that ppc64le kernels
always expect ELFv2 userland. Looking at the code though I was wrong
about that, it looks like we will run little endian ELFv1 binaries,
though I don't think anyone is testing it.

> There also is specific powerpcle-linux support in GCC, and in binutils,
> too.  Also, config.guess/config.sub supports it.  Half a year ago this
> all built fine (no, I don't test it often either).
>
> I don't think glibc supports it though, so I wonder if anyone builds an
> actual system with it?  Maybe busybox or the like?
>
>> So I think we should not make this selectable, unless someone puts their
>> hand up to say they want it and are willing to test it and keep it
>> working.
>
> What about actual 32-bit LE systems?  Does anyone still use those?

Not that I've ever heard of.

cheers


Re: [PATCH v7 3/6] powerpc/perf: consolidate read_user_stack_32

2019-09-02 Thread Michael Ellerman
Michal Suchánek  writes:
> On Mon, 02 Sep 2019 14:01:17 +1000
> Michael Ellerman  wrote:
>> Michael Ellerman  writes:
>> > Michal Suchanek  writes:  
>> ...
>> >> @@ -295,6 +279,12 @@ static inline int current_is_64bit(void)
>> >>  }
>> >>  
>> >>  #else  /* CONFIG_PPC64 */
>> >> +static int read_user_stack_slow(void __user *ptr, void *buf, int nb)
>> >> +{
>> >> + return 0;
>> >> +}
>> >> +#endif /* CONFIG_PPC64 */  
>> >
>> > Ending the PPC64 else case here, and then restarting it below with an
>> > ifndef means we end up with two parts of the file that define 32-bit
>> > code, with a common chunk in the middle, which I dislike.
>> >
>> > I'd rather you add the empty read_user_stack_slow() in the existing
>> > #else section and then move read_user_stack_32() below the whole ifdef
>> > PPC64/else/endif section.
>> >
>> > Is there some reason that doesn't work?  
>> 
>> Gah, I missed that you split the whole file later in the series. Any
>> reason you did it in two steps rather than moving patch 6 earlier in the
>> series?
>
> To make this patch readable.

But it's not very readable :)

You also retained the comment about the 32-bit behaviour which is now a
bit confusing, because the function is used on both 32 & 64-bit.

I think moving it as I suggested in my first reply makes for a better
diff, something like eg:

diff --git a/arch/powerpc/perf/callchain.c b/arch/powerpc/perf/callchain.c
index c84bbd4298a0..82c0f81b89a5 100644
--- a/arch/powerpc/perf/callchain.c
+++ b/arch/powerpc/perf/callchain.c
@@ -165,22 +165,6 @@ static int read_user_stack_64(unsigned long __user *ptr, 
unsigned long *ret)
return read_user_stack_slow(ptr, ret, 8);
 }
 
-static int read_user_stack_32(unsigned int __user *ptr, unsigned int *ret)
-{
-   if ((unsigned long)ptr > TASK_SIZE - sizeof(unsigned int) ||
-   ((unsigned long)ptr & 3))
-   return -EFAULT;
-
-   pagefault_disable();
-   if (!__get_user_inatomic(*ret, ptr)) {
-   pagefault_enable();
-   return 0;
-   }
-   pagefault_enable();
-
-   return read_user_stack_slow(ptr, ret, 4);
-}
-
 static inline int valid_user_sp(unsigned long sp, int is_64)
 {
if (!sp || (sp & 7) || sp > (is_64 ? TASK_SIZE : 0x1UL) - 32)
@@ -295,27 +279,6 @@ static inline int current_is_64bit(void)
 }
 
 #else  /* CONFIG_PPC64 */
-/*
- * On 32-bit we just access the address and let hash_page create a
- * HPTE if necessary, so there is no need to fall back to reading
- * the page tables.  Since this is called at interrupt level,
- * do_page_fault() won't treat a DSI as a page fault.
- */
-static int read_user_stack_32(unsigned int __user *ptr, unsigned int *ret)
-{
-   int rc;
-
-   if ((unsigned long)ptr > TASK_SIZE - sizeof(unsigned int) ||
-   ((unsigned long)ptr & 3))
-   return -EFAULT;
-
-   pagefault_disable();
-   rc = __get_user_inatomic(*ret, ptr);
-   pagefault_enable();
-
-   return rc;
-}
-
 static inline void perf_callchain_user_64(struct perf_callchain_entry_ctx 
*entry,
  struct pt_regs *regs)
 {
@@ -333,6 +296,11 @@ static inline int valid_user_sp(unsigned long sp, int 
is_64)
return 1;
 }
 
+static int read_user_stack_slow(void __user *ptr, void *buf, int nb)
+{
+   return 0;
+}
+
 #define __SIGNAL_FRAMESIZE32   __SIGNAL_FRAMESIZE
 #define sigcontext32   sigcontext
 #define mcontext32 mcontext
@@ -341,6 +309,33 @@ static inline int valid_user_sp(unsigned long sp, int 
is_64)
 
 #endif /* CONFIG_PPC64 */
 
+static int read_user_stack_32(unsigned int __user *ptr, unsigned int *ret)
+{
+   int rc;
+
+   if ((unsigned long)ptr > TASK_SIZE - sizeof(unsigned int) ||
+   ((unsigned long)ptr & 3))
+   return -EFAULT;
+
+   pagefault_disable();
+   rc = __get_user_inatomic(*ret, ptr);
+   pagefault_enable();
+
+   /*
+* On 32-bit we just access the address and let hash_page() create a
+* HPTE if necessary, so there is no need to fall back to reading the
+* page tables. Since this is called at interrupt level, do_page_fault()
+* won't treat a DSI as a page fault.
+*
+* On 64-bit if the access faults we fall back to
+* read_user_stack_slow(), see the comment there for more details.
+*/
+   if (IS_ENABLED(CONFIG_PPC64) && rc)
+   return read_user_stack_slow(ptr, ret, 4);
+
+   return rc;
+}
+
 /*
  * Layout for non-RT signal frames
  */


cheers


[PATCH v2] powerpc/64: system call implement the bulk of the logic in C

2019-09-02 Thread Nicholas Piggin
System call entry and particularly exit code is beyond the limit of what
is reasonable to implement in asm.

This conversion moves all conditional branches out of the asm code,
except for the case that all GPRs should be restored at exit.

Null syscall test is about 5% faster after this patch, because the exit
work is handled under local_irq_disable, and the hard mask and pending
interrupt replay is handled after that, which avoids games with MSR.

Signed-off-by: Nicholas Piggin 
---
Since v1:
- Fix big endian build (mpe)
- Fix order of exit tracing to after the result registers have been set.
- Move ->softe store before MSR[EE] is set, fix the now obsolete comment.
- More #ifdef tidyups and writing the accounting helpers nicer (Christophe)
- Minor things like move the TM debug store into C

 arch/powerpc/include/asm/asm-prototypes.h |  11 -
 .../powerpc/include/asm/book3s/64/kup-radix.h |  10 +-
 arch/powerpc/include/asm/cputime.h|  24 ++
 arch/powerpc/include/asm/ptrace.h |   3 +
 arch/powerpc/include/asm/signal.h |   3 +
 arch/powerpc/include/asm/switch_to.h  |   5 +
 arch/powerpc/include/asm/time.h   |   3 +
 arch/powerpc/kernel/Makefile  |   3 +-
 arch/powerpc/kernel/entry_64.S| 337 +++---
 arch/powerpc/kernel/signal.h  |   2 -
 arch/powerpc/kernel/syscall_64.c  | 183 ++
 11 files changed, 277 insertions(+), 307 deletions(-)
 create mode 100644 arch/powerpc/kernel/syscall_64.c

diff --git a/arch/powerpc/include/asm/asm-prototypes.h 
b/arch/powerpc/include/asm/asm-prototypes.h
index ec1c97a8e8cb..f00ef8924a99 100644
--- a/arch/powerpc/include/asm/asm-prototypes.h
+++ b/arch/powerpc/include/asm/asm-prototypes.h
@@ -92,14 +92,6 @@ long sys_switch_endian(void);
 notrace unsigned int __check_irq_replay(void);
 void notrace restore_interrupts(void);
 
-/* ptrace */
-long do_syscall_trace_enter(struct pt_regs *regs);
-void do_syscall_trace_leave(struct pt_regs *regs);
-
-/* process */
-void restore_math(struct pt_regs *regs);
-void restore_tm_state(struct pt_regs *regs);
-
 /* prom_init (OpenFirmware) */
 unsigned long __init prom_init(unsigned long r3, unsigned long r4,
   unsigned long pp,
@@ -110,9 +102,6 @@ unsigned long __init prom_init(unsigned long r3, unsigned 
long r4,
 void __init early_setup(unsigned long dt_ptr);
 void early_setup_secondary(void);
 
-/* time */
-void accumulate_stolen_time(void);
-
 /* misc runtime */
 extern u64 __bswapdi2(u64);
 extern s64 __lshrdi3(s64, int);
diff --git a/arch/powerpc/include/asm/book3s/64/kup-radix.h 
b/arch/powerpc/include/asm/book3s/64/kup-radix.h
index f254de956d6a..83f6bda0542b 100644
--- a/arch/powerpc/include/asm/book3s/64/kup-radix.h
+++ b/arch/powerpc/include/asm/book3s/64/kup-radix.h
@@ -3,6 +3,7 @@
 #define _ASM_POWERPC_BOOK3S_64_KUP_RADIX_H
 
 #include 
+#include 
 
 #define AMR_KUAP_BLOCK_READUL(0x4000)
 #define AMR_KUAP_BLOCK_WRITE   UL(0x8000)
@@ -56,7 +57,14 @@
 
 #ifdef CONFIG_PPC_KUAP
 
-#include 
+#include 
+#include 
+
+static inline void kuap_check_amr(void)
+{
+   if (IS_ENABLED(CONFIG_PPC_KUAP_DEBUG) && 
mmu_has_feature(MMU_FTR_RADIX_KUAP))
+   WARN_ON_ONCE(mfspr(SPRN_AMR) != AMR_KUAP_BLOCKED);
+}
 
 /*
  * We support individually allowing read or write, but we don't support nesting
diff --git a/arch/powerpc/include/asm/cputime.h 
b/arch/powerpc/include/asm/cputime.h
index 2431b4ada2fa..c43614cffaac 100644
--- a/arch/powerpc/include/asm/cputime.h
+++ b/arch/powerpc/include/asm/cputime.h
@@ -60,6 +60,30 @@ static inline void arch_vtime_task_switch(struct task_struct 
*prev)
 }
 #endif
 
+static inline void account_cpu_user_entry(void)
+{
+   unsigned long tb = mftb();
+   struct cpu_accounting_data *acct = get_accounting(current);
+
+   acct->utime += (tb - acct->starttime_user);
+   acct->starttime = tb;
+}
+static inline void account_cpu_user_exit(void)
+{
+   unsigned long tb = mftb();
+   struct cpu_accounting_data *acct = get_accounting(current);
+
+   acct->stime += (tb - acct->starttime);
+   acct->starttime_user = tb;
+}
+
 #endif /* __KERNEL__ */
+#else /* CONFIG_VIRT_CPU_ACCOUNTING_NATIVE */
+static inline void account_cpu_user_entry(void)
+{
+}
+static inline void account_cpu_user_exit(void)
+{
+}
 #endif /* CONFIG_VIRT_CPU_ACCOUNTING_NATIVE */
 #endif /* __POWERPC_CPUTIME_H */
diff --git a/arch/powerpc/include/asm/ptrace.h 
b/arch/powerpc/include/asm/ptrace.h
index feee1b21bbd5..af363086403a 100644
--- a/arch/powerpc/include/asm/ptrace.h
+++ b/arch/powerpc/include/asm/ptrace.h
@@ -138,6 +138,9 @@ extern unsigned long profile_pc(struct pt_regs *regs);
 #define profile_pc(regs) instruction_pointer(regs)
 #endif
 
+long do_syscall_trace_enter(struct pt_regs *regs);
+void do_syscall_trace_leave(struct pt_regs *regs);
+
 #define kernel_stack_pointer(regs) ((regs)->gpr[1])
 static inline int 

Re: [PATCH v2 2/9] x86: numa: check the node id consistently for x86

2019-09-02 Thread Peter Zijlstra
On Mon, Sep 02, 2019 at 08:22:52PM +0200, Ingo Molnar wrote:
> 
> * Peter Zijlstra  wrote:

> > diff --git a/drivers/base/core.c b/drivers/base/core.c
> > index f0dd8e38fee3..2caf204966a0 100644
> > --- a/drivers/base/core.c
> > +++ b/drivers/base/core.c
> > @@ -2120,8 +2120,16 @@ int device_add(struct device *dev)
> > dev->kobj.parent = kobj;
> >  
> > /* use parent numa_node */
> > -   if (parent && (dev_to_node(dev) == NUMA_NO_NODE))
> > -   set_dev_node(dev, dev_to_node(parent));
> > +   if (dev_to_node(dev) == NUMA_NO_NODE) {
> > +   if (parent)
> > +   set_dev_node(dev, dev_to_node(parent));
> > +#ifdef CONFIG_NUMA
> > +   else {
> > +   pr_err("device: '%s': has no assigned NUMA node\n", 
> > dev_name(dev));
> > +   set_dev_node(dev, 0);
> > +   }
> > +#endif
> 
> BTW., is firmware required to always provide a NUMA node on NUMA systems?
> 
> I.e. do we really want this warning on non-NUMA systems that don't assign 
> NUMA nodes?

Good point; we might have to exclude nr_node_ids==1 systems from
warning.

> Also, even on NUMA systems, is firmware required to provide a NUMA node - 
> i.e. is it in principle invalid to offer no NUMA binding?

I think so; a device needs to be _somewhere_, right? Typically though;
devices are on a PCI bus, and the PCI bridge itself will have a NUMA
binding and then the above parent rule will make everything just work.

But I don't see how you can be outside of the NUMA topology.


Re: [PATCH v2 2/9] x86: numa: check the node id consistently for x86

2019-09-02 Thread Ingo Molnar


* Peter Zijlstra  wrote:

> On Mon, Sep 02, 2019 at 08:25:24PM +0800, Yunsheng Lin wrote:
> > On 2019/9/2 15:25, Peter Zijlstra wrote:
> > > On Mon, Sep 02, 2019 at 01:46:51PM +0800, Yunsheng Lin wrote:
> > >> On 2019/9/1 0:12, Peter Zijlstra wrote:
> > > 
> > >>> 1) because even it is not set, the device really does belong to a node.
> > >>> It is impossible a device will have magic uniform access to memory when
> > >>> CPUs cannot.
> > >>
> > >> So it means dev_to_node() will return either NUMA_NO_NODE or a
> > >> valid node id?
> > > 
> > > NUMA_NO_NODE := -1, which is not a valid node number. It is also, like I
> > > said, not a valid device location on a NUMA system.
> > > 
> > > Just because ACPI/BIOS is shit, doesn't mean the device doesn't have a
> > > node association. It just means we don't know and might have to guess.
> > 
> > How do we guess the device's location when ACPI/BIOS does not set it?
> 
> See device_add(), it looks to the device's parent and on NO_NODE, puts
> it there.
> 
> Lacking any hints, just stick it to node0 and print a FW_BUG or
> something.
> 
> > It seems dev_to_node() does not do anything about that and leave the
> > job to the caller or whatever function that get called with its return
> > value, such as cpumask_of_node().
> 
> Well, dev_to_node() doesn't do anything; nor should it. It are the
> callers of set_dev_node() that should be taking care.
> 
> Also note how device_add() sets the device node to the parent device's
> node on NUMA_NO_NODE. Arguably we should change it to complain when it
> finds NUMA_NO_NODE and !parent.
> 
> ---
>  drivers/base/core.c | 12 ++--
>  1 file changed, 10 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/base/core.c b/drivers/base/core.c
> index f0dd8e38fee3..2caf204966a0 100644
> --- a/drivers/base/core.c
> +++ b/drivers/base/core.c
> @@ -2120,8 +2120,16 @@ int device_add(struct device *dev)
>   dev->kobj.parent = kobj;
>  
>   /* use parent numa_node */
> - if (parent && (dev_to_node(dev) == NUMA_NO_NODE))
> - set_dev_node(dev, dev_to_node(parent));
> + if (dev_to_node(dev) == NUMA_NO_NODE) {
> + if (parent)
> + set_dev_node(dev, dev_to_node(parent));
> +#ifdef CONFIG_NUMA
> + else {
> + pr_err("device: '%s': has no assigned NUMA node\n", 
> dev_name(dev));
> + set_dev_node(dev, 0);
> + }
> +#endif

BTW., is firmware required to always provide a NUMA node on NUMA systems?

I.e. do we really want this warning on non-NUMA systems that don't assign 
NUMA nodes?

Also, even on NUMA systems, is firmware required to provide a NUMA node - 
i.e. is it in principle invalid to offer no NUMA binding?

Thanks,

Ingo


Re: [PATCH v2 2/9] x86: numa: check the node id consistently for x86

2019-09-02 Thread Ingo Molnar


* Peter Zijlstra  wrote:

> Also note that the CONFIG_DEBUG_PER_CPU_MAPS version of
> cpumask_of_node() already does this (although it wants the below fix).
> 
> ---
> diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
> index e6dad600614c..5f49c10201c7 100644
> --- a/arch/x86/mm/numa.c
> +++ b/arch/x86/mm/numa.c
> @@ -861,7 +861,7 @@ void numa_remove_cpu(int cpu)
>   */
>  const struct cpumask *cpumask_of_node(int node)
>  {
> - if (node >= nr_node_ids) {
> + if ((unsigned)node >= nr_node_ids) {
>   printk(KERN_WARNING
>   "cpumask_of_node(%d): node > nr_node_ids(%u)\n",
>   node, nr_node_ids);

Nitpicking: please also fix the kernel message to say ">=".

With that:

Acked-by: Ingo Molnar 

Note that:

- arch/arm64/mm/numa.c copied the same sign bug (or unrobustness if we 
  don't want to call it a bug).

- Kudos to the mm/memcontrol.c and kernel/bpf/syscall.c teams for not 
  copying that bug. Booo for none of them fixing the buggy pattern 
  elsewhere in the kernel ;-)

Thanks,

Ingo


Re: [PATCH v3 01/11] PCI: designware-ep: Add multiple PFs support for DWC

2019-09-02 Thread Andrew Murray
On Mon, Sep 02, 2019 at 11:17:06AM +0800, Xiaowei Bao wrote:
> Add multiple PFs support for DWC, different PF have different config space
> we use pf-offset property which get from the DTS to access the different pF

This needs to be updated as this no longer comes from the DT.

> config space.
> 
> Signed-off-by: Xiaowei Bao 


We're assuming:

 - The offset address (func_offset) between PF's in the memory map can be
   different between different DWC implementations. And also that it's
   possible for DWC implementations to address PFs without using an offset.

 - The current approach is preferable to adding DWC EP driver callbacks
   for writing to the EP config space (e.g. a variant of dw_pcie_writew_dbi
   that takes a func number).

I'm keen to hear feedback from Jingoo/Gustavo on this.

Thanks,

Andrew Murray

> ---
> v2:
>  - Remove duplicate redundant code.
>  - Reimplement the PF config space access way.
> v3:
>  - Integrate duplicate code for func_select.
>  - Move PCIE_ATU_FUNC_NUM(pf) (pf << 20) to ((pf) << 20).
>  - Add the comments for func_conf_select function.
> 
>  drivers/pci/controller/dwc/pcie-designware-ep.c | 123 
> 
>  drivers/pci/controller/dwc/pcie-designware.c|  59 
>  drivers/pci/controller/dwc/pcie-designware.h|  18 +++-
>  3 files changed, 142 insertions(+), 58 deletions(-)
> 
> diff --git a/drivers/pci/controller/dwc/pcie-designware-ep.c 
> b/drivers/pci/controller/dwc/pcie-designware-ep.c
> index 65f4792..eb851c2 100644
> --- a/drivers/pci/controller/dwc/pcie-designware-ep.c
> +++ b/drivers/pci/controller/dwc/pcie-designware-ep.c
> @@ -19,12 +19,26 @@ void dw_pcie_ep_linkup(struct dw_pcie_ep *ep)
>   pci_epc_linkup(epc);
>  }
>  
> -static void __dw_pcie_ep_reset_bar(struct dw_pcie *pci, enum pci_barno bar,
> -int flags)
> +static unsigned int dw_pcie_ep_func_select(struct dw_pcie_ep *ep, u8 func_no)
> +{
> + unsigned int func_offset = 0;
> +
> + if (ep->ops->func_conf_select)
> + func_offset = ep->ops->func_conf_select(ep, func_no);
> +
> + return func_offset;
> +}
> +
> +static void __dw_pcie_ep_reset_bar(struct dw_pcie *pci, u8 func_no,
> +enum pci_barno bar, int flags)
>  {
>   u32 reg;
> + unsigned int func_offset = 0;
> + struct dw_pcie_ep *ep = >ep;
> +
> + func_offset = dw_pcie_ep_func_select(ep, func_no);
>  
> - reg = PCI_BASE_ADDRESS_0 + (4 * bar);
> + reg = func_offset + PCI_BASE_ADDRESS_0 + (4 * bar);
>   dw_pcie_dbi_ro_wr_en(pci);
>   dw_pcie_writel_dbi2(pci, reg, 0x0);
>   dw_pcie_writel_dbi(pci, reg, 0x0);
> @@ -37,7 +51,12 @@ static void __dw_pcie_ep_reset_bar(struct dw_pcie *pci, 
> enum pci_barno bar,
>  
>  void dw_pcie_ep_reset_bar(struct dw_pcie *pci, enum pci_barno bar)
>  {
> - __dw_pcie_ep_reset_bar(pci, bar, 0);
> + u8 func_no, funcs;
> +
> + funcs = pci->ep.epc->max_functions;
> +
> + for (func_no = 0; func_no < funcs; func_no++)
> + __dw_pcie_ep_reset_bar(pci, func_no, bar, 0);
>  }
>  
>  static int dw_pcie_ep_write_header(struct pci_epc *epc, u8 func_no,
> @@ -45,28 +64,31 @@ static int dw_pcie_ep_write_header(struct pci_epc *epc, 
> u8 func_no,
>  {
>   struct dw_pcie_ep *ep = epc_get_drvdata(epc);
>   struct dw_pcie *pci = to_dw_pcie_from_ep(ep);
> + unsigned int func_offset = 0;
> +
> + func_offset = dw_pcie_ep_func_select(ep, func_no);
>  
>   dw_pcie_dbi_ro_wr_en(pci);
> - dw_pcie_writew_dbi(pci, PCI_VENDOR_ID, hdr->vendorid);
> - dw_pcie_writew_dbi(pci, PCI_DEVICE_ID, hdr->deviceid);
> - dw_pcie_writeb_dbi(pci, PCI_REVISION_ID, hdr->revid);
> - dw_pcie_writeb_dbi(pci, PCI_CLASS_PROG, hdr->progif_code);
> - dw_pcie_writew_dbi(pci, PCI_CLASS_DEVICE,
> + dw_pcie_writew_dbi(pci, func_offset + PCI_VENDOR_ID, hdr->vendorid);
> + dw_pcie_writew_dbi(pci, func_offset + PCI_DEVICE_ID, hdr->deviceid);
> + dw_pcie_writeb_dbi(pci, func_offset + PCI_REVISION_ID, hdr->revid);
> + dw_pcie_writeb_dbi(pci, func_offset + PCI_CLASS_PROG, hdr->progif_code);
> + dw_pcie_writew_dbi(pci, func_offset + PCI_CLASS_DEVICE,
>  hdr->subclass_code | hdr->baseclass_code << 8);
> - dw_pcie_writeb_dbi(pci, PCI_CACHE_LINE_SIZE,
> + dw_pcie_writeb_dbi(pci, func_offset + PCI_CACHE_LINE_SIZE,
>  hdr->cache_line_size);
> - dw_pcie_writew_dbi(pci, PCI_SUBSYSTEM_VENDOR_ID,
> + dw_pcie_writew_dbi(pci, func_offset + PCI_SUBSYSTEM_VENDOR_ID,
>  hdr->subsys_vendor_id);
> - dw_pcie_writew_dbi(pci, PCI_SUBSYSTEM_ID, hdr->subsys_id);
> - dw_pcie_writeb_dbi(pci, PCI_INTERRUPT_PIN,
> + dw_pcie_writew_dbi(pci, func_offset + PCI_SUBSYSTEM_ID, hdr->subsys_id);
> + dw_pcie_writeb_dbi(pci, func_offset + PCI_INTERRUPT_PIN,
>  hdr->interrupt_pin);
>   dw_pcie_dbi_ro_wr_dis(pci);
>  
>   return 0;
>  }
>  

[PATCH 6/6] powerpc/64s/radix: introduce options to disable use of the tlbie instruction

2019-09-02 Thread Nicholas Piggin
Introduce two options to control the use of the tlbie instruction. A
boot time option which completely disables the kernel using the
instruction, this is currently incompatible with HASH MMU, KVM, and
coherent accelerators.

And a debugfs option can be switched at runtime and avoids using tlbie
for invalidating CPU TLBs for normal process and kernel address
mappings. Coherent accelerators are still managed with tlbie, as will
KVM partition scope translations.

Cross-CPU TLB flushing is implemented with IPIs and tlbiel. This is a
basic implementation which does not attempt to make any optimisation
beyond the tlbie implementation.

This is useful for performance testing among other things. For example
in certain situations on large systems, using IPIs may be faster than
tlbie as they can be directed rather than broadcast. Later we may also
take advantage of the IPIs to do more interesting things such as trim
the mm cpumask more aggressively.

Signed-off-by: Nicholas Piggin 
---
 .../admin-guide/kernel-parameters.txt |   4 +
 arch/powerpc/include/asm/book3s/64/tlbflush.h |   9 +
 arch/powerpc/kvm/book3s_hv.c  |   6 +
 arch/powerpc/mm/book3s64/pgtable.c|  47 +
 arch/powerpc/mm/book3s64/radix_tlb.c  | 190 --
 drivers/misc/cxl/main.c   |   4 +
 drivers/misc/ocxl/main.c  |   4 +
 7 files changed, 246 insertions(+), 18 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index d3cbb3ae62b6..65ae16549aa3 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -860,6 +860,10 @@
disable_radix   [PPC]
Disable RADIX MMU mode on POWER9
 
+   disable_tlbie   [PPC]
+   Disable TLBIE instruction. Currently does not work
+   with KVM, with HASH MMU, or with coherent accelerators.
+
disable_cpu_apicid= [X86,APIC,SMP]
Format: 
The number of initial APIC ID for the
diff --git a/arch/powerpc/include/asm/book3s/64/tlbflush.h 
b/arch/powerpc/include/asm/book3s/64/tlbflush.h
index ebf572ea621e..7aa8195b6cff 100644
--- a/arch/powerpc/include/asm/book3s/64/tlbflush.h
+++ b/arch/powerpc/include/asm/book3s/64/tlbflush.h
@@ -162,4 +162,13 @@ static inline void flush_tlb_pgtable(struct mmu_gather 
*tlb, unsigned long addre
 
radix__flush_tlb_pwc(tlb, address);
 }
+
+extern bool tlbie_capable;
+extern bool tlbie_enabled;
+
+static inline bool cputlb_use_tlbie(void)
+{
+   return tlbie_enabled;
+}
+
 #endif /*  _ASM_POWERPC_BOOK3S_64_TLBFLUSH_H */
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index cde3f5a4b3e4..3cdaa2a09a19 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -5462,6 +5462,12 @@ static int kvmppc_radix_possible(void)
 static int kvmppc_book3s_init_hv(void)
 {
int r;
+
+   if (!tlbie_capable) {
+   pr_err("KVM-HV: Host does not support TLBIE\n");
+   return -ENODEV;
+   }
+
/*
 * FIXME!! Do we need to check on all cpus ?
 */
diff --git a/arch/powerpc/mm/book3s64/pgtable.c 
b/arch/powerpc/mm/book3s64/pgtable.c
index 351eb78eed55..75483b40fcb1 100644
--- a/arch/powerpc/mm/book3s64/pgtable.c
+++ b/arch/powerpc/mm/book3s64/pgtable.c
@@ -8,6 +8,7 @@
 #include 
 #include 
 
+#include 
 #include 
 #include 
 #include 
@@ -469,3 +470,49 @@ int pmd_move_must_withdraw(struct spinlock *new_pmd_ptl,
 
return true;
 }
+
+/*
+ * Does the CPU support tlbie?
+ */
+bool tlbie_capable __read_mostly = true;
+EXPORT_SYMBOL(tlbie_capable);
+
+/*
+ * Should tlbie be used for management of CPU TLBs, for kernel and process
+ * address spaces? tlbie may still be used for nMMU accelerators, and for KVM
+ * guest address spaces.
+ */
+bool tlbie_enabled __read_mostly = true;
+
+static int __init setup_disable_tlbie(char *str)
+{
+   if (!radix_enabled()) {
+   pr_err("disable_tlbie: Unable to disable TLBIE with Hash 
MMU.\n");
+   return 1;
+   }
+
+   tlbie_capable = false;
+   tlbie_enabled = false;
+
+return 1;
+}
+__setup("disable_tlbie", setup_disable_tlbie);
+
+static int __init pgtable_debugfs_setup(void)
+{
+   if (!tlbie_capable)
+   return 0;
+
+   /*
+* There is no locking vs tlb flushing when changing this value.
+* The tlb flushers will see one value or another, and use either
+* tlbie or tlbiel with IPIs. In both cases the TLBs will be
+* invalidated as expected.
+*/
+   debugfs_create_bool("tlbie_enabled", 0600,
+   powerpc_debugfs_root,
+   _enabled);
+
+   return 0;
+}
+arch_initcall(pgtable_debugfs_setup);
diff --git a/arch/powerpc/mm/book3s64/radix_tlb.c 

[PATCH 0/6] Making tlbie optional for radix

2019-09-02 Thread Nicholas Piggin
This is a rebase of the series against the the powerpc next branch
with ultravisor changes. Main improvements are implementing and
splitting out the precursor patches better.

KVM still requires tlbie to run radix guests. A naive implementation
of tlbiel + IPI for LPID flushes was crashing so requires more
investigation.

Thanks,
Nick

Nicholas Piggin (6):
  powerpc/64s: remove register_process_table callback
  powerpc/64s/radix: tidy up TLB flushing code
  powerpc/64s: make mmu_partition_table_set_entry TLB flush optional
  powerpc/64s/pseries: radix flush translations before MMU is enabled at
boot
  powerpc/64s: remove unnecessary translation cache flushes at boot
  powerpc/64s/radix: introduce options to disable use of the tlbie
instruction

 .../admin-guide/kernel-parameters.txt |   4 +
 arch/powerpc/include/asm/book3s/64/mmu.h  |   4 -
 .../include/asm/book3s/64/tlbflush-radix.h|  12 +-
 arch/powerpc/include/asm/book3s/64/tlbflush.h |   9 +
 arch/powerpc/include/asm/mmu.h|   2 +-
 arch/powerpc/kvm/book3s_hv.c  |   6 +
 arch/powerpc/kvm/book3s_hv_nested.c   |   4 +-
 arch/powerpc/mm/book3s64/hash_utils.c |   8 +-
 arch/powerpc/mm/book3s64/pgtable.c|  72 -
 arch/powerpc/mm/book3s64/radix_pgtable.c  |  45 +--
 arch/powerpc/mm/book3s64/radix_tlb.c  | 303 --
 arch/powerpc/platforms/pseries/lpar.c |  12 +-
 drivers/misc/cxl/main.c   |   4 +
 drivers/misc/ocxl/main.c  |   4 +
 14 files changed, 308 insertions(+), 181 deletions(-)

-- 
2.22.0



[PATCH 5/6] powerpc/64s: remove unnecessary translation cache flushes at boot

2019-09-02 Thread Nicholas Piggin
The various translation structure invalidations performed in early boot
when the MMU is off are not required, because everything is invalidated
immediately before a CPU first enables its MMU (see early_init_mmu
and early_init_mmu_secondary).

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/mm/book3s64/hash_utils.c| 2 +-
 arch/powerpc/mm/book3s64/pgtable.c   | 5 +
 arch/powerpc/mm/book3s64/radix_pgtable.c | 8 +---
 arch/powerpc/platforms/pseries/lpar.c| 5 -
 4 files changed, 7 insertions(+), 13 deletions(-)

diff --git a/arch/powerpc/mm/book3s64/hash_utils.c 
b/arch/powerpc/mm/book3s64/hash_utils.c
index b73d08b54d12..7684a596158b 100644
--- a/arch/powerpc/mm/book3s64/hash_utils.c
+++ b/arch/powerpc/mm/book3s64/hash_utils.c
@@ -825,7 +825,7 @@ static void __init hash_init_partition_table(phys_addr_t 
hash_table,
 * For now, UPRT is 0 and we have no segment table.
 */
htab_size =  __ilog2(htab_size) - 18;
-   mmu_partition_table_set_entry(0, hash_table | htab_size, 0, true);
+   mmu_partition_table_set_entry(0, hash_table | htab_size, 0, false);
pr_info("Partition table %p\n", partition_tb);
 }
 
diff --git a/arch/powerpc/mm/book3s64/pgtable.c 
b/arch/powerpc/mm/book3s64/pgtable.c
index 6fab9c0bbbaf..351eb78eed55 100644
--- a/arch/powerpc/mm/book3s64/pgtable.c
+++ b/arch/powerpc/mm/book3s64/pgtable.c
@@ -252,6 +252,11 @@ void mmu_partition_table_set_entry(unsigned int lpid, 
unsigned long dw0,
pr_info("PATE registered by ultravisor: dw0 = 0x%lx, dw1 = 
0x%lx\n",
dw0, dw1);
} else if (flush) {
+   /*
+* Boot does not need to flush, because MMU is off and each
+* CPU does a tlbiel_all() before switching them on, which
+* flushes everything.
+*/
flush_partition(lpid, (old & PATB_HR));
}
 }
diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c 
b/arch/powerpc/mm/book3s64/radix_pgtable.c
index e1e711c4704a..0d1107fb34c1 100644
--- a/arch/powerpc/mm/book3s64/radix_pgtable.c
+++ b/arch/powerpc/mm/book3s64/radix_pgtable.c
@@ -396,13 +396,7 @@ static void __init radix_init_partition_table(void)
rts_field = radix__get_tree_size();
dw0 = rts_field | __pa(init_mm.pgd) | RADIX_PGD_INDEX_SIZE | PATB_HR;
dw1 = __pa(process_tb) | (PRTB_SIZE_SHIFT - 12) | PATB_GR;
-   mmu_partition_table_set_entry(0, dw0, dw1, true);
-
-   asm volatile("ptesync" : : : "memory");
-   asm volatile(PPC_TLBIE_5(%0,%1,2,1,1) : :
-"r" (TLBIEL_INVAL_SET_LPID), "r" (0));
-   asm volatile("eieio; tlbsync; ptesync" : : : "memory");
-   trace_tlbie(0, 0, TLBIEL_INVAL_SET_LPID, 0, 2, 1, 1);
+   mmu_partition_table_set_entry(0, dw0, dw1, false);
 
pr_info("Initializing Radix MMU\n");
pr_info("Partition table %p\n", partition_tb);
diff --git a/arch/powerpc/platforms/pseries/lpar.c 
b/arch/powerpc/platforms/pseries/lpar.c
index b3205a6c950c..36b846f6e74e 100644
--- a/arch/powerpc/platforms/pseries/lpar.c
+++ b/arch/powerpc/platforms/pseries/lpar.c
@@ -1549,11 +1549,6 @@ void radix_init_pseries(void)
 
pseries_lpar_register_process_table(__pa(process_tb),
0, PRTB_SIZE_SHIFT - 12);
-   asm volatile("ptesync" : : : "memory");
-   asm volatile(PPC_TLBIE_5(%0,%1,2,1,1) : :
-"r" (TLBIEL_INVAL_SET_LPID), "r" (0));
-   asm volatile("eieio; tlbsync; ptesync" : : : "memory");
-   trace_tlbie(0, 0, TLBIEL_INVAL_SET_LPID, 0, 2, 1, 1);
 }
 
 #ifdef CONFIG_PPC_SMLPAR
-- 
2.22.0



[PATCH 4/6] powerpc/64s/pseries: radix flush translations before MMU is enabled at boot

2019-09-02 Thread Nicholas Piggin
Radix guests are responsible for managing their own translation caches,
so make them match bare metal radix and hash, and make each CPU flush
all its translations right before enabling its MMU.

Radix guests may not flush partition scope translations, so in
tlbiel_all, make these flushes conditional on CPU_FTR_HVMODE. Process
scope translations are the only type visible to the guest.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/mm/book3s64/radix_pgtable.c |  6 ++
 arch/powerpc/mm/book3s64/radix_tlb.c | 12 
 2 files changed, 10 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c 
b/arch/powerpc/mm/book3s64/radix_pgtable.c
index 078a7eeec1f5..e1e711c4704a 100644
--- a/arch/powerpc/mm/book3s64/radix_pgtable.c
+++ b/arch/powerpc/mm/book3s64/radix_pgtable.c
@@ -616,8 +616,7 @@ void __init radix__early_init_mmu(void)
 
/* Switch to the guard PID before turning on MMU */
radix__switch_mmu_context(NULL, _mm);
-   if (cpu_has_feature(CPU_FTR_HVMODE))
-   tlbiel_all();
+   tlbiel_all();
 }
 
 void radix__early_init_mmu_secondary(void)
@@ -637,8 +636,7 @@ void radix__early_init_mmu_secondary(void)
}
 
radix__switch_mmu_context(NULL, _mm);
-   if (cpu_has_feature(CPU_FTR_HVMODE))
-   tlbiel_all();
+   tlbiel_all();
 }
 
 void radix__mmu_cleanup_all(void)
diff --git a/arch/powerpc/mm/book3s64/radix_tlb.c 
b/arch/powerpc/mm/book3s64/radix_tlb.c
index 082f90d068ee..f9cf8ae59831 100644
--- a/arch/powerpc/mm/book3s64/radix_tlb.c
+++ b/arch/powerpc/mm/book3s64/radix_tlb.c
@@ -51,11 +51,15 @@ static void tlbiel_all_isa300(unsigned int num_sets, 
unsigned int is)
 * and partition table entries. Then flush the remaining sets of the
 * TLB.
 */
-   tlbiel_radix_set_isa300(0, is, 0, RIC_FLUSH_ALL, 0);
-   for (set = 1; set < num_sets; set++)
-   tlbiel_radix_set_isa300(set, is, 0, RIC_FLUSH_TLB, 0);
 
-   /* Do the same for process scoped entries. */
+   if (early_cpu_has_feature(CPU_FTR_HVMODE)) {
+   /* MSR[HV] should flush partition scope translations first. */
+   tlbiel_radix_set_isa300(0, is, 0, RIC_FLUSH_ALL, 0);
+   for (set = 1; set < num_sets; set++)
+   tlbiel_radix_set_isa300(set, is, 0, RIC_FLUSH_TLB, 0);
+   }
+
+   /* Flush process scoped entries. */
tlbiel_radix_set_isa300(0, is, 0, RIC_FLUSH_ALL, 1);
for (set = 1; set < num_sets; set++)
tlbiel_radix_set_isa300(set, is, 0, RIC_FLUSH_TLB, 1);
-- 
2.22.0



[PATCH 3/6] powerpc/64s: make mmu_partition_table_set_entry TLB flush optional

2019-09-02 Thread Nicholas Piggin
No functional change.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/include/asm/mmu.h   | 2 +-
 arch/powerpc/kvm/book3s_hv_nested.c  | 2 +-
 arch/powerpc/mm/book3s64/hash_utils.c| 2 +-
 arch/powerpc/mm/book3s64/pgtable.c   | 4 ++--
 arch/powerpc/mm/book3s64/radix_pgtable.c | 2 +-
 5 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/include/asm/mmu.h b/arch/powerpc/include/asm/mmu.h
index ba94ce8c22d7..0699cfeeb8c9 100644
--- a/arch/powerpc/include/asm/mmu.h
+++ b/arch/powerpc/include/asm/mmu.h
@@ -257,7 +257,7 @@ extern void radix__mmu_cleanup_all(void);
 /* Functions for creating and updating partition table on POWER9 */
 extern void mmu_partition_table_init(void);
 extern void mmu_partition_table_set_entry(unsigned int lpid, unsigned long dw0,
- unsigned long dw1);
+ unsigned long dw1, bool flush);
 #endif /* CONFIG_PPC64 */
 
 struct mm_struct;
diff --git a/arch/powerpc/kvm/book3s_hv_nested.c 
b/arch/powerpc/kvm/book3s_hv_nested.c
index b3316da2f13e..fff90f2c3de2 100644
--- a/arch/powerpc/kvm/book3s_hv_nested.c
+++ b/arch/powerpc/kvm/book3s_hv_nested.c
@@ -411,7 +411,7 @@ static void kvmhv_flush_lpid(unsigned int lpid)
 void kvmhv_set_ptbl_entry(unsigned int lpid, u64 dw0, u64 dw1)
 {
if (!kvmhv_on_pseries()) {
-   mmu_partition_table_set_entry(lpid, dw0, dw1);
+   mmu_partition_table_set_entry(lpid, dw0, dw1, true);
return;
}
 
diff --git a/arch/powerpc/mm/book3s64/hash_utils.c 
b/arch/powerpc/mm/book3s64/hash_utils.c
index 7aed27ea5361..b73d08b54d12 100644
--- a/arch/powerpc/mm/book3s64/hash_utils.c
+++ b/arch/powerpc/mm/book3s64/hash_utils.c
@@ -825,7 +825,7 @@ static void __init hash_init_partition_table(phys_addr_t 
hash_table,
 * For now, UPRT is 0 and we have no segment table.
 */
htab_size =  __ilog2(htab_size) - 18;
-   mmu_partition_table_set_entry(0, hash_table | htab_size, 0);
+   mmu_partition_table_set_entry(0, hash_table | htab_size, 0, true);
pr_info("Partition table %p\n", partition_tb);
 }
 
diff --git a/arch/powerpc/mm/book3s64/pgtable.c 
b/arch/powerpc/mm/book3s64/pgtable.c
index c2b87c5ba50b..6fab9c0bbbaf 100644
--- a/arch/powerpc/mm/book3s64/pgtable.c
+++ b/arch/powerpc/mm/book3s64/pgtable.c
@@ -224,7 +224,7 @@ static void flush_partition(unsigned int lpid, bool radix)
 }
 
 void mmu_partition_table_set_entry(unsigned int lpid, unsigned long dw0,
- unsigned long dw1)
+ unsigned long dw1, bool flush)
 {
unsigned long old = be64_to_cpu(partition_tb[lpid].patb0);
 
@@ -251,7 +251,7 @@ void mmu_partition_table_set_entry(unsigned int lpid, 
unsigned long dw0,
uv_register_pate(lpid, dw0, dw1);
pr_info("PATE registered by ultravisor: dw0 = 0x%lx, dw1 = 
0x%lx\n",
dw0, dw1);
-   } else {
+   } else if (flush) {
flush_partition(lpid, (old & PATB_HR));
}
 }
diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c 
b/arch/powerpc/mm/book3s64/radix_pgtable.c
index 83fa7864e8f4..078a7eeec1f5 100644
--- a/arch/powerpc/mm/book3s64/radix_pgtable.c
+++ b/arch/powerpc/mm/book3s64/radix_pgtable.c
@@ -396,7 +396,7 @@ static void __init radix_init_partition_table(void)
rts_field = radix__get_tree_size();
dw0 = rts_field | __pa(init_mm.pgd) | RADIX_PGD_INDEX_SIZE | PATB_HR;
dw1 = __pa(process_tb) | (PRTB_SIZE_SHIFT - 12) | PATB_GR;
-   mmu_partition_table_set_entry(0, dw0, dw1);
+   mmu_partition_table_set_entry(0, dw0, dw1, true);
 
asm volatile("ptesync" : : : "memory");
asm volatile(PPC_TLBIE_5(%0,%1,2,1,1) : :
-- 
2.22.0



[PATCH 1/6] powerpc/64s: remove register_process_table callback

2019-09-02 Thread Nicholas Piggin
This callback is only required because the partition table init comes
before process table allocation on powernv (aka bare metal aka native).

Change the order to allocate the process table first, and remove the
callback.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/include/asm/book3s/64/mmu.h |  4 ---
 arch/powerpc/mm/book3s64/hash_utils.c|  6 
 arch/powerpc/mm/book3s64/pgtable.c   |  3 --
 arch/powerpc/mm/book3s64/radix_pgtable.c | 45 +++-
 arch/powerpc/platforms/pseries/lpar.c| 17 +++--
 5 files changed, 27 insertions(+), 48 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/mmu.h 
b/arch/powerpc/include/asm/book3s/64/mmu.h
index 23b83d3593e2..bb3deb76c951 100644
--- a/arch/powerpc/include/asm/book3s/64/mmu.h
+++ b/arch/powerpc/include/asm/book3s/64/mmu.h
@@ -206,7 +206,6 @@ extern int mmu_io_psize;
 void mmu_early_init_devtree(void);
 void hash__early_init_devtree(void);
 void radix__early_init_devtree(void);
-extern void radix_init_native(void);
 extern void hash__early_init_mmu(void);
 extern void radix__early_init_mmu(void);
 static inline void early_init_mmu(void)
@@ -238,9 +237,6 @@ static inline void setup_initial_memory_limit(phys_addr_t 
first_memblock_base,
   first_memblock_size);
 }
 
-extern int (*register_process_table)(unsigned long base, unsigned long 
page_size,
-unsigned long tbl_size);
-
 #ifdef CONFIG_PPC_PSERIES
 extern void radix_init_pseries(void);
 #else
diff --git a/arch/powerpc/mm/book3s64/hash_utils.c 
b/arch/powerpc/mm/book3s64/hash_utils.c
index fe99bba39b69..7aed27ea5361 100644
--- a/arch/powerpc/mm/book3s64/hash_utils.c
+++ b/arch/powerpc/mm/book3s64/hash_utils.c
@@ -859,12 +859,6 @@ static void __init htab_initialize(void)
/* Using a hypervisor which owns the htab */
htab_address = NULL;
_SDR1 = 0; 
-   /*
-* On POWER9, we need to do a H_REGISTER_PROC_TBL hcall
-* to inform the hypervisor that we wish to use the HPT.
-*/
-   if (cpu_has_feature(CPU_FTR_ARCH_300))
-   register_process_table(0, 0, 0);
 #ifdef CONFIG_FA_DUMP
/*
 * If firmware assisted dump is active firmware preserves
diff --git a/arch/powerpc/mm/book3s64/pgtable.c 
b/arch/powerpc/mm/book3s64/pgtable.c
index 206b43ae4000..97f3be778c79 100644
--- a/arch/powerpc/mm/book3s64/pgtable.c
+++ b/arch/powerpc/mm/book3s64/pgtable.c
@@ -23,9 +23,6 @@ EXPORT_SYMBOL(__pmd_frag_nr);
 unsigned long __pmd_frag_size_shift;
 EXPORT_SYMBOL(__pmd_frag_size_shift);
 
-int (*register_process_table)(unsigned long base, unsigned long page_size,
- unsigned long tbl_size);
-
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 /*
  * This is called when relaxing access to a hugepage. It's also called in the 
page
diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c 
b/arch/powerpc/mm/book3s64/radix_pgtable.c
index 71b649473045..83fa7864e8f4 100644
--- a/arch/powerpc/mm/book3s64/radix_pgtable.c
+++ b/arch/powerpc/mm/book3s64/radix_pgtable.c
@@ -34,19 +34,6 @@
 unsigned int mmu_pid_bits;
 unsigned int mmu_base_pid;
 
-static int native_register_process_table(unsigned long base, unsigned long 
pg_sz,
-unsigned long table_size)
-{
-   unsigned long patb0, patb1;
-
-   patb0 = be64_to_cpu(partition_tb[0].patb0);
-   patb1 = base | table_size | PATB_GR;
-
-   mmu_partition_table_set_entry(0, patb0, patb1);
-
-   return 0;
-}
-
 static __ref void *early_alloc_pgtable(unsigned long size, int nid,
unsigned long region_start, unsigned long region_end)
 {
@@ -381,18 +368,8 @@ static void __init radix_init_pgtable(void)
 */
rts_field = radix__get_tree_size();
process_tb->prtb0 = cpu_to_be64(rts_field | __pa(init_mm.pgd) | 
RADIX_PGD_INDEX_SIZE);
-   /*
-* Fill in the partition table. We are suppose to use effective address
-* of process table here. But our linear mapping also enable us to use
-* physical address here.
-*/
-   register_process_table(__pa(process_tb), 0, PRTB_SIZE_SHIFT - 12);
+
pr_info("Process table %p and radix root for kernel: %p\n", process_tb, 
init_mm.pgd);
-   asm volatile("ptesync" : : : "memory");
-   asm volatile(PPC_TLBIE_5(%0,%1,2,1,1) : :
-"r" (TLBIEL_INVAL_SET_LPID), "r" (0));
-   asm volatile("eieio; tlbsync; ptesync" : : : "memory");
-   trace_tlbie(0, 0, TLBIEL_INVAL_SET_LPID, 0, 2, 1, 1);
 
/*
 * The init_mm context is given the first available (non-zero) PID,
@@ -413,22 +390,24 @@ static void __init radix_init_pgtable(void)
 
 static void __init radix_init_partition_table(void)
 {
-   unsigned long rts_field, dw0;
+   unsigned long rts_field, dw0, dw1;
 

[PATCH 2/6] powerpc/64s/radix: tidy up TLB flushing code

2019-09-02 Thread Nicholas Piggin
There should be no functional changes.

- Use calls to existing radix_tlb.c functions in flush_partition.

- Rename radix__flush_tlb_lpid to radix__flush_all_lpid and similar,
  because they flush everything, matching flush_all_mm rather than
  flush_tlb_mm for the lpid.

- Remove some unused radix_tlb.c flush primitives.

Signed-off: Nicholas Piggin 
---
 .../include/asm/book3s/64/tlbflush-radix.h|  12 +-
 arch/powerpc/kvm/book3s_hv_nested.c   |   2 +-
 arch/powerpc/mm/book3s64/pgtable.c|  13 +-
 arch/powerpc/mm/book3s64/radix_tlb.c  | 117 --
 4 files changed, 34 insertions(+), 110 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/tlbflush-radix.h 
b/arch/powerpc/include/asm/book3s/64/tlbflush-radix.h
index 05147cecb8df..4ce795d30377 100644
--- a/arch/powerpc/include/asm/book3s/64/tlbflush-radix.h
+++ b/arch/powerpc/include/asm/book3s/64/tlbflush-radix.h
@@ -17,8 +17,8 @@ extern void radix__flush_tlb_lpid_page(unsigned int lpid,
unsigned long addr,
unsigned long page_size);
 extern void radix__flush_pwc_lpid(unsigned int lpid);
-extern void radix__flush_tlb_lpid(unsigned int lpid);
-extern void radix__local_flush_tlb_lpid_guest(unsigned int lpid);
+extern void radix__flush_all_lpid(unsigned int lpid);
+extern void radix__flush_all_lpid_guest(unsigned int lpid);
 #else
 static inline void radix__tlbiel_all(unsigned int action) { WARN_ON(1); };
 static inline void radix__flush_tlb_lpid_page(unsigned int lpid,
@@ -31,11 +31,7 @@ static inline void radix__flush_pwc_lpid(unsigned int lpid)
 {
WARN_ON(1);
 }
-static inline void radix__flush_tlb_lpid(unsigned int lpid)
-{
-   WARN_ON(1);
-}
-static inline void radix__local_flush_tlb_lpid_guest(unsigned int lpid)
+static inline void radix__flush_all_lpid(unsigned int lpid)
 {
WARN_ON(1);
 }
@@ -73,6 +69,4 @@ extern void radix__flush_tlb_pwc(struct mmu_gather *tlb, 
unsigned long addr);
 extern void radix__flush_tlb_collapsed_pmd(struct mm_struct *mm, unsigned long 
addr);
 extern void radix__flush_tlb_all(void);
 
-extern void radix__local_flush_tlb_lpid(unsigned int lpid);
-
 #endif
diff --git a/arch/powerpc/kvm/book3s_hv_nested.c 
b/arch/powerpc/kvm/book3s_hv_nested.c
index 735e0ac6f5b2..b3316da2f13e 100644
--- a/arch/powerpc/kvm/book3s_hv_nested.c
+++ b/arch/powerpc/kvm/book3s_hv_nested.c
@@ -398,7 +398,7 @@ static void kvmhv_flush_lpid(unsigned int lpid)
long rc;
 
if (!kvmhv_on_pseries()) {
-   radix__flush_tlb_lpid(lpid);
+   radix__flush_all_lpid(lpid);
return;
}
 
diff --git a/arch/powerpc/mm/book3s64/pgtable.c 
b/arch/powerpc/mm/book3s64/pgtable.c
index 97f3be778c79..c2b87c5ba50b 100644
--- a/arch/powerpc/mm/book3s64/pgtable.c
+++ b/arch/powerpc/mm/book3s64/pgtable.c
@@ -210,20 +210,17 @@ void __init mmu_partition_table_init(void)
 
 static void flush_partition(unsigned int lpid, bool radix)
 {
-   asm volatile("ptesync" : : : "memory");
if (radix) {
-   asm volatile(PPC_TLBIE_5(%0,%1,2,0,1) : :
-"r" (TLBIEL_INVAL_SET_LPID), "r" (lpid));
-   asm volatile(PPC_TLBIE_5(%0,%1,2,1,1) : :
-"r" (TLBIEL_INVAL_SET_LPID), "r" (lpid));
-   trace_tlbie(lpid, 0, TLBIEL_INVAL_SET_LPID, lpid, 2, 0, 1);
+   radix__flush_all_lpid(lpid);
+   radix__flush_all_lpid_guest(lpid);
} else {
+   asm volatile("ptesync" : : : "memory");
asm volatile(PPC_TLBIE_5(%0,%1,2,0,0) : :
 "r" (TLBIEL_INVAL_SET_LPID), "r" (lpid));
+   /* do we need fixup here ?*/
+   asm volatile("eieio; tlbsync; ptesync" : : : "memory");
trace_tlbie(lpid, 0, TLBIEL_INVAL_SET_LPID, lpid, 2, 0, 0);
}
-   /* do we need fixup here ?*/
-   asm volatile("eieio; tlbsync; ptesync" : : : "memory");
 }
 
 void mmu_partition_table_set_entry(unsigned int lpid, unsigned long dw0,
diff --git a/arch/powerpc/mm/book3s64/radix_tlb.c 
b/arch/powerpc/mm/book3s64/radix_tlb.c
index 71f7fede2fa4..082f90d068ee 100644
--- a/arch/powerpc/mm/book3s64/radix_tlb.c
+++ b/arch/powerpc/mm/book3s64/radix_tlb.c
@@ -116,22 +116,6 @@ static __always_inline void __tlbie_pid(unsigned long pid, 
unsigned long ric)
trace_tlbie(0, 0, rb, rs, ric, prs, r);
 }
 
-static __always_inline void __tlbiel_lpid(unsigned long lpid, int set,
-   unsigned long ric)
-{
-   unsigned long rb,rs,prs,r;
-
-   rb = PPC_BIT(52); /* IS = 2 */
-   rb |= set << PPC_BITLSHIFT(51);
-   rs = 0;  /* LPID comes from LPIDR */
-   prs = 0; /* partition scoped */
-   r = 1;   /* radix format */
-
-   asm volatile(PPC_TLBIEL(%0, %4, %3, %2, %1)
-: : "r"(rb), "i"(r), "i"(prs), "i"(ric), "r"(rs) : 
"memory");
-   

Re: microblaze HAVE_MEMBLOCK_NODE_MAP dependency (was Re: [PATCH v2 0/5] mm: Enable CONFIG_NODES_SPAN_OTHER_NODES by default for NUMA)

2019-09-02 Thread Mike Rapoport
On Mon, Sep 02, 2019 at 03:51:25PM +0200, Michal Simek wrote:
> On 31. 07. 19 19:15, Mike Rapoport wrote:
> > On Wed, Jul 31, 2019 at 04:41:14PM +0200, Michal Hocko wrote:
> >> On Wed 31-07-19 17:21:29, Mike Rapoport wrote:
> >>> On Wed, Jul 31, 2019 at 03:00:37PM +0200, Michal Hocko wrote:
> 
>  I am sorry, but I still do not follow. Who is consuming that node id
>  information when NUMA=n. In other words why cannot we simply do
> >>>  
> >>> We can, I think nobody cared to change it.
> >>
> >> It would be great if somebody with the actual HW could try it out.
> >> I can throw a patch but I do not even have a cross compiler in my
> >> toolbox.
> > 
> > Well, it compiles :)
> >  
>  diff --git a/arch/microblaze/mm/init.c b/arch/microblaze/mm/init.c
>  index a015a951c8b7..3a47e8db8d1c 100644
>  --- a/arch/microblaze/mm/init.c
>  +++ b/arch/microblaze/mm/init.c
>  @@ -175,14 +175,9 @@ void __init setup_memory(void)
>   
>   start_pfn = memblock_region_memory_base_pfn(reg);
>   end_pfn = memblock_region_memory_end_pfn(reg);
>  -memblock_set_node(start_pfn << PAGE_SHIFT,
>  -  (end_pfn - start_pfn) << PAGE_SHIFT,
>  -  , 0);
>  +memory_present(0, start_pfn << PAGE_SHIFT, end_pfn << 
>  PAGE_SHIFT);
> >>>
> >>> memory_present() expects pfns, the shift is not needed.
> >>
> >> Right.
> 
> Sorry for slow response on this. In general regarding this topic.
> Microblaze is soft core CPU (now there are hardcore versions too but not
> running Linux). I believe there could be Numa system with
> microblaze/microblazes (SMP is not supported in mainline).
> 
> This code was added in 2011 which is pretty hard to remember why it was
> done in this way.
> 
> It compiles but not working on HW. Please take a look at log below.
> 
> Thanks,
> Michal
> 
> 
> [0.00] Linux version 5.3.0-rc6-7-g54b01939182f-dirty
> (monstr@monstr-desktop3) (gcc version 8.2.0 (crosstool-NG 1.20.0)) #101
> Mon Sep 2 15:44:05 CEST 2019
> [0.00] setup_memory: max_mapnr: 0x4
> [0.00] setup_memory: min_low_pfn: 0x8
> [0.00] setup_memory: max_low_pfn: 0xb
> [0.00] setup_memory: max_pfn: 0xc
> [0.00] start pfn 0x8
> [0.00] end pfn 0xc
> [0.00] Zone ranges:
> [0.00]   DMA  [mem 0x8000-0xafff]
> [0.00]   Normal   empty
> [0.00]   HighMem  [mem 0xb000-0xbfff]
> [0.00] Movable zone start for each node
> [0.00] Early memory node ranges
> [0.00]   node   1: [mem 0x8000-0xbfff]
> [0.00] Could not find start_pfn for node 0
> [0.00] Initmem setup node 0 [mem
> 0x-0x]

This does not look good :)

I think the problem is that without an explicit call to memblock_set_node()
the ->nid in memblock is MAX_NUMNODES but free_area_init_nodes() presumes
actual node ids are properly set.

> [0.00] earlycon: ns16550a0 at MMIO 0x44a01000 (options '115200n8')
> [0.00] printk: bootconsole [ns16550a0] enabled
> [0.00] setup_cpuinfo: initialising
> [0.00] setup_cpuinfo: Using full CPU PVR support
> [0.00] wt_msr_noirq
> [0.00] pcpu-alloc: s0 r0 d32768 u32768 alloc=1*32768
> [0.00] pcpu-alloc: [0] 0
> [0.00] Built 1 zonelists, mobility grouping off.  Total pages: 0
> [0.00] Kernel command line: earlycon
> [0.00] Dentry cache hash table entries: -2147483648 (order: -13,
> 0 bytes, linear)
> [0.00] Inode-cache hash table entries: -2147483648 (order: -13,
> 0 bytes, linear)
> [0.00] mem auto-init: stack:off, heap alloc:off, heap free:off
> [0.00] Oops: kernel access of bad area, sig: 11
> [0.00] CPU: 0 PID: 0 Comm: swapper Not tainted
> 5.3.0-rc6-7-g54b01939182f-dirty #101
> [0.00]  Registers dump: mode=805B9EA8
> [0.00]  r1=65A0, r2=C05B7AE6, r3=, r4=
> [0.00]  r5=0008, r6=00080B50, r7=, r8=0004
> [0.00]  r9=, r10=001F, r11=, r12=
> [0.00]  r13=4119DCC0, r14=, r15=C05EFF8C, r16=
> [0.00]  r17=C0604408, r18=FFFC, r19=C05B9F6C, r20=BFFEC168
> [0.00]  r21=BFFEC168, r22=EFFF9AC0, r23=0001, r24=C0606874
> [0.00]  r25=BFE6B74C, r26=8000, r27=, r28=9040
> [0.00]  r29=0100, r30=0380, r31=C05C02F0, rPC=C0604408
> [0.00]  msr=46A0, ear=0004, esr=0D12, fsr=
> [0.00] Oops: kernel access of bad area, sig: 11
> 
> 
> -- 
> Michal Simek, Ing. (M.Eng), OpenPGP -> KeyID: FE3D1F91
> w: www.monstr.eu p: +42-0-721842854
> Maintainer of Linux kernel - Xilinx Microblaze
> Maintainer of Linux kernel - Xilinx Zynq ARM and ZynqMP 

Re: [PATCH v2 7/9] sparc64: numa: check the node id consistently for sparc64

2019-09-02 Thread David Miller
From: Yunsheng Lin 
Date: Mon, 2 Sep 2019 14:08:31 +0800

> The NUMA node id in sparc64 system is defined by DT semantics?

Sometimes, and in other cases other methods are used to determine
the NUMA node id.


Re: [PATCH v3 04/11] PCI: designware-ep: Modify MSI and MSIX CAP way of finding

2019-09-02 Thread Andrew Murray
On Mon, Sep 02, 2019 at 11:17:09AM +0800, Xiaowei Bao wrote:
> Each PF of EP device should have it's own MSI or MSIX capabitily
> struct, so create a dw_pcie_ep_func struct and remover the msi_cap

remover?

> and msix_cap to this struce, and manage the PFs with a list.

struce?

> 
> Signed-off-by: Xiaowei Bao 
> ---
> v1:
>  - This is a new patch, to fix the issue of MSI and MSIX CAP way of
>finding.
> v2:
>  - No change.
> v3:
>  - No change.

This makes it look like you introduced the patch in v1 and haven't
changed it since.

I think it's more common to have a history like this:

---
v3:
 - Introduced new patch, to fix the issue of MSI and MSIX CAP way of
   finding.


> 
>  drivers/pci/controller/dwc/pcie-designware-ep.c | 135 
> +---
>  drivers/pci/controller/dwc/pcie-designware.h|  18 +++-
>  2 files changed, 134 insertions(+), 19 deletions(-)
> 
> diff --git a/drivers/pci/controller/dwc/pcie-designware-ep.c 
> b/drivers/pci/controller/dwc/pcie-designware-ep.c
> index c3bc7bd..144eb12 100644
> --- a/drivers/pci/controller/dwc/pcie-designware-ep.c
> +++ b/drivers/pci/controller/dwc/pcie-designware-ep.c
> @@ -19,6 +19,19 @@ void dw_pcie_ep_linkup(struct dw_pcie_ep *ep)
>   pci_epc_linkup(epc);
>  }
>  
> +struct dw_pcie_ep_func *
> +dw_pcie_ep_get_func_from_ep(struct dw_pcie_ep *ep, u8 func_no)
> +{
> + struct dw_pcie_ep_func *ep_func;
> +
> + list_for_each_entry(ep_func, >func_list, list) {
> + if (ep_func->func_no == func_no)
> + return ep_func;
> + }
> +
> + return NULL;
> +}
> +
>  static unsigned int dw_pcie_ep_func_select(struct dw_pcie_ep *ep, u8 func_no)
>  {
>   unsigned int func_offset = 0;
> @@ -59,6 +72,47 @@ void dw_pcie_ep_reset_bar(struct dw_pcie *pci, enum 
> pci_barno bar)
>   __dw_pcie_ep_reset_bar(pci, func_no, bar, 0);
>  }
>  
> +static u8 __dw_pcie_ep_find_next_cap(struct dw_pcie_ep *ep, u8 func_no,
> + u8 cap_ptr, u8 cap)
> +{
> + struct dw_pcie *pci = to_dw_pcie_from_ep(ep);
> + unsigned int func_offset = 0;
> + u8 cap_id, next_cap_ptr;
> + u16 reg;
> +
> + if (!cap_ptr)
> + return 0;
> +
> + func_offset = dw_pcie_ep_func_select(ep, func_no);
> +
> + reg = dw_pcie_readw_dbi(pci, func_offset + cap_ptr);
> + cap_id = (reg & 0x00ff);
> +
> + if (cap_id > PCI_CAP_ID_MAX)
> + return 0;
> +
> + if (cap_id == cap)
> + return cap_ptr;
> +
> + next_cap_ptr = (reg & 0xff00) >> 8;
> + return __dw_pcie_ep_find_next_cap(ep, func_no, next_cap_ptr, cap);
> +}

Which tree have you based this patchset on? v5.3-rc3 and pci/dwc both already
have this function (without the func_no). See beb4641a787d
("PCI: dwc: Add MSI-X callbacks handler").

> +
> +static u8 dw_pcie_ep_find_capability(struct dw_pcie_ep *ep, u8 func_no, u8 
> cap)
> +{
> + struct dw_pcie *pci = to_dw_pcie_from_ep(ep);
> + unsigned int func_offset = 0;
> + u8 next_cap_ptr;
> + u16 reg;
> +
> + func_offset = dw_pcie_ep_func_select(ep, func_no);
> +
> + reg = dw_pcie_readw_dbi(pci, func_offset + PCI_CAPABILITY_LIST);
> + next_cap_ptr = (reg & 0x00ff);
> +
> + return __dw_pcie_ep_find_next_cap(ep, func_no, next_cap_ptr, cap);
> +}
> +
>  static int dw_pcie_ep_write_header(struct pci_epc *epc, u8 func_no,
>  struct pci_epf_header *hdr)
>  {
> @@ -246,13 +300,18 @@ static int dw_pcie_ep_get_msi(struct pci_epc *epc, u8 
> func_no)
>   struct dw_pcie *pci = to_dw_pcie_from_ep(ep);
>   u32 val, reg;
>   unsigned int func_offset = 0;
> + struct dw_pcie_ep_func *ep_func;
>  
> - if (!ep->msi_cap)
> + ep_func = dw_pcie_ep_get_func_from_ep(ep, func_no);
> + if (!ep_func)
> + return -EINVAL;
> +
> + if (!ep_func->msi_cap)
>   return -EINVAL;
>  
>   func_offset = dw_pcie_ep_func_select(ep, func_no);
>  
> - reg = ep->msi_cap + func_offset + PCI_MSI_FLAGS;
> + reg = ep_func->msi_cap + func_offset + PCI_MSI_FLAGS;
>   val = dw_pcie_readw_dbi(pci, reg);
>   if (!(val & PCI_MSI_FLAGS_ENABLE))
>   return -EINVAL;
> @@ -268,13 +327,18 @@ static int dw_pcie_ep_set_msi(struct pci_epc *epc, u8 
> func_no, u8 interrupts)
>   struct dw_pcie *pci = to_dw_pcie_from_ep(ep);
>   u32 val, reg;
>   unsigned int func_offset = 0;
> + struct dw_pcie_ep_func *ep_func;
> +
> + ep_func = dw_pcie_ep_get_func_from_ep(ep, func_no);
> + if (!ep_func)
> + return -EINVAL;
>  
> - if (!ep->msi_cap)
> + if (!ep_func->msi_cap)
>   return -EINVAL;
>  
>   func_offset = dw_pcie_ep_func_select(ep, func_no);
>  
> - reg = ep->msi_cap + func_offset + PCI_MSI_FLAGS;
> + reg = ep_func->msi_cap + func_offset + PCI_MSI_FLAGS;
>   val = dw_pcie_readw_dbi(pci, reg);
>   val &= ~PCI_MSI_FLAGS_QMASK;
>   val |= (interrupts << 1) & PCI_MSI_FLAGS_QMASK;
> @@ -291,13 

Re: [PATCH v6 1/5] kasan: support backing vmalloc space with real shadow memory

2019-09-02 Thread Mark Rutland
On Tue, Sep 03, 2019 at 12:32:49AM +1000, Daniel Axtens wrote:
> Hi Mark,
> 
> >> +static int kasan_depopulate_vmalloc_pte(pte_t *ptep, unsigned long addr,
> >> +  void *unused)
> >> +{
> >> +  unsigned long page;
> >> +
> >> +  page = (unsigned long)__va(pte_pfn(*ptep) << PAGE_SHIFT);
> >> +
> >> +  spin_lock(_mm.page_table_lock);
> >> +
> >> +  if (likely(!pte_none(*ptep))) {
> >> +  pte_clear(_mm, addr, ptep);
> >> +  free_page(page);
> >> +  }
> >> +  spin_unlock(_mm.page_table_lock);
> >> +
> >> +  return 0;
> >> +}
> >
> > There needs to be TLB maintenance after unmapping the page, but I don't
> > see that happening below.
> >
> > We need that to ensure that errant accesses don't hit the page we're
> > freeing and that new mappings at the same VA don't cause a TLB conflict
> > or TLB amalgamation issue.
> 
> Darn it, I knew there was something I forgot to do! I thought of that
> over the weekend, didn't write it down, and then forgot it when I went
> to respin the patches. You're totally right.
> 
> >
> >> +/*
> >> + * Release the backing for the vmalloc region [start, end), which
> >> + * lies within the free region [free_region_start, free_region_end).
> >> + *
> >> + * This can be run lazily, long after the region was freed. It runs
> >> + * under vmap_area_lock, so it's not safe to interact with the 
> >> vmalloc/vmap
> >> + * infrastructure.
> >> + */
> >
> > IIUC we aim to only free non-shared shadow by aligning the start
> > upwards, and aligning the end downwards. I think it would be worth
> > mentioning that explicitly in the comment since otherwise it's not
> > obvious how we handle races between alloc/free.
> >
> 
> Oh, I will need to think through that more carefully.
> 
> I think the vmap_area_lock protects us against alloc/free races.

AFAICT, on the alloc side we only hold the vmap_area_lock while
allocating the area in __get_vm_area_node(), but we don't holding the
vmap_area_lock while we populate the page tables for the shadow in
kasan_populate_vmalloc().

So I believe that kasan_populate_vmalloc() can race with
kasan_release_vmalloc().

> I think alignment operates at least somewhat as you've described, and
> while it is important for correctness, I'm not sure I'd say it
> prevented races? I will double check my understanding of
> vmap_area_lock, and I agree the comment needs to be much clearer.

I had assumed that you were trying to only free pages which were
definitely not shared (for which there couldn't possibly be a race to
allocate), by looking at the sibling areas to see if they potentially
overlapped.

Was that not the case?

Thanks,
Mark.


Re: [PATCH v6 1/5] kasan: support backing vmalloc space with real shadow memory

2019-09-02 Thread Daniel Axtens
Hi Mark,

>> +static int kasan_depopulate_vmalloc_pte(pte_t *ptep, unsigned long addr,
>> +void *unused)
>> +{
>> +unsigned long page;
>> +
>> +page = (unsigned long)__va(pte_pfn(*ptep) << PAGE_SHIFT);
>> +
>> +spin_lock(_mm.page_table_lock);
>> +
>> +if (likely(!pte_none(*ptep))) {
>> +pte_clear(_mm, addr, ptep);
>> +free_page(page);
>> +}
>> +spin_unlock(_mm.page_table_lock);
>> +
>> +return 0;
>> +}
>
> There needs to be TLB maintenance after unmapping the page, but I don't
> see that happening below.
>
> We need that to ensure that errant accesses don't hit the page we're
> freeing and that new mappings at the same VA don't cause a TLB conflict
> or TLB amalgamation issue.

Darn it, I knew there was something I forgot to do! I thought of that
over the weekend, didn't write it down, and then forgot it when I went
to respin the patches. You're totally right.

>
>> +/*
>> + * Release the backing for the vmalloc region [start, end), which
>> + * lies within the free region [free_region_start, free_region_end).
>> + *
>> + * This can be run lazily, long after the region was freed. It runs
>> + * under vmap_area_lock, so it's not safe to interact with the vmalloc/vmap
>> + * infrastructure.
>> + */
>
> IIUC we aim to only free non-shared shadow by aligning the start
> upwards, and aligning the end downwards. I think it would be worth
> mentioning that explicitly in the comment since otherwise it's not
> obvious how we handle races between alloc/free.
>

Oh, I will need to think through that more carefully.

I think the vmap_area_lock protects us against alloc/free races. I think
alignment operates at least somewhat as you've described, and while it
is important for correctness, I'm not sure I'd say it prevented races? I
will double check my understanding of vmap_area_lock, and I agree the
comment needs to be much clearer.

Once again, thanks for your patience and thoughtful review.

Regards,
Daniel

> Thanks,
> Mark.
>
>> +void kasan_release_vmalloc(unsigned long start, unsigned long end,
>> +   unsigned long free_region_start,
>> +   unsigned long free_region_end)
>> +{
>> +void *shadow_start, *shadow_end;
>> +unsigned long region_start, region_end;
>> +
>> +/* we start with shadow entirely covered by this region */
>> +region_start = ALIGN(start, PAGE_SIZE * KASAN_SHADOW_SCALE_SIZE);
>> +region_end = ALIGN_DOWN(end, PAGE_SIZE * KASAN_SHADOW_SCALE_SIZE);
>> +
>> +/*
>> + * We don't want to extend the region we release to the entire free
>> + * region, as the free region might cover huge chunks of vmalloc space
>> + * where we never allocated anything. We just want to see if we can
>> + * extend the [start, end) range: if start or end fall part way through
>> + * a shadow page, we want to check if we can free that entire page.
>> + */
>> +
>> +free_region_start = ALIGN(free_region_start,
>> +  PAGE_SIZE * KASAN_SHADOW_SCALE_SIZE);
>> +
>> +if (start != region_start &&
>> +free_region_start < region_start)
>> +region_start -= PAGE_SIZE * KASAN_SHADOW_SCALE_SIZE;
>> +
>> +free_region_end = ALIGN_DOWN(free_region_end,
>> + PAGE_SIZE * KASAN_SHADOW_SCALE_SIZE);
>> +
>> +if (end != region_end &&
>> +free_region_end > region_end)
>> +region_end += PAGE_SIZE * KASAN_SHADOW_SCALE_SIZE;
>> +
>> +shadow_start = kasan_mem_to_shadow((void *)region_start);
>> +shadow_end = kasan_mem_to_shadow((void *)region_end);
>> +
>> +if (shadow_end > shadow_start)
>> +apply_to_page_range(_mm, (unsigned long)shadow_start,
>> +(unsigned long)(shadow_end - shadow_start),
>> +kasan_depopulate_vmalloc_pte, NULL);
>> +}


Re: microblaze HAVE_MEMBLOCK_NODE_MAP dependency (was Re: [PATCH v2 0/5] mm: Enable CONFIG_NODES_SPAN_OTHER_NODES by default for NUMA)

2019-09-02 Thread Michal Simek
On 31. 07. 19 19:15, Mike Rapoport wrote:
> On Wed, Jul 31, 2019 at 04:41:14PM +0200, Michal Hocko wrote:
>> On Wed 31-07-19 17:21:29, Mike Rapoport wrote:
>>> On Wed, Jul 31, 2019 at 03:00:37PM +0200, Michal Hocko wrote:

 I am sorry, but I still do not follow. Who is consuming that node id
 information when NUMA=n. In other words why cannot we simply do
>>>  
>>> We can, I think nobody cared to change it.
>>
>> It would be great if somebody with the actual HW could try it out.
>> I can throw a patch but I do not even have a cross compiler in my
>> toolbox.
> 
> Well, it compiles :)
>  
 diff --git a/arch/microblaze/mm/init.c b/arch/microblaze/mm/init.c
 index a015a951c8b7..3a47e8db8d1c 100644
 --- a/arch/microblaze/mm/init.c
 +++ b/arch/microblaze/mm/init.c
 @@ -175,14 +175,9 @@ void __init setup_memory(void)
  
start_pfn = memblock_region_memory_base_pfn(reg);
end_pfn = memblock_region_memory_end_pfn(reg);
 -  memblock_set_node(start_pfn << PAGE_SHIFT,
 -(end_pfn - start_pfn) << PAGE_SHIFT,
 -, 0);
 +  memory_present(0, start_pfn << PAGE_SHIFT, end_pfn << 
 PAGE_SHIFT);
>>>
>>> memory_present() expects pfns, the shift is not needed.
>>
>> Right.

Sorry for slow response on this. In general regarding this topic.
Microblaze is soft core CPU (now there are hardcore versions too but not
running Linux). I believe there could be Numa system with
microblaze/microblazes (SMP is not supported in mainline).

This code was added in 2011 which is pretty hard to remember why it was
done in this way.

It compiles but not working on HW. Please take a look at log below.

Thanks,
Michal


[0.00] Linux version 5.3.0-rc6-7-g54b01939182f-dirty
(monstr@monstr-desktop3) (gcc version 8.2.0 (crosstool-NG 1.20.0)) #101
Mon Sep 2 15:44:05 CEST 2019
[0.00] setup_memory: max_mapnr: 0x4
[0.00] setup_memory: min_low_pfn: 0x8
[0.00] setup_memory: max_low_pfn: 0xb
[0.00] setup_memory: max_pfn: 0xc
[0.00] start pfn 0x8
[0.00] end pfn 0xc
[0.00] Zone ranges:
[0.00]   DMA  [mem 0x8000-0xafff]
[0.00]   Normal   empty
[0.00]   HighMem  [mem 0xb000-0xbfff]
[0.00] Movable zone start for each node
[0.00] Early memory node ranges
[0.00]   node   1: [mem 0x8000-0xbfff]
[0.00] Could not find start_pfn for node 0
[0.00] Initmem setup node 0 [mem
0x-0x]
[0.00] earlycon: ns16550a0 at MMIO 0x44a01000 (options '115200n8')
[0.00] printk: bootconsole [ns16550a0] enabled
[0.00] setup_cpuinfo: initialising
[0.00] setup_cpuinfo: Using full CPU PVR support
[0.00] wt_msr_noirq
[0.00] pcpu-alloc: s0 r0 d32768 u32768 alloc=1*32768
[0.00] pcpu-alloc: [0] 0
[0.00] Built 1 zonelists, mobility grouping off.  Total pages: 0
[0.00] Kernel command line: earlycon
[0.00] Dentry cache hash table entries: -2147483648 (order: -13,
0 bytes, linear)
[0.00] Inode-cache hash table entries: -2147483648 (order: -13,
0 bytes, linear)
[0.00] mem auto-init: stack:off, heap alloc:off, heap free:off
[0.00] Oops: kernel access of bad area, sig: 11
[0.00] CPU: 0 PID: 0 Comm: swapper Not tainted
5.3.0-rc6-7-g54b01939182f-dirty #101
[0.00]  Registers dump: mode=805B9EA8
[0.00]  r1=65A0, r2=C05B7AE6, r3=, r4=
[0.00]  r5=0008, r6=00080B50, r7=, r8=0004
[0.00]  r9=, r10=001F, r11=, r12=
[0.00]  r13=4119DCC0, r14=, r15=C05EFF8C, r16=
[0.00]  r17=C0604408, r18=FFFC, r19=C05B9F6C, r20=BFFEC168
[0.00]  r21=BFFEC168, r22=EFFF9AC0, r23=0001, r24=C0606874
[0.00]  r25=BFE6B74C, r26=8000, r27=, r28=9040
[0.00]  r29=0100, r30=0380, r31=C05C02F0, rPC=C0604408
[0.00]  msr=46A0, ear=0004, esr=0D12, fsr=
[0.00] Oops: kernel access of bad area, sig: 11


-- 
Michal Simek, Ing. (M.Eng), OpenPGP -> KeyID: FE3D1F91
w: www.monstr.eu p: +42-0-721842854
Maintainer of Linux kernel - Xilinx Microblaze
Maintainer of Linux kernel - Xilinx Zynq ARM and ZynqMP ARM64 SoCs
U-Boot custodian - Xilinx Microblaze/Zynq/ZynqMP/Versal SoCs




signature.asc
Description: OpenPGP digital signature


Re: [PATCH v3 07/11] PCI: layerscape: Modify the way of getting capability with different PEX

2019-09-02 Thread Andrew Murray
On Mon, Sep 02, 2019 at 11:17:12AM +0800, Xiaowei Bao wrote:
> The different PCIe controller in one board may be have different
> capability of MSI or MSIX, so change the way of getting the MSI
> capability, make it more flexible.
> 
> Signed-off-by: Xiaowei Bao 

Please see the comments I just made to Kishon's feedback in the thread for
this patch in series v2.

Thanks,

Andrew Murray

> ---
> v2:
>  - Remove the repeated assignment code.
> v3:
>  - Use ep_func msi_cap and msix_cap to decide the msi_capable and
>msix_capable of pci_epc_features struct.
> 
>  drivers/pci/controller/dwc/pci-layerscape-ep.c | 31 
> +++---
>  1 file changed, 23 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/pci/controller/dwc/pci-layerscape-ep.c 
> b/drivers/pci/controller/dwc/pci-layerscape-ep.c
> index a9c552e..1e07287 100644
> --- a/drivers/pci/controller/dwc/pci-layerscape-ep.c
> +++ b/drivers/pci/controller/dwc/pci-layerscape-ep.c
> @@ -22,6 +22,7 @@
>  
>  struct ls_pcie_ep {
>   struct dw_pcie  *pci;
> + struct pci_epc_features *ls_epc;
>  };
>  
>  #define to_ls_pcie_ep(x) dev_get_drvdata((x)->dev)
> @@ -40,26 +41,31 @@ static const struct of_device_id ls_pcie_ep_of_match[] = {
>   { },
>  };
>  
> -static const struct pci_epc_features ls_pcie_epc_features = {
> - .linkup_notifier = false,
> - .msi_capable = true,
> - .msix_capable = false,
> - .bar_fixed_64bit = (1 << BAR_2) | (1 << BAR_4),
> -};
> -
>  static const struct pci_epc_features*
>  ls_pcie_ep_get_features(struct dw_pcie_ep *ep)
>  {
> - return _pcie_epc_features;
> + struct dw_pcie *pci = to_dw_pcie_from_ep(ep);
> + struct ls_pcie_ep *pcie = to_ls_pcie_ep(pci);
> +
> + return pcie->ls_epc;
>  }
>  
>  static void ls_pcie_ep_init(struct dw_pcie_ep *ep)
>  {
>   struct dw_pcie *pci = to_dw_pcie_from_ep(ep);
> + struct ls_pcie_ep *pcie = to_ls_pcie_ep(pci);
> + struct dw_pcie_ep_func *ep_func;
>   enum pci_barno bar;
>  
> + ep_func = dw_pcie_ep_get_func_from_ep(ep, 0);
> + if (!ep_func)
> + return;
> +
>   for (bar = BAR_0; bar <= BAR_5; bar++)
>   dw_pcie_ep_reset_bar(pci, bar);
> +
> + pcie->ls_epc->msi_capable = ep_func->msi_cap ? true : false;
> + pcie->ls_epc->msix_capable = ep_func->msix_cap ? true : false;
>  }
>  
>  static int ls_pcie_ep_raise_irq(struct dw_pcie_ep *ep, u8 func_no,
> @@ -119,6 +125,7 @@ static int __init ls_pcie_ep_probe(struct platform_device 
> *pdev)
>   struct device *dev = >dev;
>   struct dw_pcie *pci;
>   struct ls_pcie_ep *pcie;
> + struct pci_epc_features *ls_epc;
>   struct resource *dbi_base;
>   int ret;
>  
> @@ -130,6 +137,10 @@ static int __init ls_pcie_ep_probe(struct 
> platform_device *pdev)
>   if (!pci)
>   return -ENOMEM;
>  
> + ls_epc = devm_kzalloc(dev, sizeof(*ls_epc), GFP_KERNEL);
> + if (!ls_epc)
> + return -ENOMEM;
> +
>   dbi_base = platform_get_resource_byname(pdev, IORESOURCE_MEM, "regs");
>   pci->dbi_base = devm_pci_remap_cfg_resource(dev, dbi_base);
>   if (IS_ERR(pci->dbi_base))
> @@ -140,6 +151,10 @@ static int __init ls_pcie_ep_probe(struct 
> platform_device *pdev)
>   pci->ops = _pcie_ep_ops;
>   pcie->pci = pci;
>  
> + ls_epc->bar_fixed_64bit = (1 << BAR_2) | (1 << BAR_4),
> +
> + pcie->ls_epc = ls_epc;
> +
>   platform_set_drvdata(pdev, pcie);
>  
>   ret = ls_add_pcie_ep(pcie, pdev);
> -- 
> 2.9.5
> 


Re: [PATCH v2 06/10] PCI: layerscape: Modify the way of getting capability with different PEX

2019-09-02 Thread Andrew Murray
On Fri, Aug 23, 2019 at 04:13:30AM +, Xiaowei Bao wrote:
> 
> 
> > -Original Message-
> > From: Kishon Vijay Abraham I 
> > Sent: 2019年8月23日 11:40
> > To: Xiaowei Bao ; bhelg...@google.com;
> > robh...@kernel.org; mark.rutl...@arm.com; shawn...@kernel.org; Leo Li
> > ; lorenzo.pieral...@arm.co
> > ; a...@arndb.de; gre...@linuxfoundation.org;
> > M.h. Lian ; Mingkai Hu ;
> > Roy Zang ; jingooh...@gmail.com;
> > gustavo.pimen...@synopsys.com; linux-...@vger.kernel.org;
> > devicet...@vger.kernel.org; linux-ker...@vger.kernel.org;
> > linux-arm-ker...@lists.infradead.org; linuxppc-dev@lists.ozlabs.org;
> > andrew.mur...@arm.com
> > Subject: Re: [PATCH v2 06/10] PCI: layerscape: Modify the way of getting
> > capability with different PEX
> > 
> > Hi,
> > 
> > (Fixed Lorenzo's email address. All the patches in the series have wrong 
> > email
> > id)
> > 
> > On 23/08/19 8:09 AM, Xiaowei Bao wrote:
> > >
> > >
> > >> -Original Message-
> > >> From: Kishon Vijay Abraham I 
> > >> Sent: 2019年8月22日 19:44
> > >> To: Xiaowei Bao ; bhelg...@google.com;
> > >> robh...@kernel.org; mark.rutl...@arm.com; shawn...@kernel.org; Leo
> > Li
> > >> ; lorenzo.pieral...@arm.co; a...@arndb.de;
> > >> gre...@linuxfoundation.org; M.h. Lian ;
> > >> Mingkai Hu ; Roy Zang ;
> > >> jingooh...@gmail.com; gustavo.pimen...@synopsys.com;
> > >> linux-...@vger.kernel.org; devicet...@vger.kernel.org;
> > >> linux-ker...@vger.kernel.org; linux-arm-ker...@lists.infradead.org;
> > >> linuxppc-dev@lists.ozlabs.org; andrew.mur...@arm.com
> > >> Subject: Re: [PATCH v2 06/10] PCI: layerscape: Modify the way of
> > >> getting capability with different PEX
> > >>
> > >> Hi,
> > >>
> > >> On 22/08/19 4:52 PM, Xiaowei Bao wrote:
> > >>> The different PCIe controller in one board may be have different
> > >>> capability of MSI or MSIX, so change the way of getting the MSI
> > >>> capability, make it more flexible.
> > >>
> > >> please use different pci_epc_features table for different boards.
> > > Thanks, I think that it will be more flexible to dynamically get MSI
> > > or MSIX capability, Thus, we will not need to define the pci_epc_feature 
> > > for
> > different boards.
> > 
> > Is the restriction because you cannot have different compatible for 
> > different
> > boards?
> Sorry, I am not very clear what your mean, I think even if I use the same 
> compatible
> with different boards, each boards will enter the probe function, in there I 
> will get
> the MSI or MSIX PCIe capability of the current controller in this board. Why 
> do I need
> to define the pci_epc_feature for different boards? 

At present you determine how to set the [msi,msix]_capable flags of
pci_epc_features based on reading the function capabilities at probe
time. Instead of doing this, is it possible that you can determine the flags
based on the compatible type alone? For example, is the MSI/MSIX capability
the same for all fsl,ls2088a-pcie-ep devices?

If it isn't *necessary* to probe for this information at probe time, then
you could instead create a static pci_epc_features structure and assign
it to something in your drvdata. This may provide some benefits.

The dw_pcie_ep_get_features function would then look like:

static const struct pci_epc_features*
ls_pcie_ep_get_features(struct dw_pcie_ep *ep)
{
struct dw_pcie *pci = to_dw_pcie_from_pp(ep);
struct ls_pcie_ep *pcie = dev_get_drvdata(pci->dev);
return pcie->epc_features;
}

This also means you can revert "[v3,03/11] PCI: designware-ep: Move the".

Is this what you had in mind Kishon?

Thanks,

Andrew Murray

> > 
> > Thanks
> > Kishon
> > 
> > >>
> > >> Thanks
> > >> Kishon
> > >>>
> > >>> Signed-off-by: Xiaowei Bao 
> > >>> ---
> > >>> v2:
> > >>>  - Remove the repeated assignment code.
> > >>>
> > >>>  drivers/pci/controller/dwc/pci-layerscape-ep.c | 26
> > >>> +++---
> > >>>  1 file changed, 19 insertions(+), 7 deletions(-)
> > >>>
> > >>> diff --git a/drivers/pci/controller/dwc/pci-layerscape-ep.c
> > >>> b/drivers/pci/controller/dwc/pci-layerscape-ep.c
> > >>> index 4e92a95..8461f62 100644
> > >>> --- a/drivers/pci/controller/dwc/pci-layerscape-ep.c
> > >>> +++ b/drivers/pci/controller/dwc/pci-layerscape-ep.c
> > >>> @@ -22,6 +22,7 @@
> > >>>
> > >>>  struct ls_pcie_ep {
> > >>> struct dw_pcie  *pci;
> > >>> +   struct pci_epc_features *ls_epc;
> > >>>  };
> > >>>
> > >>>  #define to_ls_pcie_ep(x)   dev_get_drvdata((x)->dev)
> > >>> @@ -40,25 +41,26 @@ static const struct of_device_id
> > >> ls_pcie_ep_of_match[] = {
> > >>> { },
> > >>>  };
> > >>>
> > >>> -static const struct pci_epc_features ls_pcie_epc_features = {
> > >>> -   .linkup_notifier = false,
> > >>> -   .msi_capable = true,
> > >>> -   .msix_capable = false,
> > >>> -};
> > >>> -
> > >>>  static const struct pci_epc_features*
> > >>> ls_pcie_ep_get_features(struct dw_pcie_ep *ep)  {
> > >>> -   return _pcie_epc_features;
> > >>> +   

Re: [PATCH v6 1/5] kasan: support backing vmalloc space with real shadow memory

2019-09-02 Thread Mark Rutland
On Mon, Sep 02, 2019 at 09:20:24PM +1000, Daniel Axtens wrote:
> Hook into vmalloc and vmap, and dynamically allocate real shadow
> memory to back the mappings.
> 
> Most mappings in vmalloc space are small, requiring less than a full
> page of shadow space. Allocating a full shadow page per mapping would
> therefore be wasteful. Furthermore, to ensure that different mappings
> use different shadow pages, mappings would have to be aligned to
> KASAN_SHADOW_SCALE_SIZE * PAGE_SIZE.
> 
> Instead, share backing space across multiple mappings. Allocate a
> backing page when a mapping in vmalloc space uses a particular page of
> the shadow region. This page can be shared by other vmalloc mappings
> later on.
> 
> We hook in to the vmap infrastructure to lazily clean up unused shadow
> memory.
> 
> To avoid the difficulties around swapping mappings around, this code
> expects that the part of the shadow region that covers the vmalloc
> space will not be covered by the early shadow page, but will be left
> unmapped. This will require changes in arch-specific code.
> 
> This allows KASAN with VMAP_STACK, and may be helpful for architectures
> that do not have a separate module space (e.g. powerpc64, which I am
> currently working on). It also allows relaxing the module alignment
> back to PAGE_SIZE.
> 
> Link: https://bugzilla.kernel.org/show_bug.cgi?id=202009
> Acked-by: Vasily Gorbik 
> Signed-off-by: Daniel Axtens 
> [Mark: rework shadow allocation]
> Signed-off-by: Mark Rutland 
> 
> --
> 
> v2: let kasan_unpoison_shadow deal with ranges that do not use a
> full shadow byte.
> 
> v3: relax module alignment
> rename to kasan_populate_vmalloc which is a much better name
> deal with concurrency correctly
> 
> v4: Mark's rework
> Poision pages on vfree
> Handle allocation failures
> 
> v5: Per Christophe Leroy, split out test and dynamically free pages.
> 
> v6: Guard freeing page properly. Drop WARN_ON_ONCE(pte_none(*ptep)),
>  on reflection it's unnecessary debugging cruft with too high a
>  false positive rate.
> ---

[...]

> +static int kasan_depopulate_vmalloc_pte(pte_t *ptep, unsigned long addr,
> + void *unused)
> +{
> + unsigned long page;
> +
> + page = (unsigned long)__va(pte_pfn(*ptep) << PAGE_SHIFT);
> +
> + spin_lock(_mm.page_table_lock);
> +
> + if (likely(!pte_none(*ptep))) {
> + pte_clear(_mm, addr, ptep);
> + free_page(page);
> + }
> + spin_unlock(_mm.page_table_lock);
> +
> + return 0;
> +}

There needs to be TLB maintenance after unmapping the page, but I don't
see that happening below.

We need that to ensure that errant accesses don't hit the page we're
freeing and that new mappings at the same VA don't cause a TLB conflict
or TLB amalgamation issue.

> +/*
> + * Release the backing for the vmalloc region [start, end), which
> + * lies within the free region [free_region_start, free_region_end).
> + *
> + * This can be run lazily, long after the region was freed. It runs
> + * under vmap_area_lock, so it's not safe to interact with the vmalloc/vmap
> + * infrastructure.
> + */

IIUC we aim to only free non-shared shadow by aligning the start
upwards, and aligning the end downwards. I think it would be worth
mentioning that explicitly in the comment since otherwise it's not
obvious how we handle races between alloc/free.

Thanks,
Mark.

> +void kasan_release_vmalloc(unsigned long start, unsigned long end,
> +unsigned long free_region_start,
> +unsigned long free_region_end)
> +{
> + void *shadow_start, *shadow_end;
> + unsigned long region_start, region_end;
> +
> + /* we start with shadow entirely covered by this region */
> + region_start = ALIGN(start, PAGE_SIZE * KASAN_SHADOW_SCALE_SIZE);
> + region_end = ALIGN_DOWN(end, PAGE_SIZE * KASAN_SHADOW_SCALE_SIZE);
> +
> + /*
> +  * We don't want to extend the region we release to the entire free
> +  * region, as the free region might cover huge chunks of vmalloc space
> +  * where we never allocated anything. We just want to see if we can
> +  * extend the [start, end) range: if start or end fall part way through
> +  * a shadow page, we want to check if we can free that entire page.
> +  */
> +
> + free_region_start = ALIGN(free_region_start,
> +   PAGE_SIZE * KASAN_SHADOW_SCALE_SIZE);
> +
> + if (start != region_start &&
> + free_region_start < region_start)
> + region_start -= PAGE_SIZE * KASAN_SHADOW_SCALE_SIZE;
> +
> + free_region_end = ALIGN_DOWN(free_region_end,
> +  PAGE_SIZE * KASAN_SHADOW_SCALE_SIZE);
> +
> + if (end != region_end &&
> + free_region_end > region_end)
> + region_end += PAGE_SIZE * KASAN_SHADOW_SCALE_SIZE;
> +
> + shadow_start = kasan_mem_to_shadow((void *)region_start);
> + shadow_end = 

Re: [PATCH v3 10/11] arm64: dts: layerscape: Add PCIe EP node for ls1088a

2019-09-02 Thread Andrew Murray
On Mon, Sep 02, 2019 at 11:17:15AM +0800, Xiaowei Bao wrote:
> Add PCIe EP node for ls1088a to support EP mode.
> 
> Signed-off-by: Xiaowei Bao 
> ---
> v2:
>  - Remove the pf-offset proparty.
> v3:
>  - No change.
>  
>  arch/arm64/boot/dts/freescale/fsl-ls1088a.dtsi | 31 
> ++
>  1 file changed, 31 insertions(+)
> 
> diff --git a/arch/arm64/boot/dts/freescale/fsl-ls1088a.dtsi 
> b/arch/arm64/boot/dts/freescale/fsl-ls1088a.dtsi
> index c676d07..da246ab 100644
> --- a/arch/arm64/boot/dts/freescale/fsl-ls1088a.dtsi
> +++ b/arch/arm64/boot/dts/freescale/fsl-ls1088a.dtsi
> @@ -483,6 +483,17 @@
>   status = "disabled";
>   };
>  
> + pcie_ep@340 {
> + compatible = "fsl,ls1088a-pcie-ep","fsl,ls-pcie-ep";

Here you specify a fallback "fsl,ls-pcie-ep" that is removed by this series.

Besides that, this looks OK.

Thanks,

Andrew Murray

> + reg = <0x00 0x0340 0x0 0x0010
> +0x20 0x 0x8 0x>;
> + reg-names = "regs", "addr_space";
> + num-ib-windows = <24>;
> + num-ob-windows = <128>;
> + max-functions = /bits/ 8 <2>;
> + status = "disabled";
> + };
> +
>   pcie@350 {
>   compatible = "fsl,ls1088a-pcie";
>   reg = <0x00 0x0350 0x0 0x0010   /* controller 
> registers */
> @@ -508,6 +519,16 @@
>   status = "disabled";
>   };
>  
> + pcie_ep@350 {
> + compatible = "fsl,ls1088a-pcie-ep","fsl,ls-pcie-ep";
> + reg = <0x00 0x0350 0x0 0x0010
> +0x28 0x 0x8 0x>;
> + reg-names = "regs", "addr_space";
> + num-ib-windows = <6>;
> + num-ob-windows = <8>;
> + status = "disabled";
> + };
> +
>   pcie@360 {
>   compatible = "fsl,ls1088a-pcie";
>   reg = <0x00 0x0360 0x0 0x0010   /* controller 
> registers */
> @@ -533,6 +554,16 @@
>   status = "disabled";
>   };
>  
> + pcie_ep@360 {
> + compatible = "fsl,ls1088a-pcie-ep","fsl,ls-pcie-ep";
> + reg = <0x00 0x0360 0x0 0x0010
> +0x30 0x 0x8 0x>;
> + reg-names = "regs", "addr_space";
> + num-ib-windows = <6>;
> + num-ob-windows = <8>;
> + status = "disabled";
> + };
> +
>   smmu: iommu@500 {
>   compatible = "arm,mmu-500";
>   reg = <0 0x500 0 0x80>;
> -- 
> 2.9.5
> 


Re: [PATCH v7 5/6] powerpc/64: Make COMPAT user-selectable disabled on littleendian by default.

2019-09-02 Thread Segher Boessenkool
On Mon, Sep 02, 2019 at 12:03:12PM +1000, Michael Ellerman wrote:
> Michal Suchanek  writes:
> > On bigendian ppc64 it is common to have 32bit legacy binaries but much
> > less so on littleendian.
> 
> I think the toolchain people will tell you that there is no 32-bit
> little endian ABI defined at all, if anything works it's by accident.

There of course is a lot of powerpcle-* support.  The ABI used for it
on linux is the SYSV ABI, just like on BE 32-bit.

There also is specific powerpcle-linux support in GCC, and in binutils,
too.  Also, config.guess/config.sub supports it.  Half a year ago this
all built fine (no, I don't test it often either).

I don't think glibc supports it though, so I wonder if anyone builds an
actual system with it?  Maybe busybox or the like?

> So I think we should not make this selectable, unless someone puts their
> hand up to say they want it and are willing to test it and keep it
> working.

What about actual 32-bit LE systems?  Does anyone still use those?


Segher


Re: [PATCH v2 2/9] x86: numa: check the node id consistently for x86

2019-09-02 Thread Peter Zijlstra
On Mon, Sep 02, 2019 at 08:25:24PM +0800, Yunsheng Lin wrote:
> On 2019/9/2 15:25, Peter Zijlstra wrote:
> > On Mon, Sep 02, 2019 at 01:46:51PM +0800, Yunsheng Lin wrote:
> >> On 2019/9/1 0:12, Peter Zijlstra wrote:
> > 
> >>> 1) because even it is not set, the device really does belong to a node.
> >>> It is impossible a device will have magic uniform access to memory when
> >>> CPUs cannot.
> >>
> >> So it means dev_to_node() will return either NUMA_NO_NODE or a
> >> valid node id?
> > 
> > NUMA_NO_NODE := -1, which is not a valid node number. It is also, like I
> > said, not a valid device location on a NUMA system.
> > 
> > Just because ACPI/BIOS is shit, doesn't mean the device doesn't have a
> > node association. It just means we don't know and might have to guess.
> 
> How do we guess the device's location when ACPI/BIOS does not set it?

See device_add(), it looks to the device's parent and on NO_NODE, puts
it there.

Lacking any hints, just stick it to node0 and print a FW_BUG or
something.

> It seems dev_to_node() does not do anything about that and leave the
> job to the caller or whatever function that get called with its return
> value, such as cpumask_of_node().

Well, dev_to_node() doesn't do anything; nor should it. It are the
callers of set_dev_node() that should be taking care.

Also note how device_add() sets the device node to the parent device's
node on NUMA_NO_NODE. Arguably we should change it to complain when it
finds NUMA_NO_NODE and !parent.

---
 drivers/base/core.c | 12 ++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/drivers/base/core.c b/drivers/base/core.c
index f0dd8e38fee3..2caf204966a0 100644
--- a/drivers/base/core.c
+++ b/drivers/base/core.c
@@ -2120,8 +2120,16 @@ int device_add(struct device *dev)
dev->kobj.parent = kobj;
 
/* use parent numa_node */
-   if (parent && (dev_to_node(dev) == NUMA_NO_NODE))
-   set_dev_node(dev, dev_to_node(parent));
+   if (dev_to_node(dev) == NUMA_NO_NODE) {
+   if (parent)
+   set_dev_node(dev, dev_to_node(parent));
+#ifdef CONFIG_NUMA
+   else {
+   pr_err("device: '%s': has no assigned NUMA node\n", 
dev_name(dev));
+   set_dev_node(dev, 0);
+   }
+#endif
+   }
 
/* first, register with generic layer. */
/* we require the name to be set before, and pass NULL */


Re: [PATCH v3 11/11] misc: pci_endpoint_test: Add LS1088a in pci_device_id table

2019-09-02 Thread Andrew Murray
On Mon, Sep 02, 2019 at 11:17:16AM +0800, Xiaowei Bao wrote:
> Add LS1088a in pci_device_id table so that pci-epf-test can be used
> for testing PCIe EP in LS1088a.
> 
> Signed-off-by: Xiaowei Bao 
> ---
> v2:
>  - No change.
> v3:
>  - No change.
>  
>  drivers/misc/pci_endpoint_test.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/misc/pci_endpoint_test.c 
> b/drivers/misc/pci_endpoint_test.c
> index 6e208a0..d531951 100644
> --- a/drivers/misc/pci_endpoint_test.c
> +++ b/drivers/misc/pci_endpoint_test.c
> @@ -793,6 +793,7 @@ static const struct pci_device_id pci_endpoint_test_tbl[] 
> = {
>   { PCI_DEVICE(PCI_VENDOR_ID_TI, PCI_DEVICE_ID_TI_DRA74x) },
>   { PCI_DEVICE(PCI_VENDOR_ID_TI, PCI_DEVICE_ID_TI_DRA72x) },
>   { PCI_DEVICE(PCI_VENDOR_ID_FREESCALE, 0x81c0) },
> + { PCI_DEVICE(PCI_VENDOR_ID_FREESCALE, 0x80c0) },

The Freescale PCI devices are the only devices in this table that don't
have a define for their device ID. I think a define should be created
for both of the device IDs above.

Thanks,

Andrew Murray

>   { PCI_DEVICE_DATA(SYNOPSYS, EDDA, NULL) },
>   { PCI_DEVICE(PCI_VENDOR_ID_TI, PCI_DEVICE_ID_TI_AM654),
> .driver_data = (kernel_ulong_t)_data
> -- 
> 2.9.5
> 


Re: [PATCH v3 09/11] PCI: layerscape: Add EP mode support for ls1088a and ls2088a

2019-09-02 Thread Andrew Murray
On Mon, Sep 02, 2019 at 11:17:14AM +0800, Xiaowei Bao wrote:
> Add PCIe EP mode support for ls1088a and ls2088a, there are some
> difference between LS1 and LS2 platform, so refactor the code of
> the EP driver.
> 
> Signed-off-by: Xiaowei Bao 
> ---
> v2: 
>  - This is a new patch for supporting the ls1088a and ls2088a platform.
> v3:
>  - Adjust the some struct assignment order in probe function.
> 
>  drivers/pci/controller/dwc/pci-layerscape-ep.c | 72 
> +++---
>  1 file changed, 53 insertions(+), 19 deletions(-)
> 
> diff --git a/drivers/pci/controller/dwc/pci-layerscape-ep.c 
> b/drivers/pci/controller/dwc/pci-layerscape-ep.c
> index 5f0cb99..723bbe5 100644
> --- a/drivers/pci/controller/dwc/pci-layerscape-ep.c
> +++ b/drivers/pci/controller/dwc/pci-layerscape-ep.c
> @@ -20,27 +20,29 @@
>  
>  #define PCIE_DBI2_OFFSET 0x1000  /* DBI2 base address*/
>  
> -struct ls_pcie_ep {
> - struct dw_pcie  *pci;
> - struct pci_epc_features *ls_epc;
> +#define to_ls_pcie_ep(x) dev_get_drvdata((x)->dev)
> +
> +struct ls_pcie_ep_drvdata {
> + u32 func_offset;
> + const struct dw_pcie_ep_ops *ops;
> + const struct dw_pcie_ops*dw_pcie_ops;
>  };
>  
> -#define to_ls_pcie_ep(x) dev_get_drvdata((x)->dev)
> +struct ls_pcie_ep {
> + struct dw_pcie  *pci;
> + struct pci_epc_features *ls_epc;
> + const struct ls_pcie_ep_drvdata *drvdata;
> +};
>  
>  static int ls_pcie_establish_link(struct dw_pcie *pci)
>  {
>   return 0;
>  }
>  
> -static const struct dw_pcie_ops ls_pcie_ep_ops = {
> +static const struct dw_pcie_ops dw_ls_pcie_ep_ops = {
>   .start_link = ls_pcie_establish_link,
>  };
>  
> -static const struct of_device_id ls_pcie_ep_of_match[] = {
> - { .compatible = "fsl,ls-pcie-ep",},
> - { },
> -};
> -
>  static const struct pci_epc_features*
>  ls_pcie_ep_get_features(struct dw_pcie_ep *ep)
>  {
> @@ -87,10 +89,39 @@ static int ls_pcie_ep_raise_irq(struct dw_pcie_ep *ep, u8 
> func_no,
>   }
>  }
>  
> -static const struct dw_pcie_ep_ops pcie_ep_ops = {
> +static unsigned int ls_pcie_ep_func_conf_select(struct dw_pcie_ep *ep,
> + u8 func_no)
> +{
> + struct dw_pcie *pci = to_dw_pcie_from_ep(ep);
> + struct ls_pcie_ep *pcie = to_ls_pcie_ep(pci);
> +
> + WARN_ON(func_no && !pcie->drvdata->func_offset);
> + return pcie->drvdata->func_offset * func_no;
> +}
> +
> +static const struct dw_pcie_ep_ops ls_pcie_ep_ops = {
>   .ep_init = ls_pcie_ep_init,
>   .raise_irq = ls_pcie_ep_raise_irq,
>   .get_features = ls_pcie_ep_get_features,
> + .func_conf_select = ls_pcie_ep_func_conf_select,
> +};
> +
> +static const struct ls_pcie_ep_drvdata ls1_ep_drvdata = {
> + .ops = _pcie_ep_ops,
> + .dw_pcie_ops = _ls_pcie_ep_ops,
> +};
> +
> +static const struct ls_pcie_ep_drvdata ls2_ep_drvdata = {
> + .func_offset = 0x2,
> + .ops = _pcie_ep_ops,
> + .dw_pcie_ops = _ls_pcie_ep_ops,
> +};
> +
> +static const struct of_device_id ls_pcie_ep_of_match[] = {
> + { .compatible = "fsl,ls1046a-pcie-ep", .data = _ep_drvdata },
> + { .compatible = "fsl,ls1088a-pcie-ep", .data = _ep_drvdata },
> + { .compatible = "fsl,ls2088a-pcie-ep", .data = _ep_drvdata },
> + { },

This removes support for "fsl,ls-pcie-ep" - was that intentional? If you do
plan to drop it please make sure you explain why in the commit message. See
also my comments in your dt-binding patch.

Thanks,

Andrew Murray

>  };
>  
>  static int __init ls_add_pcie_ep(struct ls_pcie_ep *pcie,
> @@ -103,7 +134,7 @@ static int __init ls_add_pcie_ep(struct ls_pcie_ep *pcie,
>   int ret;
>  
>   ep = >ep;
> - ep->ops = _ep_ops;
> + ep->ops = pcie->drvdata->ops;
>  
>   res = platform_get_resource_byname(pdev, IORESOURCE_MEM, "addr_space");
>   if (!res)
> @@ -142,20 +173,23 @@ static int __init ls_pcie_ep_probe(struct 
> platform_device *pdev)
>   if (!ls_epc)
>   return -ENOMEM;
>  
> - dbi_base = platform_get_resource_byname(pdev, IORESOURCE_MEM, "regs");
> - pci->dbi_base = devm_pci_remap_cfg_resource(dev, dbi_base);
> - if (IS_ERR(pci->dbi_base))
> - return PTR_ERR(pci->dbi_base);
> + pcie->drvdata = of_device_get_match_data(dev);
>  
> - pci->dbi_base2 = pci->dbi_base + PCIE_DBI2_OFFSET;
>   pci->dev = dev;
> - pci->ops = _pcie_ep_ops;
> - pcie->pci = pci;
> + pci->ops = pcie->drvdata->dw_pcie_ops;
>  
>   ls_epc->bar_fixed_64bit = (1 << BAR_2) | (1 << BAR_4),
>  
> + pcie->pci = pci;
>   pcie->ls_epc = ls_epc;
>  
> + dbi_base = platform_get_resource_byname(pdev, IORESOURCE_MEM, "regs");
> + pci->dbi_base = devm_pci_remap_cfg_resource(dev, dbi_base);
> + if (IS_ERR(pci->dbi_base))
> + return PTR_ERR(pci->dbi_base);
> +
> + pci->dbi_base2 = pci->dbi_base + PCIE_DBI2_OFFSET;
> +
> 

Re: [PATCH v3 05/11] dt-bindings: pci: layerscape-pci: add compatible strings for ls1088a and ls2088a

2019-09-02 Thread Andrew Murray
On Mon, Sep 02, 2019 at 11:17:10AM +0800, Xiaowei Bao wrote:
> Add compatible strings for ls1088a and ls2088a.
> 
> Signed-off-by: Xiaowei Bao 
> ---
> v2:
>  - No change.
> v3:
>  - Use one valid combination of compatible strings.
> 
>  Documentation/devicetree/bindings/pci/layerscape-pci.txt | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/Documentation/devicetree/bindings/pci/layerscape-pci.txt 
> b/Documentation/devicetree/bindings/pci/layerscape-pci.txt
> index e20ceaa..762ae41 100644
> --- a/Documentation/devicetree/bindings/pci/layerscape-pci.txt
> +++ b/Documentation/devicetree/bindings/pci/layerscape-pci.txt
> @@ -22,7 +22,9 @@ Required properties:
>  "fsl,ls1043a-pcie"
>  "fsl,ls1012a-pcie"
>EP mode:
> - "fsl,ls1046a-pcie-ep", "fsl,ls-pcie-ep"
> + "fsl,ls1046a-pcie-ep" "fsl,ls-pcie-ep"
> + "fsl,ls1088a-pcie-ep" "fsl,ls-pcie-ep"
> + "fsl,ls2088a-pcie-ep" "fsl,ls-pcie-ep"

This isn't consistent with "[PATCH v3 09/11] PCI: layerscape: Add EP mode..."
as that patch drops the fallback "fsl,ls-pcie-ep". Either the fallback must
be preserved in the driver, or you need to drop it here.

What if there are existing users that depend on the fallback?

(I'm also not sure if that comma should have been dropped).

Thanks,

Andrew Murray

>  - reg: base addresses and lengths of the PCIe controller register blocks.
>  - interrupts: A list of interrupt outputs of the controller. Must contain an
>entry for each entry in the interrupt-names property.
> -- 
> 2.9.5
> 


Re: [PATCH v2 2/9] x86: numa: check the node id consistently for x86

2019-09-02 Thread Yunsheng Lin
On 2019/9/2 15:25, Peter Zijlstra wrote:
> On Mon, Sep 02, 2019 at 01:46:51PM +0800, Yunsheng Lin wrote:
>> On 2019/9/1 0:12, Peter Zijlstra wrote:
> 
>>> 1) because even it is not set, the device really does belong to a node.
>>> It is impossible a device will have magic uniform access to memory when
>>> CPUs cannot.
>>
>> So it means dev_to_node() will return either NUMA_NO_NODE or a
>> valid node id?
> 
> NUMA_NO_NODE := -1, which is not a valid node number. It is also, like I
> said, not a valid device location on a NUMA system.
> 
> Just because ACPI/BIOS is shit, doesn't mean the device doesn't have a
> node association. It just means we don't know and might have to guess.

How do we guess the device's location when ACPI/BIOS does not set it?

It seems dev_to_node() does not do anything about that and leave the
job to the caller or whatever function that get called with its return
value, such as cpumask_of_node().

> 
>>> 2) is already true today, cpumask_of_node() requires a valid node_id.
>>
>> Ok, most of the user does check node_id before calling
>> cpumask_of_node(), but does a little different type of checking:
>>
>> 1) some does " < 0" check;
>> 2) some does "== NUMA_NO_NODE" check;
>> 3) some does ">= MAX_NUMNODES" check;
>> 4) some does "< 0 || >= MAX_NUMNODES || !node_online(node)" check.
> 
> The one true way is:
> 
>   '(unsigned)node_id >= nr_node_ids'

I missed the magic of the "unsigned" in your previous reply.

> 
>>> 3) is just wrong and increases overhead for everyone.
>>
>> Ok, cpumask_of_node() is also used in some critical path such
>> as scheduling, which may not need those checking, the overhead
>> is unnecessary.
>>
>> But for non-critical path such as setup or configuration path,
>> it better to have consistent checking, and also simplify the
>> user code that calls cpumask_of_node().
>>
>> Do you think it is worth the trouble to add a new function
>> such as cpumask_of_node_check(maybe some other name) to do
>> consistent checking?
>>
>> Or caller just simply check if dev_to_node()'s return value is
>> NUMA_NO_NODE before calling cpumask_of_node()?
> 
> It is not a matter of convenience. The function is called
> cpumask_of_node(), when node < 0 || node >= nr_node_ids, it is not a
> valid node, therefore the function shouldn't return anything except an
> error.
what do you mean by error? What I can think is three type of errors:
1) return NULL, this way it seems cpumask_of_node() also leave the
   job to the function that calls it.
2) cpu_none_mask, I am not sure what this means, maybe it means there
   is no cpu on the same node with the device?
3) give a warning, stack dump, or even a BUG_ON?

I would prefer the second one, and implement the third one when the
CONFIG_DEBUG_PER_CPU_MAPS is selected.

Any suggestion?

> 
> Also note that the CONFIG_DEBUG_PER_CPU_MAPS version of
> cpumask_of_node() already does this (although it wants the below fix).

Thanks for the note and example.



Re: Oops (request_key_auth_describe) while running cve-2016-7042 from LTP

2019-09-02 Thread Hillf Danton
>
> Hi Hillf,
>
> Would you like to me to put you down as the author of this patch?  If so, I'll
> need a Signed-off-by from you.
>
Signed-off-by: Hillf Danton 



Re: [PATCH v3 08/11] PCI: layerscape: Modify the MSIX to the doorbell mode

2019-09-02 Thread Andrew Murray
On Mon, Sep 02, 2019 at 11:17:13AM +0800, Xiaowei Bao wrote:
> dw_pcie_ep_raise_msix_irq was never called in the exisitng driver
> before, because the ls1046a platform don't support the MSIX feature
> and msix_capable was always set to false.
> Now that add the ls1088a platform with MSIX support, but the existing
> dw_pcie_ep_raise_msix_irq doesn't work, so use the doorbell method to
> support the MSIX feature.
> 
> Signed-off-by: Xiaowei Bao 

Reviewed-by: Andrew Murray 

> ---
> v2: 
>  - No change
> v3:
>  - Modify the commit message make it clearly.
> 
>  drivers/pci/controller/dwc/pci-layerscape-ep.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/pci/controller/dwc/pci-layerscape-ep.c 
> b/drivers/pci/controller/dwc/pci-layerscape-ep.c
> index 1e07287..5f0cb99 100644
> --- a/drivers/pci/controller/dwc/pci-layerscape-ep.c
> +++ b/drivers/pci/controller/dwc/pci-layerscape-ep.c
> @@ -79,7 +79,8 @@ static int ls_pcie_ep_raise_irq(struct dw_pcie_ep *ep, u8 
> func_no,
>   case PCI_EPC_IRQ_MSI:
>   return dw_pcie_ep_raise_msi_irq(ep, func_no, interrupt_num);
>   case PCI_EPC_IRQ_MSIX:
> - return dw_pcie_ep_raise_msix_irq(ep, func_no, interrupt_num);
> + return dw_pcie_ep_raise_msix_irq_doorbell(ep, func_no,
> +   interrupt_num);
>   default:
>   dev_err(pci->dev, "UNKNOWN IRQ type\n");
>   return -EINVAL;
> -- 
> 2.9.5
> 


Re: [PATCH v3 3/4] x86/efi: move common keyring handler functions to new file

2019-09-02 Thread Michael Ellerman
Nayna Jain  writes:

> The handlers to add the keys to the .platform keyring and blacklisted
> hashes to the .blacklist keyring is common for both the uefi and powerpc
> mechanisms of loading the keys/hashes from the firmware.
>
> This patch moves the common code from load_uefi.c to keyring_handler.c
>
> Signed-off-by: Nayna Jain 
> ---
>  security/integrity/Makefile   |  3 +-
>  .../platform_certs/keyring_handler.c  | 80 +++
>  .../platform_certs/keyring_handler.h  | 32 
>  security/integrity/platform_certs/load_uefi.c | 67 +---
>  4 files changed, 115 insertions(+), 67 deletions(-)
>  create mode 100644 security/integrity/platform_certs/keyring_handler.c
>  create mode 100644 security/integrity/platform_certs/keyring_handler.h

This has no acks from security folks, though I'm not really clear on who
maintains those files.

Do I take it because it's mostly just code movement people are OK with
it going in via the powerpc tree?

cheers

> diff --git a/security/integrity/Makefile b/security/integrity/Makefile
> index 19faace69644..525bf1d6e0db 100644
> --- a/security/integrity/Makefile
> +++ b/security/integrity/Makefile
> @@ -11,7 +11,8 @@ integrity-$(CONFIG_INTEGRITY_SIGNATURE) += digsig.o
>  integrity-$(CONFIG_INTEGRITY_ASYMMETRIC_KEYS) += digsig_asymmetric.o
>  integrity-$(CONFIG_INTEGRITY_PLATFORM_KEYRING) += 
> platform_certs/platform_keyring.o
>  integrity-$(CONFIG_LOAD_UEFI_KEYS) += platform_certs/efi_parser.o \
> - platform_certs/load_uefi.o
> +   platform_certs/load_uefi.o \
> +   platform_certs/keyring_handler.o
>  integrity-$(CONFIG_LOAD_IPL_KEYS) += platform_certs/load_ipl_s390.o
>  $(obj)/load_uefi.o: KBUILD_CFLAGS += -fshort-wchar
>  
> diff --git a/security/integrity/platform_certs/keyring_handler.c 
> b/security/integrity/platform_certs/keyring_handler.c
> new file mode 100644
> index ..c5ba695c10e3
> --- /dev/null
> +++ b/security/integrity/platform_certs/keyring_handler.c
> @@ -0,0 +1,80 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include "../integrity.h"
> +
> +static efi_guid_t efi_cert_x509_guid __initdata = EFI_CERT_X509_GUID;
> +static efi_guid_t efi_cert_x509_sha256_guid __initdata =
> + EFI_CERT_X509_SHA256_GUID;
> +static efi_guid_t efi_cert_sha256_guid __initdata = EFI_CERT_SHA256_GUID;
> +
> +/*
> + * Blacklist a hash.
> + */
> +static __init void uefi_blacklist_hash(const char *source, const void *data,
> +size_t len, const char *type,
> +size_t type_len)
> +{
> + char *hash, *p;
> +
> + hash = kmalloc(type_len + len * 2 + 1, GFP_KERNEL);
> + if (!hash)
> + return;
> + p = memcpy(hash, type, type_len);
> + p += type_len;
> + bin2hex(p, data, len);
> + p += len * 2;
> + *p = 0;
> +
> + mark_hash_blacklisted(hash);
> + kfree(hash);
> +}
> +
> +/*
> + * Blacklist an X509 TBS hash.
> + */
> +static __init void uefi_blacklist_x509_tbs(const char *source,
> +const void *data, size_t len)
> +{
> + uefi_blacklist_hash(source, data, len, "tbs:", 4);
> +}
> +
> +/*
> + * Blacklist the hash of an executable.
> + */
> +static __init void uefi_blacklist_binary(const char *source,
> +  const void *data, size_t len)
> +{
> + uefi_blacklist_hash(source, data, len, "bin:", 4);
> +}
> +
> +/*
> + * Return the appropriate handler for particular signature list types found 
> in
> + * the UEFI db and MokListRT tables.
> + */
> +__init efi_element_handler_t get_handler_for_db(const efi_guid_t *sig_type)
> +{
> + if (efi_guidcmp(*sig_type, efi_cert_x509_guid) == 0)
> + return add_to_platform_keyring;
> + return 0;
> +}
> +
> +/*
> + * Return the appropriate handler for particular signature list types found 
> in
> + * the UEFI dbx and MokListXRT tables.
> + */
> +__init efi_element_handler_t get_handler_for_dbx(const efi_guid_t *sig_type)
> +{
> + if (efi_guidcmp(*sig_type, efi_cert_x509_sha256_guid) == 0)
> + return uefi_blacklist_x509_tbs;
> + if (efi_guidcmp(*sig_type, efi_cert_sha256_guid) == 0)
> + return uefi_blacklist_binary;
> + return 0;
> +}
> diff --git a/security/integrity/platform_certs/keyring_handler.h 
> b/security/integrity/platform_certs/keyring_handler.h
> new file mode 100644
> index ..2462bfa08fe3
> --- /dev/null
> +++ b/security/integrity/platform_certs/keyring_handler.h
> @@ -0,0 +1,32 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +
> +#ifndef PLATFORM_CERTS_INTERNAL_H
> +#define PLATFORM_CERTS_INTERNAL_H
> +
> +#include 
> +
> +void blacklist_hash(const char *source, const void *data,
> + size_t len, 

Re: linux-next: build failure after merge of the powerpc tree

2019-09-02 Thread Christoph Hellwig
On Mon, Sep 02, 2019 at 09:40:11PM +1000, Stephen Rothwell wrote:
> Hi all,
> 
> After merging the powerpc tree, today's linux-next build (powerpc
> ppc44x_defconfig) failed like this:

Yes, this conflict is expected and we dicussed it before.  I'll make
sure Linus is in the loop when sending the pull request.


Re: [PATCH v5 2/2] powerpc: Add support to initialize ima policy rules

2019-09-02 Thread Michael Ellerman
Hi Nayna,

Some more comments below.

Nayna Jain  writes:
> POWER secure boot relies on the kernel IMA security subsystem to
> perform the OS kernel image signature verification.

Again this is just a design choice we've made, it's not specified
anywhere or anything like that. And it only applies to bare metal secure
boot, at least so far. AIUI.

> Since each secure
> boot mode has different IMA policy requirements, dynamic definition of
> the policy rules based on the runtime secure boot mode of the system is
> required. On systems that support secure boot, but have it disabled,
> only measurement policy rules of the kernel image and modules are
> defined.

It's probably worth mentioning that we intend to use this in our
Linux-based boot loader, which uses kexec, and that's one of the reasons
why we're particularly interested in defining the rules for kexec?

> This patch defines the arch-specific implementation to retrieve the
> secure boot mode of the system and accordingly configures the IMA policy
> rules.
>
> This patch provides arch-specific IMA policies if PPC_SECURE_BOOT
> config is enabled.
>
> Signed-off-by: Nayna Jain 
> ---
>  arch/powerpc/Kconfig   |  2 ++
>  arch/powerpc/kernel/Makefile   |  2 +-
>  arch/powerpc/kernel/ima_arch.c | 50 ++
>  include/linux/ima.h|  3 +-
>  4 files changed, 55 insertions(+), 2 deletions(-)
>  create mode 100644 arch/powerpc/kernel/ima_arch.c
>
> diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
> index c902a39124dc..42109682b727 100644
> --- a/arch/powerpc/Kconfig
> +++ b/arch/powerpc/Kconfig
> @@ -917,6 +917,8 @@ config PPC_SECURE_BOOT
>   bool
>   default n
>   depends on PPC64
> + depends on IMA
> + depends on IMA_ARCH_POLICY
>   help
> Linux on POWER with firmware secure boot enabled needs to define
> security policies to extend secure boot to the OS.This config
> diff --git a/arch/powerpc/kernel/Makefile b/arch/powerpc/kernel/Makefile
> index d310ebb4e526..520b1c814197 100644
> --- a/arch/powerpc/kernel/Makefile
> +++ b/arch/powerpc/kernel/Makefile
> @@ -157,7 +157,7 @@ endif
>  obj-$(CONFIG_EPAPR_PARAVIRT) += epapr_paravirt.o epapr_hcalls.o
>  obj-$(CONFIG_KVM_GUEST)  += kvm.o kvm_emul.o
>  
> -obj-$(CONFIG_PPC_SECURE_BOOT)+= secboot.o
> +obj-$(CONFIG_PPC_SECURE_BOOT)+= secboot.o ima_arch.o
>  
>  # Disable GCOV, KCOV & sanitizers in odd or sensitive code
>  GCOV_PROFILE_prom_init.o := n
> diff --git a/arch/powerpc/kernel/ima_arch.c b/arch/powerpc/kernel/ima_arch.c
> new file mode 100644
> index ..ac90fac83338
> --- /dev/null
> +++ b/arch/powerpc/kernel/ima_arch.c
> @@ -0,0 +1,50 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Copyright (C) 2019 IBM Corporation
> + * Author: Nayna Jain 
> + *
> + * ima_arch.c
> + *  - initialize ima policies for PowerPC Secure Boot
> + */
> +
> +#include 
> +#include 
> +
> +bool arch_ima_get_secureboot(void)
> +{
> + return get_powerpc_secureboot();
> +}
> +
> +/*
> + * File signature verification is not needed, include only measurements
> + */
> +static const char *const default_arch_rules[] = {
> + "measure func=KEXEC_KERNEL_CHECK",
> + "measure func=MODULE_CHECK",
> + NULL
> +};

The rules above seem fairly self explanatory.

> +
> +/* Both file signature verification and measurements are needed */
> +static const char *const sb_arch_rules[] = {
> + "measure func=KEXEC_KERNEL_CHECK template=ima-modsig",
> + "appraise func=KEXEC_KERNEL_CHECK appraise_type=imasig|modsig",
> +#if IS_ENABLED(CONFIG_MODULE_SIG)
> + "measure func=MODULE_CHECK",
> +#else
> + "measure func=MODULE_CHECK template=ima-modsig",
> + "appraise func=MODULE_CHECK appraise_type=imasig|modsig",
> +#endif

But these ones are not so obvious, at least to me who knows very little
about IMA.

Can you add a one line comment to each of the ones in here saying what
it does and why we want it?

> + NULL
> +};
> +
> +/*
> + * On PowerPC, file measurements are to be added to the IMA measurement list
> + * irrespective of the secure boot state of the system.

Why? Just because we think it's useful? Would be good to provide some
further justification.

* Signature verification
> + * is conditionally enabled based on the secure boot state.
> + */
> +const char *const *arch_get_ima_policy(void)
> +{
> + if (IS_ENABLED(CONFIG_IMA_ARCH_POLICY) && arch_ima_get_secureboot())
> + return sb_arch_rules;
> + return default_arch_rules;
> +}
> diff --git a/include/linux/ima.h b/include/linux/ima.h
> index a20ad398d260..10af09b5b478 100644
> --- a/include/linux/ima.h
> +++ b/include/linux/ima.h
> @@ -29,7 +29,8 @@ extern void ima_kexec_cmdline(const void *buf, int size);
>  extern void ima_add_kexec_buffer(struct kimage *image);
>  #endif
>  
> -#if (defined(CONFIG_X86) && defined(CONFIG_EFI)) || defined(CONFIG_S390)
> +#if (defined(CONFIG_X86) && defined(CONFIG_EFI)) 

Re: [PATCH v5 1/2] powerpc: detect the secure boot mode of the system

2019-09-02 Thread Michael Ellerman
Hi Nayna,

Sorry I've taken so long to get to this series, there's just too many
patches that need reviewing :/

Nayna Jain  writes:
> Secure boot on POWER defines different IMA policies based on the secure
> boot state of the system.

The terminology throughout is a bit vague, we have POWER, PowerPC, Linux
on POWER etc.

What this patch is talking about is a particular implemention of secure
boot on some OpenPOWER machines running bare metal - am I right?

So saying "Secure boot on POWER defines different IMA policies" is a bit
broad I think. Really we've just decided that a way to implement secure
boot is to use IMA policies.

> This patch defines a function to detect the secure boot state of the
> system.
>
> The PPC_SECURE_BOOT config represents the base enablement of secureboot
> on POWER.
>
> Signed-off-by: Nayna Jain 
> ---
>  arch/powerpc/Kconfig   | 11 +
>  arch/powerpc/include/asm/secboot.h | 27 
>  arch/powerpc/kernel/Makefile   |  2 +
>  arch/powerpc/kernel/secboot.c  | 71 ++
>  4 files changed, 111 insertions(+)
>  create mode 100644 arch/powerpc/include/asm/secboot.h
>  create mode 100644 arch/powerpc/kernel/secboot.c
>
> diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
> index 77f6ebf97113..c902a39124dc 100644
> --- a/arch/powerpc/Kconfig
> +++ b/arch/powerpc/Kconfig
> @@ -912,6 +912,17 @@ config PPC_MEM_KEYS
>  
> If unsure, say y.
>  
> +config PPC_SECURE_BOOT
> + prompt "Enable PowerPC Secure Boot"

How about "Enable secure boot support"

> + bool
> + default n

The default is 'n', so you don't need that default line.

> + depends on PPC64

Should it just depend on POWERNV for now? AFAIK there's nothing in here
that's necessarily going to be shared with the guest secure boot code is
there?

> + help
> +   Linux on POWER with firmware secure boot enabled needs to define
> +   security policies to extend secure boot to the OS.This config
> +   allows user to enable OS Secure Boot on PowerPC systems that
> +   have firmware secure boot support.

Again POWER vs PowerPC.

I think something like:

"Enable support for secure boot on some systems that have firmware
support for it. If in doubt say N."


> diff --git a/arch/powerpc/include/asm/secboot.h 
> b/arch/powerpc/include/asm/secboot.h

secure_boot.h would be fine.

> new file mode 100644
> index ..e726261bb00b
> --- /dev/null
> +++ b/arch/powerpc/include/asm/secboot.h
> @@ -0,0 +1,27 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * PowerPC secure boot definitions
> + *
> + * Copyright (C) 2019 IBM Corporation
> + * Author: Nayna Jain 

I prefer to not have email addresses in copyright headers, as they just
bit rot. Your email is in the git log.

> + *
> + */
> +#ifndef POWERPC_SECBOOT_H
> +#define POWERPC_SECBOOT_H

We usually do _ASM_POWERPC_SECBOOT_H (or _ASM_POWERPC_SECURE_BOOT_H).

> +#ifdef CONFIG_PPC_SECURE_BOOT
> +extern struct device_node *is_powerpc_secvar_supported(void);
> +extern bool get_powerpc_secureboot(void);

You don't need 'extern' for functions in headers.

> +#else
> +static inline struct device_node *is_powerpc_secvar_supported(void)
> +{
> + return NULL;
> +}
> +
> +static inline bool get_powerpc_secureboot(void)
> +{
> + return false;
> +}
> +
> +#endif
> +#endif
> diff --git a/arch/powerpc/kernel/Makefile b/arch/powerpc/kernel/Makefile
> index ea0c69236789..d310ebb4e526 100644
> --- a/arch/powerpc/kernel/Makefile
> +++ b/arch/powerpc/kernel/Makefile
> @@ -157,6 +157,8 @@ endif
>  obj-$(CONFIG_EPAPR_PARAVIRT) += epapr_paravirt.o epapr_hcalls.o
>  obj-$(CONFIG_KVM_GUEST)  += kvm.o kvm_emul.o
>  
> +obj-$(CONFIG_PPC_SECURE_BOOT)+= secboot.o
> +
>  # Disable GCOV, KCOV & sanitizers in odd or sensitive code
>  GCOV_PROFILE_prom_init.o := n
>  KCOV_INSTRUMENT_prom_init.o := n
> diff --git a/arch/powerpc/kernel/secboot.c b/arch/powerpc/kernel/secboot.c
> new file mode 100644
> index ..5ea0d52d64ef
> --- /dev/null
> +++ b/arch/powerpc/kernel/secboot.c
> @@ -0,0 +1,71 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Copyright (C) 2019 IBM Corporation
> + * Author: Nayna Jain 
> + *
> + * secboot.c
> + *  - util function to get powerpc secboot state

That's not really necessary.

> + */
> +#include 
> +#include 
> +#include 
> +
> +struct device_node *is_powerpc_secvar_supported(void)

This is a pretty weird signature. The "is_" implies it will return a
bool, but then it actually returns a device node *.

> +{
> + struct device_node *np;
> + int status;
> +
> + np = of_find_node_by_name(NULL, "ibm,secureboot");
> + if (!np) {
> + pr_info("secureboot node is not found\n");
> + return NULL;
> + }

There's no good reason to search by name. You should just search by compatible.

eg. of_find_compatible_node()

> + status = of_device_is_compatible(np, "ibm,secureboot-v3");
> + if (!status) {
> +

linux-next: build failure after merge of the powerpc tree

2019-09-02 Thread Stephen Rothwell
Hi all,

After merging the powerpc tree, today's linux-next build (powerpc
ppc44x_defconfig) failed like this:

arch/powerpc/mm/dma-noncoherent.c: In function 'atomic_pool_init':
arch/powerpc/mm/dma-noncoherent.c:128:9: error: implicit declaration of 
function 'dma_atomic_pool_init'; did you mean 'atomic_pool_init'? 
[-Werror=implicit-function-declaration]
  128 |  return dma_atomic_pool_init(GFP_KERNEL, pgprot_noncached(PAGE_KERNEL));
  | ^~~~
  | atomic_pool_init

Caused by commit

  f2902a2fb40c ("powerpc: use the generic dma coherent remap allocator")

interacting with commit

  8e3a68fb55e0 ("dma-mapping: make dma_atomic_pool_init self-contained")

from the dma-mapping tree.

I have applied the following patch for today (I did the microblaze
update as well):

From: Stephen Rothwell 
Date: Mon, 2 Sep 2019 21:23:11 +1000
Subject: [PATCH] merge fixes for "dma-mapping: make dma_atomic_pool_init 
self-contained"

Signed-off-by: Stephen Rothwell 
---
 arch/microblaze/mm/consistent.c   | 6 --
 arch/powerpc/mm/dma-noncoherent.c | 6 --
 2 files changed, 12 deletions(-)

diff --git a/arch/microblaze/mm/consistent.c b/arch/microblaze/mm/consistent.c
index 0e0f733eb846..8c5f0c332d8b 100644
--- a/arch/microblaze/mm/consistent.c
+++ b/arch/microblaze/mm/consistent.c
@@ -56,10 +56,4 @@ void *cached_kernel_address(void *ptr)
 
return (void *)(addr & ~UNCACHED_SHADOW_MASK);
 }
-#else /* CONFIG_MMU */
-static int __init atomic_pool_init(void)
-{
-   return dma_atomic_pool_init(GFP_KERNEL, pgprot_noncached(PAGE_KERNEL));
-}
-postcore_initcall(atomic_pool_init);
 #endif /* CONFIG_MMU */
diff --git a/arch/powerpc/mm/dma-noncoherent.c 
b/arch/powerpc/mm/dma-noncoherent.c
index 4272ca5e8159..2a82984356f8 100644
--- a/arch/powerpc/mm/dma-noncoherent.c
+++ b/arch/powerpc/mm/dma-noncoherent.c
@@ -122,9 +122,3 @@ void arch_dma_prep_coherent(struct page *page, size_t size)
 
flush_dcache_range(kaddr, kaddr + size);
 }
-
-static int __init atomic_pool_init(void)
-{
-   return dma_atomic_pool_init(GFP_KERNEL, pgprot_noncached(PAGE_KERNEL));
-}
-postcore_initcall(atomic_pool_init);
-- 
2.23.0.rc1

-- 
Cheers,
Stephen Rothwell


pgpHs5a_aT6ed.pgp
Description: OpenPGP digital signature


[PATCH v6 5/5] kasan debug: track pages allocated for vmalloc shadow

2019-09-02 Thread Daniel Axtens
Provide the current number of vmalloc shadow pages in
/sys/kernel/debug/kasan_vmalloc/shadow_pages.

Signed-off-by: Daniel Axtens 

---

Merging this is probably overkill, but I leave it to the discretion
of the broader community.

On v4 (no dynamic freeing), I saw the following approximate figures
on my test VM:

 - fresh boot: 720
 - after test_vmalloc: ~14000

With v5 (lazy dynamic freeing):

 - boot: ~490-500
 - running modprobe test_vmalloc pushes the figures up to sometimes
as high as ~14000, but they drop down to ~560 after the test ends.
I'm not sure where the extra sixty pages are from, but running the
test repeately doesn't cause the number to keep growing, so I don't
think we're leaking.
 - with vmap_stack, spawning tasks pushes the figure up to ~4200, then
some clearing kicks in and drops it down to previous levels again.
---
 mm/kasan/common.c | 26 ++
 1 file changed, 26 insertions(+)

diff --git a/mm/kasan/common.c b/mm/kasan/common.c
index 0b5141108cdc..fae3cf4ab23a 100644
--- a/mm/kasan/common.c
+++ b/mm/kasan/common.c
@@ -35,6 +35,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "kasan.h"
 #include "../slab.h"
@@ -748,6 +749,8 @@ core_initcall(kasan_memhotplug_init);
 #endif
 
 #ifdef CONFIG_KASAN_VMALLOC
+static u64 vmalloc_shadow_pages;
+
 static int kasan_populate_vmalloc_pte(pte_t *ptep, unsigned long addr,
  void *unused)
 {
@@ -774,6 +777,7 @@ static int kasan_populate_vmalloc_pte(pte_t *ptep, unsigned 
long addr,
if (likely(pte_none(*ptep))) {
set_pte_at(_mm, addr, ptep, pte);
page = 0;
+   vmalloc_shadow_pages++;
}
spin_unlock(_mm.page_table_lock);
if (page)
@@ -827,6 +831,7 @@ static int kasan_depopulate_vmalloc_pte(pte_t *ptep, 
unsigned long addr,
if (likely(!pte_none(*ptep))) {
pte_clear(_mm, addr, ptep);
free_page(page);
+   vmalloc_shadow_pages--;
}
spin_unlock(_mm.page_table_lock);
 
@@ -882,4 +887,25 @@ void kasan_release_vmalloc(unsigned long start, unsigned 
long end,
(unsigned long)(shadow_end - shadow_start),
kasan_depopulate_vmalloc_pte, NULL);
 }
+
+static __init int kasan_init_vmalloc_debugfs(void)
+{
+   struct dentry *root, *count;
+
+   root = debugfs_create_dir("kasan_vmalloc", NULL);
+   if (IS_ERR(root)) {
+   if (PTR_ERR(root) == -ENODEV)
+   return 0;
+   return PTR_ERR(root);
+   }
+
+   count = debugfs_create_u64("shadow_pages", 0444, root,
+  _shadow_pages);
+
+   if (IS_ERR(count))
+   return PTR_ERR(root);
+
+   return 0;
+}
+late_initcall(kasan_init_vmalloc_debugfs);
 #endif
-- 
2.20.1



[PATCH v6 4/5] x86/kasan: support KASAN_VMALLOC

2019-09-02 Thread Daniel Axtens
In the case where KASAN directly allocates memory to back vmalloc
space, don't map the early shadow page over it.

We prepopulate pgds/p4ds for the range that would otherwise be empty.
This is required to get it synced to hardware on boot, allowing the
lower levels of the page tables to be filled dynamically.

Acked-by: Dmitry Vyukov 
Signed-off-by: Daniel Axtens 

---
v5: fix some checkpatch CHECK warnings. There are some that remain
around lines ending with '(': I have not changed these because
it's consistent with the rest of the file and it's not easy to
see how to fix it without creating an overlong line or lots of
temporary variables.

v2: move from faulting in shadow pgds to prepopulating
---
 arch/x86/Kconfig|  1 +
 arch/x86/mm/kasan_init_64.c | 60 +
 2 files changed, 61 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 2502f7f60c9c..300b4766ccfa 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -134,6 +134,7 @@ config X86
select HAVE_ARCH_JUMP_LABEL
select HAVE_ARCH_JUMP_LABEL_RELATIVE
select HAVE_ARCH_KASAN  if X86_64
+   select HAVE_ARCH_KASAN_VMALLOC  if X86_64
select HAVE_ARCH_KGDB
select HAVE_ARCH_MMAP_RND_BITS  if MMU
select HAVE_ARCH_MMAP_RND_COMPAT_BITS   if MMU && COMPAT
diff --git a/arch/x86/mm/kasan_init_64.c b/arch/x86/mm/kasan_init_64.c
index 296da58f3013..8f00f462709e 100644
--- a/arch/x86/mm/kasan_init_64.c
+++ b/arch/x86/mm/kasan_init_64.c
@@ -245,6 +245,51 @@ static void __init kasan_map_early_shadow(pgd_t *pgd)
} while (pgd++, addr = next, addr != end);
 }
 
+static void __init kasan_shallow_populate_p4ds(pgd_t *pgd,
+  unsigned long addr,
+  unsigned long end,
+  int nid)
+{
+   p4d_t *p4d;
+   unsigned long next;
+   void *p;
+
+   p4d = p4d_offset(pgd, addr);
+   do {
+   next = p4d_addr_end(addr, end);
+
+   if (p4d_none(*p4d)) {
+   p = early_alloc(PAGE_SIZE, nid, true);
+   p4d_populate(_mm, p4d, p);
+   }
+   } while (p4d++, addr = next, addr != end);
+}
+
+static void __init kasan_shallow_populate_pgds(void *start, void *end)
+{
+   unsigned long addr, next;
+   pgd_t *pgd;
+   void *p;
+   int nid = early_pfn_to_nid((unsigned long)start);
+
+   addr = (unsigned long)start;
+   pgd = pgd_offset_k(addr);
+   do {
+   next = pgd_addr_end(addr, (unsigned long)end);
+
+   if (pgd_none(*pgd)) {
+   p = early_alloc(PAGE_SIZE, nid, true);
+   pgd_populate(_mm, pgd, p);
+   }
+
+   /*
+* we need to populate p4ds to be synced when running in
+* four level mode - see sync_global_pgds_l4()
+*/
+   kasan_shallow_populate_p4ds(pgd, addr, next, nid);
+   } while (pgd++, addr = next, addr != (unsigned long)end);
+}
+
 #ifdef CONFIG_KASAN_INLINE
 static int kasan_die_handler(struct notifier_block *self,
 unsigned long val,
@@ -352,9 +397,24 @@ void __init kasan_init(void)
shadow_cpu_entry_end = (void *)round_up(
(unsigned long)shadow_cpu_entry_end, PAGE_SIZE);
 
+   /*
+* If we're in full vmalloc mode, don't back vmalloc space with early
+* shadow pages. Instead, prepopulate pgds/p4ds so they are synced to
+* the global table and we can populate the lower levels on demand.
+*/
+#ifdef CONFIG_KASAN_VMALLOC
+   kasan_shallow_populate_pgds(
+   kasan_mem_to_shadow((void *)PAGE_OFFSET + MAXMEM),
+   kasan_mem_to_shadow((void *)VMALLOC_END));
+
+   kasan_populate_early_shadow(
+   kasan_mem_to_shadow((void *)VMALLOC_END + 1),
+   shadow_cpu_entry_begin);
+#else
kasan_populate_early_shadow(
kasan_mem_to_shadow((void *)PAGE_OFFSET + MAXMEM),
shadow_cpu_entry_begin);
+#endif
 
kasan_populate_shadow((unsigned long)shadow_cpu_entry_begin,
  (unsigned long)shadow_cpu_entry_end, 0);
-- 
2.20.1



[PATCH v6 3/5] fork: support VMAP_STACK with KASAN_VMALLOC

2019-09-02 Thread Daniel Axtens
Supporting VMAP_STACK with KASAN_VMALLOC is straightforward:

 - clear the shadow region of vmapped stacks when swapping them in
 - tweak Kconfig to allow VMAP_STACK to be turned on with KASAN

Reviewed-by: Dmitry Vyukov 
Signed-off-by: Daniel Axtens 
---
 arch/Kconfig  | 9 +
 kernel/fork.c | 4 
 2 files changed, 9 insertions(+), 4 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index 6728c5fa057e..e15f1486682a 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -843,16 +843,17 @@ config HAVE_ARCH_VMAP_STACK
 config VMAP_STACK
default y
bool "Use a virtually-mapped stack"
-   depends on HAVE_ARCH_VMAP_STACK && !KASAN
+   depends on HAVE_ARCH_VMAP_STACK
+   depends on !KASAN || KASAN_VMALLOC
---help---
  Enable this if you want the use virtually-mapped kernel stacks
  with guard pages.  This causes kernel stack overflows to be
  caught immediately rather than causing difficult-to-diagnose
  corruption.
 
- This is presently incompatible with KASAN because KASAN expects
- the stack to map directly to the KASAN shadow map using a formula
- that is incorrect if the stack is in vmalloc space.
+ To use this with KASAN, the architecture must support backing
+ virtual mappings with real shadow memory, and KASAN_VMALLOC must
+ be enabled.
 
 config ARCH_OPTIONAL_KERNEL_RWX
def_bool n
diff --git a/kernel/fork.c b/kernel/fork.c
index f601168f6b21..52279fd5e72d 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -94,6 +94,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -229,6 +230,9 @@ static unsigned long *alloc_thread_stack_node(struct 
task_struct *tsk, int node)
if (!s)
continue;
 
+   /* Clear the KASAN shadow of the stack. */
+   kasan_unpoison_shadow(s->addr, THREAD_SIZE);
+
/* Clear stale pointers from reused stack. */
memset(s->addr, 0, THREAD_SIZE);
 
-- 
2.20.1



[PATCH v6 2/5] kasan: add test for vmalloc

2019-09-02 Thread Daniel Axtens
Test kasan vmalloc support by adding a new test to the module.

Signed-off-by: Daniel Axtens 

--

v5: split out per Christophe Leroy
---
 lib/test_kasan.c | 26 ++
 1 file changed, 26 insertions(+)

diff --git a/lib/test_kasan.c b/lib/test_kasan.c
index 49cc4d570a40..328d33beae36 100644
--- a/lib/test_kasan.c
+++ b/lib/test_kasan.c
@@ -19,6 +19,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 
@@ -748,6 +749,30 @@ static noinline void __init kmalloc_double_kzfree(void)
kzfree(ptr);
 }
 
+#ifdef CONFIG_KASAN_VMALLOC
+static noinline void __init vmalloc_oob(void)
+{
+   void *area;
+
+   pr_info("vmalloc out-of-bounds\n");
+
+   /*
+* We have to be careful not to hit the guard page.
+* The MMU will catch that and crash us.
+*/
+   area = vmalloc(3000);
+   if (!area) {
+   pr_err("Allocation failed\n");
+   return;
+   }
+
+   ((volatile char *)area)[3100];
+   vfree(area);
+}
+#else
+static void __init vmalloc_oob(void) {}
+#endif
+
 static int __init kmalloc_tests_init(void)
 {
/*
@@ -793,6 +818,7 @@ static int __init kmalloc_tests_init(void)
kasan_strings();
kasan_bitops();
kmalloc_double_kzfree();
+   vmalloc_oob();
 
kasan_restore_multi_shot(multishot);
 
-- 
2.20.1



[PATCH v6 1/5] kasan: support backing vmalloc space with real shadow memory

2019-09-02 Thread Daniel Axtens
Hook into vmalloc and vmap, and dynamically allocate real shadow
memory to back the mappings.

Most mappings in vmalloc space are small, requiring less than a full
page of shadow space. Allocating a full shadow page per mapping would
therefore be wasteful. Furthermore, to ensure that different mappings
use different shadow pages, mappings would have to be aligned to
KASAN_SHADOW_SCALE_SIZE * PAGE_SIZE.

Instead, share backing space across multiple mappings. Allocate a
backing page when a mapping in vmalloc space uses a particular page of
the shadow region. This page can be shared by other vmalloc mappings
later on.

We hook in to the vmap infrastructure to lazily clean up unused shadow
memory.

To avoid the difficulties around swapping mappings around, this code
expects that the part of the shadow region that covers the vmalloc
space will not be covered by the early shadow page, but will be left
unmapped. This will require changes in arch-specific code.

This allows KASAN with VMAP_STACK, and may be helpful for architectures
that do not have a separate module space (e.g. powerpc64, which I am
currently working on). It also allows relaxing the module alignment
back to PAGE_SIZE.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=202009
Acked-by: Vasily Gorbik 
Signed-off-by: Daniel Axtens 
[Mark: rework shadow allocation]
Signed-off-by: Mark Rutland 

--

v2: let kasan_unpoison_shadow deal with ranges that do not use a
full shadow byte.

v3: relax module alignment
rename to kasan_populate_vmalloc which is a much better name
deal with concurrency correctly

v4: Mark's rework
Poision pages on vfree
Handle allocation failures

v5: Per Christophe Leroy, split out test and dynamically free pages.

v6: Guard freeing page properly. Drop WARN_ON_ONCE(pte_none(*ptep)),
 on reflection it's unnecessary debugging cruft with too high a
 false positive rate.
---
 Documentation/dev-tools/kasan.rst |  63 ++
 include/linux/kasan.h |  31 +++
 include/linux/moduleloader.h  |   2 +-
 include/linux/vmalloc.h   |  12 +++
 lib/Kconfig.kasan |  16 
 mm/kasan/common.c | 139 ++
 mm/kasan/generic_report.c |   3 +
 mm/kasan/kasan.h  |   1 +
 mm/vmalloc.c  |  45 +-
 9 files changed, 310 insertions(+), 2 deletions(-)

diff --git a/Documentation/dev-tools/kasan.rst 
b/Documentation/dev-tools/kasan.rst
index b72d07d70239..bdb92c3de7a5 100644
--- a/Documentation/dev-tools/kasan.rst
+++ b/Documentation/dev-tools/kasan.rst
@@ -215,3 +215,66 @@ brk handler is used to print bug reports.
 A potential expansion of this mode is a hardware tag-based mode, which would
 use hardware memory tagging support instead of compiler instrumentation and
 manual shadow memory manipulation.
+
+What memory accesses are sanitised by KASAN?
+
+
+The kernel maps memory in a number of different parts of the address
+space. This poses something of a problem for KASAN, which requires
+that all addresses accessed by instrumented code have a valid shadow
+region.
+
+The range of kernel virtual addresses is large: there is not enough
+real memory to support a real shadow region for every address that
+could be accessed by the kernel.
+
+By default
+~~
+
+By default, architectures only map real memory over the shadow region
+for the linear mapping (and potentially other small areas). For all
+other areas - such as vmalloc and vmemmap space - a single read-only
+page is mapped over the shadow area. This read-only shadow page
+declares all memory accesses as permitted.
+
+This presents a problem for modules: they do not live in the linear
+mapping, but in a dedicated module space. By hooking in to the module
+allocator, KASAN can temporarily map real shadow memory to cover
+them. This allows detection of invalid accesses to module globals, for
+example.
+
+This also creates an incompatibility with ``VMAP_STACK``: if the stack
+lives in vmalloc space, it will be shadowed by the read-only page, and
+the kernel will fault when trying to set up the shadow data for stack
+variables.
+
+CONFIG_KASAN_VMALLOC
+
+
+With ``CONFIG_KASAN_VMALLOC``, KASAN can cover vmalloc space at the
+cost of greater memory usage. Currently this is only supported on x86.
+
+This works by hooking into vmalloc and vmap, and dynamically
+allocating real shadow memory to back the mappings.
+
+Most mappings in vmalloc space are small, requiring less than a full
+page of shadow space. Allocating a full shadow page per mapping would
+therefore be wasteful. Furthermore, to ensure that different mappings
+use different shadow pages, mappings would have to be aligned to
+``KASAN_SHADOW_SCALE_SIZE * PAGE_SIZE``.
+
+Instead, we share backing space across multiple mappings. We allocate
+a backing page when a mapping in vmalloc space uses a particular page
+of 

[PATCH v6 0/5] kasan: support backing vmalloc space with real shadow memory

2019-09-02 Thread Daniel Axtens
Currently, vmalloc space is backed by the early shadow page. This
means that kasan is incompatible with VMAP_STACK.

This series provides a mechanism to back vmalloc space with real,
dynamically allocated memory. I have only wired up x86, because that's
the only currently supported arch I can work with easily, but it's
very easy to wire up other architectures, and it appears that there is
some work-in-progress code to do this on arm64 and s390.

This has been discussed before in the context of VMAP_STACK:
 - https://bugzilla.kernel.org/show_bug.cgi?id=202009
 - https://lkml.org/lkml/2018/7/22/198
 - https://lkml.org/lkml/2019/7/19/822

In terms of implementation details:

Most mappings in vmalloc space are small, requiring less than a full
page of shadow space. Allocating a full shadow page per mapping would
therefore be wasteful. Furthermore, to ensure that different mappings
use different shadow pages, mappings would have to be aligned to
KASAN_SHADOW_SCALE_SIZE * PAGE_SIZE.

Instead, share backing space across multiple mappings. Allocate a
backing page when a mapping in vmalloc space uses a particular page of
the shadow region. This page can be shared by other vmalloc mappings
later on.

We hook in to the vmap infrastructure to lazily clean up unused shadow
memory.


v1: https://lore.kernel.org/linux-mm/20190725055503.19507-1-...@axtens.net/
v2: https://lore.kernel.org/linux-mm/20190729142108.23343-1-...@axtens.net/
 Address review comments:
 - Patch 1: use kasan_unpoison_shadow's built-in handling of
ranges that do not align to a full shadow byte
 - Patch 3: prepopulate pgds rather than faulting things in
v3: https://lore.kernel.org/linux-mm/20190731071550.31814-1-...@axtens.net/
 Address comments from Mark Rutland:
 - kasan_populate_vmalloc is a better name
 - handle concurrency correctly
 - various nits and cleanups
 - relax module alignment in KASAN_VMALLOC case
v4: https://lore.kernel.org/linux-mm/20190815001636.12235-1-...@axtens.net/
 Changes to patch 1 only:
 - Integrate Mark's rework, thanks Mark!
 - handle the case where kasan_populate_shadow might fail
 - poision shadow on free, allowing the alloc path to just
 unpoision memory that it uses
v5: https://lore.kernel.org/linux-mm/20190830003821.10737-1-...@axtens.net/
 Address comments from Christophe Leroy:
 - Fix some issues with my descriptions in commit messages and docs
 - Dynamically free unused shadow pages by hooking into the vmap book-keeping
 - Split out the test into a separate patch
 - Optional patch to track the number of pages allocated
 - minor checkpatch cleanups
v6: Properly guard freeing pages in patch 1, drop debugging code.

Daniel Axtens (5):
  kasan: support backing vmalloc space with real shadow memory
  kasan: add test for vmalloc
  fork: support VMAP_STACK with KASAN_VMALLOC
  x86/kasan: support KASAN_VMALLOC
  kasan debug: track pages allocated for vmalloc shadow

 Documentation/dev-tools/kasan.rst |  63 
 arch/Kconfig  |   9 +-
 arch/x86/Kconfig  |   1 +
 arch/x86/mm/kasan_init_64.c   |  60 +++
 include/linux/kasan.h |  31 ++
 include/linux/moduleloader.h  |   2 +-
 include/linux/vmalloc.h   |  12 +++
 kernel/fork.c |   4 +
 lib/Kconfig.kasan |  16 +++
 lib/test_kasan.c  |  26 +
 mm/kasan/common.c | 165 ++
 mm/kasan/generic_report.c |   3 +
 mm/kasan/kasan.h  |   1 +
 mm/vmalloc.c  |  45 +++-
 14 files changed, 432 insertions(+), 6 deletions(-)

-- 
2.20.1



Re: [RFC PATCH] powerpc: Convert ____flush_dcache_icache_phys() to C

2019-09-02 Thread Segher Boessenkool
On Mon, Sep 02, 2019 at 11:48:59AM +1000, Michael Ellerman wrote:
> "Alastair D'Silva"  writes:
> > On Wed, 2019-08-21 at 22:27 +0200, Christophe Leroy wrote:
> >> Can we be 100% sure that GCC won't add any code accessing some
> >> global data or stack while the Data MMU is OFF ?
> >
> > +mpe
> >
> > I'm not sure how we would go about making such a guarantee, but I've
> > tied every variable used to a register and addr is passed in a
> > register, so there is no stack usage, and every call in there only
> > operates on it's operands.
> 
> That's not safe, I can believe it happens to work but the compiler
> people will laugh at us if it ever breaks.

Yes.  Sorry.

> Let's leave it in asm.

+1

The asm is simpler, more readable, more maintainable, and perhaps more
performant even.  Plus the being-laughed-at issue.


Segher


Re: linux-next: manual merge of the powerpc tree with the arm64 tree

2019-09-02 Thread Michael Ellerman
Catalin Marinas  writes:
> On Mon, Sep 02, 2019 at 11:44:43AM +1000, Michael Ellerman wrote:
>> Stephen Rothwell  writes:
>> > Hi all,
>> >
>> > Today's linux-next merge of the powerpc tree got a conflict in:
>> >
>> >   arch/Kconfig
>> >
>> > between commit:
>> >
>> >   5cf896fb6be3 ("arm64: Add support for relocating the kernel with RELR 
>> > relocations")
>> >
>> > from the arm64 tree and commit:
>> >
>> >   0c9c1d563975 ("x86, s390: Move ARCH_HAS_MEM_ENCRYPT definition to 
>> > arch/Kconfig")
>> >
>> > from the powerpc tree.
>> >
>> > I fixed it up (see below) and can carry the fix as necessary. This
>> > is now fixed as far as linux-next is concerned, but any non trivial
>> > conflicts should be mentioned to your upstream maintainer when your tree
>> > is submitted for merging.  You may also want to consider cooperating
>> > with the maintainer of the conflicting tree to minimise any particularly
>> > complex conflicts.
>> 
>> Thanks.
>> 
>> That conflict seems entirely trivial, but Catalin/Will if it bothers you
>> I have the conflicting commit in a topic branch based on rc2 which you
>> could merge to resolve it:
>
> It's a trivial conflict, easy to resolve. I don't think it's worth
> trying to avoid it (Linus normally doesn't mind such conflicts).

Yep, I agree.

cheers


Re: [PATCH v2 4/4] powerpc/64: system call implement the bulk of the logic in C

2019-09-02 Thread Michael Ellerman
Michal Suchánek  writes:
> On Sat, 31 Aug 2019 02:48:26 +0800
> kbuild test robot  wrote:
>
>> Hi Nicholas,
>> 
>> I love your patch! Yet something to improve:
>> 
>> [auto build test ERROR on linus/master]
>> [cannot apply to v5.3-rc6 next-20190830]
>> [if your patch is applied to the wrong git tree, please drop us a note to 
>> help improve the system]
>> 
>> url:
>> https://github.com/0day-ci/linux/commits/Nicholas-Piggin/powerpc-64-syscalls-in-C/20190828-064221
>> config: powerpc64-defconfig (attached as .config)
>> compiler: powerpc64-linux-gcc (GCC) 7.4.0
>> reproduce:
>> wget 
>> https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
>> ~/bin/make.cross
>> chmod +x ~/bin/make.cross
>> # save the attached .config to linux build tree
>> GCC_VERSION=7.4.0 make.cross ARCH=powerpc64 
>> 
>> If you fix the issue, kindly add following tag
>> Reported-by: kbuild test robot 
>> 
>> All errors (new ones prefixed by >>):
>> 
>>powerpc64-linux-ld: warning: orphan section `.gnu.hash' from `linker 
>> stubs' being placed in section `.gnu.hash'.
>>arch/powerpc/kernel/syscall_64.o: In function `.system_call_exception':
>> >> syscall_64.c:(.text+0x180): undefined reference to `.tabort_syscall'  
>
> Interestingly it builds and boots for me. Is this something about
> dotted vs dotless symbols depending on some config options?

It's the big endian build that fails, which is ELFv1, and the linker is
saying it can't find a function called `.tabort_syscall` - which is
understandable because there isn't one. There's a text address called
`tabort_syscall` but it has no function descriptor so you can't call it
normally from C.

> I see there is _GLOBAL(ret_from_fork) just below so maybe we need
> _GLOBAL(tabort_syscall) instead of .globl tabort_syscall as well.

Yes, on ELFv1 the _GLOBAL macros creates a function descriptor for you.

This fixes it for me:

diff --git a/arch/powerpc/kernel/entry_64.S b/arch/powerpc/kernel/entry_64.S
index 82bcb9a68172..8f2735da205d 100644
--- a/arch/powerpc/kernel/entry_64.S
+++ b/arch/powerpc/kernel/entry_64.S
@@ -170,8 +170,7 @@ _ASM_NOKPROBE_SYMBOL(system_call_common);
b   .Lsyscall_restore_regs_cont
 
 #ifdef CONFIG_PPC_TRANSACTIONAL_MEM
-   .globl tabort_syscall
-tabort_syscall:
+_GLOBAL(tabort_syscall)
/* Firstly we need to enable TM in the kernel */
mfmsr   r10
li  r9, 1


cheers


Re: linux-next: manual merge of the powerpc tree with the arm64 tree

2019-09-02 Thread Will Deacon
On Mon, Sep 02, 2019 at 10:08:46AM +0100, Catalin Marinas wrote:
> On Mon, Sep 02, 2019 at 11:44:43AM +1000, Michael Ellerman wrote:
> > Stephen Rothwell  writes:
> > > Hi all,
> > >
> > > Today's linux-next merge of the powerpc tree got a conflict in:
> > >
> > >   arch/Kconfig
> > >
> > > between commit:
> > >
> > >   5cf896fb6be3 ("arm64: Add support for relocating the kernel with RELR 
> > > relocations")
> > >
> > > from the arm64 tree and commit:
> > >
> > >   0c9c1d563975 ("x86, s390: Move ARCH_HAS_MEM_ENCRYPT definition to 
> > > arch/Kconfig")
> > >
> > > from the powerpc tree.
> > >
> > > I fixed it up (see below) and can carry the fix as necessary. This
> > > is now fixed as far as linux-next is concerned, but any non trivial
> > > conflicts should be mentioned to your upstream maintainer when your tree
> > > is submitted for merging.  You may also want to consider cooperating
> > > with the maintainer of the conflicting tree to minimise any particularly
> > > complex conflicts.
> > 
> > Thanks.
> > 
> > That conflict seems entirely trivial, but Catalin/Will if it bothers you
> > I have the conflicting commit in a topic branch based on rc2 which you
> > could merge to resolve it:
> 
> It's a trivial conflict, easy to resolve. I don't think it's worth
> trying to avoid it (Linus normally doesn't mind such conflicts).

Agreed, we can live with this one :)

Will


Re: [PATCH v7 5/6] powerpc/64: Make COMPAT user-selectable disabled on littleendian by default.

2019-09-02 Thread Michal Suchánek
On Mon, 02 Sep 2019 12:03:12 +1000
Michael Ellerman  wrote:

> Michal Suchanek  writes:
> > On bigendian ppc64 it is common to have 32bit legacy binaries but much
> > less so on littleendian.  
> 
> I think the toolchain people will tell you that there is no 32-bit
> little endian ABI defined at all, if anything works it's by accident.

I have seen a piece of software that workarounds code issues on 64bit
by always compiling 32bit code. So it does work in some way. Also it
has been pointed out that you can still switch to BE even with the
'fast-switch' removed.

> 
> So I think we should not make this selectable, unless someone puts their
> hand up to say they want it and are willing to test it and keep it
> working.

I don't really care either way.

Thanks

Michal

> 
> cheers
> 
> > v3: make configurable
> > ---
> >  arch/powerpc/Kconfig | 5 +++--
> >  1 file changed, 3 insertions(+), 2 deletions(-)
> >
> > diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
> > index 5bab0bb6b833..b0339e892329 100644
> > --- a/arch/powerpc/Kconfig
> > +++ b/arch/powerpc/Kconfig
> > @@ -264,8 +264,9 @@ config PANIC_TIMEOUT
> > default 180
> >  
> >  config COMPAT
> > -   bool
> > -   default y if PPC64
> > +   bool "Enable support for 32bit binaries"
> > +   depends on PPC64
> > +   default y if !CPU_LITTLE_ENDIAN
> > select COMPAT_BINFMT_ELF
> > select ARCH_WANT_OLD_COMPAT_IPC
> > select COMPAT_OLD_SIGACTION
> > -- 
> > 2.22.0  



Re: linux-next: manual merge of the powerpc tree with the arm64 tree

2019-09-02 Thread Catalin Marinas
On Mon, Sep 02, 2019 at 11:44:43AM +1000, Michael Ellerman wrote:
> Stephen Rothwell  writes:
> > Hi all,
> >
> > Today's linux-next merge of the powerpc tree got a conflict in:
> >
> >   arch/Kconfig
> >
> > between commit:
> >
> >   5cf896fb6be3 ("arm64: Add support for relocating the kernel with RELR 
> > relocations")
> >
> > from the arm64 tree and commit:
> >
> >   0c9c1d563975 ("x86, s390: Move ARCH_HAS_MEM_ENCRYPT definition to 
> > arch/Kconfig")
> >
> > from the powerpc tree.
> >
> > I fixed it up (see below) and can carry the fix as necessary. This
> > is now fixed as far as linux-next is concerned, but any non trivial
> > conflicts should be mentioned to your upstream maintainer when your tree
> > is submitted for merging.  You may also want to consider cooperating
> > with the maintainer of the conflicting tree to minimise any particularly
> > complex conflicts.
> 
> Thanks.
> 
> That conflict seems entirely trivial, but Catalin/Will if it bothers you
> I have the conflicting commit in a topic branch based on rc2 which you
> could merge to resolve it:

It's a trivial conflict, easy to resolve. I don't think it's worth
trying to avoid it (Linus normally doesn't mind such conflicts).

-- 
Catalin


[PATCH] Revert "powerpc: Add barrier_nospec to raw_copy_in_user()"

2019-09-02 Thread Michal Suchanek


This reverts commit 6fbcdd59094ade30db63f32316e9502425d7b256.

Not needed. Data handled by raw_copy_in_user must be loaded through
copy_from_user to be used in the kernel which already has the barrier.

Signed-off-by: Michal Suchanek 
---
 arch/powerpc/include/asm/uaccess.h | 1 -
 1 file changed, 1 deletion(-)

diff --git a/arch/powerpc/include/asm/uaccess.h 
b/arch/powerpc/include/asm/uaccess.h
index 8b03eb44e876..76f34346b642 100644
--- a/arch/powerpc/include/asm/uaccess.h
+++ b/arch/powerpc/include/asm/uaccess.h
@@ -312,7 +312,6 @@ raw_copy_in_user(void __user *to, const void __user *from, 
unsigned long n)
 {
unsigned long ret;
 
-   barrier_nospec();
allow_user_access(to, from, n);
ret = __copy_tofrom_user(to, from, n);
prevent_user_access(to, from, n);
-- 
2.22.0



Re: [PATCH v7 3/6] powerpc/perf: consolidate read_user_stack_32

2019-09-02 Thread Michal Suchánek
On Mon, 02 Sep 2019 14:01:17 +1000
Michael Ellerman  wrote:

> Michael Ellerman  writes:
> > Michal Suchanek  writes:  
> ...
> >> @@ -295,6 +279,12 @@ static inline int current_is_64bit(void)
> >>  }
> >>  
> >>  #else  /* CONFIG_PPC64 */
> >> +static int read_user_stack_slow(void __user *ptr, void *buf, int nb)
> >> +{
> >> +  return 0;
> >> +}
> >> +#endif /* CONFIG_PPC64 */  
> >
> > Ending the PPC64 else case here, and then restarting it below with an
> > ifndef means we end up with two parts of the file that define 32-bit
> > code, with a common chunk in the middle, which I dislike.
> >
> > I'd rather you add the empty read_user_stack_slow() in the existing
> > #else section and then move read_user_stack_32() below the whole ifdef
> > PPC64/else/endif section.
> >
> > Is there some reason that doesn't work?  
> 
> Gah, I missed that you split the whole file later in the series. Any
> reason you did it in two steps rather than moving patch 6 earlier in the
> series?

To make this patch readable.

Thanks

Michal


Re: [PATCH v7 1/7] kvmppc: Driver to manage pages of secure guest

2019-09-02 Thread Christoph Hellwig
On Fri, Aug 30, 2019 at 09:12:59AM +0530, Bharata B Rao wrote:
> On Thu, Aug 29, 2019 at 10:38:10AM +0200, Christoph Hellwig wrote:
> > On Thu, Aug 22, 2019 at 03:56:14PM +0530, Bharata B Rao wrote:
> > > +/*
> > > + * Bits 60:56 in the rmap entry will be used to identify the
> > > + * different uses/functions of rmap.
> > > + */
> > > +#define KVMPPC_RMAP_DEVM_PFN (0x2ULL << 56)
> > 
> > How did you come up with this specific value?
> 
> Different usage types of RMAP array are being defined.
> https://patchwork.ozlabs.org/patch/1149791/
> 
> The above value is reserved for device pfn usage.

Shouldn't all these defintions go in together in a patch?  Also is bi
t 56+ a set of values, so is there 1 << 56 and 3 << 56 as well?  Seems
like even that other patch doesn't fully define these "pfn" values.

> > No need for !! when returning a bool.  Also the helper seems a little
> > pointless, just opencoding it would make the code more readable in my
> > opinion.
> 
> I expect similar routines for other usages of RMAP to come up.

Please drop them all.  Having to wade through a header to check for
a specific bit that also is set manually elsewhere in related code
just obsfucates it for the reader.

> > > + *mig->dst = migrate_pfn(page_to_pfn(dpage)) | MIGRATE_PFN_LOCKED;
> > > + return 0;
> > > +}
> > 
> > I think you can just merge this trivial helper into the only caller.
> 
> Yes I can, but felt it is nicely abstracted out to a function right now.

Not really.  It just fits the old calling conventions before I removed
the indirection.

> > Here we actually have two callers, but they have a fair amount of
> > duplicate code in them.  I think you want to move that common
> > code (including setting up the migrate_vma structure) into this
> > function and maybe also give it a more descriptive name.
> 
> Sure, I will give this a try. The name is already very descriptive, will
> come up with an appropriate name.

I don't think alloc_and_copy is very helpful.  It matches some of the
implementation, but not the intent.  Why not kvmppc_svm_page_in/out
similar to the hypervisor calls calling them?  Yes, for one case it
also gets called from the pagefault handler, but it still performs
these basic page in/out actions.

> BTW this file and the fuction prefixes in this file started out with
> kvmppc_hmm, switched to kvmppc_devm when HMM routines weren't used anymore.
> Now with the use of only non-dev versions, planning to swtich to
> kvmppc_uvmem_

That prefix sounds fine to me as well.


Re: [PATCH v2 2/9] x86: numa: check the node id consistently for x86

2019-09-02 Thread Peter Zijlstra
On Mon, Sep 02, 2019 at 01:46:51PM +0800, Yunsheng Lin wrote:
> On 2019/9/1 0:12, Peter Zijlstra wrote:

> > 1) because even it is not set, the device really does belong to a node.
> > It is impossible a device will have magic uniform access to memory when
> > CPUs cannot.
> 
> So it means dev_to_node() will return either NUMA_NO_NODE or a
> valid node id?

NUMA_NO_NODE := -1, which is not a valid node number. It is also, like I
said, not a valid device location on a NUMA system.

Just because ACPI/BIOS is shit, doesn't mean the device doesn't have a
node association. It just means we don't know and might have to guess.

> > 2) is already true today, cpumask_of_node() requires a valid node_id.
> 
> Ok, most of the user does check node_id before calling
> cpumask_of_node(), but does a little different type of checking:
> 
> 1) some does " < 0" check;
> 2) some does "== NUMA_NO_NODE" check;
> 3) some does ">= MAX_NUMNODES" check;
> 4) some does "< 0 || >= MAX_NUMNODES || !node_online(node)" check.

The one true way is:

'(unsigned)node_id >= nr_node_ids'

> > 3) is just wrong and increases overhead for everyone.
> 
> Ok, cpumask_of_node() is also used in some critical path such
> as scheduling, which may not need those checking, the overhead
> is unnecessary.
> 
> But for non-critical path such as setup or configuration path,
> it better to have consistent checking, and also simplify the
> user code that calls cpumask_of_node().
> 
> Do you think it is worth the trouble to add a new function
> such as cpumask_of_node_check(maybe some other name) to do
> consistent checking?
> 
> Or caller just simply check if dev_to_node()'s return value is
> NUMA_NO_NODE before calling cpumask_of_node()?

It is not a matter of convenience. The function is called
cpumask_of_node(), when node < 0 || node >= nr_node_ids, it is not a
valid node, therefore the function shouldn't return anything except an
error.

Also note that the CONFIG_DEBUG_PER_CPU_MAPS version of
cpumask_of_node() already does this (although it wants the below fix).

---
diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index e6dad600614c..5f49c10201c7 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -861,7 +861,7 @@ void numa_remove_cpu(int cpu)
  */
 const struct cpumask *cpumask_of_node(int node)
 {
-   if (node >= nr_node_ids) {
+   if ((unsigned)node >= nr_node_ids) {
printk(KERN_WARNING
"cpumask_of_node(%d): node > nr_node_ids(%u)\n",
node, nr_node_ids);


Re: [PATCH] powerpc: Perform a bounds check in arch_add_memory

2019-09-02 Thread David Hildenbrand
On 02.09.19 01:54, Alastair D'Silva wrote:
> On Tue, 2019-08-27 at 09:13 +0200, David Hildenbrand wrote:
>> On 27.08.19 08:39, Alastair D'Silva wrote:
>>> On Tue, 2019-08-27 at 08:28 +0200, Michal Hocko wrote:
 On Tue 27-08-19 15:20:46, Alastair D'Silva wrote:
> From: Alastair D'Silva 
>
> It is possible for firmware to allocate memory ranges outside
> the range of physical memory that we support
> (MAX_PHYSMEM_BITS).

 Doesn't that count as a FW bug? Do you have any evidence of that
 in
 the
 field? Just wondering...

>>>
>>> Not outside our lab, but OpenCAPI attached LPC memory is assigned
>>> addresses based on the slot/NPU it is connected to. These addresses
>>> prior to:
>>> 4ffe713b7587 ("powerpc/mm: Increase the max addressable memory to
>>> 2PB")
>>> were inaccessible and resulted in bogus sections - see our
>>> discussion
>>> on 'mm: Trigger bug on if a section is not found in __section_nr'.
>>> Doing this check here was your suggestion :)
>>>
>>> It's entirely possible that a similar problem will occur in the
>>> future,
>>> and it's cheap to guard against, which is why I've added this.
>>>
>>
>> If you keep it here, I guess this should be wrapped by a
>> WARN_ON_ONCE().
>>
>> If we move it to common code (e.g., __add_pages() or add_memory()),
>> then
>> probably not. I can see that s390x allows to configure
>> MAX_PHYSMEM_BITS,
>> so the check could actually make sense.
>>
> 
> I couldn't see a nice platform indepedent way to determine the
> allowable address range, but if there is, then I'll move this to the
> generic code instead.
> 

At least on the !ZONE_DEVICE path we have

__add_memory() -> register_memory_resource() ...

return ERR_PTR(-E2BIG);


I was thinking about something like

int add_pages()
{
if ((start + size - 1) >> MAX_PHYSMEM_BITS)
return -E2BIG;  

return arch_add_memory(...)
}

And switching users of arch_add_memory() to add_pages(). However, x86
already has an add_pages() function, so that would need some more thought.

Maybe simply renaming the existing add_pages() to arch_add_pages().

add_pages(): Create virtual mapping
__add_pages(): Don't create virtual mapping

arch_add_memory(): Arch backend for add_pages()
arch_add_pages(): Arch backend for __add_pages()

It would be even more consistent if we would have arch_add_pages() vs.
__arch_add_pages().

-- 

Thanks,

David / dhildenb


Re: Oops (request_key_auth_describe) while running cve-2016-7042 from LTP

2019-09-02 Thread David Howells
Hi Hillf,

Would you like to me to put you down as the author of this patch?  If so, I'll
need a Signed-off-by from you.

David
---
commit df882ad6d4e24a3763719c1798ea58e87d56c2d7
Author: Hillf Danton 
Date:   Fri Aug 30 15:54:33 2019 +0100

keys: Fix missing null pointer check in request_key_auth_describe()

If a request_key authentication token key gets revoked, there's a window in
which request_key_auth_describe() can see it with a NULL payload - but it
makes no check for this and something like the following oops may occur:

BUG: Kernel NULL pointer dereference at 0x0038
Faulting instruction address: 0xc04ddf30
Oops: Kernel access of bad area, sig: 11 [#1]
...
NIP [...] request_key_auth_describe+0x90/0xd0
LR [...] request_key_auth_describe+0x54/0xd0
Call Trace:
[...] request_key_auth_describe+0x54/0xd0 (unreliable)
[...] proc_keys_show+0x308/0x4c0
[...] seq_read+0x3d0/0x540
[...] proc_reg_read+0x90/0x110
[...] __vfs_read+0x3c/0x70
[...] vfs_read+0xb4/0x1b0
[...] ksys_read+0x7c/0x130
[...] system_call+0x5c/0x70

Fix this by checking for a NULL pointer when describing such a key.

Also make the read routine check for a NULL pointer to be on the safe side.

Fixes: 04c567d9313e ("[PATCH] Keys: Fix race between two instantiators of a 
key")
Reported-by: Sachin Sant 
Signed-off-by: David Howells 
Tested-by: Sachin Sant 

diff --git a/security/keys/request_key_auth.c b/security/keys/request_key_auth.c
index e73ec040e250..ecba39c93fd9 100644
--- a/security/keys/request_key_auth.c
+++ b/security/keys/request_key_auth.c
@@ -66,6 +66,9 @@ static void request_key_auth_describe(const struct key *key,
 {
struct request_key_auth *rka = dereference_key_rcu(key);
 
+   if (!rka)
+   return;
+
seq_puts(m, "key:");
seq_puts(m, key->description);
if (key_is_positive(key))
@@ -83,6 +86,9 @@ static long request_key_auth_read(const struct key *key,
size_t datalen;
long ret;
 
+   if (!rka)
+   return -EKEYREVOKED;
+
datalen = rka->callout_len;
ret = datalen;
 


Re: [PATCH v2 8/9] mips: numa: check the node id consistently for mips ip27

2019-09-02 Thread Yunsheng Lin
On 2019/8/31 23:45, Paul Burton wrote:
> Hi Yunsheng,
> 
> On Sat, Aug 31, 2019 at 01:58:22PM +0800, Yunsheng Lin wrote:
>> According to Section 6.2.14 from ACPI spec 6.3 [1], the setting
>> of proximity domain is optional, as below:
>>
>> This optional object is used to describe proximity domain
>> associations within a machine. _PXM evaluates to an integer
>> that identifies a device as belonging to a Proximity Domain
>> defined in the System Resource Affinity Table (SRAT).
>>
>> Since mips ip27 uses hub_data instead of node_to_cpumask_map,
>> this patch checks node id with the below case before returning
>> _data(node)->h_cpus:
>> 1. if node_id >= MAX_COMPACT_NODES, return cpu_none_mask
>> 2. if node_id < 0, return cpu_online_mask
>> 3. if hub_data(node) is NULL, return cpu_online_mask
>>
>> [1] https://uefi.org/sites/default/files/resources/ACPI_6_3_final_Jan30.pdf
> 
> Similar to David's comment on the sparc patch, these systems don't use
> ACPI so I don't see from your commit message why this change would be
> relevant.
> 
> This same comment applies to patch 9 too.

Thanks for pointing out.

MIPS's NUMA node id is also defined by DT?


> 
> Thanks,
> Paul
> 
>>
>> Signed-off-by: Yunsheng Lin 
>> ---
>>  arch/mips/include/asm/mach-ip27/topology.h | 15 ---
>>  1 file changed, 12 insertions(+), 3 deletions(-)
>>
>> diff --git a/arch/mips/include/asm/mach-ip27/topology.h 
>> b/arch/mips/include/asm/mach-ip27/topology.h
>> index 965f079..914a55a 100644
>> --- a/arch/mips/include/asm/mach-ip27/topology.h
>> +++ b/arch/mips/include/asm/mach-ip27/topology.h
>> @@ -15,9 +15,18 @@ struct cpuinfo_ip27 {
>>  extern struct cpuinfo_ip27 sn_cpu_info[NR_CPUS];
>>  
>>  #define cpu_to_node(cpu)(sn_cpu_info[(cpu)].p_nodeid)
>> -#define cpumask_of_node(node)   ((node) == -1 ? 
>> \
>> - cpu_all_mask : \
>> - _data(node)->h_cpus)
>> +
>> +static inline const struct cpumask *cpumask_of_node(int node)
>> +{
>> +if (node >= MAX_COMPACT_NODES)
>> +return cpu_none_mask;
>> +
>> +if (node < 0 || !hub_data(node))
>> +return cpu_online_mask;
>> +
>> +return _data(node)->h_cpus;
>> +}
>> +
>>  struct pci_bus;
>>  extern int pcibus_to_node(struct pci_bus *);
>>  
>> -- 
>> 2.8.1
>>
> 
> .
> 



Re: [PATCH v2 7/9] sparc64: numa: check the node id consistently for sparc64

2019-09-02 Thread Yunsheng Lin
On 2019/9/1 4:02, David Miller wrote:
> From: Yunsheng Lin 
> Date: Sat, 31 Aug 2019 16:57:04 +0800
> 
>> Did you mean sparc64 system does not has ACPI, the device's node id will
>> not specified by ACPI, so the ACPI is unrelated here?
> 
> Yes, sparc64 never has and never will have ACPI.
> 
> This is also true for several other platforms where you have made this
> change.
> 
> The assumption of your entire patch set is that the semantics of the
> NUMA node ID are somehow completely defined by ACPI semantics.  Which
> is not true.

Thanks for pointing out.

The NUMA node id in sparc64 system is defined by DT semantics?

> 
> .
>