Le 04/05/2017 à 11:42, Michael Ellerman a écrit :
Frederic Barrat <fbar...@linux.vnet.ibm.com> writes:

Introduce a new 'flags' attribute per context and define its first bit
to be a marker requiring all TLBIs for that context to be broadcasted
globally. Once that marker is set on a context, it cannot be removed.

Such a marker is useful for memory contexts used by devices behind the
NPU and CAPP/PSL. The NPU and the PSL keep their own
translation cache so they need to see all the TLBIs for those
contexts.

Signed-off-by: Frederic Barrat <fbar...@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/book3s/64/mmu.h |  9 +++++++++
 arch/powerpc/include/asm/tlb.h           | 10 ++++++++--
 arch/powerpc/mm/mmu_context_book3s64.c   |  1 +
 3 files changed, 18 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/mmu.h 
b/arch/powerpc/include/asm/book3s/64/mmu.h
index 77529a3e3811..7b640ab1cbeb 100644
--- a/arch/powerpc/include/asm/book3s/64/mmu.h
+++ b/arch/powerpc/include/asm/book3s/64/mmu.h
@@ -78,8 +78,12 @@ struct spinlock;
 /* Maximum possible number of NPUs in a system. */
 #define NV_MAX_NPUS 8

+/* Bits definition for the context flags */
+#define MM_CONTEXT_GLOBAL_TLBI 1       /* TLBI must be global */

I think I'd prefer MM_GLOBAL_TLBIE, it's shorter and tlbie is the name
of the instruction so is something people can search for.


OK


@@ -164,5 +168,10 @@ extern void radix_init_pseries(void);
 static inline void radix_init_pseries(void) { };
 #endif

+static inline void mm_context_set_global_tlbi(mm_context_t *ctx)
+{
+       set_bit(MM_CONTEXT_GLOBAL_TLBI, &ctx->flags);
+}

set_bit() and test_bit() are non-atomic, and unordered vs other loads
and stores.

So the caller will need to be careful they have a barrier between this
and whatever it is they do that creates mappings that might need to be
invalidated.

Agreed, I missed the barrier. So I need to set the flag, have a write memory barrier. Then, in the case of cxl, we can attach to the accelerator.


Similarly on the read side we should have a barrier between the store
that makes the PTE invalid and the load of the flag.

That one is more subtle, at least to me, but I think I now see what you mean. With no extra read barrier, we would be exposed to have the following order:

CPU1                CPU2                 device
                    read flag=>local
set global flag
write barrier
attach
                                         read PTE
                    update PTE
                    tlbiel               not seen, hence broken


Is that what you meant?
That would mean an extra read barrier for each tlbie.


Which makes me think cxl_ctx_in_use() is buggy :/, hmm. But it's late so
hopefully I'm wrong :D

Unfortunately, I think you're right. And we're missing the same 2 barriers: a write barrier when cxl increments atomically the use count before attaching, and a read barrier like above.

  Fred


diff --git a/arch/powerpc/include/asm/tlb.h b/arch/powerpc/include/asm/tlb.h
index 609557569f65..bd18ed083011 100644
--- a/arch/powerpc/include/asm/tlb.h
+++ b/arch/powerpc/include/asm/tlb.h
@@ -71,8 +71,14 @@ static inline int mm_is_core_local(struct mm_struct *mm)

 static inline int mm_is_thread_local(struct mm_struct *mm)
 {
-       return cpumask_equal(mm_cpumask(mm),
-                             cpumask_of(smp_processor_id()));
+       int rc;
+
+       rc = cpumask_equal(mm_cpumask(mm),
+                       cpumask_of(smp_processor_id()));
+#ifdef CONFIG_PPC_BOOK3S_64
+       rc = rc && !test_bit(MM_CONTEXT_GLOBAL_TLBI, &mm->context.flags);
+#endif

The ifdef's a bit ugly, but I guess it's not worth putting it in a
static inline.

I'd be interested to see the generated code for this, and for the
reverse, ie. putting the test_bit() first, and doing an early return if
it's true. That way once the bit is set we can just skip the cpumask
comparison.

cheers


Reply via email to