Re: [PATCH v2 10/10] x86/mm: Try to preserve old TLB entries using PCID

2017-06-19 Thread Nadav Amit
Nadav Amit  wrote:

>> 
> Just to clarify: I asked since I don’t understand how the interaction with
> PCID-unaware CR3 users go. Specifically, IIUC, arch_efi_call_virt_teardown()
> can reload CR3 with an old PCID value. No?

Please ignore this email. I realized it is not a problem.

Nadav

Re: [PATCH v2 10/10] x86/mm: Try to preserve old TLB entries using PCID

2017-06-19 Thread Nadav Amit
Nadav Amit  wrote:

>> 
> Just to clarify: I asked since I don’t understand how the interaction with
> PCID-unaware CR3 users go. Specifically, IIUC, arch_efi_call_virt_teardown()
> can reload CR3 with an old PCID value. No?

Please ignore this email. I realized it is not a problem.

Nadav

Re: [PATCH v2 10/10] x86/mm: Try to preserve old TLB entries using PCID

2017-06-19 Thread Nadav Amit
Andy Lutomirski  wrote:

> On Sat, Jun 17, 2017 at 11:26 PM, Nadav Amit  wrote:
>>> On Jun 13, 2017, at 9:56 PM, Andy Lutomirski  wrote:
>>> 
>>> PCID is a "process context ID" -- it's what other architectures call
>>> an address space ID.  Every non-global TLB entry is tagged with a
>>> PCID, only TLB entries that match the currently selected PCID are
>>> used, and we can switch PGDs without flushing the TLB.  x86's
>>> PCID is 12 bits.
>>> 
>>> This is an unorthodox approach to using PCID.  x86's PCID is far too
>>> short to uniquely identify a process, and we can't even really
>>> uniquely identify a running process because there are monster
>>> systems with over 4096 CPUs.  To make matters worse, past attempts
>>> to use all 12 PCID bits have resulted in slowdowns instead of
>>> speedups.
>>> 
>>> This patch uses PCID differently.  We use a PCID to identify a
>>> recently-used mm on a per-cpu basis.  An mm has no fixed PCID
>>> binding at all; instead, we give it a fresh PCID each time it's
>>> loaded except in cases where we want to preserve the TLB, in which
>>> case we reuse a recent value.
>>> 
>>> In particular, we use PCIDs 1-3 for recently-used mms and we reserve
>>> PCID 0 for swapper_pg_dir and for PCID-unaware CR3 users (e.g. EFI).
>>> Nothing ever switches to PCID 0 without flushing PCID 0 non-global
>>> pages, so PCID 0 conflicts won't cause problems.
>> 
>> Is this commit message outdated?
> 
> Yes, it's old.  Will fix.

Just to clarify: I asked since I don’t understand how the interaction with
PCID-unaware CR3 users go. Specifically, IIUC, arch_efi_call_virt_teardown()
can reload CR3 with an old PCID value. No?



Re: [PATCH v2 10/10] x86/mm: Try to preserve old TLB entries using PCID

2017-06-19 Thread Nadav Amit
Andy Lutomirski  wrote:

> On Sat, Jun 17, 2017 at 11:26 PM, Nadav Amit  wrote:
>>> On Jun 13, 2017, at 9:56 PM, Andy Lutomirski  wrote:
>>> 
>>> PCID is a "process context ID" -- it's what other architectures call
>>> an address space ID.  Every non-global TLB entry is tagged with a
>>> PCID, only TLB entries that match the currently selected PCID are
>>> used, and we can switch PGDs without flushing the TLB.  x86's
>>> PCID is 12 bits.
>>> 
>>> This is an unorthodox approach to using PCID.  x86's PCID is far too
>>> short to uniquely identify a process, and we can't even really
>>> uniquely identify a running process because there are monster
>>> systems with over 4096 CPUs.  To make matters worse, past attempts
>>> to use all 12 PCID bits have resulted in slowdowns instead of
>>> speedups.
>>> 
>>> This patch uses PCID differently.  We use a PCID to identify a
>>> recently-used mm on a per-cpu basis.  An mm has no fixed PCID
>>> binding at all; instead, we give it a fresh PCID each time it's
>>> loaded except in cases where we want to preserve the TLB, in which
>>> case we reuse a recent value.
>>> 
>>> In particular, we use PCIDs 1-3 for recently-used mms and we reserve
>>> PCID 0 for swapper_pg_dir and for PCID-unaware CR3 users (e.g. EFI).
>>> Nothing ever switches to PCID 0 without flushing PCID 0 non-global
>>> pages, so PCID 0 conflicts won't cause problems.
>> 
>> Is this commit message outdated?
> 
> Yes, it's old.  Will fix.

Just to clarify: I asked since I don’t understand how the interaction with
PCID-unaware CR3 users go. Specifically, IIUC, arch_efi_call_virt_teardown()
can reload CR3 with an old PCID value. No?



Re: [PATCH v2 10/10] x86/mm: Try to preserve old TLB entries using PCID

2017-06-19 Thread Andy Lutomirski
On Sat, Jun 17, 2017 at 11:26 PM, Nadav Amit  wrote:
>
>> On Jun 13, 2017, at 9:56 PM, Andy Lutomirski  wrote:
>>
>> PCID is a "process context ID" -- it's what other architectures call
>> an address space ID.  Every non-global TLB entry is tagged with a
>> PCID, only TLB entries that match the currently selected PCID are
>> used, and we can switch PGDs without flushing the TLB.  x86's
>> PCID is 12 bits.
>>
>> This is an unorthodox approach to using PCID.  x86's PCID is far too
>> short to uniquely identify a process, and we can't even really
>> uniquely identify a running process because there are monster
>> systems with over 4096 CPUs.  To make matters worse, past attempts
>> to use all 12 PCID bits have resulted in slowdowns instead of
>> speedups.
>>
>> This patch uses PCID differently.  We use a PCID to identify a
>> recently-used mm on a per-cpu basis.  An mm has no fixed PCID
>> binding at all; instead, we give it a fresh PCID each time it's
>> loaded except in cases where we want to preserve the TLB, in which
>> case we reuse a recent value.
>>
>> In particular, we use PCIDs 1-3 for recently-used mms and we reserve
>> PCID 0 for swapper_pg_dir and for PCID-unaware CR3 users (e.g. EFI).
>> Nothing ever switches to PCID 0 without flushing PCID 0 non-global
>> pages, so PCID 0 conflicts won't cause problems.
>
> Is this commit message outdated?

Yes, it's old.  Will fix.


Re: [PATCH v2 10/10] x86/mm: Try to preserve old TLB entries using PCID

2017-06-19 Thread Andy Lutomirski
On Sat, Jun 17, 2017 at 11:26 PM, Nadav Amit  wrote:
>
>> On Jun 13, 2017, at 9:56 PM, Andy Lutomirski  wrote:
>>
>> PCID is a "process context ID" -- it's what other architectures call
>> an address space ID.  Every non-global TLB entry is tagged with a
>> PCID, only TLB entries that match the currently selected PCID are
>> used, and we can switch PGDs without flushing the TLB.  x86's
>> PCID is 12 bits.
>>
>> This is an unorthodox approach to using PCID.  x86's PCID is far too
>> short to uniquely identify a process, and we can't even really
>> uniquely identify a running process because there are monster
>> systems with over 4096 CPUs.  To make matters worse, past attempts
>> to use all 12 PCID bits have resulted in slowdowns instead of
>> speedups.
>>
>> This patch uses PCID differently.  We use a PCID to identify a
>> recently-used mm on a per-cpu basis.  An mm has no fixed PCID
>> binding at all; instead, we give it a fresh PCID each time it's
>> loaded except in cases where we want to preserve the TLB, in which
>> case we reuse a recent value.
>>
>> In particular, we use PCIDs 1-3 for recently-used mms and we reserve
>> PCID 0 for swapper_pg_dir and for PCID-unaware CR3 users (e.g. EFI).
>> Nothing ever switches to PCID 0 without flushing PCID 0 non-global
>> pages, so PCID 0 conflicts won't cause problems.
>
> Is this commit message outdated?

Yes, it's old.  Will fix.


Re: [PATCH v2 10/10] x86/mm: Try to preserve old TLB entries using PCID

2017-06-18 Thread Nadav Amit

> On Jun 13, 2017, at 9:56 PM, Andy Lutomirski  wrote:
> 
> PCID is a "process context ID" -- it's what other architectures call
> an address space ID.  Every non-global TLB entry is tagged with a
> PCID, only TLB entries that match the currently selected PCID are
> used, and we can switch PGDs without flushing the TLB.  x86's
> PCID is 12 bits.
> 
> This is an unorthodox approach to using PCID.  x86's PCID is far too
> short to uniquely identify a process, and we can't even really
> uniquely identify a running process because there are monster
> systems with over 4096 CPUs.  To make matters worse, past attempts
> to use all 12 PCID bits have resulted in slowdowns instead of
> speedups.
> 
> This patch uses PCID differently.  We use a PCID to identify a
> recently-used mm on a per-cpu basis.  An mm has no fixed PCID
> binding at all; instead, we give it a fresh PCID each time it's
> loaded except in cases where we want to preserve the TLB, in which
> case we reuse a recent value.
> 
> In particular, we use PCIDs 1-3 for recently-used mms and we reserve
> PCID 0 for swapper_pg_dir and for PCID-unaware CR3 users (e.g. EFI).
> Nothing ever switches to PCID 0 without flushing PCID 0 non-global
> pages, so PCID 0 conflicts won't cause problems.

Is this commit message outdated? NR_DYNAMIC_ASIDS is set to 6.
More importantly, I do not see PCID 0 as reserved:

> +static void choose_new_asid(struct mm_struct *next, u64 next_tlb_gen,
> + u16 *new_asid, bool *need_flush)
> +{
> 

[snip]

> + if (*new_asid >= NR_DYNAMIC_ASIDS) {
> + *new_asid = 0;
> + this_cpu_write(cpu_tlbstate.next_asid, 1);
> + }
> + *need_flush = true;
> +}


Am I missing something?



Re: [PATCH v2 10/10] x86/mm: Try to preserve old TLB entries using PCID

2017-06-18 Thread Nadav Amit

> On Jun 13, 2017, at 9:56 PM, Andy Lutomirski  wrote:
> 
> PCID is a "process context ID" -- it's what other architectures call
> an address space ID.  Every non-global TLB entry is tagged with a
> PCID, only TLB entries that match the currently selected PCID are
> used, and we can switch PGDs without flushing the TLB.  x86's
> PCID is 12 bits.
> 
> This is an unorthodox approach to using PCID.  x86's PCID is far too
> short to uniquely identify a process, and we can't even really
> uniquely identify a running process because there are monster
> systems with over 4096 CPUs.  To make matters worse, past attempts
> to use all 12 PCID bits have resulted in slowdowns instead of
> speedups.
> 
> This patch uses PCID differently.  We use a PCID to identify a
> recently-used mm on a per-cpu basis.  An mm has no fixed PCID
> binding at all; instead, we give it a fresh PCID each time it's
> loaded except in cases where we want to preserve the TLB, in which
> case we reuse a recent value.
> 
> In particular, we use PCIDs 1-3 for recently-used mms and we reserve
> PCID 0 for swapper_pg_dir and for PCID-unaware CR3 users (e.g. EFI).
> Nothing ever switches to PCID 0 without flushing PCID 0 non-global
> pages, so PCID 0 conflicts won't cause problems.

Is this commit message outdated? NR_DYNAMIC_ASIDS is set to 6.
More importantly, I do not see PCID 0 as reserved:

> +static void choose_new_asid(struct mm_struct *next, u64 next_tlb_gen,
> + u16 *new_asid, bool *need_flush)
> +{
> 

[snip]

> + if (*new_asid >= NR_DYNAMIC_ASIDS) {
> + *new_asid = 0;
> + this_cpu_write(cpu_tlbstate.next_asid, 1);
> + }
> + *need_flush = true;
> +}


Am I missing something?



[PATCH v2 10/10] x86/mm: Try to preserve old TLB entries using PCID

2017-06-13 Thread Andy Lutomirski
PCID is a "process context ID" -- it's what other architectures call
an address space ID.  Every non-global TLB entry is tagged with a
PCID, only TLB entries that match the currently selected PCID are
used, and we can switch PGDs without flushing the TLB.  x86's
PCID is 12 bits.

This is an unorthodox approach to using PCID.  x86's PCID is far too
short to uniquely identify a process, and we can't even really
uniquely identify a running process because there are monster
systems with over 4096 CPUs.  To make matters worse, past attempts
to use all 12 PCID bits have resulted in slowdowns instead of
speedups.

This patch uses PCID differently.  We use a PCID to identify a
recently-used mm on a per-cpu basis.  An mm has no fixed PCID
binding at all; instead, we give it a fresh PCID each time it's
loaded except in cases where we want to preserve the TLB, in which
case we reuse a recent value.

In particular, we use PCIDs 1-3 for recently-used mms and we reserve
PCID 0 for swapper_pg_dir and for PCID-unaware CR3 users (e.g. EFI).
Nothing ever switches to PCID 0 without flushing PCID 0 non-global
pages, so PCID 0 conflicts won't cause problems.

This seems to save about 100ns on context switches between mms.

Signed-off-by: Andy Lutomirski 
---
 arch/x86/include/asm/mmu_context.h |  3 ++
 arch/x86/include/asm/processor-flags.h |  2 +
 arch/x86/include/asm/tlbflush.h| 18 +++-
 arch/x86/mm/init.c |  1 +
 arch/x86/mm/tlb.c  | 82 ++
 5 files changed, 86 insertions(+), 20 deletions(-)

diff --git a/arch/x86/include/asm/mmu_context.h 
b/arch/x86/include/asm/mmu_context.h
index 69a4f1ee86ac..2537ec03c9b7 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -299,6 +299,9 @@ static inline unsigned long __get_current_cr3_fast(void)
 {
unsigned long cr3 = __pa(this_cpu_read(cpu_tlbstate.loaded_mm)->pgd);
 
+   if (static_cpu_has(X86_FEATURE_PCID))
+   cr3 |= this_cpu_read(cpu_tlbstate.loaded_mm_asid);
+
/* For now, be very restrictive about when this can be called. */
VM_WARN_ON(in_nmi() || !in_atomic());
 
diff --git a/arch/x86/include/asm/processor-flags.h 
b/arch/x86/include/asm/processor-flags.h
index 79aa2f98398d..791b60199aa4 100644
--- a/arch/x86/include/asm/processor-flags.h
+++ b/arch/x86/include/asm/processor-flags.h
@@ -35,6 +35,7 @@
 /* Mask off the address space ID bits. */
 #define CR3_ADDR_MASK 0x7000ull
 #define CR3_PCID_MASK 0xFFFull
+#define CR3_NOFLUSH (1UL << 63)
 #else
 /*
  * CR3_ADDR_MASK needs at least bits 31:5 set on PAE systems, and we save
@@ -42,6 +43,7 @@
  */
 #define CR3_ADDR_MASK 0xull
 #define CR3_PCID_MASK 0ull
+#define CR3_NOFLUSH 0
 #endif
 
 #endif /* _ASM_X86_PROCESSOR_FLAGS_H */
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 57b305e13c4c..a9a5aa6f45f7 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -82,6 +82,12 @@ static inline u64 bump_mm_tlb_gen(struct mm_struct *mm)
 #define __flush_tlb_single(addr) __native_flush_tlb_single(addr)
 #endif
 
+/*
+ * 6 because 6 should be plenty and struct tlb_state will fit in
+ * two cache lines.
+ */
+#define NR_DYNAMIC_ASIDS 6
+
 struct tlb_context {
u64 ctx_id;
u64 tlb_gen;
@@ -95,6 +101,8 @@ struct tlb_state {
 * mode even if we've already switched back to swapper_pg_dir.
 */
struct mm_struct *loaded_mm;
+   u16 loaded_mm_asid;
+   u16 next_asid;
 
/*
 * Access to this CR4 shadow and to H/W CR4 is protected by
@@ -104,7 +112,8 @@ struct tlb_state {
 
/*
 * This is a list of all contexts that might exist in the TLB.
-* Since we don't yet use PCID, there is only one context.
+* There is one per ASID that we use, and the ASID (what the
+* CPU calls PCID) is the index into ctxts.
 *
 * For each context, ctx_id indicates which mm the TLB's user
 * entries came from.  As an invariant, the TLB will never
@@ -114,8 +123,13 @@ struct tlb_state {
 * To be clear, this means that it's legal for the TLB code to
 * flush the TLB without updating tlb_gen.  This can happen
 * (for now, at least) due to paravirt remote flushes.
+*
+* NB: context 0 is a bit special, since it's also used by
+* various bits of init code.  This is fine -- code that
+* isn't aware of PCID will end up harmlessly flushing
+* context 0.
 */
-   struct tlb_context ctxs[1];
+   struct tlb_context ctxs[NR_DYNAMIC_ASIDS];
 };
 DECLARE_PER_CPU_SHARED_ALIGNED(struct tlb_state, cpu_tlbstate);
 
diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index 7d6fa4676af9..9c9570d300ba 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -812,6 +812,7 @@ void __init zone_sizes_init(void)
 
 

[PATCH v2 10/10] x86/mm: Try to preserve old TLB entries using PCID

2017-06-13 Thread Andy Lutomirski
PCID is a "process context ID" -- it's what other architectures call
an address space ID.  Every non-global TLB entry is tagged with a
PCID, only TLB entries that match the currently selected PCID are
used, and we can switch PGDs without flushing the TLB.  x86's
PCID is 12 bits.

This is an unorthodox approach to using PCID.  x86's PCID is far too
short to uniquely identify a process, and we can't even really
uniquely identify a running process because there are monster
systems with over 4096 CPUs.  To make matters worse, past attempts
to use all 12 PCID bits have resulted in slowdowns instead of
speedups.

This patch uses PCID differently.  We use a PCID to identify a
recently-used mm on a per-cpu basis.  An mm has no fixed PCID
binding at all; instead, we give it a fresh PCID each time it's
loaded except in cases where we want to preserve the TLB, in which
case we reuse a recent value.

In particular, we use PCIDs 1-3 for recently-used mms and we reserve
PCID 0 for swapper_pg_dir and for PCID-unaware CR3 users (e.g. EFI).
Nothing ever switches to PCID 0 without flushing PCID 0 non-global
pages, so PCID 0 conflicts won't cause problems.

This seems to save about 100ns on context switches between mms.

Signed-off-by: Andy Lutomirski 
---
 arch/x86/include/asm/mmu_context.h |  3 ++
 arch/x86/include/asm/processor-flags.h |  2 +
 arch/x86/include/asm/tlbflush.h| 18 +++-
 arch/x86/mm/init.c |  1 +
 arch/x86/mm/tlb.c  | 82 ++
 5 files changed, 86 insertions(+), 20 deletions(-)

diff --git a/arch/x86/include/asm/mmu_context.h 
b/arch/x86/include/asm/mmu_context.h
index 69a4f1ee86ac..2537ec03c9b7 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -299,6 +299,9 @@ static inline unsigned long __get_current_cr3_fast(void)
 {
unsigned long cr3 = __pa(this_cpu_read(cpu_tlbstate.loaded_mm)->pgd);
 
+   if (static_cpu_has(X86_FEATURE_PCID))
+   cr3 |= this_cpu_read(cpu_tlbstate.loaded_mm_asid);
+
/* For now, be very restrictive about when this can be called. */
VM_WARN_ON(in_nmi() || !in_atomic());
 
diff --git a/arch/x86/include/asm/processor-flags.h 
b/arch/x86/include/asm/processor-flags.h
index 79aa2f98398d..791b60199aa4 100644
--- a/arch/x86/include/asm/processor-flags.h
+++ b/arch/x86/include/asm/processor-flags.h
@@ -35,6 +35,7 @@
 /* Mask off the address space ID bits. */
 #define CR3_ADDR_MASK 0x7000ull
 #define CR3_PCID_MASK 0xFFFull
+#define CR3_NOFLUSH (1UL << 63)
 #else
 /*
  * CR3_ADDR_MASK needs at least bits 31:5 set on PAE systems, and we save
@@ -42,6 +43,7 @@
  */
 #define CR3_ADDR_MASK 0xull
 #define CR3_PCID_MASK 0ull
+#define CR3_NOFLUSH 0
 #endif
 
 #endif /* _ASM_X86_PROCESSOR_FLAGS_H */
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 57b305e13c4c..a9a5aa6f45f7 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -82,6 +82,12 @@ static inline u64 bump_mm_tlb_gen(struct mm_struct *mm)
 #define __flush_tlb_single(addr) __native_flush_tlb_single(addr)
 #endif
 
+/*
+ * 6 because 6 should be plenty and struct tlb_state will fit in
+ * two cache lines.
+ */
+#define NR_DYNAMIC_ASIDS 6
+
 struct tlb_context {
u64 ctx_id;
u64 tlb_gen;
@@ -95,6 +101,8 @@ struct tlb_state {
 * mode even if we've already switched back to swapper_pg_dir.
 */
struct mm_struct *loaded_mm;
+   u16 loaded_mm_asid;
+   u16 next_asid;
 
/*
 * Access to this CR4 shadow and to H/W CR4 is protected by
@@ -104,7 +112,8 @@ struct tlb_state {
 
/*
 * This is a list of all contexts that might exist in the TLB.
-* Since we don't yet use PCID, there is only one context.
+* There is one per ASID that we use, and the ASID (what the
+* CPU calls PCID) is the index into ctxts.
 *
 * For each context, ctx_id indicates which mm the TLB's user
 * entries came from.  As an invariant, the TLB will never
@@ -114,8 +123,13 @@ struct tlb_state {
 * To be clear, this means that it's legal for the TLB code to
 * flush the TLB without updating tlb_gen.  This can happen
 * (for now, at least) due to paravirt remote flushes.
+*
+* NB: context 0 is a bit special, since it's also used by
+* various bits of init code.  This is fine -- code that
+* isn't aware of PCID will end up harmlessly flushing
+* context 0.
 */
-   struct tlb_context ctxs[1];
+   struct tlb_context ctxs[NR_DYNAMIC_ASIDS];
 };
 DECLARE_PER_CPU_SHARED_ALIGNED(struct tlb_state, cpu_tlbstate);
 
diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index 7d6fa4676af9..9c9570d300ba 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -812,6 +812,7 @@ void __init zone_sizes_init(void)
 
 DEFINE_PER_CPU_SHARED_ALIGNED(struct