Re: [PATCH] x86/paravirt: Disable virt spinlock when CONFIG_PARAVIRT_SPINLOCKS disabled

2024-05-23 Thread Jürgen Groß

On 23.05.24 18:30, Dave Hansen wrote:

On 5/16/24 06:02, Chen Yu wrote:

Performance drop is reported when running encode/decode workload and
BenchSEE cache sub-workload.
Bisect points to commit ce0a1b608bfc ("x86/paravirt: Silence unused
native_pv_lock_init() function warning"). When CONFIG_PARAVIRT_SPINLOCKS
is disabled the virt_spin_lock_key is set to true on bare-metal.
The qspinlock degenerates to test-and-set spinlock, which decrease the
performance on bare-metal.

Fix this by disabling virt_spin_lock_key if CONFIG_PARAVIRT_SPINLOCKS
is not set, or it is on bare-metal.


This is missing some background:

The kernel can change spinlock behavior when running as a guest.  But
this guest-friendly behavior causes performance problems on bare metal.
So there's a 'virt_spin_lock_key' static key to switch between the two
modes.

The static key is always enabled by default (run in guest mode) and
should be disabled for bare metal (and in some guests that want native
behavior).

... then describe the regression and the fix


diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index 5358d43886ad..ee51c0949ed8 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -55,7 +55,7 @@ DEFINE_STATIC_KEY_TRUE(virt_spin_lock_key);
  
  void __init native_pv_lock_init(void)

  {
-   if (IS_ENABLED(CONFIG_PARAVIRT_SPINLOCKS) &&
+   if (!IS_ENABLED(CONFIG_PARAVIRT_SPINLOCKS) ||
!boot_cpu_has(X86_FEATURE_HYPERVISOR))
static_branch_disable(_spin_lock_key);
  }

This gets used at a single site:

 if (pv_enabled())
 goto pv_queue;

 if (virt_spin_lock(lock))
 return;

which is logically:

if (IS_ENABLED(CONFIG_PARAVIRT_SPINLOCKS))
goto ...; // don't look at virt_spin_lock_key

if (virt_spin_lock_key)
return; // On virt, but non-paravirt.  Did Test-and-Set
// spinlock.

So I _think_ Arnd was trying to optimize native_pv_lock_init() away when
it's going to get skipped over anyway by the 'goto'.

But this took me at least 30 minutes of scratching my head and trying to
untangle the whole thing.  It's all far too subtle for my taste, and all
of that to save a few bytes of init text in a configuration that's
probably not even used very often (PARAVIRT=y, but PARAVIRT_SPINLOCKS=n).

Let's just keep it simple.  How about the attached patch?


Simple indeed. The attachment is empty. :-p


Juergen



Re: [PATCH 4/5] LoongArch: Add paravirt interface for guest kernel

2024-01-02 Thread Jürgen Groß

On 03.01.24 08:16, Bibo Mao wrote:

The patch add paravirt interface for guest kernel, it checks whether
system runs on VM mode. If it is, it will detect hypervisor type. And
returns true it is KVM hypervisor, else return false. Currently only
KVM hypervisor is supported, so there is only hypervisor detection
for KVM type.


I guess you are talking of pv_guest_init() here? Or do you mean
kvm_para_available()?



Signed-off-by: Bibo Mao 
---
  arch/loongarch/Kconfig|  8 
  arch/loongarch/include/asm/kvm_para.h |  7 
  arch/loongarch/include/asm/paravirt.h | 27 
  .../include/asm/paravirt_api_clock.h  |  1 +
  arch/loongarch/kernel/Makefile|  1 +
  arch/loongarch/kernel/paravirt.c  | 41 +++
  arch/loongarch/kernel/setup.c |  2 +
  7 files changed, 87 insertions(+)
  create mode 100644 arch/loongarch/include/asm/paravirt.h
  create mode 100644 arch/loongarch/include/asm/paravirt_api_clock.h
  create mode 100644 arch/loongarch/kernel/paravirt.c

diff --git a/arch/loongarch/Kconfig b/arch/loongarch/Kconfig
index ee123820a476..940e5960d297 100644
--- a/arch/loongarch/Kconfig
+++ b/arch/loongarch/Kconfig
@@ -564,6 +564,14 @@ config CPU_HAS_PREFETCH
bool
default y
  
+config PARAVIRT

+   bool "Enable paravirtualization code"
+   help
+  This changes the kernel so it can modify itself when it is run
+ under a hypervisor, potentially improving performance significantly
+ over full virtualization.  However, when run without a hypervisor
+ the kernel is theoretically slower and slightly larger.
+
  config ARCH_SUPPORTS_KEXEC
def_bool y
  
diff --git a/arch/loongarch/include/asm/kvm_para.h b/arch/loongarch/include/asm/kvm_para.h

index 9425d3b7e486..41200e922a82 100644
--- a/arch/loongarch/include/asm/kvm_para.h
+++ b/arch/loongarch/include/asm/kvm_para.h
@@ -2,6 +2,13 @@
  #ifndef _ASM_LOONGARCH_KVM_PARA_H
  #define _ASM_LOONGARCH_KVM_PARA_H
  
+/*

+ * Hypcall code field
+ */
+#define HYPERVISOR_KVM 1
+#define HYPERVISOR_VENDOR_SHIFT8
+#define HYPERCALL_CODE(vendor, code)   ((vendor << HYPERVISOR_VENDOR_SHIFT) + 
code)
+
  /*
   * LoongArch hypcall return code
   */
diff --git a/arch/loongarch/include/asm/paravirt.h 
b/arch/loongarch/include/asm/paravirt.h
new file mode 100644
index ..b64813592ba0
--- /dev/null
+++ b/arch/loongarch/include/asm/paravirt.h
@@ -0,0 +1,27 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_LOONGARCH_PARAVIRT_H
+#define _ASM_LOONGARCH_PARAVIRT_H
+
+#ifdef CONFIG_PARAVIRT
+#include 
+struct static_key;
+extern struct static_key paravirt_steal_enabled;
+extern struct static_key paravirt_steal_rq_enabled;
+
+u64 dummy_steal_clock(int cpu);
+DECLARE_STATIC_CALL(pv_steal_clock, dummy_steal_clock);
+
+static inline u64 paravirt_steal_clock(int cpu)
+{
+   return static_call(pv_steal_clock)(cpu);
+}
+
+int pv_guest_init(void);
+#else
+static inline int pv_guest_init(void)
+{
+   return 0;
+}
+
+#endif // CONFIG_PARAVIRT
+#endif
diff --git a/arch/loongarch/include/asm/paravirt_api_clock.h 
b/arch/loongarch/include/asm/paravirt_api_clock.h
new file mode 100644
index ..65ac7cee0dad
--- /dev/null
+++ b/arch/loongarch/include/asm/paravirt_api_clock.h
@@ -0,0 +1 @@
+#include 
diff --git a/arch/loongarch/kernel/Makefile b/arch/loongarch/kernel/Makefile
index 3c808c680370..662e6e9de12d 100644
--- a/arch/loongarch/kernel/Makefile
+++ b/arch/loongarch/kernel/Makefile
@@ -48,6 +48,7 @@ obj-$(CONFIG_MODULES) += module.o module-sections.o
  obj-$(CONFIG_STACKTRACE)  += stacktrace.o
  
  obj-$(CONFIG_PROC_FS)		+= proc.o

+obj-$(CONFIG_PARAVIRT) += paravirt.o
  
  obj-$(CONFIG_SMP)		+= smp.o
  
diff --git a/arch/loongarch/kernel/paravirt.c b/arch/loongarch/kernel/paravirt.c

new file mode 100644
index ..21d01d05791a
--- /dev/null
+++ b/arch/loongarch/kernel/paravirt.c
@@ -0,0 +1,41 @@
+// SPDX-License-Identifier: GPL-2.0
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+struct static_key paravirt_steal_enabled;
+struct static_key paravirt_steal_rq_enabled;
+
+static u64 native_steal_clock(int cpu)
+{
+   return 0;
+}
+
+DEFINE_STATIC_CALL(pv_steal_clock, native_steal_clock);


This is the 4th arch with the same definition of native_steal_clock() and
pv_steal_clock. I think we should add a common file kernel/paravirt.c and
move the related functions from the archs into the new file.

If you don't want to do that I can prepare a series.


+
+static bool kvm_para_available(void)
+{
+   static int hypervisor_type;
+   int config;
+
+   if (!hypervisor_type) {
+   config = read_cpucfg(CPUCFG_KVM_SIG);
+   if (!memcmp(, KVM_SIGNATURE, 4))
+   hypervisor_type = HYPERVISOR_KVM;
+   }
+
+   return hypervisor_type == HYPERVISOR_KVM;
+}
+
+int 

Re: [PATCH v3 3/4] kernel/smp: add more data to CSD lock debugging

2021-03-24 Thread Jürgen Groß

On 02.03.21 07:28, Juergen Gross wrote:

In order to help identifying problems with IPI handling and remote
function execution add some more data to IPI debugging code.

There have been multiple reports of cpus looping long times (many
seconds) in smp_call_function_many() waiting for another cpu executing
a function like tlb flushing. Most of these reports have been for
cases where the kernel was running as a guest on top of KVM or Xen
(there are rumours of that happening under VMWare, too, and even on
bare metal).

Finding the root cause hasn't been successful yet, even after more than
2 years of chasing this bug by different developers.

Commit 35feb60474bf4f7 ("kernel/smp: Provide CSD lock timeout
diagnostics") tried to address this by adding some debug code and by
issuing another IPI when a hang was detected. This helped mitigating
the problem (the repeated IPI unlocks the hang), but the root cause is
still unknown.

Current available data suggests that either an IPI wasn't sent when it
should have been, or that the IPI didn't result in the target cpu
executing the queued function (due to the IPI not reaching the cpu,
the IPI handler not being called, or the handler not seeing the queued
request).

Try to add more diagnostic data by introducing a global atomic counter
which is being incremented when doing critical operations (before and
after queueing a new request, when sending an IPI, and when dequeueing
a request). The counter value is stored in percpu variables which can
be printed out when a hang is detected.

The data of the last event (consisting of sequence counter, source
cpu, target cpu, and event type) is stored in a global variable. When
a new event is to be traced, the data of the last event is stored in
the event related percpu location and the global data is updated with
the new event's data. This allows to track two events in one data
location: one by the value of the event data (the event before the
current one), and one by the location itself (the current event).

A typical printout with a detected hang will look like this:

csd: Detected non-responsive CSD lock (#1) on CPU#1, waiting 53 ns for 
CPU#06 scf_handler_1+0x0/0x50(0xa2a881bb1410).
csd: CSD lock (#1) handling prior 
scf_handler_1+0x0/0x50(0xa2a8813823c0) request.
 csd: cnt(8cc): -> dequeue (src cpu 0 == empty)
 csd: cnt(8cd): ->0006 idle
 csd: cnt(0003668): 0001->0006 queue
 csd: cnt(0003669): 0001->0006 ipi
 csd: cnt(0003e0f): 0007->000a queue
 csd: cnt(0003e10): 0001-> ping
 csd: cnt(0003e71): 0003-> ping
 csd: cnt(0003e72): ->0006 gotipi
 csd: cnt(0003e73): ->0006 handle
 csd: cnt(0003e74): ->0006 dequeue (src cpu 0 == empty)
 csd: cnt(0003e7f): 0004->0006 ping
 csd: cnt(0003e80): 0001-> pinged
 csd: cnt(0003eb2): 0005->0001 noipi
 csd: cnt(0003eb3): 0001->0006 queue
 csd: cnt(0003eb4): 0001->0006 noipi
 csd: cnt now: 0003f00

This example (being an artificial one, produced with a previous version
of this patch without the "hdlend" event), shows that cpu#6 started to
handle an IPI (cnt 3e72-3e74), bit didn't start to handle another IPI
(sent by cpu#4, cnt 3e7f). The next request from cpu#1 for cpu#6 was
queued (3eb3), but no IPI was needed (cnt 3eb4, there was the event
from cpu#4 in the queue already).

The idea is to print only relevant entries. Those are all events which
are associated with the hang (so sender side events for the source cpu
of the hanging request, and receiver side events for the target cpu),
and the related events just before those (for adding data needed to
identify a possible race). Printing all available data would be
possible, but this would add large amounts of data printed on larger
configurations.

Signed-off-by: Juergen Gross 
Tested-by: Paul E. McKenney 


Just an update regarding current status with debugging the underlying
issue:

On a customer's machine with a backport of this patch applied we've
seen another case of the hang. In the logs we've found:

smp: csd: Detected non-responsive CSD lock (#1) on CPU#18, waiting 
500046 ns for CPU#06 do_flush_tlb_all+0x0/0x30(  (null)).

smp:csd: CSD lock (#1) unresponsive.
smp:csd: cnt(000): -> queue
smp:csd: cnt(001): ->0006 idle
smp:csd: cnt(0025dba): 0012->0006 queue
smp:csd: cnt(0025dbb): 0012->0006 noipi
smp:csd: cnt(01d1333): 001a->0006 pinged
smp:csd: cnt(01d1334): ->0006 gotipi
smp:csd: cnt(01d1335): ->0006 handle
smp:csd: cnt(01d1336): ->0006 dequeue (src cpu 0 == empty)
smp:csd: cnt(01d1337): ->0006 hdlend (src cpu 0 == early)
smp:csd: cnt(01d16cb): 0012->0005 ipi
smp:csd: cnt(01d16cc): 0012->0006 queue
smp:csd: cnt(01d16cd): 0012->0006 ipi
smp:csd: cnt(01d16f3): 0012->001a ipi
smp:csd: cnt(01d16f4): 0012-> ping
smp:csd: 

Re: [PATCH 2/2] Revert "xen: fix p2m size in dom0 for disabled memory hotplug case"

2021-03-23 Thread Jürgen Groß

On 17.03.21 12:04, Roger Pau Monne wrote:

This partially reverts commit 882213990d32fd224340a4533f6318dd152be4b2.

There's no need to special case XEN_UNPOPULATED_ALLOC anymore in order
to correctly size the p2m. The generic memory hotplug option has
already been tied together with the Xen hotplug limit, so enabling
memory hotplug should already trigger a properly sized p2m on Xen PV.


Can you add some words here that XEN_UNPOPULATED_ALLOC depends on
MEMORY_HOTPLUG via ZONE_DEVICE?


Juergen


OpenPGP_0xB0DE9DD628BF132F.asc
Description: application/pgp-keys


OpenPGP_signature
Description: OpenPGP digital signature


Re: [PATCH v6 00/12] x86: major paravirt cleanup

2021-03-11 Thread Jürgen Groß

On 11.03.21 13:51, Borislav Petkov wrote:

On Thu, Mar 11, 2021 at 01:50:26PM +0100, Borislav Petkov wrote:

and move the cleanups patches 13 and 14 to the beginning of the set?


Yeah, 14 needs ALTERNATIVE_TERNARY so I guess after patch 5, that is.


I'm putting 13 at the begin of the series and 14 after 5.


Juergen


OpenPGP_0xB0DE9DD628BF132F.asc
Description: application/pgp-keys


OpenPGP_signature
Description: OpenPGP digital signature


Re: [PATCH v6 05/12] x86/alternative: support ALTERNATIVE_TERNARY

2021-03-10 Thread Jürgen Groß

On 10.03.21 15:27, Borislav Petkov wrote:

On Tue, Mar 09, 2021 at 02:48:06PM +0100, Juergen Gross wrote:

diff --git a/arch/x86/include/asm/alternative.h 
b/arch/x86/include/asm/alternative.h
index 89889618ae01..4fb844e29d26 100644
--- a/arch/x86/include/asm/alternative.h
+++ b/arch/x86/include/asm/alternative.h
@@ -178,6 +178,9 @@ static inline int alternatives_text_reserved(void *start, 
void *end)
ALTINSTR_REPLACEMENT(newinstr2, 2)  \
".popsection\n"
  
+#define ALTERNATIVE_TERNARY(oldinstr, feature, newinstr1, newinstr2)	\

+   ALTERNATIVE_2(oldinstr, newinstr2, X86_FEATURE_ALWAYS, newinstr1, 
feature)


Make that:

/*
  * If @feature is set, patch @newinstr_yes, else @newinstr_no
  */
#define ALTERNATIVE_TERNARY(oldinstr, feature, newinstr_yes, newinstr_no) \
 ALTERNATIVE_2(oldinstr, newinstr_no, X86_FEATURE_ALWAYS, newinstr_yes, 
feature)

and in alternative-asm.h too pls.


Okay.



Regardless, this looks nice! :)


Thanks,


Juergen


OpenPGP_0xB0DE9DD628BF132F.asc
Description: application/pgp-keys


OpenPGP_signature
Description: OpenPGP digital signature


Re: [PATCH v6 04/12] x86/alternative: support not-feature

2021-03-10 Thread Jürgen Groß

On 10.03.21 15:15, Borislav Petkov wrote:

On Wed, Mar 10, 2021 at 08:52:40AM +0100, Jürgen Groß wrote:

Did you look at patch 13? :-)


Well, I usually review in increasing patch order. :-P

But make that change here pls because otherwise unnecessary churn.


Okay.


Juergen



OpenPGP_0xB0DE9DD628BF132F.asc
Description: application/pgp-keys


OpenPGP_signature
Description: OpenPGP digital signature


Re: [PATCH v6 04/12] x86/alternative: support not-feature

2021-03-09 Thread Jürgen Groß

On 10.03.21 07:07, Borislav Petkov wrote:

On Tue, Mar 09, 2021 at 02:48:05PM +0100, Juergen Gross wrote:

Add support for alternative patching for the case a feature is not
present on the current cpu.

For users of ALTERNATIVE() and friends an inverted feature is specified
by applying the ALT_NOT() macro to it, e.g.:

ALTERNATIVE(old, new, ALT_NOT(feature))

Signed-off-by: Juergen Gross 
---
V5:
- split off from next patch
- reworked to use flag byte (Boris Petkov)
V6:
- rework again to not use flag byte (Boris Petkov)
---
  arch/x86/include/asm/alternative-asm.h |  3 +++
  arch/x86/include/asm/alternative.h |  3 +++
  arch/x86/kernel/alternative.c  | 19 ++-
  3 files changed, 20 insertions(+), 5 deletions(-)


LGTM, minor touchups I'd do ontop:

---

diff --git a/arch/x86/include/asm/alternative.h 
b/arch/x86/include/asm/alternative.h
index 89889618ae01..fd205cdcfbad 100644
--- a/arch/x86/include/asm/alternative.h
+++ b/arch/x86/include/asm/alternative.h
@@ -55,18 +55,18 @@
".long 999b - .\n\t"  \
".popsection\n\t"
  
+#define ALTINSTR_FLAG_INV	(1 << 15)

+#define ALT_NOT(feat)  ((feat) | ALTINSTR_FLAG_INV)
+
  struct alt_instr {
s32 instr_offset;   /* original instruction */
s32 repl_offset;/* offset to replacement instruction */
u16 cpuid;  /* cpuid bit set for replacement */
-#define ALTINSTR_FLAG_INV (1 << 15)
u8  instrlen;   /* length of original instruction */
u8  replacementlen; /* length of new instruction */
u8  padlen; /* length of build-time padding */
  } __packed;
  
-#define ALT_NOT(feat)	((feat) | ALTINSTR_FLAG_INV)

-


Did you look at patch 13? :-)


  /*
   * Debug flag that can be tested to see whether alternative
   * instructions were patched in already:
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index d8e669a1546f..133b549dc091 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -397,9 +397,10 @@ void __init_or_module noinline apply_alternatives(struct 
alt_instr *start,
BUG_ON(feature >= (NCAPINTS + NBUGINTS) * 32);
  
  		/*

-* Drop out if either:
-* - feature not available, but required, or
-* - feature available, but NOT required
+* Patch if either:
+* - feature is present
+* - feature not present but ALTINSTR_FLAG_INV is set to mean,
+*   patch if feature is *NOT* present.


Okay.


Juergen


OpenPGP_0xB0DE9DD628BF132F.asc
Description: application/pgp-keys


OpenPGP_signature
Description: OpenPGP digital signature


Re: [PATCH v6 02/12] x86/paravirt: switch time pvops functions to use static_call()

2021-03-09 Thread Jürgen Groß

On 09.03.21 19:57, Borislav Petkov wrote:

On Tue, Mar 09, 2021 at 02:48:03PM +0100, Juergen Gross wrote:

@@ -167,6 +168,17 @@ static u64 native_steal_clock(int cpu)
return 0;
  }
  
+DEFINE_STATIC_CALL(pv_steal_clock, native_steal_clock);

+DEFINE_STATIC_CALL(pv_sched_clock, native_sched_clock);
+
+bool paravirt_using_native_sched_clock = true;
+
+void paravirt_set_sched_clock(u64 (*func)(void))
+{
+   static_call_update(pv_sched_clock, func);
+   paravirt_using_native_sched_clock = (func == native_sched_clock);
+}


What's the point of this function if there's a global
paravirt_using_native_sched_clock variable now?


It is combining the two needed actions: update the static call and
set the paravirt_using_native_sched_clock boolean.


Looking how the bit of information whether native_sched_clock is used,
is needed in tsc.c, it probably would be cleaner if you add a

set_sched_clock_native(void);

or so, to tsc.c instead and call that here and make that long var name a
a shorter and static one in tsc.c instead.


I need to transfer a boolean value, so it would need to be

set_sched_clock_native(bool state);

In the end the difference is only marginal IMO.

Just had another idea: I could add a function to static_call.h for
querying the current function. This would avoid the double book keeping
and could probably be used later when switching other pv_ops calls to
static_call, too (e.g. pv_is_native_spin_unlock()).

What do you think?


Juergen


OpenPGP_0xB0DE9DD628BF132F.asc
Description: application/pgp-keys


OpenPGP_signature
Description: OpenPGP digital signature


Re: [PATCH v5 11/12] x86/paravirt: switch functions with custom code to ALTERNATIVE

2021-03-08 Thread Jürgen Groß

On 08.03.21 19:30, Borislav Petkov wrote:

On Mon, Mar 08, 2021 at 01:28:43PM +0100, Juergen Gross wrote:

diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index 36cd71fa097f..04b3067f31b5 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -137,7 +137,8 @@ static inline void write_cr0(unsigned long x)
  
  static inline unsigned long read_cr2(void)

  {
-   return PVOP_CALLEE0(unsigned long, mmu.read_cr2);
+   return PVOP_ALT_CALLEE0(unsigned long, mmu.read_cr2,
+   "mov %%cr2, %%rax;", ~X86_FEATURE_XENPV);


Just some cursory poking first - indepth review later.

Do I see this correctly that the negated feature can be expressed with, to use
this example here:

ALTERNATIVE_TERNARY(mmu.read_cr2, X86_FEATURE_XENPV, "", "mov %%cr2, 
%%rax;");

?


No.

This would leave the Xen-pv case with a nop, while we need it to call
mmu.read_cr2().

In the Xen-pv case there must be _no_ alternative patching in order to
have the paravirt patching do its patching (indirect->direct call).

This is exactly the reason why I need to "not feature".

The only other solution I can think of would be a "split static_call"
handling using ALTERNATIVE_TERNARY():

ALTERNATIVE_TERNARY(initial_static_call(mmu.read_cr2),
X86_FEATURE_XENPV,
final_static_call(mmu.read_cr2),
"mov %%cr2, %%rax;");

with initial_static_call() doing an indirect call, while
final_static_call() would do a direct call.

Not sure we really want that.


Juergen


OpenPGP_0xB0DE9DD628BF132F.asc
Description: application/pgp-keys


OpenPGP_signature
Description: OpenPGP digital signature


Re: [PATCH v5 02/12] x86/paravirt: switch time pvops functions to use static_call()

2021-03-08 Thread Jürgen Groß

On 08.03.21 18:00, Boris Ostrovsky wrote:


On 3/8/21 7:28 AM, Juergen Gross wrote:

--- a/arch/x86/xen/time.c
+++ b/arch/x86/xen/time.c
@@ -379,11 +379,6 @@ void xen_timer_resume(void)
}
  }
  
-static const struct pv_time_ops xen_time_ops __initconst = {

-   .sched_clock = xen_sched_clock,
-   .steal_clock = xen_steal_clock,
-};
-
  static struct pvclock_vsyscall_time_info *xen_clock __read_mostly;
  static u64 xen_clock_value_saved;
  
@@ -528,7 +523,8 @@ static void __init xen_time_init(void)

  void __init xen_init_time_ops(void)
  {
xen_sched_clock_offset = xen_clocksource_read();
-   pv_ops.time = xen_time_ops;
+   static_call_update(pv_steal_clock, xen_steal_clock);
+   paravirt_set_sched_clock(xen_sched_clock);
  
  	x86_init.timers.timer_init = xen_time_init;

x86_init.timers.setup_percpu_clockev = x86_init_noop;
@@ -570,7 +566,8 @@ void __init xen_hvm_init_time_ops(void)
}
  
  	xen_sched_clock_offset = xen_clocksource_read();

-   pv_ops.time = xen_time_ops;
+   static_call_update(pv_steal_clock, xen_steal_clock);
+   paravirt_set_sched_clock(xen_sched_clock);
x86_init.timers.setup_percpu_clockev = xen_time_init;
x86_cpuinit.setup_percpu_clockev = xen_hvm_setup_cpu_clockevents;



There is a bunch of stuff that's common between the two cases so it can be 
factored out.


Yes.




  
diff --git a/drivers/xen/time.c b/drivers/xen/time.c

index 108edbcbc040..152dd33bb223 100644
--- a/drivers/xen/time.c
+++ b/drivers/xen/time.c
@@ -7,6 +7,7 @@
  #include 
  #include 
  #include 
+#include 
  
  #include 

  #include 
@@ -175,7 +176,7 @@ void __init xen_time_setup_guest(void)
xen_runstate_remote = !HYPERVISOR_vm_assist(VMASST_CMD_enable,
VMASST_TYPE_runstate_update_flag);
  
-	pv_ops.time.steal_clock = xen_steal_clock;

+   static_call_update(pv_steal_clock, xen_steal_clock);
  



Do we actually need this? We've already set this up in xen_init_time_ops(). 
(But maybe for ARM).


Correct. Arm needs this.


Juergen


OpenPGP_0xB0DE9DD628BF132F.asc
Description: application/pgp-keys


OpenPGP_signature
Description: OpenPGP digital signature


Re: [PATCH v4 2/3] xen/events: don't unmask an event channel when an eoi is pending

2021-03-08 Thread Jürgen Groß

On 08.03.21 21:33, Boris Ostrovsky wrote:


On 3/6/21 11:18 AM, Juergen Gross wrote:

An event channel should be kept masked when an eoi is pending for it.
When being migrated to another cpu it might be unmasked, though.

In order to avoid this keep three different flags for each event channel
to be able to distinguish "normal" masking/unmasking from eoi related
masking/unmasking and temporary masking. The event channel should only
be able to generate an interrupt if all flags are cleared.

Cc: sta...@vger.kernel.org
Fixes: 54c9de89895e0a36047 ("xen/events: add a new late EOI evtchn framework")
Reported-by: Julien Grall 
Signed-off-by: Juergen Gross 
Reviewed-by: Julien Grall 
---
V2:
- introduce a lock around masking/unmasking
- merge patch 3 into this one (Jan Beulich)
V4:
- don't set eoi masking flag in lateeoi_mask_ack_dynirq()



Reviewed-by: Boris Ostrovsky 


Ross, are you planning to test this?


Just as another data point: With the previous version of the patches
a reboot loop of a guest needed max 33 reboots to loose network in
my tests (those were IIRC 6 test runs). With this patch version I
stopped the test after about 1300 reboots without having seen any
problems.

Juergen


OpenPGP_0xB0DE9DD628BF132F.asc
Description: application/pgp-keys


OpenPGP_signature
Description: OpenPGP digital signature


Re: [PATCH v3 2/8] xen/events: don't unmask an event channel when an eoi is pending

2021-03-06 Thread Jürgen Groß

On 23.02.21 10:26, Ross Lagerwall wrote:

On 2021-02-19 15:40, Juergen Gross wrote:

An event channel should be kept masked when an eoi is pending for it.
When being migrated to another cpu it might be unmasked, though.

In order to avoid this keep three different flags for each event channel
to be able to distinguish "normal" masking/unmasking from eoi related
masking/unmasking and temporary masking. The event channel should only
be able to generate an interrupt if all flags are cleared.

Cc: sta...@vger.kernel.org
Fixes: 54c9de89895e0a36047 ("xen/events: add a new late EOI evtchn framework")
Reported-by: Julien Grall 
Signed-off-by: Juergen Gross 


I tested this patch series backported to a 4.19 kernel and found that
when doing a reboot loop of Windows with PV drivers, occasionally it will
end up in a state with some event channels pending and masked in dom0
which breaks networking in the guest.

The issue seems to have been introduced with this patch, though at first
glance it appears correct. I haven't yet looked into why it is happening.
Have you seen anything like this with this patch?


I have found the issue. lateeoi_mask_ack_dynirq() must not set the "eoi"
mask reason flag, as this callback will be called when the handler will
not be called later, so there will never be a call of xen_irq_lateeoi()
to unmask the event channel again.

Juergen



OpenPGP_0xB0DE9DD628BF132F.asc
Description: application/pgp-keys


OpenPGP_signature
Description: OpenPGP digital signature


Re: [PATCH v3 2/8] xen/events: don't unmask an event channel when an eoi is pending

2021-03-05 Thread Jürgen Groß

On 23.02.21 10:26, Ross Lagerwall wrote:

On 2021-02-19 15:40, Juergen Gross wrote:

An event channel should be kept masked when an eoi is pending for it.
When being migrated to another cpu it might be unmasked, though.

In order to avoid this keep three different flags for each event channel
to be able to distinguish "normal" masking/unmasking from eoi related
masking/unmasking and temporary masking. The event channel should only
be able to generate an interrupt if all flags are cleared.

Cc: sta...@vger.kernel.org
Fixes: 54c9de89895e0a36047 ("xen/events: add a new late EOI evtchn framework")
Reported-by: Julien Grall 
Signed-off-by: Juergen Gross 


I tested this patch series backported to a 4.19 kernel and found that
when doing a reboot loop of Windows with PV drivers, occasionally it will
end up in a state with some event channels pending and masked in dom0
which breaks networking in the guest.

The issue seems to have been introduced with this patch, though at first
glance it appears correct. I haven't yet looked into why it is happening.
Have you seen anything like this with this patch?


Sorry it took so long, but now I was able to look into this issue.

I have managed to reproduce it with a pv Linux guest. I'm now adding
some debug code to understand what is happening there.


Juergen


OpenPGP_0xB0DE9DD628BF132F.asc
Description: application/pgp-keys


OpenPGP_signature
Description: OpenPGP digital signature


Re: [PATCH] efi: x86/xen: fix -Wmissing-prototypes warning

2021-03-03 Thread Jürgen Groß

On 03.03.21 10:36, maqiang wrote:

We get 1 warning when building kernel with W=1:
arch/x86/xen/efi.c:130:13: warning:
  no previous prototype for ‘xen_efi_init’ [-Wmissing-prototypes]
  void __init xen_efi_init(struct boot_params *boot_params)

In fact, this function is declared as a static inline function
in header file, but is not decorated as a static inline function
in source file.
So this patch marks this function with 'static inline'.

Signed-off-by: maqiang 
---
  arch/x86/xen/efi.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/xen/efi.c b/arch/x86/xen/efi.c
index 7d7ffb9c826a..cf2e9ff3866d 100644
--- a/arch/x86/xen/efi.c
+++ b/arch/x86/xen/efi.c
@@ -127,7 +127,7 @@ static enum efi_secureboot_mode xen_efi_get_secureboot(void)
return efi_secureboot_mode_enabled;
  }
  
-void __init xen_efi_init(struct boot_params *boot_params)

+static inline void __init xen_efi_init(struct boot_params *boot_params)


This is an absolutely wrong "fix". You are breaking a normal build
as xen_efi_init() will no longer be callable from other sources.


Juergen


OpenPGP_0xB0DE9DD628BF132F.asc
Description: application/pgp-keys


OpenPGP_signature
Description: OpenPGP digital signature


Re: [PATCH v2 3/3] kernel/smp: add more data to CSD lock debugging

2021-03-02 Thread Jürgen Groß

On 01.03.21 18:07, Peter Zijlstra wrote:

On Mon, Mar 01, 2021 at 11:13:36AM +0100, Juergen Gross wrote:

In order to help identifying problems with IPI handling and remote
function execution add some more data to IPI debugging code.

There have been multiple reports of cpus looping long times (many
seconds) in smp_call_function_many() waiting for another cpu executing
a function like tlb flushing. Most of these reports have been for
cases where the kernel was running as a guest on top of KVM or Xen
(there are rumours of that happening under VMWare, too, and even on
bare metal).

Finding the root cause hasn't been successful yet, even after more than
2 years of chasing this bug by different developers.

Commit 35feb60474bf4f7 ("kernel/smp: Provide CSD lock timeout
diagnostics") tried to address this by adding some debug code and by
issuing another IPI when a hang was detected. This helped mitigating
the problem (the repeated IPI unlocks the hang), but the root cause is
still unknown.

Current available data suggests that either an IPI wasn't sent when it
should have been, or that the IPI didn't result in the target cpu
executing the queued function (due to the IPI not reaching the cpu,
the IPI handler not being called, or the handler not seeing the queued
request).

Try to add more diagnostic data by introducing a global atomic counter
which is being incremented when doing critical operations (before and
after queueing a new request, when sending an IPI, and when dequeueing
a request). The counter value is stored in percpu variables which can
be printed out when a hang is detected.

The data of the last event (consisting of sequence counter, source
cpu, target cpu, and event type) is stored in a global variable. When
a new event is to be traced, the data of the last event is stored in
the event related percpu location and the global data is updated with
the new event's data. This allows to track two events in one data
location: one by the value of the event data (the event before the
current one), and one by the location itself (the current event).

A typical printout with a detected hang will look like this:

csd: Detected non-responsive CSD lock (#1) on CPU#1, waiting 53 ns for 
CPU#06 scf_handler_1+0x0/0x50(0xa2a881bb1410).
csd: CSD lock (#1) handling prior 
scf_handler_1+0x0/0x50(0xa2a8813823c0) request.
 csd: cnt(8cc): -> dequeue (src cpu 0 == empty)
 csd: cnt(8cd): ->0006 idle
 csd: cnt(0003668): 0001->0006 queue
 csd: cnt(0003669): 0001->0006 ipi
 csd: cnt(0003e0f): 0007->000a queue
 csd: cnt(0003e10): 0001-> ping
 csd: cnt(0003e71): 0003-> ping
 csd: cnt(0003e72): ->0006 gotipi
 csd: cnt(0003e73): ->0006 handle
 csd: cnt(0003e74): ->0006 dequeue (src cpu 0 == empty)
 csd: cnt(0003e7f): 0004->0006 ping
 csd: cnt(0003e80): 0001-> pinged
 csd: cnt(0003eb2): 0005->0001 noipi
 csd: cnt(0003eb3): 0001->0006 queue
 csd: cnt(0003eb4): 0001->0006 noipi
 csd: cnt now: 0003f00

The idea is to print only relevant entries. Those are all events which
are associated with the hang (so sender side events for the source cpu
of the hanging request, and receiver side events for the target cpu),
and the related events just before those (for adding data needed to
identify a possible race). Printing all available data would be
possible, but this would add large amounts of data printed on larger
configurations.

Signed-off-by: Juergen Gross 
Tested-by: Paul E. McKenney 
---
V2:
- add automatic data deciphering and sorting of entries
- add new trace point for leaving flush_smp_call_function_queue()
- add information when finding an empty call_single_queue


They do not apply on top of these:

   https://lkml.kernel.org/r/20210220231712.2475218-2-na...@vmware.com

:-/


@@ -290,6 +476,19 @@ static DEFINE_PER_CPU_SHARED_ALIGNED(call_single_data_t, 
csd_data);
  
  void __smp_call_single_queue(int cpu, struct llist_node *node)

  {
+#ifdef CONFIG_CSD_LOCK_WAIT_DEBUG
+   if (static_branch_unlikely(_debug_extended)) {
+   unsigned int type;
+
+   type = CSD_TYPE(container_of(node, call_single_data_t,
+node.llist));
+   if (type == CSD_TYPE_SYNC || type == CSD_TYPE_ASYNC) {
+   __smp_call_single_queue_debug(cpu, node);
+   return;
+   }
+   }
+#endif


This really ought to be in generic_exec_single(), because there we know
the type matches.


You are right. This is much better.


Juergen


OpenPGP_0xB0DE9DD628BF132F.asc
Description: application/pgp-keys


OpenPGP_signature
Description: OpenPGP digital signature


Re: [PATCH v1] xen: ACPI: Get rid of ACPICA message printing

2021-03-01 Thread Jürgen Groß

On 01.03.21 17:16, Boris Ostrovsky wrote:


On 3/1/21 9:11 AM, Rafael J. Wysocki wrote:

On Sun, Feb 28, 2021 at 2:49 AM Boris Ostrovsky
 wrote:


On 2/24/21 1:47 PM, Rafael J. Wysocki wrote:

From: Rafael J. Wysocki 

The ACPI_DEBUG_PRINT() macro is used in a few places in
xen-acpi-cpuhotplug.c and xen-acpi-memhotplug.c for printing debug
messages, but that is questionable, because that macro belongs to
ACPICA and it should not be used elsewhere.  In addition,
ACPI_DEBUG_PRINT() requires special enabling to allow it to actually
print the message and the _COMPONENT symbol generally needed for
that is not defined in any of the files in question.

For this reason, replace all of the ACPI_DEBUG_PRINT() instances in
the Xen code with acpi_handle_debug() (with the additional benefit
that the source object can be identified more easily after this
change) and drop the ACPI_MODULE_NAME() definitions that are only
used by the ACPICA message printing macros from that code.

Signed-off-by: Rafael J. Wysocki 
---
  drivers/xen/xen-acpi-cpuhotplug.c |   12 +---
  drivers/xen/xen-acpi-memhotplug.c |   16 +++-


As I was building with this patch I (re-)discovered that since 2013 it depends 
on BROKEN (commit 76fc253723add). Despite commit message there saying that it's 
a temporary patch it seems to me by now that it's more than that.


And clearly noone tried to build these files since at least 2015 because 
memhotplug file doesn't compile due to commit cfafae940381207.


While this is easily fixable the question is whether we want to keep these 
files. Obviously noone cares about this functionality.

Well, I would be for dropping them.

Do you want me to send a patch to do that?



Yes, if you don't mind (but let's give this a few days for people to have a 
chance to comment).


I'm fine with removing those files.


Juergen


OpenPGP_0xB0DE9DD628BF132F.asc
Description: application/pgp-keys


OpenPGP_signature
Description: OpenPGP digital signature


Re: [PATCH v2 3/3] kernel/smp: add more data to CSD lock debugging

2021-03-01 Thread Jürgen Groß

On 01.03.21 18:07, Peter Zijlstra wrote:

On Mon, Mar 01, 2021 at 11:13:36AM +0100, Juergen Gross wrote:

In order to help identifying problems with IPI handling and remote
function execution add some more data to IPI debugging code.

There have been multiple reports of cpus looping long times (many
seconds) in smp_call_function_many() waiting for another cpu executing
a function like tlb flushing. Most of these reports have been for
cases where the kernel was running as a guest on top of KVM or Xen
(there are rumours of that happening under VMWare, too, and even on
bare metal).

Finding the root cause hasn't been successful yet, even after more than
2 years of chasing this bug by different developers.

Commit 35feb60474bf4f7 ("kernel/smp: Provide CSD lock timeout
diagnostics") tried to address this by adding some debug code and by
issuing another IPI when a hang was detected. This helped mitigating
the problem (the repeated IPI unlocks the hang), but the root cause is
still unknown.

Current available data suggests that either an IPI wasn't sent when it
should have been, or that the IPI didn't result in the target cpu
executing the queued function (due to the IPI not reaching the cpu,
the IPI handler not being called, or the handler not seeing the queued
request).

Try to add more diagnostic data by introducing a global atomic counter
which is being incremented when doing critical operations (before and
after queueing a new request, when sending an IPI, and when dequeueing
a request). The counter value is stored in percpu variables which can
be printed out when a hang is detected.

The data of the last event (consisting of sequence counter, source
cpu, target cpu, and event type) is stored in a global variable. When
a new event is to be traced, the data of the last event is stored in
the event related percpu location and the global data is updated with
the new event's data. This allows to track two events in one data
location: one by the value of the event data (the event before the
current one), and one by the location itself (the current event).

A typical printout with a detected hang will look like this:

csd: Detected non-responsive CSD lock (#1) on CPU#1, waiting 53 ns for 
CPU#06 scf_handler_1+0x0/0x50(0xa2a881bb1410).
csd: CSD lock (#1) handling prior 
scf_handler_1+0x0/0x50(0xa2a8813823c0) request.
 csd: cnt(8cc): -> dequeue (src cpu 0 == empty)
 csd: cnt(8cd): ->0006 idle
 csd: cnt(0003668): 0001->0006 queue
 csd: cnt(0003669): 0001->0006 ipi
 csd: cnt(0003e0f): 0007->000a queue
 csd: cnt(0003e10): 0001-> ping
 csd: cnt(0003e71): 0003-> ping
 csd: cnt(0003e72): ->0006 gotipi
 csd: cnt(0003e73): ->0006 handle
 csd: cnt(0003e74): ->0006 dequeue (src cpu 0 == empty)
 csd: cnt(0003e7f): 0004->0006 ping
 csd: cnt(0003e80): 0001-> pinged
 csd: cnt(0003eb2): 0005->0001 noipi
 csd: cnt(0003eb3): 0001->0006 queue
 csd: cnt(0003eb4): 0001->0006 noipi
 csd: cnt now: 0003f00

The idea is to print only relevant entries. Those are all events which
are associated with the hang (so sender side events for the source cpu
of the hanging request, and receiver side events for the target cpu),
and the related events just before those (for adding data needed to
identify a possible race). Printing all available data would be
possible, but this would add large amounts of data printed on larger
configurations.

Signed-off-by: Juergen Gross 
Tested-by: Paul E. McKenney 
---
V2:
- add automatic data deciphering and sorting of entries
- add new trace point for leaving flush_smp_call_function_queue()
- add information when finding an empty call_single_queue


They do not apply on top of these:

   https://lkml.kernel.org/r/20210220231712.2475218-2-na...@vmware.com


They are already in tip locking/core (Ingo applied them).


Juergen



OpenPGP_0xB0DE9DD628BF132F.asc
Description: application/pgp-keys


OpenPGP_signature
Description: OpenPGP digital signature


Re: [PATCH 3/3] kernel/smp: add more data to CSD lock debugging

2021-03-01 Thread Jürgen Groß

On 01.03.21 16:45, Peter Zijlstra wrote:

On Sun, Feb 28, 2021 at 04:07:04PM -0800, Paul E. McKenney wrote:

On Fri, Feb 26, 2021 at 10:05:51PM +0100, Peter Zijlstra wrote:

On Fri, Feb 26, 2021 at 10:12:05AM -0800, Paul E. McKenney wrote:

On Fri, Feb 26, 2021 at 05:38:44PM +0100, Peter Zijlstra wrote:


I hate all of this, but if this will finally catch the actual problem,
we can then revert all this, so sure.


OK, I will bite...  What exactly do you hate about it?


It's ugly and messy. We're basically commiting a debug session. Normally
gunk like this is done in private trees, then we find the problem and
fix that and crap like this never sees the light of day.


Is your hatred due to the presence of debug code at all, or a belief that
this code is unlikely to be useful in any subsequent IPI-related bug hunt?


The clutter, mostly I think. There's a lot of debug sprinkled about.


I agree.

In case we are able to identify the root cause of the problem I think
it would be no problem to revert this patch and put a comment into smp.c
naming the commit-id of this patch and what it was good for. This will
enable future bug hunters to make use of the patch without spoiling the
code for others.


Juergen



OpenPGP_0xB0DE9DD628BF132F.asc
Description: application/pgp-keys


OpenPGP_signature
Description: OpenPGP digital signature


Re: WARNING: modpost: vmlinux.o(.text+0x1a8edb8): Section mismatch in reference from the function stop_machine() to the function .init.text:intel_rng_hw_init()

2021-02-24 Thread Jürgen Groß

On 24.02.21 15:20, kernel test robot wrote:

tree:   https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git 
master
head:   c03c21ba6f4e95e406a1a7b4c34ef334b977c194
commit: ab234a260b1f625b26cbefa93ca365b0ae66df33 x86/pv: Rework 
arch_local_irq_restore() to not use popf
date:   2 weeks ago
config: x86_64-randconfig-a005-20210223 (attached as .config)
compiler: clang version 13.0.0 (https://github.com/llvm/llvm-project 
f14a14dd2564703db02f80c00db8ae492b594f77)
reproduce (this is a W=1 build):
 wget 
https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
~/bin/make.cross
 chmod +x ~/bin/make.cross
 # install x86_64 cross compiling tool for clang build
 # apt-get install binutils-x86-64-linux-gnu
 # 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ab234a260b1f625b26cbefa93ca365b0ae66df33
 git remote add linus 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
 git fetch --no-tags linus master
 git checkout ab234a260b1f625b26cbefa93ca365b0ae66df33
 # save the attached .config to linux build tree
 COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross ARCH=x86_64

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot 

All warnings (new ones prefixed by >>, old ones prefixed by <<):


WARNING: modpost: vmlinux.o(.text+0x1a8edb8): Section mismatch in reference 
from the function stop_machine() to the function .init.text:intel_rng_hw_init()

The function stop_machine() references
the function __init intel_rng_hw_init().
This is often because stop_machine lacks a __init
annotation or the annotation of intel_rng_hw_init is wrong.


I'd be very interested to know how the identified patch would be able to
have this effect.


Juergen


OpenPGP_0xB0DE9DD628BF132F.asc
Description: application/pgp-keys


OpenPGP_signature
Description: OpenPGP digital signature


Re: [RFC][PATCH 6/6] objtool,x86: Rewrite retpoline thunk calls

2021-02-20 Thread Jürgen Groß

On 20.02.21 23:32, Peter Zijlstra wrote:

On Sat, Feb 20, 2021 at 06:41:01PM +0100, Borislav Petkov wrote:

  - if we had negative alternatives objtool doesn't need to actually
rewrite code in this case. It could simply emit alternative entries
and call it a day.


I don't mind the negative alt per se - I mind the implementation I saw.
I'm sure we can come up with something nicer, like, for example, struct
alt_instr.flags to denote that this feature is a NOT feature.


So you don't like the ~ or - on cpuid? ISTR we talked about
alt_instr::flags before, but Google isn't playing ball today so I can't
seem to find it.


If you want I can cook up a patch and include it in my paravirt
cleanup series.

Juergen


OpenPGP_0xB0DE9DD628BF132F.asc
Description: application/pgp-keys


OpenPGP_signature
Description: OpenPGP digital signature


Re: [PATCH v2 8/8] xen/evtchn: use READ/WRITE_ONCE() for accessing ring indices

2021-02-18 Thread Jürgen Groß

On 17.02.21 14:29, Ross Lagerwall wrote:

On 2021-02-11 10:16, Juergen Gross wrote:

For avoiding read- and write-tearing by the compiler use READ_ONCE()
and WRITE_ONCE() for accessing the ring indices in evtchn.c.

Signed-off-by: Juergen Gross 
---
V2:
- modify all accesses (Julien Grall)
---
  drivers/xen/evtchn.c | 25 -
  1 file changed, 16 insertions(+), 9 deletions(-)

diff --git a/drivers/xen/evtchn.c b/drivers/xen/evtchn.c
index 421382c73d88..620008f89dbe 100644
--- a/drivers/xen/evtchn.c
+++ b/drivers/xen/evtchn.c
@@ -162,6 +162,7 @@ static irqreturn_t evtchn_interrupt(int irq, void *data)
  {
struct user_evtchn *evtchn = data;
struct per_user_data *u = evtchn->user;
+   unsigned int prod, cons;
  
  	WARN(!evtchn->enabled,

 "Interrupt for port %u, but apparently not enabled; per-user %p\n",
@@ -171,10 +172,14 @@ static irqreturn_t evtchn_interrupt(int irq, void *data)
  
  	spin_lock(>ring_prod_lock);
  
-	if ((u->ring_prod - u->ring_cons) < u->ring_size) {

-   *evtchn_ring_entry(u, u->ring_prod) = evtchn->port;
+   prod = READ_ONCE(u->ring_prod);
+   cons = READ_ONCE(u->ring_cons);
+
+   if ((prod - cons) < u->ring_size) {
+   *evtchn_ring_entry(u, prod) = evtchn->port;
smp_wmb(); /* Ensure ring contents visible */
-   if (u->ring_cons == u->ring_prod++) {
+   if (cons == prod++) {
+   WRITE_ONCE(u->ring_prod, prod);
wake_up_interruptible(>evtchn_wait);
kill_fasync(>evtchn_async_queue,
SIGIO, POLL_IN);


This doesn't work correctly since now u->ring_prod is only updated if cons == 
prod++.


Right. Thanks for noticing.


Juergen



OpenPGP_0xB0DE9DD628BF132F.asc
Description: application/pgp-keys


OpenPGP_signature
Description: OpenPGP digital signature


Re: [PATCH v2 3/8] xen/events: avoid handling the same event on two cpus at the same time

2021-02-18 Thread Jürgen Groß

On 15.02.21 22:35, Boris Ostrovsky wrote:


On 2/11/21 5:16 AM, Juergen Gross wrote:


@@ -622,6 +623,7 @@ static void xen_irq_lateeoi_locked(struct irq_info *info, 
bool spurious)
}
  
  	info->eoi_time = 0;

+   smp_store_release(>is_active, 0);



Can this be done in lateeoi_ack_dynirq()/lateeoi_mask_ack_dynirq(), after we've 
masked the channel? Then it will be consistent with how how other chips do it, 
especially with the new helper.


Yes, I think this should work.


Juergen


OpenPGP_0xB0DE9DD628BF132F.asc
Description: application/pgp-keys


OpenPGP_signature
Description: OpenPGP digital signature


Re: [PATCH v2 3/8] xen/events: avoid handling the same event on two cpus at the same time

2021-02-14 Thread Jürgen Groß

On 14.02.21 22:34, Julien Grall wrote:

Hi Juergen,

On 11/02/2021 10:16, Juergen Gross wrote:

When changing the cpu affinity of an event it can happen today that
(with some unlucky timing) the same event will be handled on the old
and the new cpu at the same time.

Avoid that by adding an "event active" flag to the per-event data and
call the handler only if this flag isn't set.

Reported-by: Julien Grall 
Signed-off-by: Juergen Gross 
---
V2:
- new patch
---
  drivers/xen/events/events_base.c | 19 +++
  1 file changed, 15 insertions(+), 4 deletions(-)

diff --git a/drivers/xen/events/events_base.c 
b/drivers/xen/events/events_base.c

index e157e7506830..f7e22330dcef 100644
--- a/drivers/xen/events/events_base.c
+++ b/drivers/xen/events/events_base.c
@@ -102,6 +102,7 @@ struct irq_info {
  #define EVT_MASK_REASON_EXPLICIT    0x01
  #define EVT_MASK_REASON_TEMPORARY    0x02
  #define EVT_MASK_REASON_EOI_PENDING    0x04
+    u8 is_active;    /* Is event just being handled? */
  unsigned irq;
  evtchn_port_t evtchn;   /* event channel */
  unsigned short cpu; /* cpu bound */
@@ -622,6 +623,7 @@ static void xen_irq_lateeoi_locked(struct irq_info 
*info, bool spurious)

  }
  info->eoi_time = 0;
+    smp_store_release(>is_active, 0);
  do_unmask(info, EVT_MASK_REASON_EOI_PENDING);
  }
@@ -809,13 +811,15 @@ static void pirq_query_unmask(int irq)
  static void eoi_pirq(struct irq_data *data)
  {
-    evtchn_port_t evtchn = evtchn_from_irq(data->irq);
+    struct irq_info *info = info_for_irq(data->irq);
+    evtchn_port_t evtchn = info ? info->evtchn : 0;
  struct physdev_eoi eoi = { .irq = pirq_from_irq(data->irq) };
  int rc = 0;
  if (!VALID_EVTCHN(evtchn))
  return;
+    smp_store_release(>is_active, 0);


Would you mind to explain why you are using the release semantics?


It is basically releasing a lock. So release semantics seem to be
appropriate.

It is also not clear to me if there are any expected ordering between 
clearing is_active and clearing the pending bit.


No, I don't think there is a specific ordering required. is_active is
just guarding against two simultaneous IRQ handler calls for the same
event being active. Clearing the pending bit is not part of the guarded
section.




  clear_evtchn(evtchn);



The 2 lines here seems to be a common pattern in this patch. So I would 
suggest to create a new helper.


Okay.


Juergen


OpenPGP_0xB0DE9DD628BF132F.asc
Description: application/pgp-keys


OpenPGP_signature
Description: OpenPGP digital signature


Re: [PATCH] arm/xen: Don't probe xenbus as part of an early initcall

2021-02-10 Thread Jürgen Groß

On 10.02.21 18:06, Julien Grall wrote:

From: Julien Grall 

After Commit 3499ba8198cad ("xen: Fix event channel callback via
INTX/GSI"), xenbus_probe() will be called too early on Arm. This will
recent to a guest hang during boot.

If there hang wasn't there, we would have ended up to call
xenbus_probe() twice (the second time is in xenbus_probe_initcall()).

We don't need to initialize xenbus_probe() early for Arm guest.
Therefore, the call in xen_guest_init() is now removed.

After this change, there is no more external caller for xenbus_probe().
So the function is turned to a static one. Interestingly there were two
prototypes for it.

Fixes: 3499ba8198cad ("xen: Fix event channel callback via INTX/GSI")
Reported-by: Ian Jackson 
Signed-off-by: Julien Grall 


Pushed to xen/tip.git for-linus-5.11


Juergen


OpenPGP_0xB0DE9DD628BF132F.asc
Description: application/pgp-keys


OpenPGP_signature
Description: OpenPGP digital signature


Re: [PATCH 0/7] xen/events: bug fixes and some diagnostic aids

2021-02-08 Thread Jürgen Groß

On 08.02.21 15:20, Julien Grall wrote:

Hi Juergen,

On 08/02/2021 13:58, Jürgen Groß wrote:

On 08.02.21 14:09, Julien Grall wrote:

Hi Juergen,

On 08/02/2021 12:31, Jürgen Groß wrote:

On 08.02.21 13:16, Julien Grall wrote:



On 08/02/2021 12:14, Jürgen Groß wrote:

On 08.02.21 11:40, Julien Grall wrote:

Hi Juergen,

On 08/02/2021 10:22, Jürgen Groß wrote:

On 08.02.21 10:54, Julien Grall wrote:
... I don't really see how the difference matter here. The idea 
is to re-use what's already existing rather than trying to 
re-invent the wheel with an extra lock (or whatever we can come 
up).


The difference is that the race is occurring _before_ any IRQ is
involved. So I don't see how modification of IRQ handling would 
help.


Roughly our current IRQ handling flow (handle_eoi_irq()) looks like:

if ( irq in progress )
{
   set IRQS_PENDING
   return;
}

do
{
   clear IRQS_PENDING
   handle_irq()
} while (IRQS_PENDING is set)

IRQ handling flow like handle_fasteoi_irq() looks like:

if ( irq in progress )
   return;

handle_irq()

The latter flow would catch "spurious" interrupt and ignore them. 
So it would handle nicely the race when changing the event affinity.


Sure? Isn't "irq in progress" being reset way before our "lateeoi" is
issued, thus having the same problem again? 


Sorry I can't parse this.


handle_fasteoi_irq() will do nothing "if ( irq in progress )". When is
this condition being reset again in order to be able to process another
IRQ?

It is reset after the handler has been called. See handle_irq_event().


Right. And for us this is too early, as we want the next IRQ being
handled only after we have called xen_irq_lateeoi().


It is not really the next IRQ here. It is more a spurious IRQ because we 
don't clear & mask the event right away. Instead, it is done later in 
the handling.







I believe this will be the case before our "lateeoi" handling is
becoming active (more precise: when our IRQ handler is returning to
handle_fasteoi_irq()), resulting in the possibility of the same race we
are experiencing now.


I am a bit confused what you mean by "lateeoi" handling is becoming 
active. Can you clarify?


See above: the next call of the handler should be allowed only after
xen_irq_lateeoi() for the IRQ has been called.

If the handler is being called earlier we have the race resulting
in the WARN() splats.


I feel it is dislike to understand race with just words. Can you provide 
a scenario (similar to the one I originally provided) with two vCPUs and 
show how this can happen?


vCPU0| vCPU1
 |
 | Call xen_rebind_evtchn_to_cpu()
receive event X  |
 | mask event X
 | bind to vCPU1
   | unmask event X
 |
 | receive event X
 |
 | handle_fasteoi_irq(X)
 |  -> handle_irq_event()
 |   -> set IRQD_IN_PROGRESS
 |   -> evtchn_interrupt()
 |  -> evtchn->enabled = false
 |   -> clear IRQD_IN_PROGRESS
handle_fasteoi_irq(X)|
-> evtchn_interrupt()|
   -> WARN() |
 | xen_irq_lateeoi(X)





Note that are are other IRQ flows existing. We should have a look at 
them before trying to fix thing ourself.


Fine with me, but it either needs to fit all use cases (interdomain,
IPI, real interrupts) or we need to have a per-type IRQ flow.


AFAICT, we already used different flow based on the use cases. Before 
2011, we used to use the fasteoi one but this was changed by the 
following commit:


Yes, I know that.



I think we should fix the issue locally first, then we can start to do
a thorough rework planning. Its not as if the needed changes with the
current flow would be so huge, and I'd really like to have a solution
rather sooner than later. Changing the IRQ flow might have other side
effects which need to be excluded by thorough testing.

I agree that we need a solution ASAP. But I am a bit worry to:
   1) Add another lock in that event handling path.


Regarding complexity: it is very simple (just around masking/unmasking
of the event channel). Contention is very unlikely.

   2) Add more complexity in the event handling (it is already fairly 
difficult to reason about the locking/race)


Let see what the local fix look like.


Yes.



Although, the other issue I can see so far is handle_irq_for_port() 
will update info->{eoi_cpu, irq_epoch, eoi_time} without any locking. 
But it is not clear this is what you mean by "becoming active".


As long as a single event can't be handled on multiple cpus at the same
time, there is no locking needed.


Well, it can happen in the current code (see my original scenario). If 
your idea fix it then fine.


I hope so.


Juergen


OpenPGP_0xB0DE9DD628BF132F.asc
Description: application/pgp-keys


OpenPGP_signature
Description: OpenPGP digital signature


Re: [PATCH 0/7] xen/events: bug fixes and some diagnostic aids

2021-02-08 Thread Jürgen Groß

On 08.02.21 14:09, Julien Grall wrote:

Hi Juergen,

On 08/02/2021 12:31, Jürgen Groß wrote:

On 08.02.21 13:16, Julien Grall wrote:



On 08/02/2021 12:14, Jürgen Groß wrote:

On 08.02.21 11:40, Julien Grall wrote:

Hi Juergen,

On 08/02/2021 10:22, Jürgen Groß wrote:

On 08.02.21 10:54, Julien Grall wrote:
... I don't really see how the difference matter here. The idea 
is to re-use what's already existing rather than trying to 
re-invent the wheel with an extra lock (or whatever we can come up).


The difference is that the race is occurring _before_ any IRQ is
involved. So I don't see how modification of IRQ handling would help.


Roughly our current IRQ handling flow (handle_eoi_irq()) looks like:

if ( irq in progress )
{
   set IRQS_PENDING
   return;
}

do
{
   clear IRQS_PENDING
   handle_irq()
} while (IRQS_PENDING is set)

IRQ handling flow like handle_fasteoi_irq() looks like:

if ( irq in progress )
   return;

handle_irq()

The latter flow would catch "spurious" interrupt and ignore them. 
So it would handle nicely the race when changing the event affinity.


Sure? Isn't "irq in progress" being reset way before our "lateeoi" is
issued, thus having the same problem again? 


Sorry I can't parse this.


handle_fasteoi_irq() will do nothing "if ( irq in progress )". When is
this condition being reset again in order to be able to process another
IRQ?

It is reset after the handler has been called. See handle_irq_event().


Right. And for us this is too early, as we want the next IRQ being
handled only after we have called xen_irq_lateeoi().




I believe this will be the case before our "lateeoi" handling is
becoming active (more precise: when our IRQ handler is returning to
handle_fasteoi_irq()), resulting in the possibility of the same race we
are experiencing now.


I am a bit confused what you mean by "lateeoi" handling is becoming 
active. Can you clarify?


See above: the next call of the handler should be allowed only after
xen_irq_lateeoi() for the IRQ has been called.

If the handler is being called earlier we have the race resulting
in the WARN() splats.

Note that are are other IRQ flows existing. We should have a look at 
them before trying to fix thing ourself.


Fine with me, but it either needs to fit all use cases (interdomain,
IPI, real interrupts) or we need to have a per-type IRQ flow.

I think we should fix the issue locally first, then we can start to do
a thorough rework planning. Its not as if the needed changes with the
current flow would be so huge, and I'd really like to have a solution
rather sooner than later. Changing the IRQ flow might have other side
effects which need to be excluded by thorough testing.

Although, the other issue I can see so far is handle_irq_for_port() will 
update info->{eoi_cpu, irq_epoch, eoi_time} without any locking. But it 
is not clear this is what you mean by "becoming active".


As long as a single event can't be handled on multiple cpus at the same
time, there is no locking needed.


Juergen


OpenPGP_0xB0DE9DD628BF132F.asc
Description: application/pgp-keys


OpenPGP_signature
Description: OpenPGP digital signature


Re: [PATCH 0/7] xen/events: bug fixes and some diagnostic aids

2021-02-08 Thread Jürgen Groß

On 08.02.21 13:16, Julien Grall wrote:



On 08/02/2021 12:14, Jürgen Groß wrote:

On 08.02.21 11:40, Julien Grall wrote:

Hi Juergen,

On 08/02/2021 10:22, Jürgen Groß wrote:

On 08.02.21 10:54, Julien Grall wrote:
... I don't really see how the difference matter here. The idea is 
to re-use what's already existing rather than trying to re-invent 
the wheel with an extra lock (or whatever we can come up).


The difference is that the race is occurring _before_ any IRQ is
involved. So I don't see how modification of IRQ handling would help.


Roughly our current IRQ handling flow (handle_eoi_irq()) looks like:

if ( irq in progress )
{
   set IRQS_PENDING
   return;
}

do
{
   clear IRQS_PENDING
   handle_irq()
} while (IRQS_PENDING is set)

IRQ handling flow like handle_fasteoi_irq() looks like:

if ( irq in progress )
   return;

handle_irq()

The latter flow would catch "spurious" interrupt and ignore them. So 
it would handle nicely the race when changing the event affinity.


Sure? Isn't "irq in progress" being reset way before our "lateeoi" is
issued, thus having the same problem again? 


Sorry I can't parse this.


handle_fasteoi_irq() will do nothing "if ( irq in progress )". When is
this condition being reset again in order to be able to process another
IRQ? I believe this will be the case before our "lateeoi" handling is
becoming active (more precise: when our IRQ handler is returning to
handle_fasteoi_irq()), resulting in the possibility of the same race we
are experiencing now.


Juergen


OpenPGP_0xB0DE9DD628BF132F.asc
Description: application/pgp-keys


OpenPGP_signature
Description: OpenPGP digital signature


Re: [PATCH 7/7] xen/evtchn: read producer index only once

2021-02-08 Thread Jürgen Groß

On 08.02.21 13:23, Jan Beulich wrote:

On 08.02.2021 13:15, Jürgen Groß wrote:

On 08.02.21 12:54, Jan Beulich wrote:

On 08.02.2021 11:59, Jürgen Groß wrote:

On 08.02.21 11:51, Jan Beulich wrote:

On 08.02.2021 11:41, Jürgen Groß wrote:

On 08.02.21 10:48, Jan Beulich wrote:

On 06.02.2021 11:49, Juergen Gross wrote:

In evtchn_read() use READ_ONCE() for reading the producer index in
order to avoid the compiler generating multiple accesses.

Signed-off-by: Juergen Gross 
---
 drivers/xen/evtchn.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/xen/evtchn.c b/drivers/xen/evtchn.c
index 421382c73d88..f6b199b597bf 100644
--- a/drivers/xen/evtchn.c
+++ b/drivers/xen/evtchn.c
@@ -211,7 +211,7 @@ static ssize_t evtchn_read(struct file *file, char __user 
*buf,
goto unlock_out;
 
 		c = u->ring_cons;

-   p = u->ring_prod;
+   p = READ_ONCE(u->ring_prod);
if (c != p)
break;


Why only here and not also in

rc = wait_event_interruptible(u->evtchn_wait,
  u->ring_cons != u->ring_prod);

or in evtchn_poll()? I understand it's not needed when
ring_prod_lock is held, but that's not the case in the two
afaics named places. Plus isn't the same then true for
ring_cons and ring_cons_mutex, i.e. aren't the two named
places plus evtchn_interrupt() also in need of READ_ONCE()
for ring_cons?


The problem solved here is the further processing using "p" multiple
times. p must not be silently replaced with u->ring_prod by the
compiler, so I probably should reword the commit message to say:

... in order to not allow the compiler to refetch p.


I still wouldn't understand the change (and the lack of
further changes) then: The first further use of p is
outside the loop, alongside one of c. IOW why would c
then not need treating the same as p?


Its value wouldn't change, as ring_cons is being modified only at
the bottom of this function, and nowhere else (apart from the reset
case, but this can't run concurrently due to ring_cons_mutex).


I also still don't see the difference between latching a
value into a local variable vs a "freestanding" access -
neither are guaranteed to result in exactly one memory
access afaict.


READ_ONCE() is using a pointer to volatile, so any refetching by
the compiler would be a bug.


Of course, but this wasn't my point. I was contrasting

c = u->ring_cons;
p = u->ring_prod;

which you change with

rc = wait_event_interruptible(u->evtchn_wait,
  u->ring_cons != u->ring_prod);

which you leave alone.


Can you point out which problem might arise from that?


Not any particular active one. Yet enhancing some accesses
but not others seems to me like a recipe for new problems
down the road.


I already reasoned that the usage of READ_ONCE() is due to storing the
value in a local variable which needs to be kept constant during the
following processing (no refetches by the compiler). This reasoning
very clearly doesn't apply to the other accesses.


Juergen


OpenPGP_0xB0DE9DD628BF132F.asc
Description: application/pgp-keys


OpenPGP_signature
Description: OpenPGP digital signature


Re: [PATCH 7/7] xen/evtchn: read producer index only once

2021-02-08 Thread Jürgen Groß

On 08.02.21 12:54, Jan Beulich wrote:

On 08.02.2021 11:59, Jürgen Groß wrote:

On 08.02.21 11:51, Jan Beulich wrote:

On 08.02.2021 11:41, Jürgen Groß wrote:

On 08.02.21 10:48, Jan Beulich wrote:

On 06.02.2021 11:49, Juergen Gross wrote:

In evtchn_read() use READ_ONCE() for reading the producer index in
order to avoid the compiler generating multiple accesses.

Signed-off-by: Juergen Gross 
---
drivers/xen/evtchn.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/xen/evtchn.c b/drivers/xen/evtchn.c
index 421382c73d88..f6b199b597bf 100644
--- a/drivers/xen/evtchn.c
+++ b/drivers/xen/evtchn.c
@@ -211,7 +211,7 @@ static ssize_t evtchn_read(struct file *file, char __user 
*buf,
goto unlock_out;

		c = u->ring_cons;

-   p = u->ring_prod;
+   p = READ_ONCE(u->ring_prod);
if (c != p)
break;


Why only here and not also in

rc = wait_event_interruptible(u->evtchn_wait,
  u->ring_cons != u->ring_prod);

or in evtchn_poll()? I understand it's not needed when
ring_prod_lock is held, but that's not the case in the two
afaics named places. Plus isn't the same then true for
ring_cons and ring_cons_mutex, i.e. aren't the two named
places plus evtchn_interrupt() also in need of READ_ONCE()
for ring_cons?


The problem solved here is the further processing using "p" multiple
times. p must not be silently replaced with u->ring_prod by the
compiler, so I probably should reword the commit message to say:

... in order to not allow the compiler to refetch p.


I still wouldn't understand the change (and the lack of
further changes) then: The first further use of p is
outside the loop, alongside one of c. IOW why would c
then not need treating the same as p?


Its value wouldn't change, as ring_cons is being modified only at
the bottom of this function, and nowhere else (apart from the reset
case, but this can't run concurrently due to ring_cons_mutex).


I also still don't see the difference between latching a
value into a local variable vs a "freestanding" access -
neither are guaranteed to result in exactly one memory
access afaict.


READ_ONCE() is using a pointer to volatile, so any refetching by
the compiler would be a bug.


Of course, but this wasn't my point. I was contrasting

c = u->ring_cons;
p = u->ring_prod;

which you change with

rc = wait_event_interruptible(u->evtchn_wait,
  u->ring_cons != u->ring_prod);

which you leave alone.


Can you point out which problem might arise from that?


Juergen


OpenPGP_0xB0DE9DD628BF132F.asc
Description: application/pgp-keys


OpenPGP_signature
Description: OpenPGP digital signature


Re: [PATCH 0/7] xen/events: bug fixes and some diagnostic aids

2021-02-08 Thread Jürgen Groß

On 08.02.21 11:40, Julien Grall wrote:

Hi Juergen,

On 08/02/2021 10:22, Jürgen Groß wrote:

On 08.02.21 10:54, Julien Grall wrote:
... I don't really see how the difference matter here. The idea is to 
re-use what's already existing rather than trying to re-invent the 
wheel with an extra lock (or whatever we can come up).


The difference is that the race is occurring _before_ any IRQ is
involved. So I don't see how modification of IRQ handling would help.


Roughly our current IRQ handling flow (handle_eoi_irq()) looks like:

if ( irq in progress )
{
   set IRQS_PENDING
   return;
}

do
{
   clear IRQS_PENDING
   handle_irq()
} while (IRQS_PENDING is set)

IRQ handling flow like handle_fasteoi_irq() looks like:

if ( irq in progress )
   return;

handle_irq()

The latter flow would catch "spurious" interrupt and ignore them. So it 
would handle nicely the race when changing the event affinity.


Sure? Isn't "irq in progress" being reset way before our "lateeoi" is
issued, thus having the same problem again? And I think we want to keep
the lateeoi behavior in order to be able to control event storms.


Juergen


OpenPGP_0xB0DE9DD628BF132F.asc
Description: application/pgp-keys


OpenPGP_signature
Description: OpenPGP digital signature


Re: [PATCH 7/7] xen/evtchn: read producer index only once

2021-02-08 Thread Jürgen Groß

On 08.02.21 12:40, Julien Grall wrote:



On 06/02/2021 10:49, Juergen Gross wrote:

In evtchn_read() use READ_ONCE() for reading the producer index in
order to avoid the compiler generating multiple accesses.

Signed-off-by: Juergen Gross 
---
  drivers/xen/evtchn.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/xen/evtchn.c b/drivers/xen/evtchn.c
index 421382c73d88..f6b199b597bf 100644
--- a/drivers/xen/evtchn.c
+++ b/drivers/xen/evtchn.c
@@ -211,7 +211,7 @@ static ssize_t evtchn_read(struct file *file, char 
__user *buf,

  goto unlock_out;
  c = u->ring_cons;
-    p = u->ring_prod;
+    p = READ_ONCE(u->ring_prod);
For consistency, don't you also need the write side in 
evtchn_interrupt() to use WRITE_ONCE()?


Only in case I'd consider the compiler needing multiple memory
accesses for doing the update (see my reply to Jan's comment on this
patch).


Juergen


OpenPGP_0xB0DE9DD628BF132F.asc
Description: application/pgp-keys


OpenPGP_signature
Description: OpenPGP digital signature


Re: [PATCH 7/7] xen/evtchn: read producer index only once

2021-02-08 Thread Jürgen Groß

On 08.02.21 11:51, Jan Beulich wrote:

On 08.02.2021 11:41, Jürgen Groß wrote:

On 08.02.21 10:48, Jan Beulich wrote:

On 06.02.2021 11:49, Juergen Gross wrote:

In evtchn_read() use READ_ONCE() for reading the producer index in
order to avoid the compiler generating multiple accesses.

Signed-off-by: Juergen Gross 
---
   drivers/xen/evtchn.c | 2 +-
   1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/xen/evtchn.c b/drivers/xen/evtchn.c
index 421382c73d88..f6b199b597bf 100644
--- a/drivers/xen/evtchn.c
+++ b/drivers/xen/evtchn.c
@@ -211,7 +211,7 @@ static ssize_t evtchn_read(struct file *file, char __user 
*buf,
goto unlock_out;
   
   		c = u->ring_cons;

-   p = u->ring_prod;
+   p = READ_ONCE(u->ring_prod);
if (c != p)
break;


Why only here and not also in

rc = wait_event_interruptible(u->evtchn_wait,
  u->ring_cons != u->ring_prod);

or in evtchn_poll()? I understand it's not needed when
ring_prod_lock is held, but that's not the case in the two
afaics named places. Plus isn't the same then true for
ring_cons and ring_cons_mutex, i.e. aren't the two named
places plus evtchn_interrupt() also in need of READ_ONCE()
for ring_cons?


The problem solved here is the further processing using "p" multiple
times. p must not be silently replaced with u->ring_prod by the
compiler, so I probably should reword the commit message to say:

... in order to not allow the compiler to refetch p.


I still wouldn't understand the change (and the lack of
further changes) then: The first further use of p is
outside the loop, alongside one of c. IOW why would c
then not need treating the same as p?


Its value wouldn't change, as ring_cons is being modified only at
the bottom of this function, and nowhere else (apart from the reset
case, but this can't run concurrently due to ring_cons_mutex).


I also still don't see the difference between latching a
value into a local variable vs a "freestanding" access -
neither are guaranteed to result in exactly one memory
access afaict.


READ_ONCE() is using a pointer to volatile, so any refetching by
the compiler would be a bug.


And of course there's also our beloved topic of access
tearing here: READ_ONCE() also excludes that (at least as
per its intentions aiui).


Yes, but I don't see an urgent need to fix that, as there would
be thousands of accesses in the kernel needing a fix. A compiler
tearing a naturally aligned access into multiple memory accesses
would be rejected as buggy from the kernel community IMO.


Juergen


OpenPGP_0xB0DE9DD628BF132F.asc
Description: application/pgp-keys


OpenPGP_signature
Description: OpenPGP digital signature


Re: [PATCH 7/7] xen/evtchn: read producer index only once

2021-02-08 Thread Jürgen Groß

On 08.02.21 10:48, Jan Beulich wrote:

On 06.02.2021 11:49, Juergen Gross wrote:

In evtchn_read() use READ_ONCE() for reading the producer index in
order to avoid the compiler generating multiple accesses.

Signed-off-by: Juergen Gross 
---
  drivers/xen/evtchn.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/xen/evtchn.c b/drivers/xen/evtchn.c
index 421382c73d88..f6b199b597bf 100644
--- a/drivers/xen/evtchn.c
+++ b/drivers/xen/evtchn.c
@@ -211,7 +211,7 @@ static ssize_t evtchn_read(struct file *file, char __user 
*buf,
goto unlock_out;
  
  		c = u->ring_cons;

-   p = u->ring_prod;
+   p = READ_ONCE(u->ring_prod);
if (c != p)
break;


Why only here and not also in

rc = wait_event_interruptible(u->evtchn_wait,
  u->ring_cons != u->ring_prod);

or in evtchn_poll()? I understand it's not needed when
ring_prod_lock is held, but that's not the case in the two
afaics named places. Plus isn't the same then true for
ring_cons and ring_cons_mutex, i.e. aren't the two named
places plus evtchn_interrupt() also in need of READ_ONCE()
for ring_cons?


The problem solved here is the further processing using "p" multiple
times. p must not be silently replaced with u->ring_prod by the
compiler, so I probably should reword the commit message to say:

... in order to not allow the compiler to refetch p.


Juergen


OpenPGP_0xB0DE9DD628BF132F.asc
Description: application/pgp-keys


OpenPGP_signature
Description: OpenPGP digital signature


Re: [PATCH 6/7] xen/evtch: use smp barriers for user event ring

2021-02-08 Thread Jürgen Groß

On 08.02.21 11:23, Andrew Cooper wrote:

On 08/02/2021 09:50, Jan Beulich wrote:

On 08.02.2021 10:44, Andrew Cooper wrote:

On 06/02/2021 10:49, Juergen Gross wrote:

The ring buffer for user events is used in the local system only, so
smp barriers are fine for ensuring consistency.

Reported-by: Andrew Cooper 
Signed-off-by: Juergen Gross 

These need to be virt_* to not break in UP builds (on non-x86).

Initially I though so, too, but isn't the sole vCPU of such a
VM getting re-scheduled to a different pCPU in the hypervisor
an implied barrier anyway?


Yes, but that isn't relevant to why UP builds break.

smp_*() degrade to compiler barriers in UP builds, and while that's
mostly fine for x86 read/write, its not fine for ARM barriers.

virt_*() exist specifically to be smp_*() which don't degrade to broken
in UP builds.


But the barrier is really only necessary to serialize accesses within
the guest against each other. There is no guest outside party involved.

In case you are right this would mean that UP guests are all broken on
Arm.


Juergen


OpenPGP_0xB0DE9DD628BF132F.asc
Description: application/pgp-keys


OpenPGP_signature
Description: OpenPGP digital signature


Re: [PATCH 0/7] xen/events: bug fixes and some diagnostic aids

2021-02-08 Thread Jürgen Groß

On 08.02.21 10:54, Julien Grall wrote:



On 08/02/2021 09:41, Jürgen Groß wrote:

On 08.02.21 10:11, Julien Grall wrote:

Hi Juergen,

On 07/02/2021 12:58, Jürgen Groß wrote:

On 06.02.21 19:46, Julien Grall wrote:

Hi Juergen,

On 06/02/2021 10:49, Juergen Gross wrote:

The first three patches are fixes for XSA-332. The avoid WARN splats
and a performance issue with interdomain events.


Thanks for helping to figure out the problem. Unfortunately, I 
still see reliably the WARN splat with the latest Linux master 
(1e0d27fce010) + your first 3 patches.


I am using Xen 4.11 (1c7d984645f9) and dom0 is forced to use the 2L 
events ABI.


After some debugging, I think I have an idea what's went wrong. The 
problem happens when the event is initially bound from vCPU0 to a 
different vCPU.


 From the comment in xen_rebind_evtchn_to_cpu(), we are masking the 
event to prevent it being delivered on an unexpected vCPU. However, 
I believe the following can happen:


vCPU0    | vCPU1
 |
 | Call xen_rebind_evtchn_to_cpu()
receive event X    |
 | mask event X
 | bind to vCPU1
    | unmask event X
 |
 | receive event X
 |
 | handle_edge_irq(X)
handle_edge_irq(X)    |  -> handle_irq_event()
 |   -> set IRQD_IN_PROGRESS
  -> set IRQS_PENDING    |
 |   -> evtchn_interrupt()
 |   -> clear IRQD_IN_PROGRESS
 |  -> IRQS_PENDING is set
 |  -> handle_irq_event()
 |   -> evtchn_interrupt()
 | -> WARN()
 |

All the lateeoi handlers expect a ONESHOT semantic and 
evtchn_interrupt() is doesn't tolerate any deviation.


I think the problem was introduced by 7f874a0447a9 ("xen/events: 
fix lateeoi irq acknowledgment") because the interrupt was disabled 
previously. Therefore we wouldn't do another iteration in 
handle_edge_irq().


I think you picked the wrong commit for blaming, as this is just
the last patch of the three patches you were testing.


I actually found the right commit for blaming but I copied the 
information from the wrong shell :/. The bug was introduced by:


c44b849cee8c ("xen/events: switch user event channels to lateeoi model")



Aside the handlers, I think it may impact the defer EOI mitigation 
because in theory if a 3rd vCPU is joining the party (let say vCPU 
A migrate the event from vCPU B to vCPU C). So info->{eoi_cpu, 
irq_epoch, eoi_time} could possibly get mangled?


For a fix, we may want to consider to hold evtchn_rwlock with the 
write permission. Although, I am not 100% sure this is going to 
prevent everything.


It will make things worse, as it would violate the locking hierarchy
(xen_rebind_evtchn_to_cpu() is called with the IRQ-desc lock held).


Ah, right.



On a first glance I think we'll need a 3rd masking state ("temporarily
masked") in the second patch in order to avoid a race with lateeoi.

In order to avoid the race you outlined above we need an "event is 
being

handled" indicator checked via test_and_set() semantics in
handle_irq_for_port() and reset only when calling clear_evtchn().


It feels like we are trying to workaround the IRQ flow we are using 
(i.e. handle_edge_irq()).


I'm not really sure this is the main problem here. According to your
analysis the main problem is occurring when handling the event, not when
handling the IRQ: the event is being received on two vcpus.


I don't think we can easily divide the two because we rely on the IRQ 
framework to handle the lifecycle of the event. So...




Our problem isn't due to the IRQ still being pending, but due it being
raised again, which should happen for a one shot IRQ the same way.


... I don't really see how the difference matter here. The idea is to 
re-use what's already existing rather than trying to re-invent the wheel 
with an extra lock (or whatever we can come up).


The difference is that the race is occurring _before_ any IRQ is
involved. So I don't see how modification of IRQ handling would help.


Juergen


OpenPGP_0xB0DE9DD628BF132F.asc
Description: application/pgp-keys


OpenPGP_signature
Description: OpenPGP digital signature


Re: [PATCH 2/7] xen/events: don't unmask an event channel when an eoi is pending

2021-02-08 Thread Jürgen Groß

On 08.02.21 11:06, Jan Beulich wrote:

On 06.02.2021 11:49, Juergen Gross wrote:

@@ -1798,6 +1818,29 @@ static void mask_ack_dynirq(struct irq_data *data)
ack_dynirq(data);
  }
  
+static void lateeoi_ack_dynirq(struct irq_data *data)

+{
+   struct irq_info *info = info_for_irq(data->irq);
+   evtchn_port_t evtchn = info ? info->evtchn : 0;
+
+   if (VALID_EVTCHN(evtchn)) {
+   info->eoi_pending = true;
+   mask_evtchn(evtchn);
+   }
+}
+
+static void lateeoi_mask_ack_dynirq(struct irq_data *data)
+{
+   struct irq_info *info = info_for_irq(data->irq);
+   evtchn_port_t evtchn = info ? info->evtchn : 0;
+
+   if (VALID_EVTCHN(evtchn)) {
+   info->masked = true;
+   info->eoi_pending = true;
+   mask_evtchn(evtchn);
+   }
+}
+
  static int retrigger_dynirq(struct irq_data *data)
  {
evtchn_port_t evtchn = evtchn_from_irq(data->irq);
@@ -2023,8 +2066,8 @@ static struct irq_chip xen_lateeoi_chip __read_mostly = {
.irq_mask   = disable_dynirq,
.irq_unmask = enable_dynirq,
  
-	.irq_ack		= mask_ack_dynirq,

-   .irq_mask_ack   = mask_ack_dynirq,
+   .irq_ack= lateeoi_ack_dynirq,
+   .irq_mask_ack   = lateeoi_mask_ack_dynirq,
  
  	.irq_set_affinity	= set_affinity_irq,

.irq_retrigger  = retrigger_dynirq,



Unlike the prior handler the two new ones don't call ack_dynirq()
anymore, and the description doesn't give a hint towards this
difference. As a consequence, clear_evtchn() also doesn't get
called anymore - patch 3 adds the calls, but claims an older
commit to have been at fault. _If_ ack_dynirq() indeed isn't to
be called here, shouldn't the clear_evtchn() calls get added
right here?


There was clearly too much time between writing this patch and looking
at its performance impact. :-(

Somehow I managed to overlook that I just introduced the bug here. This
OTOH explains why there are not tons of complaints with the current
implementation. :-)

Will merge patch 3 into this one.


Juergen


OpenPGP_0xB0DE9DD628BF132F.asc
Description: application/pgp-keys


OpenPGP_signature
Description: OpenPGP digital signature


Re: [PATCH 6/7] xen/evtch: use smp barriers for user event ring

2021-02-08 Thread Jürgen Groß

On 08.02.21 10:38, Jan Beulich wrote:

On 06.02.2021 11:49, Juergen Gross wrote:

The ring buffer for user events is used in the local system only, so
smp barriers are fine for ensuring consistency.

Reported-by: Andrew Cooper 
Signed-off-by: Juergen Gross 


Reviewed-by: Jan Beulich 

Albeit I think "local system" is at least ambiguous (physical
machine? VM?). How about something like "is local to the given
kernel instance"?


Yes.


Juergen



OpenPGP_0xB0DE9DD628BF132F.asc
Description: application/pgp-keys


OpenPGP_signature
Description: OpenPGP digital signature


Re: [PATCH 0/7] xen/events: bug fixes and some diagnostic aids

2021-02-08 Thread Jürgen Groß

On 08.02.21 10:11, Julien Grall wrote:

Hi Juergen,

On 07/02/2021 12:58, Jürgen Groß wrote:

On 06.02.21 19:46, Julien Grall wrote:

Hi Juergen,

On 06/02/2021 10:49, Juergen Gross wrote:

The first three patches are fixes for XSA-332. The avoid WARN splats
and a performance issue with interdomain events.


Thanks for helping to figure out the problem. Unfortunately, I still 
see reliably the WARN splat with the latest Linux master 
(1e0d27fce010) + your first 3 patches.


I am using Xen 4.11 (1c7d984645f9) and dom0 is forced to use the 2L 
events ABI.


After some debugging, I think I have an idea what's went wrong. The 
problem happens when the event is initially bound from vCPU0 to a 
different vCPU.


 From the comment in xen_rebind_evtchn_to_cpu(), we are masking the 
event to prevent it being delivered on an unexpected vCPU. However, I 
believe the following can happen:


vCPU0    | vCPU1
 |
 | Call xen_rebind_evtchn_to_cpu()
receive event X    |
 | mask event X
 | bind to vCPU1
    | unmask event X
 |
 | receive event X
 |
 | handle_edge_irq(X)
handle_edge_irq(X)    |  -> handle_irq_event()
 |   -> set IRQD_IN_PROGRESS
  -> set IRQS_PENDING    |
 |   -> evtchn_interrupt()
 |   -> clear IRQD_IN_PROGRESS
 |  -> IRQS_PENDING is set
 |  -> handle_irq_event()
 |   -> evtchn_interrupt()
 | -> WARN()
 |

All the lateeoi handlers expect a ONESHOT semantic and 
evtchn_interrupt() is doesn't tolerate any deviation.


I think the problem was introduced by 7f874a0447a9 ("xen/events: fix 
lateeoi irq acknowledgment") because the interrupt was disabled 
previously. Therefore we wouldn't do another iteration in 
handle_edge_irq().


I think you picked the wrong commit for blaming, as this is just
the last patch of the three patches you were testing.


I actually found the right commit for blaming but I copied the 
information from the wrong shell :/. The bug was introduced by:


c44b849cee8c ("xen/events: switch user event channels to lateeoi model")



Aside the handlers, I think it may impact the defer EOI mitigation 
because in theory if a 3rd vCPU is joining the party (let say vCPU A 
migrate the event from vCPU B to vCPU C). So info->{eoi_cpu, 
irq_epoch, eoi_time} could possibly get mangled?


For a fix, we may want to consider to hold evtchn_rwlock with the 
write permission. Although, I am not 100% sure this is going to 
prevent everything.


It will make things worse, as it would violate the locking hierarchy
(xen_rebind_evtchn_to_cpu() is called with the IRQ-desc lock held).


Ah, right.



On a first glance I think we'll need a 3rd masking state ("temporarily
masked") in the second patch in order to avoid a race with lateeoi.

In order to avoid the race you outlined above we need an "event is being
handled" indicator checked via test_and_set() semantics in
handle_irq_for_port() and reset only when calling clear_evtchn().


It feels like we are trying to workaround the IRQ flow we are using 
(i.e. handle_edge_irq()).


I'm not really sure this is the main problem here. According to your
analysis the main problem is occurring when handling the event, not when
handling the IRQ: the event is being received on two vcpus.

Our problem isn't due to the IRQ still being pending, but due it being
raised again, which should happen for a one shot IRQ the same way.

But maybe I'm misunderstanding your idea.


Juergen


OpenPGP_0xB0DE9DD628BF132F.asc
Description: application/pgp-keys


OpenPGP_signature
Description: OpenPGP digital signature


Re: [PATCH 0/7] xen/events: bug fixes and some diagnostic aids

2021-02-07 Thread Jürgen Groß

On 06.02.21 19:46, Julien Grall wrote:

Hi Juergen,

On 06/02/2021 10:49, Juergen Gross wrote:

The first three patches are fixes for XSA-332. The avoid WARN splats
and a performance issue with interdomain events.


Thanks for helping to figure out the problem. Unfortunately, I still see 
reliably the WARN splat with the latest Linux master (1e0d27fce010) + 
your first 3 patches.


I am using Xen 4.11 (1c7d984645f9) and dom0 is forced to use the 2L 
events ABI.


After some debugging, I think I have an idea what's went wrong. The 
problem happens when the event is initially bound from vCPU0 to a 
different vCPU.


 From the comment in xen_rebind_evtchn_to_cpu(), we are masking the 
event to prevent it being delivered on an unexpected vCPU. However, I 
believe the following can happen:


vCPU0    | vCPU1
     |
     | Call xen_rebind_evtchn_to_cpu()
receive event X    |
     | mask event X
     | bind to vCPU1
    | unmask event X
     |
     | receive event X
     |
     | handle_edge_irq(X)
handle_edge_irq(X)    |  -> handle_irq_event()
     |   -> set IRQD_IN_PROGRESS
  -> set IRQS_PENDING    |
     |   -> evtchn_interrupt()
     |   -> clear IRQD_IN_PROGRESS
     |  -> IRQS_PENDING is set
     |  -> handle_irq_event()
     |   -> evtchn_interrupt()
     | -> WARN()
     |

All the lateeoi handlers expect a ONESHOT semantic and 
evtchn_interrupt() is doesn't tolerate any deviation.


I think the problem was introduced by 7f874a0447a9 ("xen/events: fix 
lateeoi irq acknowledgment") because the interrupt was disabled 
previously. Therefore we wouldn't do another iteration in 
handle_edge_irq().


I think you picked the wrong commit for blaming, as this is just
the last patch of the three patches you were testing.

Aside the handlers, I think it may impact the defer EOI mitigation 
because in theory if a 3rd vCPU is joining the party (let say vCPU A 
migrate the event from vCPU B to vCPU C). So info->{eoi_cpu, irq_epoch, 
eoi_time} could possibly get mangled?


For a fix, we may want to consider to hold evtchn_rwlock with the write 
permission. Although, I am not 100% sure this is going to prevent 
everything.


It will make things worse, as it would violate the locking hierarchy
(xen_rebind_evtchn_to_cpu() is called with the IRQ-desc lock held).

On a first glance I think we'll need a 3rd masking state ("temporarily
masked") in the second patch in order to avoid a race with lateeoi.

In order to avoid the race you outlined above we need an "event is being
handled" indicator checked via test_and_set() semantics in
handle_irq_for_port() and reset only when calling clear_evtchn().


Does my write-up make sense to you?


Yes. What about my reply? ;-)


Juergen


OpenPGP_0xB0DE9DD628BF132F.asc
Description: application/pgp-keys


OpenPGP_signature
Description: OpenPGP digital signature


Re: [PATCH 1/7] xen/events: reset affinity of 2-level event initially

2021-02-06 Thread Jürgen Groß

On 06.02.21 12:20, Julien Grall wrote:

Hi Juergen,

On 06/02/2021 10:49, Juergen Gross wrote:

When creating a new event channel with 2-level events the affinity
needs to be reset initially in order to avoid using an old affinity
from earlier usage of the event channel port.

The same applies to the affinity when onlining a vcpu: all old
affinity settings for this vcpu must be reset. As percpu events get
initialized before the percpu event channel hook is called,
resetting of the affinities happens after offlining a vcpu (this is
working, as initial percpu memory is zeroed out).

Cc: sta...@vger.kernel.org
Reported-by: Julien Grall 
Signed-off-by: Juergen Gross 
---
  drivers/xen/events/events_2l.c | 20 
  1 file changed, 20 insertions(+)

diff --git a/drivers/xen/events/events_2l.c 
b/drivers/xen/events/events_2l.c

index da87f3a1e351..23217940144a 100644
--- a/drivers/xen/events/events_2l.c
+++ b/drivers/xen/events/events_2l.c
@@ -47,6 +47,16 @@ static unsigned evtchn_2l_max_channels(void)
  return EVTCHN_2L_NR_CHANNELS;
  }
+static int evtchn_2l_setup(evtchn_port_t evtchn)
+{
+    unsigned int cpu;
+
+    for_each_online_cpu(cpu)
+    clear_bit(evtchn, BM(per_cpu(cpu_evtchn_mask, cpu)));


The bit corresponding to the event channel can only be set on a single 
CPU. Could we avoid the loop and instead clear the bit while closing the 
port?


This would need another callback.


Juergen



OpenPGP_0xB0DE9DD628BF132F.asc
Description: application/pgp-keys


OpenPGP_signature
Description: OpenPGP digital signature


Re: xen/evtchn: Interrupt for port 34, but apparently not enabled; per-user 00000000a86a4c1b on 5.10

2021-02-04 Thread Jürgen Groß

On 14.12.20 22:25, Julien Grall wrote:

Hi Juergen,

When testing Linux 5.10 dom0, I could reliably hit the following warning 
with using event 2L ABI:


[  589.591737] Interrupt for port 34, but apparently not enabled; 
per-user a86a4c1b
[  589.593259] WARNING: CPU: 0 PID:  at 
/home/ANT.AMAZON.COM/jgrall/works/oss/linux/drivers/xen/evtchn.c:170 
evtchn_interrupt+0xeb/0x100

[  589.595514] Modules linked in:
[  589.596145] CPU: 0 PID:  Comm: qemu-system-i38 Tainted: G 
W 5.10.0+ #180
[  589.597708] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 
rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014

[  589.599782] RIP: e030:evtchn_interrupt+0xeb/0x100
[  589.600698] Code: 48 8d bb d8 01 00 00 ba 01 00 00 00 be 1d 00 00 00 
e8 d9 10 ca ff eb b2 8b 75 20 48 89 da 48 c7 c7 a8 31 3d 82 e8 65 29 a0 
ff <0f> 0b e9 42 ff ff ff 0f 1f 40 00 66 2e 0f 1f 84 00 00 00 00 00 0f

[  589.604087] RSP: e02b:c90040003e70 EFLAGS: 00010086
[  589.605102] RAX:  RBX: 888102091800 RCX: 
0027
[  589.606445] RDX:  RSI: 88817fe19150 RDI: 
88817fe19158
[  589.607790] RBP: 88810f5ab980 R08: 0001 R09: 
00328980
[  589.609134] R10:  R11: c90040003c70 R12: 
888107fd3c00
[  589.610484] R13: c90040003ed4 R14:  R15: 
88810f5ffd80
[  589.611828] FS:  7f960c4b8ac0() GS:88817fe0() 
knlGS:

[  589.613348] CS:  1e030 DS:  ES:  CR0: 80050033
[  589.614525] CR2: 7f17ee72e000 CR3: 00010f5b6000 CR4: 
00050660

[  589.615874] Call Trace:
[  589.616402]  
[  589.616855]  __handle_irq_event_percpu+0x4e/0x2c0
[  589.617784]  handle_irq_event_percpu+0x30/0x80
[  589.618660]  handle_irq_event+0x3a/0x60
[  589.619428]  handle_edge_irq+0x9b/0x1f0
[  589.620209]  generic_handle_irq+0x4f/0x60
[  589.621008]  evtchn_2l_handle_events+0x160/0x280
[  589.621913]  __xen_evtchn_do_upcall+0x66/0xb0
[  589.622767]  __xen_pv_evtchn_do_upcall+0x11/0x20
[  589.623665]  asm_call_irq_on_stack+0x12/0x20
[  589.624511]  
[  589.624978]  xen_pv_evtchn_do_upcall+0x77/0xf0
[  589.625848]  exc_xen_hypervisor_callback+0x8/0x10

This can be reproduced when creating/destroying guest in a loop. 
Although, I have struggled to reproduce it on a vanilla Xen.


After several hours of debugging, I think I have found the root cause.

While we only expect the unmask to happen when the event channel is 
EOIed, there is an unmask happening as part of handle_edge_irq() because 
the interrupt was seen as pending by another vCPU (IRQS_PENDING is set).


It turns out that the event channel is set for multiple vCPU is in 
cpu_evtchn_mask. This is happening because the affinity is not cleared 
when freeing an event channel.


The implementation of evtchn_2l_handle_events() will look for all the 
active interrupts for the current vCPU and later on clear the pending 
bit (via the ack() callback). IOW, I believe, this is not an atomic 
operation.


Even if Xen will notify the event to a single vCPU, evtchn_pending_sel 
may still be set on the other vCPU (thanks to a different event 
channel). Therefore, there is a chance that two vCPUs will try to handle 
the same interrupt.


The IRQ handler handle_edge_irq() is able to deal with that and will 
mask/unmask the interrupt. This will mess us with the lateeoi logic 
(although, I managed to reproduce it once without XSA-332).


My initial idea to fix the problem was to switch the affinity from CPU X 
to CPU0 when the event channel is freed.


However, I am not sure this is enough because I haven't found anything 
yet preventing a race between evtchn_2l_handle_events9) and 
evtchn_2l_bind_vcpu().


So maybe we want to introduce a refcounting (if there is nothing 
provided by the IRQ framework) and only unmask when the counter drop to 0.


Any opinions?


With the two attached patches testing on my side survived more than 2
hours of constant guest reboots and destroy/create loops. Without the
patches the WARN()s came up after less than one minute.

Can you please give it a try?


Juergen

From 908940c92fb916146a7ce24bc41a18125967c54a Mon Sep 17 00:00:00 2001
From: Juergen Gross 
Date: Wed, 3 Feb 2021 16:24:42 +0100
Subject: [PATCH 1/2] xen/events: reset affinity of 2-level event initially

When creating a new event channel with 2-level events the affinity
needs to be reset initially in order to avoid using an old affinity
from earlier usage of the event channel port.

The same applies to the affinity when onlining a vcpu: all old
affinity settings for this vcpu must be reset. As percpu events get
initialized before the percpu event channel hook is called,
resetting of the affinities happens after offlining a vcpu (this is
working, as initial percpu memory is zeroed out).

Cc: sta...@vger.kernel.org
Reported-by: Julien Grall 
Signed-off-by: Juergen Gross 
---
 drivers/xen/events/events_2l.c | 20 
 

Re: [PATCH] drivers: net: xen-netfront: Simplify the calculation of variables

2021-02-04 Thread Jürgen Groß

On 02.02.21 11:17, Jiapeng Chong wrote:

Fix the following coccicheck warnings:

./drivers/net/xen-netfront.c:1816:52-54: WARNING !A || A && B is
equivalent to !A || B.

Reported-by: Abaci Robot 
Signed-off-by: Jiapeng Chong 


Reviewed-by: Juergen Gross 


Juergen


OpenPGP_0xB0DE9DD628BF132F.asc
Description: application/pgp-keys


OpenPGP_signature
Description: OpenPGP digital signature


Re: [PATCH] xen/netback: avoid race in xenvif_rx_ring_slots_available()

2021-02-03 Thread Jürgen Groß

On 04.02.21 00:48, Jakub Kicinski wrote:

On Tue,  2 Feb 2021 08:09:38 +0100 Juergen Gross wrote:

Since commit 23025393dbeb3b8b3 ("xen/netback: use lateeoi irq binding")
xenvif_rx_ring_slots_available() is no longer called only from the rx
queue kernel thread, so it needs to access the rx queue with the
associated queue held.

Reported-by: Igor Druzhinin 
Fixes: 23025393dbeb3b8b3 ("xen/netback: use lateeoi irq binding")
Cc: sta...@vger.kernel.org
Signed-off-by: Juergen Gross 


Should we route this change via networking trees? I see the bug did not
go through networking :)



I'm fine with either networking or the Xen tree. It should be included
in 5.11, though. So if you are willing to take it, please do so.


Juergen


OpenPGP_0xB0DE9DD628BF132F.asc
Description: application/pgp-keys


OpenPGP_signature
Description: OpenPGP digital signature


Re: [PATCH] xen/netback: avoid race in xenvif_rx_ring_slots_available()

2021-02-02 Thread Jürgen Groß

On 02.02.21 16:26, Igor Druzhinin wrote:

On 02/02/2021 07:09, Juergen Gross wrote:

Since commit 23025393dbeb3b8b3 ("xen/netback: use lateeoi irq binding")
xenvif_rx_ring_slots_available() is no longer called only from the rx
queue kernel thread, so it needs to access the rx queue with the
associated queue held.

Reported-by: Igor Druzhinin 
Fixes: 23025393dbeb3b8b3 ("xen/netback: use lateeoi irq binding")
Cc: sta...@vger.kernel.org
Signed-off-by: Juergen Gross 


Appreciate a quick fix! Is this the only place that sort of race could
happen now?


I checked and didn't find any other similar problem.


Juergen



OpenPGP_0xB0DE9DD628BF132F.asc
Description: application/pgp-keys


OpenPGP_signature
Description: OpenPGP digital signature


Re: [PATCH v2] xen-blkback: fix compatibility bug with single page rings

2021-01-28 Thread Jürgen Groß

On 29.01.21 07:20, Dongli Zhang wrote:



On 1/28/21 5:04 AM, Paul Durrant wrote:

From: Paul Durrant 

Prior to commit 4a8c31a1c6f5 ("xen/blkback: rework connect_ring() to avoid
inconsistent xenstore 'ring-page-order' set by malicious blkfront"), the
behaviour of xen-blkback when connecting to a frontend was:

- read 'ring-page-order'
- if not present then expect a single page ring specified by 'ring-ref'
- else expect a ring specified by 'ring-refX' where X is between 0 and
   1 << ring-page-order

This was correct behaviour, but was broken by the afforementioned commit to
become:

- read 'ring-page-order'
- if not present then expect a single page ring (i.e. ring-page-order = 0)
- expect a ring specified by 'ring-refX' where X is between 0 and
   1 << ring-page-order
- if that didn't work then see if there's a single page ring specified by
   'ring-ref'

This incorrect behaviour works most of the time but fails when a frontend
that sets 'ring-page-order' is unloaded and replaced by one that does not
because, instead of reading 'ring-ref', xen-blkback will read the stale
'ring-ref0' left around by the previous frontend will try to map the wrong
grant reference.

This patch restores the original behaviour.

Fixes: 4a8c31a1c6f5 ("xen/blkback: rework connect_ring() to avoid inconsistent 
xenstore 'ring-page-order' set by malicious blkfront")
Signed-off-by: Paul Durrant 
---
Cc: Konrad Rzeszutek Wilk 
Cc: "Roger Pau Monné" 
Cc: Jens Axboe 
Cc: Dongli Zhang 

v2:
  - Remove now-spurious error path special-case when nr_grefs == 1
---
  drivers/block/xen-blkback/common.h |  1 +
  drivers/block/xen-blkback/xenbus.c | 38 +-
  2 files changed, 17 insertions(+), 22 deletions(-)

diff --git a/drivers/block/xen-blkback/common.h 
b/drivers/block/xen-blkback/common.h
index b0c71d3a81a0..524a79f10de6 100644
--- a/drivers/block/xen-blkback/common.h
+++ b/drivers/block/xen-blkback/common.h
@@ -313,6 +313,7 @@ struct xen_blkif {
  
  	struct work_struct	free_work;

unsigned intnr_ring_pages;
+   boolmulti_ref;


Is it really necessary to introduce 'multi_ref' here or we may just re-use
'nr_ring_pages'?

According to blkfront code, 'ring-page-order' is set only when it is not zero,
that is, only when (info->nr_ring_pages > 1).


Did you look into all other OS's (Windows, OpenBSD, FreebSD, NetBSD,
Solaris, Netware, other proprietary systems) implementations to verify
that claim?

I don't think so. So better safe than sorry.


Juergen


OpenPGP_0xB0DE9DD628BF132F.asc
Description: application/pgp-keys


OpenPGP_signature
Description: OpenPGP digital signature


Re: Problems starting Xen domU after latest stable update

2021-01-28 Thread Jürgen Groß

On 29.01.21 01:51, Marek Marczykowski-Górecki wrote:

On Thu, Jan 28, 2021 at 07:03:00PM -0500, Boris Ostrovsky wrote:


On 1/28/21 6:52 PM, Michael D Labriola wrote:

Hey, everyone.  I've run into problems starting up my Xen domUs as of
the latest batch of stable updates.  Whenever I try to create one, I
get a bunch of block device errors like this:

libxl: error: libxl_device.c:1105:device_backend_callback: Domain 4:unable to 
add device with path /local/domain/0/backend/vbd/4/51712
libxl: error: libxl_device.c:1105:device_backend_callback: Domain 4:unable to 
add device with path /local/domain/0/backend/vbd/4/51728
libxl: error: libxl_device.c:1105:device_backend_callback: Domain 4:unable to 
add device with path /local/domain/0/backend/vbd/4/51744
libxl: error: libxl_device.c:1105:device_backend_callback: Domain 4:unable to 
add device with path /local/domain/0/backend/vbd/4/51760
libxl: error: libxl_device.c:1105:device_backend_callback: Domain 4:unable to 
add device with path /local/domain/0/backend/vbd/4/51776
libxl: error: libxl_create.c:1452:domcreate_launch_dm: Domain 4:unable to add 
disk devices
libxl: error: libxl_device.c:1105:device_backend_callback: Domain 4:unable to 
remove device with path /local/domain/0/backend/vbd/4/51712
libxl: error: libxl_device.c:1105:device_backend_callback: Domain 4:unable to 
remove device with path /local/domain/0/backend/vbd/4/51728
libxl: error: libxl_device.c:1105:device_backend_callback: Domain 4:unable to 
remove device with path /local/domain/0/backend/vbd/4/51744
libxl: error: libxl_device.c:1105:device_backend_callback: Domain 4:unable to 
remove device with path /local/domain/0/backend/vbd/4/51760
libxl: error: libxl_device.c:1105:device_backend_callback: Domain 4:unable to 
remove device with path /local/domain/0/backend/vbd/4/51776
libxl: error: libxl_domain.c:1290:devices_destroy_cb: Domain 
4:libxl__devices_destroy failed
libxl: error: libxl_domain.c:1177:libxl__destroy_domid: Domain 4:Non-existant 
domain
libxl: error: libxl_domain.c:1131:domain_destroy_callback: Domain 4:Unable to 
destroy guest
libxl: error: libxl_domain.c:1058:domain_destroy_cb: Domain 4:Destruction of 
domain failed

I'm using Xen 4.13.1 on the box I've been testing with.

I bisected down to this commit, and reverting it does indeed fix my
problem.  Well, this commit upstream and it's cherry-picked variants
on linux-5.4.y and linux-5.10.y.



You most likely need 5f46400f7a6a4fad635d5a79e2aa5a04a30ffea1. It hit Linus 
tree a few hours ago.


I can confirm this fixes the same issue for me (too?), thanks!
Shouldn't this patch have Cc: stable?


No, I don't think so.

The issue being fixed by the patch has been introduced only in 5.11
and the fixing patch references the buggy patch via a Fixes: tag.

If the buggy patch has been put into stable this Fixes: tag should
result in the fix being put into the same stable branches as well.


Juergen



OpenPGP_0xB0DE9DD628BF132F.asc
Description: application/pgp-keys


OpenPGP_signature
Description: OpenPGP digital signature


Re: [PATCH v2] x86/xen: avoid warning in Xen pv guest with CONFIG_AMD_MEM_ENCRYPT enabled

2021-01-27 Thread Jürgen Groß

On 27.01.21 12:23, Andrew Cooper wrote:

On 27/01/2021 10:26, Jürgen Groß wrote:

On 27.01.21 10:43, Andrew Cooper wrote:

On 27/01/2021 09:38, Juergen Gross wrote:

diff --git a/arch/x86/xen/enlighten_pv.c b/arch/x86/xen/enlighten_pv.c
index 4409306364dc..ca5ac10fcbf7 100644
--- a/arch/x86/xen/enlighten_pv.c
+++ b/arch/x86/xen/enlighten_pv.c
@@ -583,6 +583,12 @@ DEFINE_IDTENTRY_RAW(xenpv_exc_debug)
   exc_debug(regs);
   }
   +DEFINE_IDTENTRY_RAW(exc_xen_unknown_trap)
+{
+    /* This should never happen and there is no way to handle it. */
+    panic("Unknown trap in Xen PV mode.");


Looks much better.  How about including regs->entry_vector here, just to
short circuit the inevitable swearing which will accompany encountering
this panic() ?


You are aware the regs parameter is struct pt_regs *, not the Xen
struct cpu_user_regs *?


Yes, but I was assuming that they both contained the same information.



So I have no idea how I should get this information without creating
a per-vector handler.


Oh - that's dull.

Fine then.  Reviewed-by: Andrew Cooper 



I think I'll switch the panic() to printk(); BUG(); in order to have
more diagnostic data. Can I keep your R-b: with that modification?


Juergen


OpenPGP_0xB0DE9DD628BF132F.asc
Description: application/pgp-keys


OpenPGP_signature
Description: OpenPGP digital signature


Re: [PATCH v2] x86/xen: avoid warning in Xen pv guest with CONFIG_AMD_MEM_ENCRYPT enabled

2021-01-27 Thread Jürgen Groß

On 27.01.21 10:43, Andrew Cooper wrote:

On 27/01/2021 09:38, Juergen Gross wrote:

diff --git a/arch/x86/xen/enlighten_pv.c b/arch/x86/xen/enlighten_pv.c
index 4409306364dc..ca5ac10fcbf7 100644
--- a/arch/x86/xen/enlighten_pv.c
+++ b/arch/x86/xen/enlighten_pv.c
@@ -583,6 +583,12 @@ DEFINE_IDTENTRY_RAW(xenpv_exc_debug)
exc_debug(regs);
  }
  
+DEFINE_IDTENTRY_RAW(exc_xen_unknown_trap)

+{
+   /* This should never happen and there is no way to handle it. */
+   panic("Unknown trap in Xen PV mode.");


Looks much better.  How about including regs->entry_vector here, just to
short circuit the inevitable swearing which will accompany encountering
this panic() ?


You are aware the regs parameter is struct pt_regs *, not the Xen
struct cpu_user_regs *?

So I have no idea how I should get this information without creating
a per-vector handler.


Juergen


OpenPGP_0xB0DE9DD628BF132F.asc
Description: application/pgp-keys


OpenPGP_signature
Description: OpenPGP digital signature


Re: [PATCH v2] xen-blkfront: allow discard-* nodes to be optional

2021-01-27 Thread Jürgen Groß

On 19.01.21 11:57, Roger Pau Monne wrote:

This is inline with the specification described in blkif.h:

  * discard-granularity: should be set to the physical block size if
node is not present.
  * discard-alignment, discard-secure: should be set to 0 if node not
present.

This was detected as QEMU would only create the discard-granularity
node but not discard-alignment, and thus the setup done in
blkfront_setup_discard would fail.

Fix blkfront_setup_discard to not fail on missing nodes, and also fix
blkif_set_queue_limits to set the discard granularity to the physical
block size if none is specified in xenbus.

Fixes: ed30bf317c5ce ('xen-blkfront: Handle discard requests.')
Reported-by: Arthur Borsboom 
Signed-off-by: Roger Pau Monné 


Committed to xen/tip.git for-linus-5.11


Juergen


OpenPGP_0xB0DE9DD628BF132F.asc
Description: application/pgp-keys


OpenPGP_signature
Description: OpenPGP digital signature


Re: [PATCH] x86/xen: avoid warning in Xen pv guest with CONFIG_AMD_MEM_ENCRYPT enabled

2021-01-26 Thread Jürgen Groß

On 25.01.21 18:26, Andrew Cooper wrote:

On 25/01/2021 14:00, Juergen Gross wrote:

diff --git a/arch/x86/xen/enlighten_pv.c b/arch/x86/xen/enlighten_pv.c
index 4409306364dc..82948251f57b 100644
--- a/arch/x86/xen/enlighten_pv.c
+++ b/arch/x86/xen/enlighten_pv.c
@@ -583,6 +583,14 @@ DEFINE_IDTENTRY_RAW(xenpv_exc_debug)
exc_debug(regs);
  }
  
+#ifdef CONFIG_AMD_MEM_ENCRYPT

+DEFINE_IDTENTRY_RAW(xenpv_exc_vmm_communication)
+{
+   /* This should never happen and there is no way to handle it. */
+   panic("X86_TRAP_VC in Xen PV mode.");


Honestly, exactly the same is true of #VE, #HV and #SX.

What we do in the hypervisor is wire up one handler for all unknown
exceptions (to avoid potential future #DF issues) leading to a panic.
Wouldn't it be better to do this unconditionally, especially as #GP/#NP
doesn't work for PV guests for unregistered callbacks, rather than
fixing up piecewise like this?


I agree it would be better to have a "catch all unknown" handler.

I'll have a try how this would look like.


Juergen



OpenPGP_0xB0DE9DD628BF132F.asc
Description: application/pgp-keys


OpenPGP_signature
Description: OpenPGP digital signature


Re: [PATCH v2 14/20] x86/xen/pvh: Annotate indirect branch as safe

2021-01-21 Thread Jürgen Groß

On 21.01.21 22:29, Josh Poimboeuf wrote:

This indirect jump is harmless; annotate it to keep objtool's retpoline
validation happy.

Cc: Boris Ostrovsky 
Cc: Juergen Gross 
Signed-off-by: Josh Poimboeuf 


Reviewed-by: Juergen Gross 


Juergen


OpenPGP_0xB0DE9DD628BF132F.asc
Description: application/pgp-keys


OpenPGP_signature
Description: OpenPGP digital signature


Re: [PATCH v3 07/15] x86/alternative: support "not feature" and ALTERNATIVE_TERNARY

2021-01-19 Thread Jürgen Groß

On 19.01.21 13:06, Borislav Petkov wrote:

On Tue, Jan 19, 2021 at 12:35:42PM +0100, Jürgen Groß wrote:

In fact this should rather be named "X86_FEATURE_TRUE", as this is its
semantics.

And I think I can define it to the value 0x instead of using a
"real" bit for it.


A real bit is cheap - a special value to pay attention to in the future
not so much. Also we do have X86_FEATURE_ALWAYS already which has a
similar purpose...



Oh, well hidden. :-)

I'll use that one.


Juergen


OpenPGP_0xB0DE9DD628BF132F.asc
Description: application/pgp-keys


OpenPGP_signature
Description: OpenPGP digital signature


Re: [PATCH v3 06/15] x86/paravirt: switch time pvops functions to use static_call()

2021-01-19 Thread Jürgen Groß

On 17.12.20 18:31, Michael Kelley wrote:

From: Juergen Gross  Sent: Thursday, December 17, 2020 1:31 AM


The time pvops functions are the only ones left which might be
used in 32-bit mode and which return a 64-bit value.

Switch them to use the static_call() mechanism instead of pvops, as
this allows quite some simplification of the pvops implementation.

Due to include hell this requires to split out the time interfaces
into a new header file.

Signed-off-by: Juergen Gross 
---
  arch/x86/Kconfig  |  1 +
  arch/x86/include/asm/mshyperv.h   | 11 
  arch/x86/include/asm/paravirt.h   | 14 --
  arch/x86/include/asm/paravirt_time.h  | 38 +++
  arch/x86/include/asm/paravirt_types.h |  6 -
  arch/x86/kernel/cpu/vmware.c  |  5 ++--
  arch/x86/kernel/kvm.c |  3 ++-
  arch/x86/kernel/kvmclock.c|  3 ++-
  arch/x86/kernel/paravirt.c| 16 ---
  arch/x86/kernel/tsc.c |  3 ++-
  arch/x86/xen/time.c   | 12 -
  drivers/clocksource/hyperv_timer.c|  5 ++--
  drivers/xen/time.c|  3 ++-
  kernel/sched/sched.h  |  1 +
  14 files changed, 71 insertions(+), 50 deletions(-)
  create mode 100644 arch/x86/include/asm/paravirt_time.h



[snip]
  

diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
index ffc289992d1b..45942d420626 100644
--- a/arch/x86/include/asm/mshyperv.h
+++ b/arch/x86/include/asm/mshyperv.h
@@ -56,17 +56,6 @@ typedef int (*hyperv_fill_flush_list_func)(
  #define hv_get_raw_timer() rdtsc_ordered()
  #define hv_get_vector() HYPERVISOR_CALLBACK_VECTOR

-/*
- * Reference to pv_ops must be inline so objtool
- * detection of noinstr violations can work correctly.
- */
-static __always_inline void hv_setup_sched_clock(void *sched_clock)
-{
-#ifdef CONFIG_PARAVIRT
-   pv_ops.time.sched_clock = sched_clock;
-#endif
-}
-
  void hyperv_vector_handler(struct pt_regs *regs);

  static inline void hv_enable_stimer0_percpu_irq(int irq) {}


[snip]


diff --git a/drivers/clocksource/hyperv_timer.c 
b/drivers/clocksource/hyperv_timer.c
index ba04cb381cd3..1ed79993fc50 100644
--- a/drivers/clocksource/hyperv_timer.c
+++ b/drivers/clocksource/hyperv_timer.c
@@ -21,6 +21,7 @@
  #include 
  #include 
  #include 
+#include 

  static struct clock_event_device __percpu *hv_clock_event;
  static u64 hv_sched_clock_offset __ro_after_init;
@@ -445,7 +446,7 @@ static bool __init hv_init_tsc_clocksource(void)
clocksource_register_hz(_cs_tsc, NSEC_PER_SEC/100);

hv_sched_clock_offset = hv_read_reference_counter();
-   hv_setup_sched_clock(read_hv_sched_clock_tsc);
+   paravirt_set_sched_clock(read_hv_sched_clock_tsc);

return true;
  }
@@ -470,6 +471,6 @@ void __init hv_init_clocksource(void)
clocksource_register_hz(_cs_msr, NSEC_PER_SEC/100);

hv_sched_clock_offset = hv_read_reference_counter();
-   hv_setup_sched_clock(read_hv_sched_clock_msr);
+   static_call_update(pv_sched_clock, read_hv_sched_clock_msr);
  }
  EXPORT_SYMBOL_GPL(hv_init_clocksource);


These Hyper-V changes are problematic as we want to keep hyperv_timer.c
architecture independent.  While only the code for x86/x64 is currently
accepted upstream, code for ARM64 support is in progress.   So we need
to use hv_setup_sched_clock() in hyperv_timer.c, and have the per-arch
implementation in mshyperv.h.


Okay, will switch back to old setup.


Juergen


OpenPGP_0xB0DE9DD628BF132F.asc
Description: application/pgp-keys


OpenPGP_signature
Description: OpenPGP digital signature


Re: [PATCH v3 07/15] x86/alternative: support "not feature" and ALTERNATIVE_TERNARY

2021-01-19 Thread Jürgen Groß

On 07.01.21 20:08, Borislav Petkov wrote:

On Thu, Dec 17, 2020 at 10:31:25AM +0100, Juergen Gross wrote:

Instead of only supporting to modify instructions when a specific
feature is set, support doing so for the case a feature is not set.

As today a feature is specified using a 16 bit quantity and the highest
feature number in use is around 600, using a negated feature number for
specifying the inverted case seems to be appropriate.

   ALTERNATIVE "default_instr", "patched_instr", ~FEATURE_NR

will start with "default_instr" and patch that with "patched_instr" in
case FEATURE_NR is not set.

Using that add ALTERNATIVE_TERNARY:

   ALTERNATIVE_TERNARY "default_instr", FEATURE_NR,
   "feature_on_instr", "feature_off_instr"

which will start with "default_instr" and at patch time will, depending
on FEATURE_NR being set or not, patch that with either
"feature_on_instr" or "feature_off_instr".


How about an even simpler one (only build-tested):

---
diff --git a/arch/x86/include/asm/alternative-asm.h 
b/arch/x86/include/asm/alternative-asm.h
index 464034db299f..d52b423d3cab 100644
--- a/arch/x86/include/asm/alternative-asm.h
+++ b/arch/x86/include/asm/alternative-asm.h
@@ -109,6 +109,9 @@
.popsection
  .endm
  
+#define ALTERNATIVE_TERNARY(oldinstr, feature, newinstr1, newinstr2)	\

+   ALTERNATIVE_2 oldinstr, newinstr1, feature, newinstr2, 
X86_FEATURE_TERNARY
+
  #endif  /*  __ASSEMBLY__  */
  
  #endif /* _ASM_X86_ALTERNATIVE_ASM_H */

diff --git a/arch/x86/include/asm/alternative.h 
b/arch/x86/include/asm/alternative.h
index 13adca37c99a..f170cbe89539 100644
--- a/arch/x86/include/asm/alternative.h
+++ b/arch/x86/include/asm/alternative.h
@@ -175,6 +175,9 @@ static inline int alternatives_text_reserved(void *start, 
void *end)
ALTINSTR_REPLACEMENT(newinstr2, feature2, 2)\
".popsection\n"
  
+#define ALTERNATIVE_TERNARY(oldinstr, feature, newinstr1, newinstr2)	\

+   ALTERNATIVE_2(oldinstr, newinstr1, feature, newinstr2, 
X86_FEATURE_TERNARY)
+
  #define ALTERNATIVE_3(oldinsn, newinsn1, feat1, newinsn2, feat2, newinsn3, 
feat3) \
OLDINSTR_3(oldinsn, 1, 2, 3)
\
".pushsection .altinstructions,\"a\"\n" 
\
@@ -206,6 +209,9 @@ static inline int alternatives_text_reserved(void *start, 
void *end)
  #define alternative_2(oldinstr, newinstr1, feature1, newinstr2, feature2) \
asm_inline volatile(ALTERNATIVE_2(oldinstr, newinstr1, feature1, newinstr2, 
feature2) ::: "memory")
  
+#define alternative_ternary(oldinstr, feature, newinstr1, newinstr2)	\

+   asm_inline volatile(ALTERNATIVE_TERNARY(oldinstr, feature, newinstr1, newinstr2) 
::: "memory")
+
  /*
   * Alternative inline assembly with input.
   *
diff --git a/arch/x86/include/asm/cpufeatures.h 
b/arch/x86/include/asm/cpufeatures.h
index 84b887825f12..cc634db0b91f 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -108,7 +108,7 @@
  #define X86_FEATURE_EXTD_APICID   ( 3*32+26) /* Extended APICID 
(8 bits) */
  #define X86_FEATURE_AMD_DCM   ( 3*32+27) /* AMD multi-node processor 
*/
  #define X86_FEATURE_APERFMPERF( 3*32+28) /* P-State hardware 
coordination feedback capability (APERF/MPERF MSRs) */
-/* free( 3*32+29) */
+#define X86_FEATURE_TERNARY( 3*32+29) /* "" Synthetic bit for 
ALTERNATIVE_TERNARY() */


In fact this should rather be named "X86_FEATURE_TRUE", as this is its
semantics.

And I think I can define it to the value 0x instead of using a
"real" bit for it.


Juergen


OpenPGP_0xB0DE9DD628BF132F.asc
Description: application/pgp-keys


OpenPGP_signature
Description: OpenPGP digital signature


Re: [PATCH v3 06/15] x86/paravirt: switch time pvops functions to use static_call()

2021-01-19 Thread Jürgen Groß

On 06.01.21 11:03, Borislav Petkov wrote:

On Thu, Dec 17, 2020 at 10:31:24AM +0100, Juergen Gross wrote:

The time pvops functions are the only ones left which might be
used in 32-bit mode and which return a 64-bit value.

Switch them to use the static_call() mechanism instead of pvops, as
this allows quite some simplification of the pvops implementation.

Due to include hell this requires to split out the time interfaces
into a new header file.


I guess you can add Peter's patch to your set, no?

https://lkml.kernel.org/r/20201110005609.40989-3-frede...@kernel.org


With a slight modification, yes. :-)


Juergen


OpenPGP_0xB0DE9DD628BF132F.asc
Description: application/pgp-keys


OpenPGP_signature
Description: OpenPGP digital signature


Re: [PATCH v2] xen-blkfront: allow discard-* nodes to be optional

2021-01-19 Thread Jürgen Groß

On 19.01.21 11:57, Roger Pau Monne wrote:

This is inline with the specification described in blkif.h:

  * discard-granularity: should be set to the physical block size if
node is not present.
  * discard-alignment, discard-secure: should be set to 0 if node not
present.

This was detected as QEMU would only create the discard-granularity
node but not discard-alignment, and thus the setup done in
blkfront_setup_discard would fail.

Fix blkfront_setup_discard to not fail on missing nodes, and also fix
blkif_set_queue_limits to set the discard granularity to the physical
block size if none is specified in xenbus.

Fixes: ed30bf317c5ce ('xen-blkfront: Handle discard requests.')
Reported-by: Arthur Borsboom 
Signed-off-by: Roger Pau Monné 


Reviewed-by: Juergen Gross 


Juergen


OpenPGP_0xB0DE9DD628BF132F.asc
Description: application/pgp-keys


OpenPGP_signature
Description: OpenPGP digital signature


Re: [RFC PATCH 3/8] static_call: Pull some static_call declarations to the type headers

2021-01-19 Thread Jürgen Groß

On 18.01.21 15:12, Frederic Weisbecker wrote:

From: Peter Zijlstra 

Some static call declarations are going to be needed on low level header
files. Move the necessary material to the dedicated static call types
header to avoid inclusion dependency hell.

Signed-off-by: Peter Zijlstra (Intel) 
Cc: Thomas Gleixner 
Cc: Mel Gorman 
Cc: Ingo Molnar 
Cc: Michal Hocko 
Cc: Paul E. McKenney 
Signed-off-by: Frederic Weisbecker 


Could the patches please be reordered to have this one first in the
series? I could make use of it in my paravirt simplification series.

I'll include this patch (in slightly modified form. i.e. without the
__static_call_return0() parts) in my series in order to make it
build.

My tests suggest you should also update
tools/include/linux/static_call_types.h


Juergen


OpenPGP_0xB0DE9DD628BF132F.asc
Description: application/pgp-keys


OpenPGP_signature
Description: OpenPGP digital signature


Re: [PATCH] xen-blkfront: don't make discard-alignment mandatory

2021-01-19 Thread Jürgen Groß

On 19.01.21 11:06, Roger Pau Monné wrote:

On Tue, Jan 19, 2021 at 08:43:01AM +0100, Jürgen Groß wrote:

On 18.01.21 16:15, Roger Pau Monne wrote:

Don't require the discard-alignment xenstore node to be present in
order to correctly setup the feature. This can happen with versions of
QEMU that only write the discard-granularity but not the
discard-alignment node.

Assume discard-alignment is 0 if not present. While there also fix the
logic to not enable the discard feature if discard-granularity is not
present.

Reported-by: Arthur Borsboom 
Signed-off-by: Roger Pau Monné 
---
Cc: Boris Ostrovsky 
Cc: Juergen Gross 
Cc: Stefano Stabellini 
Cc: Konrad Rzeszutek Wilk 
Cc: "Roger Pau Monné" 
Cc: Jens Axboe 
Cc: xen-de...@lists.xenproject.org
Cc: linux-bl...@vger.kernel.org
Cc: Arthur Borsboom 
---
   drivers/block/xen-blkfront.c | 25 +
   1 file changed, 13 insertions(+), 12 deletions(-)

diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index 5265975b3fba..5a93f7cc2939 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -2179,22 +2179,23 @@ static void blkfront_closing(struct blkfront_info *info)
   static void blkfront_setup_discard(struct blkfront_info *info)
   {
-   int err;
-   unsigned int discard_granularity;
-   unsigned int discard_alignment;
+   unsigned int discard_granularity = 0;
+   unsigned int discard_alignment = 0;
+   unsigned int discard_secure = 0;
-   info->feature_discard = 1;
-   err = xenbus_gather(XBT_NIL, info->xbdev->otherend,
+   xenbus_gather(XBT_NIL, info->xbdev->otherend,
"discard-granularity", "%u", _granularity,
"discard-alignment", "%u", _alignment,
+   "discard-secure", "%u", _secure,
NULL);


This would mean that "discard-secure" will be evaluated only if the
other two items are set in Xenstore. From blkif.h I can't see this is
required, and your patch is modifying today's behavior in this regard.

You might want to have three xenbus_read_unsigned() calls instead.


You are right, discard-secure should be fetched regardless of whether
discard-alignment exists.

I can fetch discard-granularity and discard-alignment using
xenbus_gather and keep discard-secure using xenbus_read_unsigned. Let
me send a new version.


I'm still not convinced this is correct. blkif.h doesn't mention that
discard-alignment will be valid only with discard-granularity being
present.


Juergen


OpenPGP_0xB0DE9DD628BF132F.asc
Description: application/pgp-keys


OpenPGP_signature
Description: OpenPGP digital signature


Re: [PATCH] xen-blkfront: don't make discard-alignment mandatory

2021-01-18 Thread Jürgen Groß

On 18.01.21 16:15, Roger Pau Monne wrote:

Don't require the discard-alignment xenstore node to be present in
order to correctly setup the feature. This can happen with versions of
QEMU that only write the discard-granularity but not the
discard-alignment node.

Assume discard-alignment is 0 if not present. While there also fix the
logic to not enable the discard feature if discard-granularity is not
present.

Reported-by: Arthur Borsboom 
Signed-off-by: Roger Pau Monné 
---
Cc: Boris Ostrovsky 
Cc: Juergen Gross 
Cc: Stefano Stabellini 
Cc: Konrad Rzeszutek Wilk 
Cc: "Roger Pau Monné" 
Cc: Jens Axboe 
Cc: xen-de...@lists.xenproject.org
Cc: linux-bl...@vger.kernel.org
Cc: Arthur Borsboom 
---
  drivers/block/xen-blkfront.c | 25 +
  1 file changed, 13 insertions(+), 12 deletions(-)

diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index 5265975b3fba..5a93f7cc2939 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -2179,22 +2179,23 @@ static void blkfront_closing(struct blkfront_info *info)
  
  static void blkfront_setup_discard(struct blkfront_info *info)

  {
-   int err;
-   unsigned int discard_granularity;
-   unsigned int discard_alignment;
+   unsigned int discard_granularity = 0;
+   unsigned int discard_alignment = 0;
+   unsigned int discard_secure = 0;
  
-	info->feature_discard = 1;

-   err = xenbus_gather(XBT_NIL, info->xbdev->otherend,
+   xenbus_gather(XBT_NIL, info->xbdev->otherend,
"discard-granularity", "%u", _granularity,
"discard-alignment", "%u", _alignment,
+   "discard-secure", "%u", _secure,
NULL);


This would mean that "discard-secure" will be evaluated only if the
other two items are set in Xenstore. From blkif.h I can't see this is
required, and your patch is modifying today's behavior in this regard.

You might want to have three xenbus_read_unsigned() calls instead.


Juergen


OpenPGP_0xB0DE9DD628BF132F.asc
Description: application/pgp-keys


OpenPGP_signature
Description: OpenPGP digital signature


Re: [PATCH -next] x86/xen: fix 'nopvspin' build error

2021-01-17 Thread Jürgen Groß

On 15.01.21 20:11, Randy Dunlap wrote:

Fix build error in x86/xen/ when PARAVIRT_SPINLOCKS is not enabled.

Fixes this build error:

../arch/x86/xen/smp_hvm.c: In function ‘xen_hvm_smp_init’:
../arch/x86/xen/smp_hvm.c:77:3: error: ‘nopvspin’ undeclared (first use in this 
function)
nopvspin = true;

Fixes: 3d7746bea925 ("x86/xen: Fix xen_hvm_smp_init() when vector callback not 
available")
Signed-off-by: Randy Dunlap 
Cc: David Woodhouse 
Cc: Juergen Gross 


Reviewed-by: Juergen Gross 


Juergen



OpenPGP_0xB0DE9DD628BF132F.asc
Description: application/pgp-keys


OpenPGP_signature
Description: OpenPGP digital signature


Re: [PATCH 15/21] x86/xen/pvh: Convert indirect jump to retpoline

2021-01-14 Thread Jürgen Groß

On 14.01.21 20:40, Josh Poimboeuf wrote:

It's kernel policy to not have (unannotated) indirect jumps because of
Spectre v2.  This one's probably harmless, but better safe than sorry.
Convert it to a retpoline.

Cc: Boris Ostrovsky 
Cc: Juergen Gross 
Signed-off-by: Josh Poimboeuf 
---
  arch/x86/platform/pvh/head.S | 3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/platform/pvh/head.S b/arch/x86/platform/pvh/head.S
index 43b4d864817e..d87cebd08d32 100644
--- a/arch/x86/platform/pvh/head.S
+++ b/arch/x86/platform/pvh/head.S
@@ -16,6 +16,7 @@
  #include 
  #include 
  #include 
+#include 
  #include 
  
  	__HEAD

@@ -105,7 +106,7 @@ SYM_CODE_START_LOCAL(pvh_start_xen)
/* startup_64 expects boot_params in %rsi. */
mov $_pa(pvh_bootparams), %rsi
mov $_pa(startup_64), %rax
-   jmp *%rax
+   JMP_NOSPEC rax


I'd rather have it annotated only.

Using ALTERNATIVE in very early boot code is just adding needless
clutter, as the retpoline variant won't ever be active.


Juergen


OpenPGP_0xB0DE9DD628BF132F.asc
Description: application/pgp-keys


OpenPGP_signature
Description: OpenPGP digital signature


Re: [PATCH 14/21] x86/xen: Support objtool vmlinux.o validation in xen-head.S

2021-01-14 Thread Jürgen Groß

On 14.01.21 20:40, Josh Poimboeuf wrote:

The Xen hypercall page is filled with zeros, causing objtool to fall
through all the empty hypercall functions until it reaches a real
function, resulting in a stack state mismatch.

The build-time contents of the hypercall page don't matter, since it
gets mapped to the hypervisor.


This sentence is technically wrong: the contents don't matter, as the
page will be rewritten by the hypervisor.


Juergen



OpenPGP_0xB0DE9DD628BF132F.asc
Description: application/pgp-keys


OpenPGP_signature
Description: OpenPGP digital signature


Re: [PATCH v2] xen/privcmd: allow fetching resource sizes

2021-01-13 Thread Jürgen Groß

On 12.01.21 12:53, Roger Pau Monne wrote:

Allow issuing an IOCTL_PRIVCMD_MMAP_RESOURCE ioctl with num = 0 and
addr = 0 in order to fetch the size of a specific resource.

Add a shortcut to the default map resource path, since fetching the
size requires no address to be passed in, and thus no VMA to setup.

This is missing from the initial implementation, and causes issues
when mapping resources that don't have fixed or known sizes.

Signed-off-by: Roger Pau Monné 
Cc: sta...@vger.kernel.org # >= 4.18


Pushed to xen/tip.git for-linus-5.11


Juergen


OpenPGP_0xB0DE9DD628BF132F.asc
Description: application/pgp-keys


OpenPGP_signature
Description: OpenPGP digital signature


Re: [PATCH v2] xen/privcmd: allow fetching resource sizes

2021-01-12 Thread Jürgen Groß

On 12.01.21 12:53, Roger Pau Monne wrote:

Allow issuing an IOCTL_PRIVCMD_MMAP_RESOURCE ioctl with num = 0 and
addr = 0 in order to fetch the size of a specific resource.

Add a shortcut to the default map resource path, since fetching the
size requires no address to be passed in, and thus no VMA to setup.

This is missing from the initial implementation, and causes issues
when mapping resources that don't have fixed or known sizes.

Signed-off-by: Roger Pau Monné 
Cc: sta...@vger.kernel.org # >= 4.18


Reviewed-by: Juergen Gross 


Juergen


OpenPGP_0xB0DE9DD628BF132F.asc
Description: application/pgp-keys


OpenPGP_signature
Description: OpenPGP digital signature


Re: [PATCH] xen/privcmd: allow fetching resource sizes

2021-01-12 Thread Jürgen Groß

On 12.01.21 11:03, Roger Pau Monné wrote:

On Tue, Jan 12, 2021 at 06:57:30AM +0100, Jürgen Groß wrote:

On 11.01.21 16:29, Roger Pau Monne wrote:

Allow issuing an IOCTL_PRIVCMD_MMAP_RESOURCE ioctl with num = 0 and
addr = 0 in order to fetch the size of a specific resource.

Add a shortcut to the default map resource path, since fetching the
size requires no address to be passed in, and thus no VMA to setup.

Fixes: 3ad0876554caf ('xen/privcmd: add IOCTL_PRIVCMD_MMAP_RESOURCE')


I don't think this addition is a reason to add a "Fixes:" tag. This is
clearly new functionality.


It could be argued that not allowing to query the resource size was a
shortcoming of the original implementation, but a backport request to
stable would be more appropriate than a fixes tag I think. Will drop
on next version and add a backport request if you agree.


Yes, please.


Juergen


OpenPGP_0xB0DE9DD628BF132F.asc
Description: application/pgp-keys


OpenPGP_signature
Description: OpenPGP digital signature


Re: [PATCH] xen/privcmd: allow fetching resource sizes

2021-01-11 Thread Jürgen Groß

On 12.01.21 06:50, Jürgen Groß wrote:

On 11.01.21 23:39, Andrew Cooper wrote:

On 11/01/2021 22:09, boris.ostrov...@oracle.com wrote:

On 1/11/21 10:29 AM, Roger Pau Monne wrote:

+    xdata.domid = kdata.dom;
+    xdata.type = kdata.type;
+    xdata.id = kdata.id;
+
+    if (!kdata.addr && !kdata.num) {


I think we should not allow only one of them to be zero. If it's only 
kdata.num then we will end up with pfns array set to ZERO_SIZE_PTR 
(which is 0x10). We seem to be OK in that we are not derefencing pfns 
(either in kernel or in hypervisor) if number of frames is zero but 
IMO we shouldn't be tempting the fate.



(And if it's only kdata.addr then we will get a vma but I am not sure 
it will do what we want.)


Passing addr == 0 without num being 0 is already an error in Xen, and
passing num == 0 without addr being 0 is bogus and will be an error by
the time I'm finished fixing this.

FWIW, the common usecase for non-trivial examples will be:

xenforeignmem_resource_size(domid, type, id, );
xenforeignmem_map_resource(domid, type, id, NULL, size, ...);

which translates into:

ioctl(MAP_RESOURCE, NULL, 0) => size
mmap(NULL, size, ...) => ptr
ioctl(MAP_RESOURCE, ptr, size)

from the kernels point of view, and two hypercalls from Xen's point of
view.  The NULL's above are expected to be the common case for letting
the kernel chose the vma, but ought to be filled in by the time the
second ioctl() occurs.

See
https://lore.kernel.org/xen-devel/20200922182444.12350-1-andrew.coop...@citrix.com/T/#u 


for all the gory details.


I don't think the kernel should rely on the hypervisor to return
an error in case addr != 0 and num == 0.

The driver should return -EINVAL in that case IMO.


And additionally I think the kernel should check num to be not too
large (in the interface it is u64, while intermediate values are
stored in unsigned int), limiting it to something below INT_MAX
seems to be sensible.


Juergen


OpenPGP_0xB0DE9DD628BF132F.asc
Description: application/pgp-keys


OpenPGP_signature
Description: OpenPGP digital signature


Re: [PATCH] xen/privcmd: allow fetching resource sizes

2021-01-11 Thread Jürgen Groß

On 11.01.21 16:29, Roger Pau Monne wrote:

Allow issuing an IOCTL_PRIVCMD_MMAP_RESOURCE ioctl with num = 0 and
addr = 0 in order to fetch the size of a specific resource.

Add a shortcut to the default map resource path, since fetching the
size requires no address to be passed in, and thus no VMA to setup.

Fixes: 3ad0876554caf ('xen/privcmd: add IOCTL_PRIVCMD_MMAP_RESOURCE')


I don't think this addition is a reason to add a "Fixes:" tag. This is
clearly new functionality.


Juergen


OpenPGP_0xB0DE9DD628BF132F.asc
Description: application/pgp-keys


OpenPGP_signature
Description: OpenPGP digital signature


Re: [PATCH] xen/privcmd: allow fetching resource sizes

2021-01-11 Thread Jürgen Groß

On 11.01.21 23:39, Andrew Cooper wrote:

On 11/01/2021 22:09, boris.ostrov...@oracle.com wrote:

On 1/11/21 10:29 AM, Roger Pau Monne wrote:
  
+	xdata.domid = kdata.dom;

+   xdata.type = kdata.type;
+   xdata.id = kdata.id;
+
+   if (!kdata.addr && !kdata.num) {


I think we should not allow only one of them to be zero. If it's only kdata.num 
then we will end up with pfns array set to ZERO_SIZE_PTR (which is 0x10). We 
seem to be OK in that we are not derefencing pfns (either in kernel or in 
hypervisor) if number of frames is zero but IMO we shouldn't be tempting the 
fate.


(And if it's only kdata.addr then we will get a vma but I am not sure it will 
do what we want.)


Passing addr == 0 without num being 0 is already an error in Xen, and
passing num == 0 without addr being 0 is bogus and will be an error by
the time I'm finished fixing this.

FWIW, the common usecase for non-trivial examples will be:

xenforeignmem_resource_size(domid, type, id, );
xenforeignmem_map_resource(domid, type, id, NULL, size, ...);

which translates into:

ioctl(MAP_RESOURCE, NULL, 0) => size
mmap(NULL, size, ...) => ptr
ioctl(MAP_RESOURCE, ptr, size)

from the kernels point of view, and two hypercalls from Xen's point of
view.  The NULL's above are expected to be the common case for letting
the kernel chose the vma, but ought to be filled in by the time the
second ioctl() occurs.

See
https://lore.kernel.org/xen-devel/20200922182444.12350-1-andrew.coop...@citrix.com/T/#u
for all the gory details.


I don't think the kernel should rely on the hypervisor to return
an error in case addr != 0 and num == 0.

The driver should return -EINVAL in that case IMO.


Juergen


OpenPGP_0xB0DE9DD628BF132F.asc
Description: application/pgp-keys


OpenPGP_signature
Description: OpenPGP digital signature


Re: [PATCH] xen: Kconfig: remove X86_64 depends from XEN_512GB

2020-12-18 Thread Jürgen Groß

On 16.12.20 15:08, Jason Andryuk wrote:

commit bfda93aee0ec ("xen: Kconfig: nest Xen guest options")
accidentally re-added X86_64 as a dependency to XEN_512GB.  It was
originally removed in commit a13f2ef168cb ("x86/xen: remove 32-bit Xen
PV guest support").  Remove it again.

Fixes: bfda93aee0ec ("xen: Kconfig: nest Xen guest options")
Reported-by: Boris Ostrovsky 
Signed-off-by: Jason Andryuk 


Applied to xen/tip.git for-linus-5.11


Juergen


OpenPGP_0xB0DE9DD628BF132F.asc
Description: application/pgp-keys


OpenPGP_signature
Description: OpenPGP digital signature


Re: [PATCH v2] xen/xenbus: make xs_talkv() interruptible

2020-12-17 Thread Jürgen Groß

On 17.12.20 19:25, Andrew Cooper wrote:

On 16/12/2020 08:21, Jürgen Groß wrote:

On 15.12.20 21:59, Andrew Cooper wrote:

On 15/12/2020 11:10, Juergen Gross wrote:

In case a process waits for any Xenstore action in the xenbus driver
it should be interruptible by signals.

Signed-off-by: Juergen Gross 
---
V2:
- don't special case SIGKILL as libxenstore is handling -EINTR fine
---
   drivers/xen/xenbus/xenbus_xs.c | 9 -
   1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/drivers/xen/xenbus/xenbus_xs.c
b/drivers/xen/xenbus/xenbus_xs.c
index 3a06eb699f33..17c8f8a155fd 100644
--- a/drivers/xen/xenbus/xenbus_xs.c
+++ b/drivers/xen/xenbus/xenbus_xs.c
@@ -205,8 +205,15 @@ static bool test_reply(struct xb_req_data *req)
     static void *read_reply(struct xb_req_data *req)
   {
+    int ret;
+
   do {
-    wait_event(req->wq, test_reply(req));
+    ret = wait_event_interruptible(req->wq, test_reply(req));
+
+    if (ret == -ERESTARTSYS && signal_pending(current)) {
+    req->msg.type = XS_ERROR;
+    return ERR_PTR(-EINTR);
+    }


So now I can talk fully about the situations which lead to this, I think
there is a bit more complexity.

It turns out there are a number of issues related to running a Xen
system with no xenstored.

1) If a xenstore-write occurs during startup before init-xenstore-domain
runs, the former blocks on /dev/xen/xenbus waiting for xenstored to
reply, while the latter blocks on /dev/xen/xenbus_backend when trying to
tell the dom0 kernel that xenstored is in dom1.  This effectively
deadlocks the system.


This should be easy to solve: any request to /dev/xen/xenbus should
block upfront in case xenstored isn't up yet (could e.g. wait
interruptible until xenstored_ready is non-zero).


I'm not sure that that would fix the problem.  The problem is that
setting the ring details via /dev/xen/xenbus_backend blocks, which
prevents us launching the xenstored stubdomain, which prevents the
earlier xenbus write being completed.


But _why_ is it blocking? Digging through the code I think it blocks
is xs_suspend() due to the normal xenstore request being pending. If
that request doesn't reach the state to cause blocking in xs_suspend()
all is fine.


So long as /dev/xen/xenbus_backend doesn't block, there's no problem
with other /dev/xen/xenbus activity being pending briefly.


Looking at the current logic, I'm not completely convinced.  Even
finding a filled-in evtchn/gfn doesn't mean that xenstored is actually
ready.


No, but the deadlock is not going to happen anymore (famous last
words).



There are 3 possible cases.

1) PV guest, and details in start_info
2) HVM guest, and details in HVM_PARAMs
3) No details (expected for dom0).  Something in userspace must provide
details at a later point.

So the setup phases go from nothing, to having ring details, to finding
the ring working.

I think it would be prudent to try reading a key between having details
and declaring the xenstored_ready.  Any activity, even XS_ERROR,
indicates that the other end of the ring is listening.


Yes. But I really think the xs_suspend() is the problematic case. And
this will be called _before_ xenstored_ready is being set.






2) If xenstore-watch is running when xenstored dies, it spins at 100%
cpu usage making no system calls at all.  This is caused by bad error
handling from xs_watch(), and attempting to debug found:


Can you expand on "bad error handling from xs_watch()", please?


do_watch() has

     for ( ... ) { // defaults to an infinite loop
         vec = xs_read_watch();
         if (vec == NULL)
             continue;
         ...
     }


My next plan was to experiment with break instead of continue, which
I'll get to at some point.


I'd rather put a sleep() in. Otherwise you might break some use cases.







3) (this issue).  If anyone starts xenstore-watch with no xenstored
running at all, it blocks in D in the kernel.


Should be handled with solution for 1).



The cause is the special handling for watch/unwatch commands which,
instead of just queuing up the data for xenstore, explicitly waits for
an OK for registering the watch.  This causes a write() system call to
block waiting for a non-existent entity to reply.

So while this patch does resolve the major usability issue I found (I
can't even SIGINT and get my terminal back), I think there are issues.

The reason why XS_WATCH/XS_UNWATCH are special cased is because they do
require special handling.  The main kernel thread for processing
incoming data from xenstored does need to know how to associate each
async XS_WATCH_EVENT to the caller who watched the path.

Therefore, depending on when this cancellation hits, we might be in any
of the following states:

1) the watch is queued in the kernel, but not even sent to xenstored yet
2) the watch is queued in the xenstored ring, but not acted upon
3) the watch is queued in the xenstored ring, and the xenstored has 

Re: [PATCH 4/5] xen/xenbus: Count pending messages for each watch

2020-12-17 Thread Jürgen Groß

On 17.12.20 09:17, SeongJae Park wrote:

From: SeongJae Park 

This commit adds a counter of pending messages for each watch in the
struct.  It is used to skip unnecessary pending messages lookup in
'unregister_xenbus_watch()'.  It could also be used in 'will_handle'
callback.

This is part of XSA-349

This is upstream commit 3dc86ca6b4c8cfcba9da7996189d1b5a358a94fc

Cc: sta...@vger.kernel.org
Signed-off-by: SeongJae Park 
Reported-by: Michael Kurth 
Reported-by: Pawel Wieczorkiewicz 
Signed-off-by: Author Redacted 
Reviewed-by: Juergen Gross 
Signed-off-by: Juergen Gross 
---
  drivers/xen/xenbus/xenbus_xs.c | 30 ++
  include/xen/xenbus.h   |  2 ++
  2 files changed, 20 insertions(+), 12 deletions(-)

diff --git a/drivers/xen/xenbus/xenbus_xs.c b/drivers/xen/xenbus/xenbus_xs.c
index 0ea1c259f2f1..420d478e1708 100644
--- a/drivers/xen/xenbus/xenbus_xs.c
+++ b/drivers/xen/xenbus/xenbus_xs.c
@@ -701,6 +701,8 @@ int register_xenbus_watch(struct xenbus_watch *watch)
  
  	sprintf(token, "%lX", (long)watch);
  
+	watch->nr_pending = 0;

+


I'm missing the incrementing of nr_pending, which was present in the
upstream patch.


Juergen


down_read(_state.watch_mutex);
  
  	spin_lock(_lock);

@@ -750,12 +752,15 @@ void unregister_xenbus_watch(struct xenbus_watch *watch)
  
  	/* Cancel pending watch events. */

spin_lock(_events_lock);
-   list_for_each_entry_safe(msg, tmp, _events, list) {
-   if (msg->u.watch.handle != watch)
-   continue;
-   list_del(>list);
-   kfree(msg->u.watch.vec);
-   kfree(msg);
+   if (watch->nr_pending) {
+   list_for_each_entry_safe(msg, tmp, _events, list) {
+   if (msg->u.watch.handle != watch)
+   continue;
+   list_del(>list);
+   kfree(msg->u.watch.vec);
+   kfree(msg);
+   }
+   watch->nr_pending = 0;
}
spin_unlock(_events_lock);
  
@@ -802,7 +807,6 @@ void xs_suspend_cancel(void)
  
  static int xenwatch_thread(void *unused)

  {
-   struct list_head *ent;
struct xs_stored_msg *msg;
  
  	for (;;) {

@@ -815,13 +819,15 @@ static int xenwatch_thread(void *unused)
mutex_lock(_mutex);
  
  		spin_lock(_events_lock);

-   ent = watch_events.next;
-   if (ent != _events)
-   list_del(ent);
+   msg = list_first_entry_or_null(_events,
+   struct xs_stored_msg, list);
+   if (msg) {
+   list_del(>list);
+   msg->u.watch.handle->nr_pending--;
+   }
spin_unlock(_events_lock);
  
-		if (ent != _events) {

-   msg = list_entry(ent, struct xs_stored_msg, list);
+   if (msg) {
msg->u.watch.handle->callback(
msg->u.watch.handle,
(const char **)msg->u.watch.vec,
diff --git a/include/xen/xenbus.h b/include/xen/xenbus.h
index 1772507dc2c9..ed9e7e3307b7 100644
--- a/include/xen/xenbus.h
+++ b/include/xen/xenbus.h
@@ -58,6 +58,8 @@ struct xenbus_watch
/* Path being watched. */
const char *node;
  
+	unsigned int nr_pending;

+
/*
 * Called just before enqueing new event while a spinlock is held.
 * The event will be discarded if this callback returns false.





OpenPGP_0xB0DE9DD628BF132F.asc
Description: application/pgp-keys


OpenPGP_signature
Description: OpenPGP digital signature


Re: [PATCH] xen: Kconfig: remove X86_64 depends from XEN_512GB

2020-12-16 Thread Jürgen Groß

On 16.12.20 15:08, Jason Andryuk wrote:

commit bfda93aee0ec ("xen: Kconfig: nest Xen guest options")
accidentally re-added X86_64 as a dependency to XEN_512GB.  It was
originally removed in commit a13f2ef168cb ("x86/xen: remove 32-bit Xen
PV guest support").  Remove it again.

Fixes: bfda93aee0ec ("xen: Kconfig: nest Xen guest options")
Reported-by: Boris Ostrovsky 
Signed-off-by: Jason Andryuk 


Reviewed-by: Juergen Gross 


Juergen


OpenPGP_0xB0DE9DD628BF132F.asc
Description: application/pgp-keys


OpenPGP_signature
Description: OpenPGP digital signature


Re: [PATCH v2] xen/xenbus: make xs_talkv() interruptible

2020-12-16 Thread Jürgen Groß

On 15.12.20 21:59, Andrew Cooper wrote:

On 15/12/2020 11:10, Juergen Gross wrote:

In case a process waits for any Xenstore action in the xenbus driver
it should be interruptible by signals.

Signed-off-by: Juergen Gross 
---
V2:
- don't special case SIGKILL as libxenstore is handling -EINTR fine
---
  drivers/xen/xenbus/xenbus_xs.c | 9 -
  1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/drivers/xen/xenbus/xenbus_xs.c b/drivers/xen/xenbus/xenbus_xs.c
index 3a06eb699f33..17c8f8a155fd 100644
--- a/drivers/xen/xenbus/xenbus_xs.c
+++ b/drivers/xen/xenbus/xenbus_xs.c
@@ -205,8 +205,15 @@ static bool test_reply(struct xb_req_data *req)
  
  static void *read_reply(struct xb_req_data *req)

  {
+   int ret;
+
do {
-   wait_event(req->wq, test_reply(req));
+   ret = wait_event_interruptible(req->wq, test_reply(req));
+
+   if (ret == -ERESTARTSYS && signal_pending(current)) {
+   req->msg.type = XS_ERROR;
+   return ERR_PTR(-EINTR);
+   }


So now I can talk fully about the situations which lead to this, I think
there is a bit more complexity.

It turns out there are a number of issues related to running a Xen
system with no xenstored.

1) If a xenstore-write occurs during startup before init-xenstore-domain
runs, the former blocks on /dev/xen/xenbus waiting for xenstored to
reply, while the latter blocks on /dev/xen/xenbus_backend when trying to
tell the dom0 kernel that xenstored is in dom1.  This effectively
deadlocks the system.


This should be easy to solve: any request to /dev/xen/xenbus should
block upfront in case xenstored isn't up yet (could e.g. wait
interruptible until xenstored_ready is non-zero).


2) If xenstore-watch is running when xenstored dies, it spins at 100%
cpu usage making no system calls at all.  This is caused by bad error
handling from xs_watch(), and attempting to debug found:


Can you expand on "bad error handling from xs_watch()", please?



3) (this issue).  If anyone starts xenstore-watch with no xenstored
running at all, it blocks in D in the kernel.


Should be handled with solution for 1).



The cause is the special handling for watch/unwatch commands which,
instead of just queuing up the data for xenstore, explicitly waits for
an OK for registering the watch.  This causes a write() system call to
block waiting for a non-existent entity to reply.

So while this patch does resolve the major usability issue I found (I
can't even SIGINT and get my terminal back), I think there are issues.

The reason why XS_WATCH/XS_UNWATCH are special cased is because they do
require special handling.  The main kernel thread for processing
incoming data from xenstored does need to know how to associate each
async XS_WATCH_EVENT to the caller who watched the path.

Therefore, depending on when this cancellation hits, we might be in any
of the following states:

1) the watch is queued in the kernel, but not even sent to xenstored yet
2) the watch is queued in the xenstored ring, but not acted upon
3) the watch is queued in the xenstored ring, and the xenstored has seen
it but not replied yet
4) the watch has been processed, but the XS_WATCH reply hasn't been
received yet
5) the watch has been processed, and the XS_WATCH reply received

State 5 (and a little bit) is the normal success path when xenstored has
acted upon the request, and the internal kernel infrastructure is set up
appropriately to handle XS_WATCH_EVENTs.

States 1 and 2 can be very common if there is no xenstored (or at least,
it hasn't started up yet).  In reality, there is either no xenstored, or
it is up and running (and for a period of time during system startup,
these cases occur in sequence).


Yes. this is the reason we can't just reject a user request if xenstored
hasn't been detected yet: it could be just starting.



As soon as the XS_WATCH event has been written into the xenstored ring,
it is not safe to cancel.  You've committed to xenstored processing the
request (if it is up).


I'm not sure this is true. Cancelling it might result in a stale watch
in xenstored, but there shouldn't be a problem related to that. In case
that watch fires the event will normally be discarded by the kernel as
no matching watch is found in the kernel's data. In case a new watch
has been setup with the same struct xenbus_watch address (which is used
as the token), then this new watch might fire without the node of the
new watch having changed, but spurious watch events are defined to be
okay (OTOH the path in the event might look strange to the handler).



If xenstored is actually up and running, its fine and necessary to
block.  The request will be processed in due course (timing subject to
the client and server load).  If xenstored isn't up, blocking isn't ok.

Therefore, I think we need to distinguish "not yet on the ring" from "on
the ring", as our distinction as to whether cancelling is safe, and
ensure we don't 

Re: [PATCH] xen: remove trailing semicolon in macro definition

2020-12-15 Thread Jürgen Groß

On 27.11.20 17:07, t...@redhat.com wrote:

From: Tom Rix 

The macro use will already have a semicolon.

Signed-off-by: Tom Rix 


Applied to: xen/tip.git for-linus-5.11


Juergen


OpenPGP_0xB0DE9DD628BF132F.asc
Description: application/pgp-keys


OpenPGP_signature
Description: OpenPGP digital signature


Re: [PATCH 058/141] xen-blkfront: Fix fall-through warnings for Clang

2020-12-15 Thread Jürgen Groß

On 20.11.20 19:32, Gustavo A. R. Silva wrote:

In preparation to enable -Wimplicit-fallthrough for Clang, fix a warning
by explicitly adding a break statement instead of letting the code fall
through to the next case.

Link: https://github.com/KSPP/linux/issues/115
Signed-off-by: Gustavo A. R. Silva 


Applied to: xen/tip.git for-linus-5.11


Juergen


OpenPGP_0xB0DE9DD628BF132F.asc
Description: application/pgp-keys


OpenPGP_signature
Description: OpenPGP digital signature


Re: [PATCH 138/141] xen/manage: Fix fall-through warnings for Clang

2020-12-15 Thread Jürgen Groß

On 20.11.20 19:40, Gustavo A. R. Silva wrote:

In preparation to enable -Wimplicit-fallthrough for Clang, fix a warning
by explicitly adding a break statement instead of letting the code fall
through to the next case.

Link: https://github.com/KSPP/linux/issues/115
Signed-off-by: Gustavo A. R. Silva 


Applied to: xen/tip.git for-linus-5.11


Juergen


OpenPGP_0xB0DE9DD628BF132F.asc
Description: application/pgp-keys


OpenPGP_signature
Description: OpenPGP digital signature


Re: [PATCH 0/2] Remove Xen PVH dependency on PCI

2020-12-15 Thread Jürgen Groß

On 14.10.20 19:53, Jason Andryuk wrote:

A Xen PVH domain doesn't have a PCI bus or devices, so it doesn't need
PCI support built in.  Currently, XEN_PVH depends on XEN_PVHVM which
depends on PCI.

The first patch introduces XEN_PVHVM_GUEST as a toplevel item and
changes XEN_PVHVM to a hidden variable.  This allows XEN_PVH to depend
on XEN_PVHVM without PCI while XEN_PVHVM_GUEST depends on PCI.

The second patch moves XEN_512GB to clean up the option nesting.

Jason Andryuk (2):
   xen: Remove Xen PVH/PVHVM dependency on PCI
   xen: Kconfig: nest Xen guest options

  arch/x86/xen/Kconfig | 38 ++
  drivers/xen/Makefile |  2 +-
  2 files changed, 23 insertions(+), 17 deletions(-)



Series applied to: xen/tip.git for-linus-5.11


Juergen


OpenPGP_0xB0DE9DD628BF132F.asc
Description: application/pgp-keys


OpenPGP_signature
Description: OpenPGP digital signature


Re: [PATCH -next v2] x86/xen: Convert to DEFINE_SHOW_ATTRIBUTE

2020-12-15 Thread Jürgen Groß

On 17.09.20 14:55, Qinglang Miao wrote:

Use DEFINE_SHOW_ATTRIBUTE macro to simplify the code.

Signed-off-by: Qinglang Miao 


Applied to: xen/tip.git for-linus-5.11


Juergen


OpenPGP_0xB0DE9DD628BF132F.asc
Description: application/pgp-keys


OpenPGP_signature
Description: OpenPGP digital signature


Re: [PATCH v2 00/12] x86: major paravirt cleanup

2020-12-15 Thread Jürgen Groß

On 15.12.20 15:54, Peter Zijlstra wrote:

On Tue, Dec 15, 2020 at 03:18:34PM +0100, Peter Zijlstra wrote:

Ah, I was waiting for Josh to have an opinion (and then sorta forgot
about the whole thing again). Let me refresh and provide at least a
Changelog.


How's this then?


Thanks, will add it into my series.


Juergen



---
Subject: objtool: Alternatives vs ORC, the hard way
From: Peter Zijlstra 
Date: Mon, 23 Nov 2020 14:43:17 +0100

Alternatives pose an interesting problem for unwinders because from
the unwinders PoV we're just executing instructions, it has no idea
the text is modified, nor any way of retrieving what with.

Therefore the stance has been that alternatives must not change stack
state, as encoded by commit: 7117f16bf460 ("objtool: Fix ORC vs
alternatives"). This obviously guarantees that whatever actual
instructions end up in the text, the unwind information is correct.

However, there is one additional source of text patching that isn't
currently visible to objtool: paravirt immediate patching. And it
turns out one of these violates the rule.

As part of cleaning that up, the unfortunate reality is that objtool
now has to deal with alternatives modifying unwind state and validate
the combination is valid and generate ORC data to match.

The problem is that a single instance of unwind information (ORC) must
capture and correctly unwind all alternatives. Since the trivially
correct mandate is out, implement the straight forward brute-force
approach:

  1) generate CFI information for each alternative

  2) unwind every alternative with the merge-sort of the previously
 generated CFI information -- O(n^2)

  3) for any possible conflict: yell.

  4) Generate ORC with merge-sort

Specifically for 3 there are two possible classes of conflicts:

  - the merge-sort itself could find conflicting CFI for the same
offset.

  - the unwind can fail with the merged CFI.

In specific, this allows us to deal with:

Alt1Alt2Alt3

  0x00  CALL *pv_ops.save_flCALL xen_save_flPUSHF
  0x01  POP %RAX
  0x02  NOP
  ...
  0x05  NOP
  ...
  0x07   

The unwind information for offset-0x00 is identical for all 3
alternatives. Similarly offset-0x05 and higher also are identical (and
the same as 0x00). However offset-0x01 has deviating CFI, but that is
only relevant for Alt3, neither of the other alternative instruction
streams will ever hit that offset.

Signed-off-by: Peter Zijlstra (Intel) 
---
  tools/objtool/check.c   |  180 

  tools/objtool/check.h   |5 +
  tools/objtool/orc_gen.c |  180 
+++-
  3 files changed, 290 insertions(+), 75 deletions(-)

--- a/tools/objtool/check.c
+++ b/tools/objtool/check.c
@@ -1096,6 +1096,32 @@ static int handle_group_alt(struct objto
return -1;
}
  
+	/*

+* Add the filler NOP, required for alternative CFI.
+*/
+   if (special_alt->group && special_alt->new_len < special_alt->orig_len) 
{
+   struct instruction *nop = malloc(sizeof(*nop));
+   if (!nop) {
+   WARN("malloc failed");
+   return -1;
+   }
+   memset(nop, 0, sizeof(*nop));
+   INIT_LIST_HEAD(>alts);
+   INIT_LIST_HEAD(>stack_ops);
+   init_cfi_state(>cfi);
+
+   nop->sec = last_new_insn->sec;
+   nop->ignore = last_new_insn->ignore;
+   nop->func = last_new_insn->func;
+   nop->alt_group = alt_group;
+   nop->offset = last_new_insn->offset + last_new_insn->len;
+   nop->type = INSN_NOP;
+   nop->len = special_alt->orig_len - special_alt->new_len;
+
+   list_add(>list, _new_insn->list);
+   last_new_insn = nop;
+   }
+
if (fake_jump)
list_add(_jump->list, _new_insn->list);
  
@@ -2237,18 +2263,12 @@ static int handle_insn_ops(struct instru

struct stack_op *op;
  
  	list_for_each_entry(op, >stack_ops, list) {

-   struct cfi_state old_cfi = state->cfi;
int res;
  
  		res = update_cfi_state(insn, >cfi, op);

if (res)
return res;
  
-		if (insn->alt_group && memcmp(>cfi, _cfi, sizeof(struct cfi_state))) {

-   WARN_FUNC("alternative modifies stack", insn->sec, 
insn->offset);
-   return -1;
-   }
-
if (op->dest.type == OP_DEST_PUSHF) {
if (!state->uaccess_stack) {
state->uaccess_stack = 1;
@@ -2460,19 +2480,137 @@ static int validate_return(struct symbol
   * unreported (because they're NOPs), such holes would result in CFI_UNDEFINED
   * 

Re: [PATCH v2 00/12] x86: major paravirt cleanup

2020-12-15 Thread Jürgen Groß

Peter,

On 23.11.20 14:43, Peter Zijlstra wrote:

On Fri, Nov 20, 2020 at 01:53:42PM +0100, Peter Zijlstra wrote:

On Fri, Nov 20, 2020 at 12:46:18PM +0100, Juergen Gross wrote:

  30 files changed, 325 insertions(+), 598 deletions(-)


Much awesome ! I'll try and get that objtool thing sorted.


This seems to work for me. It isn't 100% accurate, because it doesn't
know about the direct call instruction, but I can either fudge that or
switching to static_call() will cure that.

It's not exactly pretty, but it should be straight forward.


Are you planning to send this out as an "official" patch, or should I
include it in my series (in this case I'd need a variant with a proper
commit message)?

I'd like to have this settled soon, as I'm going to send V2 of my
series hopefully this week.


Juergen



Index: linux-2.6/tools/objtool/check.c
===
--- linux-2.6.orig/tools/objtool/check.c
+++ linux-2.6/tools/objtool/check.c
@@ -1090,6 +1090,32 @@ static int handle_group_alt(struct objto
return -1;
}
  
+	/*

+* Add the filler NOP, required for alternative CFI.
+*/
+   if (special_alt->group && special_alt->new_len < special_alt->orig_len) 
{
+   struct instruction *nop = malloc(sizeof(*nop));
+   if (!nop) {
+   WARN("malloc failed");
+   return -1;
+   }
+   memset(nop, 0, sizeof(*nop));
+   INIT_LIST_HEAD(>alts);
+   INIT_LIST_HEAD(>stack_ops);
+   init_cfi_state(>cfi);
+
+   nop->sec = last_new_insn->sec;
+   nop->ignore = last_new_insn->ignore;
+   nop->func = last_new_insn->func;
+   nop->alt_group = alt_group;
+   nop->offset = last_new_insn->offset + last_new_insn->len;
+   nop->type = INSN_NOP;
+   nop->len = special_alt->orig_len - special_alt->new_len;
+
+   list_add(>list, _new_insn->list);
+   last_new_insn = nop;
+   }
+
if (fake_jump)
list_add(_jump->list, _new_insn->list);
  
@@ -2190,18 +2216,12 @@ static int handle_insn_ops(struct instru

struct stack_op *op;
  
  	list_for_each_entry(op, >stack_ops, list) {

-   struct cfi_state old_cfi = state->cfi;
int res;
  
  		res = update_cfi_state(insn, >cfi, op);

if (res)
return res;
  
-		if (insn->alt_group && memcmp(>cfi, _cfi, sizeof(struct cfi_state))) {

-   WARN_FUNC("alternative modifies stack", insn->sec, 
insn->offset);
-   return -1;
-   }
-
if (op->dest.type == OP_DEST_PUSHF) {
if (!state->uaccess_stack) {
state->uaccess_stack = 1;
@@ -2399,19 +2419,137 @@ static int validate_return(struct symbol
   * unreported (because they're NOPs), such holes would result in CFI_UNDEFINED
   * states which then results in ORC entries, which we just said we didn't 
want.
   *
- * Avoid them by copying the CFI entry of the first instruction into the whole
- * alternative.
+ * Avoid them by copying the CFI entry of the first instruction into the hole.
   */
-static void fill_alternative_cfi(struct objtool_file *file, struct instruction 
*insn)
+static void __fill_alt_cfi(struct objtool_file *file, struct instruction *insn)
  {
struct instruction *first_insn = insn;
int alt_group = insn->alt_group;
  
-	sec_for_each_insn_continue(file, insn) {

+   sec_for_each_insn_from(file, insn) {
if (insn->alt_group != alt_group)
break;
-   insn->cfi = first_insn->cfi;
+
+   if (!insn->visited)
+   insn->cfi = first_insn->cfi;
+   }
+}
+
+static void fill_alt_cfi(struct objtool_file *file, struct instruction 
*alt_insn)
+{
+   struct alternative *alt;
+
+   __fill_alt_cfi(file, alt_insn);
+
+   list_for_each_entry(alt, _insn->alts, list)
+   __fill_alt_cfi(file, alt->insn);
+}
+
+static struct instruction *
+__find_unwind(struct objtool_file *file,
+ struct instruction *insn, unsigned long offset)
+{
+   int alt_group = insn->alt_group;
+   struct instruction *next;
+   unsigned long off = 0;
+
+   while ((off + insn->len) <= offset) {
+   next = next_insn_same_sec(file, insn);
+   if (next && next->alt_group != alt_group)
+   next = NULL;
+
+   if (!next)
+   break;
+
+   off += insn->len;
+   insn = next;
}
+
+   return insn;
+}
+
+struct instruction *
+find_alt_unwind(struct objtool_file *file,
+   struct instruction *alt_insn, unsigned long offset)
+{
+   struct instruction *fit;
+   struct alternative *alt;
+   

Re: xen/evtchn: Interrupt for port 34, but apparently not enabled; per-user 00000000a86a4c1b on 5.10

2020-12-15 Thread Jürgen Groß

On 15.12.20 08:27, Jürgen Groß wrote:

On 14.12.20 22:25, Julien Grall wrote:

Hi Juergen,

When testing Linux 5.10 dom0, I could reliably hit the following 
warning with using event 2L ABI:


[  589.591737] Interrupt for port 34, but apparently not enabled; 
per-user a86a4c1b
[  589.593259] WARNING: CPU: 0 PID:  at 
/home/ANT.AMAZON.COM/jgrall/works/oss/linux/drivers/xen/evtchn.c:170 
evtchn_interrupt+0xeb/0x100

[  589.595514] Modules linked in:
[  589.596145] CPU: 0 PID:  Comm: qemu-system-i38 Tainted: G 
W 5.10.0+ #180
[  589.597708] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), 
BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014

[  589.599782] RIP: e030:evtchn_interrupt+0xeb/0x100
[  589.600698] Code: 48 8d bb d8 01 00 00 ba 01 00 00 00 be 1d 00 00 
00 e8 d9 10 ca ff eb b2 8b 75 20 48 89 da 48 c7 c7 a8 31 3d 82 e8 65 
29 a0 ff <0f> 0b e9 42 ff ff ff 0f 1f 40 00 66 2e 0f 1f 84 00 00 00 00 
00 0f

[  589.604087] RSP: e02b:c90040003e70 EFLAGS: 00010086
[  589.605102] RAX:  RBX: 888102091800 RCX: 
0027
[  589.606445] RDX:  RSI: 88817fe19150 RDI: 
88817fe19158
[  589.607790] RBP: 88810f5ab980 R08: 0001 R09: 
00328980
[  589.609134] R10:  R11: c90040003c70 R12: 
888107fd3c00
[  589.610484] R13: c90040003ed4 R14:  R15: 
88810f5ffd80
[  589.611828] FS:  7f960c4b8ac0() GS:88817fe0() 
knlGS:

[  589.613348] CS:  1e030 DS:  ES:  CR0: 80050033
[  589.614525] CR2: 7f17ee72e000 CR3: 00010f5b6000 CR4: 
00050660

[  589.615874] Call Trace:
[  589.616402]  
[  589.616855]  __handle_irq_event_percpu+0x4e/0x2c0
[  589.617784]  handle_irq_event_percpu+0x30/0x80
[  589.618660]  handle_irq_event+0x3a/0x60
[  589.619428]  handle_edge_irq+0x9b/0x1f0
[  589.620209]  generic_handle_irq+0x4f/0x60
[  589.621008]  evtchn_2l_handle_events+0x160/0x280
[  589.621913]  __xen_evtchn_do_upcall+0x66/0xb0
[  589.622767]  __xen_pv_evtchn_do_upcall+0x11/0x20
[  589.623665]  asm_call_irq_on_stack+0x12/0x20
[  589.624511]  
[  589.624978]  xen_pv_evtchn_do_upcall+0x77/0xf0
[  589.625848]  exc_xen_hypervisor_callback+0x8/0x10

This can be reproduced when creating/destroying guest in a loop. 
Although, I have struggled to reproduce it on a vanilla Xen.


After several hours of debugging, I think I have found the root cause.

While we only expect the unmask to happen when the event channel is 
EOIed, there is an unmask happening as part of handle_edge_irq() 
because the interrupt was seen as pending by another vCPU 
(IRQS_PENDING is set).


It turns out that the event channel is set for multiple vCPU is in 
cpu_evtchn_mask. This is happening because the affinity is not cleared 
when freeing an event channel.


The implementation of evtchn_2l_handle_events() will look for all the 
active interrupts for the current vCPU and later on clear the pending 
bit (via the ack() callback). IOW, I believe, this is not an atomic 
operation.


Even if Xen will notify the event to a single vCPU, evtchn_pending_sel 
may still be set on the other vCPU (thanks to a different event 
channel). Therefore, there is a chance that two vCPUs will try to 
handle the same interrupt.


The IRQ handler handle_edge_irq() is able to deal with that and will 
mask/unmask the interrupt. This will mess us with the lateeoi logic 
(although, I managed to reproduce it once without XSA-332).


Thanks for the analysis!

My initial idea to fix the problem was to switch the affinity from CPU 
X to CPU0 when the event channel is freed.


However, I am not sure this is enough because I haven't found anything 
yet preventing a race between evtchn_2l_handle_events9) and 
evtchn_2l_bind_vcpu().


So maybe we want to introduce a refcounting (if there is nothing 
provided by the IRQ framework) and only unmask when the counter drop 
to 0.


Any opinions?


I think we don't need a refcount, but just the internal states "masked"
and "eoi_pending" and unmask only if both are false. "masked" will be
set when the event is being masked. When delivering a lateeoi irq
"eoi_pending" will be set and "masked "reset. "masked" will be reset
when a normal unmask is happening. And "eoi_pending" will be reset
when a lateeoi is signaled. Any reset of "masked" and "eoi_pending"
will check the other flag and do an unmask if both are false.

I'll write a patch.


Julien, could you please test the attached (only build tested) patch?


Juergen
From 2ce5786fd6f29ec09ad653e30e089042ea62b309 Mon Sep 17 00:00:00 2001
From: Juergen Gross 
Date: Tue, 15 Dec 2020 10:37:11 +0100
Subject: [PATCH] xen/events: don't unmask an event channel when an eoi is
 pending

An event channel should be kept masked when an eoi is pending for it.
When being migrated to another cpu it might be unma

Re: xen/evtchn: Interrupt for port 34, but apparently not enabled; per-user 00000000a86a4c1b on 5.10

2020-12-14 Thread Jürgen Groß

On 14.12.20 22:25, Julien Grall wrote:

Hi Juergen,

When testing Linux 5.10 dom0, I could reliably hit the following warning 
with using event 2L ABI:


[  589.591737] Interrupt for port 34, but apparently not enabled; 
per-user a86a4c1b
[  589.593259] WARNING: CPU: 0 PID:  at 
/home/ANT.AMAZON.COM/jgrall/works/oss/linux/drivers/xen/evtchn.c:170 
evtchn_interrupt+0xeb/0x100

[  589.595514] Modules linked in:
[  589.596145] CPU: 0 PID:  Comm: qemu-system-i38 Tainted: G 
W 5.10.0+ #180
[  589.597708] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 
rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014

[  589.599782] RIP: e030:evtchn_interrupt+0xeb/0x100
[  589.600698] Code: 48 8d bb d8 01 00 00 ba 01 00 00 00 be 1d 00 00 00 
e8 d9 10 ca ff eb b2 8b 75 20 48 89 da 48 c7 c7 a8 31 3d 82 e8 65 29 a0 
ff <0f> 0b e9 42 ff ff ff 0f 1f 40 00 66 2e 0f 1f 84 00 00 00 00 00 0f

[  589.604087] RSP: e02b:c90040003e70 EFLAGS: 00010086
[  589.605102] RAX:  RBX: 888102091800 RCX: 
0027
[  589.606445] RDX:  RSI: 88817fe19150 RDI: 
88817fe19158
[  589.607790] RBP: 88810f5ab980 R08: 0001 R09: 
00328980
[  589.609134] R10:  R11: c90040003c70 R12: 
888107fd3c00
[  589.610484] R13: c90040003ed4 R14:  R15: 
88810f5ffd80
[  589.611828] FS:  7f960c4b8ac0() GS:88817fe0() 
knlGS:

[  589.613348] CS:  1e030 DS:  ES:  CR0: 80050033
[  589.614525] CR2: 7f17ee72e000 CR3: 00010f5b6000 CR4: 
00050660

[  589.615874] Call Trace:
[  589.616402]  
[  589.616855]  __handle_irq_event_percpu+0x4e/0x2c0
[  589.617784]  handle_irq_event_percpu+0x30/0x80
[  589.618660]  handle_irq_event+0x3a/0x60
[  589.619428]  handle_edge_irq+0x9b/0x1f0
[  589.620209]  generic_handle_irq+0x4f/0x60
[  589.621008]  evtchn_2l_handle_events+0x160/0x280
[  589.621913]  __xen_evtchn_do_upcall+0x66/0xb0
[  589.622767]  __xen_pv_evtchn_do_upcall+0x11/0x20
[  589.623665]  asm_call_irq_on_stack+0x12/0x20
[  589.624511]  
[  589.624978]  xen_pv_evtchn_do_upcall+0x77/0xf0
[  589.625848]  exc_xen_hypervisor_callback+0x8/0x10

This can be reproduced when creating/destroying guest in a loop. 
Although, I have struggled to reproduce it on a vanilla Xen.


After several hours of debugging, I think I have found the root cause.

While we only expect the unmask to happen when the event channel is 
EOIed, there is an unmask happening as part of handle_edge_irq() because 
the interrupt was seen as pending by another vCPU (IRQS_PENDING is set).


It turns out that the event channel is set for multiple vCPU is in 
cpu_evtchn_mask. This is happening because the affinity is not cleared 
when freeing an event channel.


The implementation of evtchn_2l_handle_events() will look for all the 
active interrupts for the current vCPU and later on clear the pending 
bit (via the ack() callback). IOW, I believe, this is not an atomic 
operation.


Even if Xen will notify the event to a single vCPU, evtchn_pending_sel 
may still be set on the other vCPU (thanks to a different event 
channel). Therefore, there is a chance that two vCPUs will try to handle 
the same interrupt.


The IRQ handler handle_edge_irq() is able to deal with that and will 
mask/unmask the interrupt. This will mess us with the lateeoi logic 
(although, I managed to reproduce it once without XSA-332).


Thanks for the analysis!

My initial idea to fix the problem was to switch the affinity from CPU X 
to CPU0 when the event channel is freed.


However, I am not sure this is enough because I haven't found anything 
yet preventing a race between evtchn_2l_handle_events9) and 
evtchn_2l_bind_vcpu().


So maybe we want to introduce a refcounting (if there is nothing 
provided by the IRQ framework) and only unmask when the counter drop to 0.


Any opinions?


I think we don't need a refcount, but just the internal states "masked"
and "eoi_pending" and unmask only if both are false. "masked" will be
set when the event is being masked. When delivering a lateeoi irq
"eoi_pending" will be set and "masked "reset. "masked" will be reset
when a normal unmask is happening. And "eoi_pending" will be reset
when a lateeoi is signaled. Any reset of "masked" and "eoi_pending"
will check the other flag and do an unmask if both are false.

I'll write a patch.


Juergen


OpenPGP_0xB0DE9DD628BF132F.asc
Description: application/pgp-keys


OpenPGP_signature
Description: OpenPGP digital signature


Re: [patch 27/30] xen/events: Only force affinity mask for percpu interrupts

2020-12-11 Thread Jürgen Groß

On 11.12.20 00:20, boris.ostrov...@oracle.com wrote:


On 12/10/20 2:26 PM, Thomas Gleixner wrote:

All event channel setups bind the interrupt on CPU0 or the target CPU for
percpu interrupts and overwrite the affinity mask with the corresponding
cpumask. That does not make sense.

The XEN implementation of irqchip::irq_set_affinity() already picks a
single target CPU out of the affinity mask and the actual target is stored
in the effective CPU mask, so destroying the user chosen affinity mask
which might contain more than one CPU is wrong.

Change the implementation so that the channel is bound to CPU0 at the XEN
level and leave the affinity mask alone. At startup of the interrupt
affinity will be assigned out of the affinity mask and the XEN binding will
be updated.



If that's the case then I wonder whether we need this call at all and instead 
bind at startup time.


After some discussion with Thomas on IRC and xen-devel archaeology the
result is: this will be needed especially for systems running on a
single vcpu (e.g. small guests), as the .irq_set_affinity() callback
won't be called in this case when starting the irq.


Juergen


OpenPGP_0xB0DE9DD628BF132F.asc
Description: application/pgp-keys


OpenPGP_signature
Description: OpenPGP digital signature


Re: [patch 27/30] xen/events: Only force affinity mask for percpu interrupts

2020-12-11 Thread Jürgen Groß

On 11.12.20 11:13, Thomas Gleixner wrote:

On Fri, Dec 11 2020 at 07:17, Jürgen Groß wrote:

On 11.12.20 00:20, boris.ostrov...@oracle.com wrote:


On 12/10/20 2:26 PM, Thomas Gleixner wrote:

All event channel setups bind the interrupt on CPU0 or the target CPU for
percpu interrupts and overwrite the affinity mask with the corresponding
cpumask. That does not make sense.

The XEN implementation of irqchip::irq_set_affinity() already picks a
single target CPU out of the affinity mask and the actual target is stored
in the effective CPU mask, so destroying the user chosen affinity mask
which might contain more than one CPU is wrong.

Change the implementation so that the channel is bound to CPU0 at the XEN
level and leave the affinity mask alone. At startup of the interrupt
affinity will be assigned out of the affinity mask and the XEN binding will
be updated.



If that's the case then I wonder whether we need this call at all and instead 
bind at startup time.


This binding to cpu0 was introduced with commit 97253eeeb792d61ed2
and I have no reason to believe the underlying problem has been
eliminated.


 "The kernel-side VCPU binding was not being correctly set for newly
  allocated or bound interdomain events.  In ARM guests where 2-level
  events were used, this would result in no interdomain events being
  handled because the kernel-side VCPU masks would all be clear.

  x86 guests would work because the irq affinity was set during irq
  setup and this would set the correct kernel-side VCPU binding."

I'm not convinced that this is really correctly analyzed because affinity
setting is done at irq startup.

 switch (__irq_startup_managed(desc, aff, force)) {
case IRQ_STARTUP_NORMAL:
ret = __irq_startup(desc);
 irq_setup_affinity(desc);
break;

which is completely architecture agnostic. So why should this magically
work on x86 and not on ARM if both are using the same XEN irqchip with
the same irqchip callbacks.


I think this might be related to _initial_ cpu binding of events and
changing the binding later. This might be handled differently in the
hypervisor.


Juergen


OpenPGP_0xB0DE9DD628BF132F.asc
Description: application/pgp-keys


OpenPGP_signature
Description: OpenPGP digital signature


Re: [patch 27/30] xen/events: Only force affinity mask for percpu interrupts

2020-12-10 Thread Jürgen Groß

On 11.12.20 00:20, boris.ostrov...@oracle.com wrote:


On 12/10/20 2:26 PM, Thomas Gleixner wrote:

All event channel setups bind the interrupt on CPU0 or the target CPU for
percpu interrupts and overwrite the affinity mask with the corresponding
cpumask. That does not make sense.

The XEN implementation of irqchip::irq_set_affinity() already picks a
single target CPU out of the affinity mask and the actual target is stored
in the effective CPU mask, so destroying the user chosen affinity mask
which might contain more than one CPU is wrong.

Change the implementation so that the channel is bound to CPU0 at the XEN
level and leave the affinity mask alone. At startup of the interrupt
affinity will be assigned out of the affinity mask and the XEN binding will
be updated.



If that's the case then I wonder whether we need this call at all and instead 
bind at startup time.


This binding to cpu0 was introduced with commit 97253eeeb792d61ed2
and I have no reason to believe the underlying problem has been
eliminated.


Juergen


OpenPGP_0xB0DE9DD628BF132F.asc
Description: application/pgp-keys


OpenPGP_signature
Description: OpenPGP digital signature


Re: x86/ioapic: Cleanup the timer_works() irqflags mess

2020-12-10 Thread Jürgen Groß

On 10.12.20 21:15, Thomas Gleixner wrote:

Mark tripped over the creative irqflags handling in the IO-APIC timer
delivery check which ends up doing:

 local_irq_save(flags);
local_irq_enable();
 local_irq_restore(flags);

which triggered a new consistency check he's working on required for
replacing the POPF based restore with a conditional STI.

That code is a historical mess and none of this is needed. Make it
straightforward use local_irq_disable()/enable() as that's all what is
required. It is invoked from interrupt enabled code nowadays.

Reported-by: Mark Rutland 
Signed-off-by: Thomas Gleixner 
Tested-by: Mark Rutland 


Reviewed-by: Juergen Gross 


Juergen


OpenPGP_0xB0DE9DD628BF132F.asc
Description: application/pgp-keys


OpenPGP_signature
Description: OpenPGP digital signature


Re: [PATCH 2/2] xen: don't use page->lru for ZONE_DEVICE memory

2020-12-10 Thread Jürgen Groß

On 10.12.20 12:14, Roger Pau Monné wrote:

On Tue, Dec 08, 2020 at 07:45:00AM +0100, Jürgen Groß wrote:

On 07.12.20 21:48, Jason Andryuk wrote:

On Mon, Dec 7, 2020 at 8:30 AM Juergen Gross  wrote:


Commit 9e2369c06c8a18 ("xen: add helpers to allocate unpopulated
memory") introduced usage of ZONE_DEVICE memory for foreign memory
mappings.

Unfortunately this collides with using page->lru for Xen backend
private page caches.

Fix that by using page->zone_device_data instead.

Fixes: 9e2369c06c8a18 ("xen: add helpers to allocate unpopulated memory")
Signed-off-by: Juergen Gross 


Would it make sense to add BUG_ON(is_zone_device_page(page)) and the
opposite as appropriate to cache_enq?


No, I don't think so. At least in the CONFIG_ZONE_DEVICE case the
initial list in a PV dom0 is populated from extra memory (basically
the same, but not marked as zone device memory explicitly).


I assume it's fine for us to then use page->zone_device_data even if
the page is not explicitly marked as ZONE_DEVICE memory?


I think so, yes, as we are owner of that page and we were fine to use
lru, too.



If that's fine, add my:

Acked-by: Roger Pau Monné 

To both patches, and thank you very much for fixing this!


UR welcome. :-)


Juergen


OpenPGP_0xB0DE9DD628BF132F.asc
Description: application/pgp-keys


OpenPGP_signature
Description: OpenPGP digital signature


Re: [PATCH v2 05/12] x86: rework arch_local_irq_restore() to not use popf

2020-12-09 Thread Jürgen Groß

On 09.12.20 15:02, Mark Rutland wrote:

On Wed, Dec 09, 2020 at 01:27:10PM +, Mark Rutland wrote:

On Sun, Nov 22, 2020 at 01:44:53PM -0800, Andy Lutomirski wrote:

On Sat, Nov 21, 2020 at 10:55 PM Jürgen Groß  wrote:

On 20.11.20 12:59, Peter Zijlstra wrote:

If someone were to write horrible code like:

   local_irq_disable();
   local_irq_save(flags);
   local_irq_enable();
   local_irq_restore(flags);

we'd be up some creek without a paddle... now I don't _think_ we have
genius code like that, but I'd feel saver if we can haz an assertion in
there somewhere...



I was just talking to Peter on IRC about implementing the same thing for
arm64, so could we put this in the generic irqflags code? IIUC we can
use raw_irqs_disabled() to do the check.

As this isn't really entry specific (and IIUC the cases this should
catch would break lockdep today), maybe we should add a new
DEBUG_IRQFLAGS for this, that DEBUG_LOCKDEP can also select?

Something like:

#define local_irq_restore(flags)   \
do {\
if (!raw_irqs_disabled_flags(flags)) {  \
trace_hardirqs_on();\
} else if (IS_ENABLED(CONFIG_DEBUG_IRQFLAGS) {  \
if (unlikely(raw_irqs_disabled())   \


Whoops; that should be !raw_irqs_disabled().


warn_bogus_irqrestore();\
}   \
raw_local_irq_restore(flags);   \
 } while (0)

... perhaps? (ignoring however we deal with once-ness).


If no-one shouts in the next day or two I'll spin this as its own patch.


Fine with me. So I'll just ignore a potential error case in my patch.

Thanks,


Juergen



OpenPGP_0xB0DE9DD628BF132F.asc
Description: application/pgp-keys


OpenPGP_signature
Description: OpenPGP digital signature


Re: [PATCH v2 07/12] x86: add new features for paravirt patching

2020-12-09 Thread Jürgen Groß

On 09.12.20 13:03, Borislav Petkov wrote:

On Wed, Dec 09, 2020 at 08:30:53AM +0100, Jürgen Groß wrote:

Hey, I already suggested to use ~FEATURE for that purpose (see
https://lore.kernel.org/lkml/f105a63d-6b51-3afb-83e0-e899ea408...@suse.com/


Great minds think alike!

:-P


I'd rather make the syntax:

ALTERNATIVE_TERNARY   
  

as this ...


Sure, that is ok too.


... would match perfectly to this interpretation.


Yap.


Hmm, using flags is an alternative (pun intended :-) ).


LOL!

Btw, pls do check how much the vmlinux size of an allyesconfig grows
with this as we will be adding a byte per patch site. Not that it would
matter too much - the flags are a long way a comin'. :-)


No, this is needed for non-Xen cases, too (especially pv spinlocks).


I see.


Can you give an example here pls why the paravirt patching needs to run
first?


Okay.


I meant an example for me to have a look at. :)

If possible pls.


Ah, okay.

Lets take the spin_unlock() case. With patch 11 of the series this is

PVOP_ALT_VCALLEE1(lock.queued_spin_unlock, lock,
  "movb $0, (%%" _ASM_ARG1 ");",
  X86_FEATURE_NO_PVUNLOCK);

which boils down to ALTERNATIVE "call *lock.queued_spin_unlock"
"movb $0,(%rdi)" X86_FEATURE_NO_PVUNLOCK

The initial (paravirt) code is an indirect call in order to allow
spin_unlock() before paravirt/alternative patching takes place.

Paravirt patching will then replace the indirect call with a direct call
to the correct unlock function. Then alternative patching might replace
the direct call to the bare metal unlock with a plain "movb $0,(%rdi)"
in case pvlocks are not enabled.

In case alternative patching would occur first, the indirect call might
be replaced with the "movb ...", and then paravirt patching would
clobber that with the direct call, resulting in the bare metal
optimization being removed again.


Juergen


OpenPGP_0xB0DE9DD628BF132F.asc
Description: application/pgp-keys


OpenPGP_signature
Description: OpenPGP digital signature


Re: [PATCH v2 07/12] x86: add new features for paravirt patching

2020-12-08 Thread Jürgen Groß

On 08.12.20 19:43, Borislav Petkov wrote:

On Fri, Nov 20, 2020 at 12:46:25PM +0100, Juergen Gross wrote:

diff --git a/arch/x86/include/asm/cpufeatures.h 
b/arch/x86/include/asm/cpufeatures.h
index dad350d42ecf..ffa23c655412 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -237,6 +237,9 @@
  #define X86_FEATURE_VMCALL( 8*32+18) /* "" Hypervisor supports 
the VMCALL instruction */
  #define X86_FEATURE_VMW_VMMCALL   ( 8*32+19) /* "" VMware prefers 
VMMCALL hypercall instruction */
  #define X86_FEATURE_SEV_ES( 8*32+20) /* AMD Secure Encrypted 
Virtualization - Encrypted State */
+#define X86_FEATURE_NOT_XENPV  ( 8*32+21) /* "" Inverse of 
X86_FEATURE_XENPV */
+#define X86_FEATURE_NO_PVUNLOCK( 8*32+22) /* "" No PV unlock 
function */
+#define X86_FEATURE_NO_VCPUPREEMPT ( 8*32+23) /* "" No PV 
vcpu_is_preempted function */


Ew, negative features. ;-\


Hey, I already suggested to use ~FEATURE for that purpose (see
https://lore.kernel.org/lkml/f105a63d-6b51-3afb-83e0-e899ea408...@suse.com/ 
).




/me goes forward and looks at usage sites:

+   ALTERNATIVE_2 "jmp *paravirt_iret(%rip);",\
+ "jmp native_iret;", X86_FEATURE_NOT_XENPV,  \
+ "jmp xen_iret;", X86_FEATURE_XENPV

Can we make that:

ALTERNATIVE_TERNARY "jmp *paravirt_iret(%rip);",
  "jmp xen_iret;", X86_FEATURE_XENPV,
  "jmp native_iret;", X86_FEATURE_XENPV,


Would we really want to specify the feature twice?

I'd rather make the syntax:

ALTERNATIVE_TERNARY   
 

as this ...



where the last two lines are supposed to mean

X86_FEATURE_XENPV ? "jmp xen_iret;" : "jmp 
native_iret;"


... would match perfectly to this interpretation.



Now, in order to convey that logic to apply_alternatives(), you can do:

struct alt_instr {
 s32 instr_offset;   /* original instruction */
 s32 repl_offset;/* offset to replacement instruction */
 u16 cpuid;  /* cpuid bit set for replacement */
 u8  instrlen;   /* length of original instruction */
 u8  replacementlen; /* length of new instruction */
 u8  padlen; /* length of build-time padding */
u8  flags;  /* patching flags */<--- 
THIS
} __packed;


Hmm, using flags is an alternative (pun intended :-) ).



and yes, we have had the flags thing in a lot of WIP diffs over the
years but we've never come to actually needing it.

Anyway, then, apply_alternatives() will do:

if (flags & ALT_NOT_FEATURE)

or something like that - I'm bad at naming stuff - then it should patch
only when the feature is NOT set and vice versa.

There in that

if (!boot_cpu_has(a->cpuid)) {

branch.

Hmm?


Fine with me (I'd prefer my ALTERNATIVE_TERNARY syntax, though).




  /* Intel-defined CPU features, CPUID level 0x0007:0 (EBX), word 9 */
  #define X86_FEATURE_FSGSBASE  ( 9*32+ 0) /* RDFSBASE, WRFSBASE, 
RDGSBASE, WRGSBASE instructions*/
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index 2400ad62f330..f8f9700719cf 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -593,6 +593,18 @@ int alternatives_text_reserved(void *start, void *end)
  #endif /* CONFIG_SMP */
  
  #ifdef CONFIG_PARAVIRT

+static void __init paravirt_set_cap(void)
+{
+   if (!boot_cpu_has(X86_FEATURE_XENPV))
+   setup_force_cpu_cap(X86_FEATURE_NOT_XENPV);
+
+   if (pv_is_native_spin_unlock())
+   setup_force_cpu_cap(X86_FEATURE_NO_PVUNLOCK);
+
+   if (pv_is_native_vcpu_is_preempted())
+   setup_force_cpu_cap(X86_FEATURE_NO_VCPUPREEMPT);
+}
+
  void __init_or_module apply_paravirt(struct paravirt_patch_site *start,
 struct paravirt_patch_site *end)
  {
@@ -616,6 +628,8 @@ void __init_or_module apply_paravirt(struct 
paravirt_patch_site *start,
  }
  extern struct paravirt_patch_site __start_parainstructions[],
__stop_parainstructions[];
+#else
+static void __init paravirt_set_cap(void) { }
  #endif/* CONFIG_PARAVIRT */
  
  /*

@@ -723,6 +737,18 @@ void __init alternative_instructions(void)
 * patching.
 */
  
+	paravirt_set_cap();


Can that be called from somewhere in the Xen init path and not from
here? Somewhere before check_bugs() gets called.


No, this is needed for non-Xen cases, too (especially pv spinlocks).




+   /*
+* First patch paravirt functions, such that we overwrite the indirect
+* call with the direct call.
+*/
+   apply_paravirt(__parainstructions, __parainstructions_end);
+
+   /*
+* Then patch alternatives, such that those paravirt calls that are in
+* alternatives can 

Re: [PATCH 2/2] xen: don't use page->lru for ZONE_DEVICE memory

2020-12-07 Thread Jürgen Groß

On 07.12.20 21:48, Jason Andryuk wrote:

On Mon, Dec 7, 2020 at 8:30 AM Juergen Gross  wrote:


Commit 9e2369c06c8a18 ("xen: add helpers to allocate unpopulated
memory") introduced usage of ZONE_DEVICE memory for foreign memory
mappings.

Unfortunately this collides with using page->lru for Xen backend
private page caches.

Fix that by using page->zone_device_data instead.

Fixes: 9e2369c06c8a18 ("xen: add helpers to allocate unpopulated memory")
Signed-off-by: Juergen Gross 


Would it make sense to add BUG_ON(is_zone_device_page(page)) and the
opposite as appropriate to cache_enq?


No, I don't think so. At least in the CONFIG_ZONE_DEVICE case the
initial list in a PV dom0 is populated from extra memory (basically
the same, but not marked as zone device memory explicitly).

Juergen


OpenPGP_0xB0DE9DD628BF132F.asc
Description: application/pgp-keys


OpenPGP_signature
Description: OpenPGP digital signature


Re: [PATCH] Revert "xen: add helpers to allocate unpopulated memory"

2020-12-06 Thread Jürgen Groß

On 06.12.20 18:22, Marek Marczykowski-Górecki wrote:

This reverts commit 9e2369c06c8a181478039258a4598c1ddd2cadfa.

On a Xen PV dom0, with NVME disk, this makes the dom0 crash when starting
a domain. This looks like some bad interaction between xen-blkback and


xen-scsiback has the same use pattern.


NVME driver, both using ZONE_DEVICE. Since the author is on leave now,
revert the change until proper solution is developed.

The specific crash message is:

 general protection fault, probably for non-canonical address 
0xdead0100:  [#1] SMP NOPTI
 CPU: 1 PID: 134 Comm: kworker/u12:2 Not tainted 5.9.9-1.qubes.x86_64 #1
 Hardware name: LENOVO 20M9CTO1WW/20M9CTO1WW, BIOS N2CET50W (1.33 ) 
01/15/2020
 Workqueue: dm-thin do_worker [dm_thin_pool]
 RIP: e030:nvme_map_data+0x300/0x3a0 [nvme]
 Code: b8 fe ff ff e9 a8 fe ff ff 4c 8b 56 68 8b 5e 70 8b 76 74 49 8b 02 48 c1 e8 
33 83 e0 07 83 f8 04 0f 85 f2 fe ff ff 49 8b 42 08 <83> b8 d0 00 00 00 04 0f 85 
e1 fe ff ff e9 38 fd ff ff 8b 55 70 be
 RSP: e02b:c900010e7ad8 EFLAGS: 00010246
 RAX: dead0100 RBX: 1000 RCX: 8881a58f5000
 RDX: 1000 RSI:  RDI: 8881a679e000
 RBP: 8881a5ef4c80 R08: 8881a5ef4c80 R09: 0002
 R10: ea0003dfff40 R11: 0008 R12: 8881a679e000
 R13: c900010e7b20 R14: 8881a70b5980 R15: 8881a679e000
 FS:  () GS:8881b544() 
knlGS:
 CS:  e030 DS:  ES:  CR0: 80050033
 CR2: 01d64408 CR3: 0001aa2c CR4: 00050660
 Call Trace:
  nvme_queue_rq+0xa7/0x1a0 [nvme]
  __blk_mq_try_issue_directly+0x11d/0x1e0
  ? add_wait_queue_exclusive+0x70/0x70
  blk_mq_try_issue_directly+0x35/0xc0l[
  blk_mq_submit_bio+0x58f/0x660
  __submit_bio_noacct+0x300/0x330
  process_shared_bio+0x126/0x1b0 [dm_thin_pool]
  process_cell+0x226/0x280 [dm_thin_pool]
  process_thin_deferred_cells+0x185/0x320 [dm_thin_pool]
  process_deferred_bios+0xa4/0x2a0 [dm_thin_pool]UX
  do_worker+0xcc/0x130 [dm_thin_pool]
  process_one_work+0x1b4/0x370
  worker_thread+0x4c/0x310
  ? process_one_work+0x370/0x370
  kthread+0x11b/0x140
  ? __kthread_bind_mask+0x60/0x60<
  ret_from_fork+0x22/0x30
 Modules linked in: loop snd_seq_dummy snd_hrtimer nf_tables nfnetlink vfat 
fat snd_sof_pci snd_sof_intel_byt snd_sof_intel_ipc snd_sof_intel_hda_common 
snd_soc_hdac_hda snd_sof_xtensa_dsp snd_sof_intel_hda snd_sof snd_soc_skl 
snd_soc_sst_
 ipc snd_soc_sst_dsp snd_hda_ext_core snd_soc_acpi_intel_match snd_soc_acpi 
snd_soc_core snd_compress ac97_bus snd_pcm_dmaengine elan_i2c 
snd_hda_codec_hdmi mei_hdcp iTCO_wdt intel_powerclamp intel_pmc_bxt ee1004 
intel_rapl_msr iTCO_vendor
 _support joydev pcspkr intel_wmi_thunderbolt wmi_bmof thunderbolt 
ucsi_acpi idma64 typec_ucsi snd_hda_codec_realtek typec snd_hda_codec_generic 
snd_hda_intel snd_intel_dspcfg snd_hda_codec thinkpad_acpi snd_hda_core 
ledtrig_audio int3403_
 thermal snd_hwdep snd_seq snd_seq_device snd_pcm iwlwifi snd_timer 
processor_thermal_device mei_me cfg80211 intel_rapl_common snd e1000e mei 
int3400_thermal int340x_thermal_zone i2c_i801 acpi_thermal_rel soundcore 
intel_soc_dts_iosf i2c_s
 mbus rfkill intel_pch_thermal xenfs
  ip_tables dm_thin_pool dm_persistent_data dm_bio_prison dm_crypt nouveau 
rtsx_pci_sdmmc mmc_core mxm_wmi crct10dif_pclmul ttm crc32_pclmul crc32c_intel 
i915 ghash_clmulni_intel i2c_algo_bit serio_raw nvme drm_kms_helper cec 
xhci_pci nvme
 _core rtsx_pci xhci_pci_renesas drm xhci_hcd wmi video pinctrl_cannonlake 
pinctrl_intel xen_privcmd xen_pciback xen_blkback xen_gntalloc xen_gntdev 
xen_evtchn uinput
 ---[ end trace f8d47e4aa6724df4 ]---
 RIP: e030:nvme_map_data+0x300/0x3a0 [nvme]
 Code: b8 fe ff ff e9 a8 fe ff ff 4c 8b 56 68 8b 5e 70 8b 76 74 49 8b 02 48 c1 e8 
33 83 e0 07 83 f8 04 0f 85 f2 fe ff ff 49 8b 42 08 <83> b8 d0 00 00 00 04 0f 85 
e1 fe ff ff e9 38 fd ff ff 8b 55 70 be
 RSP: e02b:c900010e7ad8 EFLAGS: 00010246
 RAX: dead0100 RBX: 1000 RCX: 8881a58f5000
 RDX: 1000 RSI:  RDI: 8881a679e000
 RBP: 8881a5ef4c80 R08: 8881a5ef4c80 R09: 0002
 R10: ea0003dfff40 R11: 0008 R12: 8881a679e000
 R13: c900010e7b20 R14: 8881a70b5980 R15: 8881a679e000
 FS:  () GS:8881b544() 
knlGS:
 CS:  e030 DS:  ES:  CR0: 80050033
 CR2: 01d64408 CR3: 0001aa2c CR4: 00050660
 Kernel panic - not syncing: Fatal exception
 Kernel Offset: disabled

Discussion at 
https://lore.kernel.org/xen-devel/20201205082839.ts3ju6yta46cgwjn@Air-de-Roger/T

Cc: sta...@vger.kernel.org #v5.9+
(for 5.9 it's easier to revert the original commit directly)

Re: [PATCH v2 04/12] x86/xen: drop USERGS_SYSRET64 paravirt call

2020-12-02 Thread Jürgen Groß

On 02.12.20 13:32, Borislav Petkov wrote:

On Fri, Nov 20, 2020 at 12:46:22PM +0100, Juergen Gross wrote:

@@ -123,12 +115,15 @@ SYM_INNER_LABEL(entry_SYSCALL_64_after_hwframe, 
SYM_L_GLOBAL)
 * Try to use SYSRET instead of IRET if we're returning to
 * a completely clean 64-bit userspace context.  If we're not,
 * go to the slow exit path.
+* In the Xen PV case we must use iret anyway.
 */
-   movqRCX(%rsp), %rcx
-   movqRIP(%rsp), %r11
  
-	cmpq	%rcx, %r11	/* SYSRET requires RCX == RIP */

-   jne swapgs_restore_regs_and_return_to_usermode
+   ALTERNATIVE __stringify( \
+   movqRCX(%rsp), %rcx; \
+   movqRIP(%rsp), %r11; \
+   cmpq%rcx, %r11; /* SYSRET requires RCX == RIP */ \
+   jne swapgs_restore_regs_and_return_to_usermode), \
+   "jmp   swapgs_restore_regs_and_return_to_usermode", 
X86_FEATURE_XENPV


Why such a big ALTERNATIVE when you can simply do:

 /*
  * Try to use SYSRET instead of IRET if we're returning to
  * a completely clean 64-bit userspace context.  If we're not,
  * go to the slow exit path.
  * In the Xen PV case we must use iret anyway.
  */
 ALTERNATIVE "", "jmp swapgs_restore_regs_and_return_to_usermode", 
X86_FEATURE_XENPV

 movqRCX(%rsp), %rcx;
 movqRIP(%rsp), %r11;
 cmpq%rcx, %r11; /* SYSRET requires RCX == RIP */ \
 jne swapgs_restore_regs_and_return_to_usermode

?



I wanted to avoid the additional NOPs for the bare metal case.

If you don't mind them I can do as you are suggesting.


Juergen


OpenPGP_0xB0DE9DD628BF132F.asc
Description: application/pgp-keys


OpenPGP_signature
Description: OpenPGP digital signature


Re: [PATCH] xen: remove trailing semicolon in macro definition

2020-12-02 Thread Jürgen Groß

On 27.11.20 17:07, t...@redhat.com wrote:

From: Tom Rix 

The macro use will already have a semicolon.

Signed-off-by: Tom Rix 


Reviewed-by: Juergen Gross 


Juergen


OpenPGP_0xB0DE9DD628BF132F.asc
Description: application/pgp-keys


OpenPGP_signature
Description: OpenPGP digital signature


Re: [PATCH v2 05/12] x86: rework arch_local_irq_restore() to not use popf

2020-11-22 Thread Jürgen Groß

On 22.11.20 22:44, Andy Lutomirski wrote:

On Sat, Nov 21, 2020 at 10:55 PM Jürgen Groß  wrote:


On 20.11.20 12:59, Peter Zijlstra wrote:

On Fri, Nov 20, 2020 at 12:46:23PM +0100, Juergen Gross wrote:

+static __always_inline void arch_local_irq_restore(unsigned long flags)
+{
+if (!arch_irqs_disabled_flags(flags))
+arch_local_irq_enable();
+}


If someone were to write horrible code like:

   local_irq_disable();
   local_irq_save(flags);
   local_irq_enable();
   local_irq_restore(flags);

we'd be up some creek without a paddle... now I don't _think_ we have
genius code like that, but I'd feel saver if we can haz an assertion in
there somewhere...

Maybe something like:

#ifdef CONFIG_DEBUG_ENTRY // for lack of something saner
   WARN_ON_ONCE((arch_local_save_flags() ^ flags) & X86_EFLAGS_IF);
#endif

At the end?


I'd like to, but using WARN_ON_ONCE() in include/asm/irqflags.h sounds
like a perfect receipt for include dependency hell.

We could use a plain asm("ud2") instead.


How about out-of-lining it:

#ifdef CONFIG_DEBUG_ENTRY
extern void warn_bogus_irqrestore();
#endif

static __always_inline void arch_local_irq_restore(unsigned long flags)
{
if (!arch_irqs_disabled_flags(flags)) {
arch_local_irq_enable();
} else {
#ifdef CONFIG_DEBUG_ENTRY
if (unlikely(arch_local_irq_save() & X86_EFLAGS_IF))
 warn_bogus_irqrestore();
#endif
}



This couldn't be a WARN_ON_ONCE() then (or it would be a catch all).
Another approach might be to open-code the WARN_ON_ONCE(), like:

#ifdef CONFIG_DEBUG_ENTRY
extern void warn_bogus_irqrestore(bool *once);
#endif

static __always_inline void arch_local_irq_restore(unsigned long flags)
{
if (!arch_irqs_disabled_flags(flags))
arch_local_irq_enable();
#ifdef CONFIG_DEBUG_ENTRY
{
static bool once;

if (unlikely(arch_local_irq_save() & X86_EFLAGS_IF))
warn_bogus_irqrestore();
}
#endif
}


Juergen


OpenPGP_0xB0DE9DD628BF132F.asc
Description: application/pgp-keys


OpenPGP_signature
Description: OpenPGP digital signature


Re: [PATCH v2 05/12] x86: rework arch_local_irq_restore() to not use popf

2020-11-21 Thread Jürgen Groß

On 20.11.20 12:59, Peter Zijlstra wrote:

On Fri, Nov 20, 2020 at 12:46:23PM +0100, Juergen Gross wrote:

+static __always_inline void arch_local_irq_restore(unsigned long flags)
+{
+   if (!arch_irqs_disabled_flags(flags))
+   arch_local_irq_enable();
+}


If someone were to write horrible code like:

local_irq_disable();
local_irq_save(flags);
local_irq_enable();
local_irq_restore(flags);

we'd be up some creek without a paddle... now I don't _think_ we have
genius code like that, but I'd feel saver if we can haz an assertion in
there somewhere...

Maybe something like:

#ifdef CONFIG_DEBUG_ENTRY // for lack of something saner
WARN_ON_ONCE((arch_local_save_flags() ^ flags) & X86_EFLAGS_IF);
#endif

At the end?


I'd like to, but using WARN_ON_ONCE() in include/asm/irqflags.h sounds
like a perfect receipt for include dependency hell.

We could use a plain asm("ud2") instead.


Juergen


OpenPGP_0xB0DE9DD628BF132F.asc
Description: application/pgp-keys


OpenPGP_signature
Description: OpenPGP digital signature


Re: [PATCH v2 08/12] x86/paravirt: remove no longer needed 32-bit pvops cruft

2020-11-20 Thread Jürgen Groß

On 20.11.20 13:08, Peter Zijlstra wrote:

On Fri, Nov 20, 2020 at 12:46:26PM +0100, Juergen Gross wrote:

+#define PVOP_CALL(rettype, op, clbr, call_clbr, extra_clbr, ...)   \
({  \
PVOP_CALL_ARGS; \
PVOP_TEST_NULL(op); \
+   BUILD_BUG_ON(sizeof(rettype) > sizeof(unsigned long));   \
+   asm volatile(paravirt_alt(PARAVIRT_CALL)\
+: call_clbr, ASM_CALL_CONSTRAINT   \
+: paravirt_type(op),   \
+  paravirt_clobber(clbr),  \
+  ##__VA_ARGS__\
+: "memory", "cc" extra_clbr);  \
+   (rettype)(__eax & PVOP_RETMASK(rettype));   \
})


This is now very similar to PVOP_VCALL() (note how PVOP_CALL_ARGS is
PVOP_VCALL_ARGS).

Could we get away with doing something horrible like:

#define PVOP_VCALL(X...) (void)PVOP_CALL(long, X)

?


Oh, indeed. And in patch 9 the same could be done for the ALT variants.


Juergen


OpenPGP_0xB0DE9DD628BF132F.asc
Description: application/pgp-keys


OpenPGP_signature
Description: OpenPGP digital signature


Re: [PATCH v2 06/12] x86/paravirt: switch time pvops functions to use static_call()

2020-11-20 Thread Jürgen Groß

On 20.11.20 13:01, Peter Zijlstra wrote:

On Fri, Nov 20, 2020 at 12:46:24PM +0100, Juergen Gross wrote:

The time pvops functions are the only ones left which might be
used in 32-bit mode and which return a 64-bit value.

Switch them to use the static_call() mechanism instead of pvops, as
this allows quite some simplification of the pvops implementation.

Due to include hell this requires to split out the time interfaces
into a new header file.


There's also this patch floating around; just in case that would come in
handy:

   https://lkml.kernel.org/r/20201110005609.40989-3-frede...@kernel.org



Ah, yes. This would make life much easier.


Juergen


OpenPGP_0xB0DE9DD628BF132F.asc
Description: application/pgp-keys


OpenPGP_signature
Description: OpenPGP digital signature


Re: [PATCH v2 05/12] x86: rework arch_local_irq_restore() to not use popf

2020-11-20 Thread Jürgen Groß

On 20.11.20 12:59, Peter Zijlstra wrote:

On Fri, Nov 20, 2020 at 12:46:23PM +0100, Juergen Gross wrote:

+static __always_inline void arch_local_irq_restore(unsigned long flags)
+{
+   if (!arch_irqs_disabled_flags(flags))
+   arch_local_irq_enable();
+}


If someone were to write horrible code like:

local_irq_disable();
local_irq_save(flags);
local_irq_enable();
local_irq_restore(flags);

we'd be up some creek without a paddle... now I don't _think_ we have
genius code like that, but I'd feel saver if we can haz an assertion in
there somewhere...

Maybe something like:

#ifdef CONFIG_DEBUG_ENTRY // for lack of something saner
WARN_ON_ONCE((arch_local_save_flags() ^ flags) & X86_EFLAGS_IF);
#endif

At the end?



I'd be fine with that. I didn't add something like that because I
couldn't find a suitable CONFIG_ :-)


Juergen


OpenPGP_0xB0DE9DD628BF132F.asc
Description: application/pgp-keys


OpenPGP_signature
Description: OpenPGP digital signature


  1   2   3   >