Hello community, here is the log from the commit of package xen for openSUSE:Leap:15.2 checked in at 2020-04-21 19:06:36 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Comparing /work/SRC/openSUSE:Leap:15.2/xen (Old) and /work/SRC/openSUSE:Leap:15.2/.xen.new.2738 (New) ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Package is "xen" Tue Apr 21 19:06:36 2020 rev:85 rq:795132 version:4.13.0_12 Changes: -------- --- /work/SRC/openSUSE:Leap:15.2/xen/xen.changes 2020-03-31 07:21:35.758377870 +0200 +++ /work/SRC/openSUSE:Leap:15.2/.xen.new.2738/xen.changes 2020-04-21 19:08:03.992055245 +0200 @@ -1,0 +2,44 @@ +Tue Apr 14 11:06:08 MDT 2020 - carn...@suse.com + +- bsc#1169392 - VUL-0: CVE-2020-11742: xen: Bad continuation + handling in GNTTABOP_copy (XSA-318) + 5e95afb8-gnttab-fix-GNTTABOP_copy-continuation-handling.patch + +------------------------------------------------------------------- +Mon Apr 6 12:01:45 MDT 2020 - carn...@suse.com + +- bsc#1168140 - VUL-0: CVE-2020-11740, CVE-2020-11741: xen: XSA-313 + multiple xenoprof issues + 5e95ad61-xenoprof-clear-buffer-intended-to-be-shared-with-guests.patch + 5e95ad8f-xenoprof-limit-consumption-of-shared-buffer-data.patch +- bsc#1168142 - VUL-0: CVE-2020-11739: xen: XSA-314 - Missing + memory barriers in read-write unlock paths + 5e95ae77-Add-missing-memory-barrier-in-the-unlock-path-of-rwlock.patch +- bsc#1168143 - VUL-0: CVE-2020-11743: xen: XSA-316 - Bad error + path in GNTTABOP_map_grant + 5e95af5e-xen-gnttab-Fix-error-path-in-map_grant_ref.patch +- bsc#1167152 - L3: Xenstored Crashed during VM install Need Core + analyzed + 5e876b0f-tools-xenstore-fix-use-after-free-in-xenstored.patch +- bsc#1165206 - Xen 4.12 DomU hang / freeze / stall / NMI watchdog + bug soft lockup CPU #0 stuck under high load / upstream with + workaround. See also bsc#1134506 + 5e86f7b7-credit2-avoid-vCPUs-with-lower-creds-than-idle.patch + 5e86f7fd-credit2-fix-credit-too-few-resets.patch +- Drop for upstream solution (bsc#1165206) + 01-xen-credit2-avoid-vcpus-to.patch + default-to-credit1-scheduler.patch +- Upstream bug fixes (bsc#1027519) + 5e4ec20e-x86-virtualise-MSR_PLATFORM_ID-properly.patch + 5e5e7188-fix-error-path-in-cpupool_unassign_cpu_start.patch + 5e6f53dd-AMD-IOMMU-fix-off-by-one-get_paging_mode.patch + 5e7a371c-sched-fix-cpu-onlining-with-core-sched.patch + 5e7c90cf-sched-fix-cpu-offlining-with-core-sched.patch + 5e7cfb29-x86-ucode-AMD-fix-assert-in-compare_patch.patch + 5e7cfb29-x86-ucode-fix-error-paths-in-apply_microcode.patch + 5e7dd83b-libx86-CPUID-fix-not-just-leaf-7.patch + 5e7dfbf6-x86-ucode-AMD-potential-buffer-overrun-equiv-tab.patch + 5e846cce-x86-HVM-fix-AMD-ECS-handling-for-Fam10.patch + 5e84905c-x86-ucode-AMD-fix-more-potential-buffer-overruns.patch + +------------------------------------------------------------------- Old: ---- 01-xen-credit2-avoid-vcpus-to.patch default-to-credit1-scheduler.patch New: ---- 5e4ec20e-x86-virtualise-MSR_PLATFORM_ID-properly.patch 5e5e7188-fix-error-path-in-cpupool_unassign_cpu_start.patch 5e6f53dd-AMD-IOMMU-fix-off-by-one-get_paging_mode.patch 5e7a371c-sched-fix-cpu-onlining-with-core-sched.patch 5e7c90cf-sched-fix-cpu-offlining-with-core-sched.patch 5e7cfb29-x86-ucode-AMD-fix-assert-in-compare_patch.patch 5e7cfb29-x86-ucode-fix-error-paths-in-apply_microcode.patch 5e7dd83b-libx86-CPUID-fix-not-just-leaf-7.patch 5e7dfbf6-x86-ucode-AMD-potential-buffer-overrun-equiv-tab.patch 5e846cce-x86-HVM-fix-AMD-ECS-handling-for-Fam10.patch 5e84905c-x86-ucode-AMD-fix-more-potential-buffer-overruns.patch 5e86f7b7-credit2-avoid-vCPUs-with-lower-creds-than-idle.patch 5e86f7fd-credit2-fix-credit-too-few-resets.patch 5e876b0f-tools-xenstore-fix-use-after-free-in-xenstored.patch 5e95ad61-xenoprof-clear-buffer-intended-to-be-shared-with-guests.patch 5e95ad8f-xenoprof-limit-consumption-of-shared-buffer-data.patch 5e95ae77-Add-missing-memory-barrier-in-the-unlock-path-of-rwlock.patch 5e95af5e-xen-gnttab-Fix-error-path-in-map_grant_ref.patch 5e95afb8-gnttab-fix-GNTTABOP_copy-continuation-handling.patch ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Other differences: ------------------ ++++++ xen.spec ++++++ --- /var/tmp/diff_new_pack.fJyATU/_old 2020-04-21 19:08:06.032059398 +0200 +++ /var/tmp/diff_new_pack.fJyATU/_new 2020-04-21 19:08:06.036059407 +0200 @@ -1,7 +1,7 @@ # # spec file for package xen # -# Copyright (c) 2020 SUSE LLC +# Copyright (c) 2020 SUSE LINUX GmbH, Nuernberg, Germany. # # All modifications and additions to the file contributed by third parties # remain the property of their copyright owners, unless otherwise agreed @@ -127,7 +127,7 @@ BuildRequires: pesign-obs-integration %endif -Version: 4.13.0_11 +Version: 4.13.0_12 Release: 0 Summary: Xen Virtualization: Hypervisor (aka VMM aka Microkernel) License: GPL-2.0-only @@ -183,6 +183,25 @@ Patch15: 5e4c00ef-VT-d-check-full-RMRR-for-E820-reserved.patch Patch16: 5e4d4f5b-sched-fix-get_cpu_idle_time-with-core-sched.patch Patch17: 5e4e614d-x86-spec-ctrl-no-xen-also-disables-branch-hardening.patch +Patch18: 5e4ec20e-x86-virtualise-MSR_PLATFORM_ID-properly.patch +Patch19: 5e5e7188-fix-error-path-in-cpupool_unassign_cpu_start.patch +Patch20: 5e6f53dd-AMD-IOMMU-fix-off-by-one-get_paging_mode.patch +Patch21: 5e7a371c-sched-fix-cpu-onlining-with-core-sched.patch +Patch22: 5e7c90cf-sched-fix-cpu-offlining-with-core-sched.patch +Patch23: 5e7cfb29-x86-ucode-AMD-fix-assert-in-compare_patch.patch +Patch24: 5e7cfb29-x86-ucode-fix-error-paths-in-apply_microcode.patch +Patch25: 5e7dd83b-libx86-CPUID-fix-not-just-leaf-7.patch +Patch26: 5e7dfbf6-x86-ucode-AMD-potential-buffer-overrun-equiv-tab.patch +Patch27: 5e846cce-x86-HVM-fix-AMD-ECS-handling-for-Fam10.patch +Patch28: 5e84905c-x86-ucode-AMD-fix-more-potential-buffer-overruns.patch +Patch29: 5e86f7b7-credit2-avoid-vCPUs-with-lower-creds-than-idle.patch +Patch30: 5e86f7fd-credit2-fix-credit-too-few-resets.patch +Patch31: 5e876b0f-tools-xenstore-fix-use-after-free-in-xenstored.patch +Patch32: 5e95ad61-xenoprof-clear-buffer-intended-to-be-shared-with-guests.patch +Patch33: 5e95ad8f-xenoprof-limit-consumption-of-shared-buffer-data.patch +Patch34: 5e95ae77-Add-missing-memory-barrier-in-the-unlock-path-of-rwlock.patch +Patch35: 5e95af5e-xen-gnttab-Fix-error-path-in-map_grant_ref.patch +Patch36: 5e95afb8-gnttab-fix-GNTTABOP_copy-continuation-handling.patch # Our platform specific patches Patch400: xen-destdir.patch Patch401: vif-bridge-no-iptables.patch @@ -194,13 +213,11 @@ Patch407: replace-obsolete-network-configuration-commands-in-s.patch Patch408: disable-building-pv-shim.patch Patch409: xenstore-launch.patch -Patch410: default-to-credit1-scheduler.patch # Needs to go upstream Patch420: suspend_evtchn_lock.patch Patch422: stubdom-have-iovec.patch Patch423: vif-route.patch Patch424: gcc10-fixes.patch -Patch425: 01-xen-credit2-avoid-vcpus-to.patch # Other bug fixes or features Patch451: xenconsole-no-multiple-connections.patch Patch452: hibernate.patch @@ -428,6 +445,25 @@ %patch15 -p1 %patch16 -p1 %patch17 -p1 +%patch18 -p1 +%patch19 -p1 +%patch20 -p1 +%patch21 -p1 +%patch22 -p1 +%patch23 -p1 +%patch24 -p1 +%patch25 -p1 +%patch26 -p1 +%patch27 -p1 +%patch28 -p1 +%patch29 -p1 +%patch30 -p1 +%patch31 -p1 +%patch32 -p1 +%patch33 -p1 +%patch34 -p1 +%patch35 -p1 +%patch36 -p1 # Our platform specific patches %patch400 -p1 %patch401 -p1 @@ -439,13 +475,11 @@ %patch407 -p1 %patch408 -p1 %patch409 -p1 -%patch410 -p1 # Needs to go upstream %patch420 -p1 %patch422 -p1 %patch423 -p1 %patch424 -p1 -%patch425 -p1 # Other bug fixes or features %patch451 -p1 %patch452 -p1 ++++++ 5e4ec20e-x86-virtualise-MSR_PLATFORM_ID-properly.patch ++++++ # Commit 691265f96097d4fe3e46ff4267451d49b30143e6 # Date 2020-02-20 17:29:50 +0000 # Author Andrew Cooper <andrew.coop...@citrix.com> # Committer Andrew Cooper <andrew.coop...@citrix.com> x86/msr: Virtualise MSR_PLATFORM_ID properly This is an Intel-only, read-only MSR related to microcode loading. Expose it in similar circumstances as the PATCHLEVEL MSR. This should have been alongside c/s 013896cb8b2 "x86/msr: Fix handling of MSR_AMD_PATCHLEVEL/MSR_IA32_UCODE_REV" Signed-off-by: Andrew Cooper <andrew.coop...@citrix.com> Reviewed-by: Jan Beulich <jbeul...@suse.com> --- a/xen/arch/x86/msr.c +++ b/xen/arch/x86/msr.c @@ -139,6 +139,13 @@ int guest_rdmsr(struct vcpu *v, uint32_t /* Not offered to guests. */ goto gp_fault; + case MSR_IA32_PLATFORM_ID: + if ( !(cp->x86_vendor & X86_VENDOR_INTEL) || + !(boot_cpu_data.x86_vendor & X86_VENDOR_INTEL) ) + goto gp_fault; + rdmsrl(MSR_IA32_PLATFORM_ID, *val); + break; + case MSR_AMD_PATCHLEVEL: BUILD_BUG_ON(MSR_IA32_UCODE_REV != MSR_AMD_PATCHLEVEL); /* @@ -271,6 +278,7 @@ int guest_wrmsr(struct vcpu *v, uint32_t { uint64_t rsvd; + case MSR_IA32_PLATFORM_ID: case MSR_INTEL_CORE_THREAD_COUNT: case MSR_INTEL_PLATFORM_INFO: case MSR_ARCH_CAPABILITIES: ++++++ 5e5e7188-fix-error-path-in-cpupool_unassign_cpu_start.patch ++++++ # Commit 98ed1f43cc2c89efd38deed1035dba5b1ced5d45 # Date 2020-03-03 16:02:32 +0100 # Author Juergen Gross <jgr...@suse.com> # Committer Jan Beulich <jbeul...@suse.com> sched: fix error path in cpupool_unassign_cpu_start() In case moving away all domains from the cpu to be removed is failing in cpupool_unassign_cpu_start() the error path is missing to release sched_res_rculock. The normal exit path is releasing domlist_read_lock instead (this is currently no problem as the reference to the specific rcu lock is not used by rcu_read_unlock()). While at it indent the present error label by one space. Reported-by: Igor Druzhinin <igor.druzhi...@citrix.com> Signed-off-by: Juergen Gross <jgr...@suse.com> Reviewed-by: Jan Beulich <jbeul...@suse.com> Reviewed-by: Dario Faggioli <dfaggi...@suse.com> --- a/xen/common/cpupool.c +++ b/xen/common/cpupool.c @@ -465,7 +465,7 @@ static int cpupool_unassign_cpu_start(st } rcu_read_unlock(&domlist_read_lock); if ( ret ) - goto out; + goto out_rcu; } cpupool_moving_cpu = cpu; atomic_inc(&c->refcnt); @@ -473,8 +473,9 @@ static int cpupool_unassign_cpu_start(st cpumask_andnot(c->cpu_valid, c->cpu_valid, cpus); cpumask_and(c->res_valid, c->cpu_valid, &sched_res_mask); - rcu_read_unlock(&domlist_read_lock); -out: + out_rcu: + rcu_read_unlock(&sched_res_rculock); + out: spin_unlock(&cpupool_lock); return ret; ++++++ 5e6f53dd-AMD-IOMMU-fix-off-by-one-get_paging_mode.patch ++++++ # Commit b75b3c62fe4afe381c6f74a07f614c0b39fe2f5d # Date 2020-03-16 11:24:29 +0100 # Author Jan Beulich <jbeul...@suse.com> # Committer Jan Beulich <jbeul...@suse.com> AMD/IOMMU: fix off-by-one in amd_iommu_get_paging_mode() callers amd_iommu_get_paging_mode() expects a count, not a "maximum possible" value. Prior to b4f042236ae0 dropping the reference, the use of our mis- named "max_page" in amd_iommu_domain_init() may have lead to such a misunderstanding. In an attempt to avoid such confusion in the future, rename the function's parameter and - while at it - convert it to an inline function. Also replace a literal 4 by an expression tying it to a wider use constant, just like amd_iommu_quarantine_init() does. Fixes: ea38867831da ("x86 / iommu: set up a scratch page in the quarantine domain") Fixes: b4f042236ae0 ("AMD/IOMMU: Cease using a dynamic height for the IOMMU pagetables") Signed-off-by: Jan Beulich <jbeul...@suse.com> Acked-by: Andrew Cooper <andrew.coop...@citrix.com> --- a/xen/drivers/passthrough/amd/iommu_map.c +++ b/xen/drivers/passthrough/amd/iommu_map.c @@ -455,9 +455,9 @@ int amd_iommu_reserve_domain_unity_map(s int __init amd_iommu_quarantine_init(struct domain *d) { struct domain_iommu *hd = dom_iommu(d); - unsigned long max_gfn = - PFN_DOWN((1ul << DEFAULT_DOMAIN_ADDRESS_WIDTH) - 1); - unsigned int level = amd_iommu_get_paging_mode(max_gfn); + unsigned long end_gfn = + 1ul << (DEFAULT_DOMAIN_ADDRESS_WIDTH - PAGE_SHIFT); + unsigned int level = amd_iommu_get_paging_mode(end_gfn); struct amd_iommu_pte *table; if ( hd->arch.root_table ) --- a/xen/drivers/passthrough/amd/pci_amd_iommu.c +++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c @@ -231,22 +231,6 @@ static int __must_check allocate_domain_ return rc; } -int amd_iommu_get_paging_mode(unsigned long entries) -{ - int level = 1; - - BUG_ON( !entries ); - - while ( entries > PTE_PER_TABLE_SIZE ) - { - entries = PTE_PER_TABLE_ALIGN(entries) >> PTE_PER_TABLE_SHIFT; - if ( ++level > 6 ) - return -ENOMEM; - } - - return level; -} - static int amd_iommu_domain_init(struct domain *d) { struct domain_iommu *hd = dom_iommu(d); @@ -259,8 +243,10 @@ static int amd_iommu_domain_init(struct * physical address space we give it, but this isn't known yet so use 4 * unilaterally. */ - hd->arch.paging_mode = is_hvm_domain(d) - ? 4 : amd_iommu_get_paging_mode(get_upper_mfn_bound()); + hd->arch.paging_mode = amd_iommu_get_paging_mode( + is_hvm_domain(d) + ? 1ul << (DEFAULT_DOMAIN_ADDRESS_WIDTH - PAGE_SHIFT) + : get_upper_mfn_bound() + 1); return 0; } --- a/xen/include/asm-x86/hvm/svm/amd-iommu-proto.h +++ b/xen/include/asm-x86/hvm/svm/amd-iommu-proto.h @@ -54,7 +54,6 @@ int amd_iommu_init_late(void); int amd_iommu_update_ivrs_mapping_acpi(void); int iov_adjust_irq_affinities(void); -int amd_iommu_get_paging_mode(unsigned long entries); int amd_iommu_quarantine_init(struct domain *d); /* mapping functions */ @@ -177,6 +176,22 @@ static inline unsigned long region_to_pa return (PAGE_ALIGN(addr + size) - (addr & PAGE_MASK)) >> PAGE_SHIFT; } +static inline int amd_iommu_get_paging_mode(unsigned long max_frames) +{ + int level = 1; + + BUG_ON(!max_frames); + + while ( max_frames > PTE_PER_TABLE_SIZE ) + { + max_frames = PTE_PER_TABLE_ALIGN(max_frames) >> PTE_PER_TABLE_SHIFT; + if ( ++level > 6 ) + return -ENOMEM; + } + + return level; +} + static inline struct page_info* alloc_amd_iommu_pgtable(void) { struct page_info *pg; ++++++ 5e7a371c-sched-fix-cpu-onlining-with-core-sched.patch ++++++ # Commit 4c7d340f75abc64f131b0f9bffd6d66d72e43528 # Date 2020-03-24 17:36:44 +0100 # Author Juergen Gross <jgr...@suse.com> # Committer Jan Beulich <jbeul...@suse.com> sched: fix onlining cpu with core scheduling active When onlining a cpu cpupool_cpu_add() checks whether all siblings of the new cpu are free in order to decide whether to add it to cpupool0. In case the added cpu is not the last sibling to be onlined this test is wrong as it only checks for all online siblings to be free. The test should include the check for the number of siblings having reached the scheduling granularity of cpupool0, too. Signed-off-by: Juergen Gross <jgr...@suse.com> Reviewed-by: Dario Faggioli <dfaggi...@suse.com> --- a/xen/common/cpupool.c +++ b/xen/common/cpupool.c @@ -615,7 +615,8 @@ static int cpupool_cpu_add(unsigned int get_sched_res(cpu)->cpupool = NULL; cpus = sched_get_opt_cpumask(cpupool0->gran, cpu); - if ( cpumask_subset(cpus, &cpupool_free_cpus) ) + if ( cpumask_subset(cpus, &cpupool_free_cpus) && + cpumask_weight(cpus) == cpupool_get_granularity(cpupool0) ) ret = cpupool_assign_cpu_locked(cpupool0, cpu); rcu_read_unlock(&sched_res_rculock); ++++++ 5e7c90cf-sched-fix-cpu-offlining-with-core-sched.patch ++++++ # Commit b6f5334aeaca133ad47ade06f4d22baf5984b55d # Date 2020-03-26 12:23:59 +0100 # Author Juergen Gross <jgr...@suse.com> # Committer Jan Beulich <jbeul...@suse.com> sched: fix cpu offlining with core scheduling Offlining a cpu with core scheduling active can result in a hanging system. Reason is the scheduling resource and unit of the to be removed cpus needs to be split in order to remove the cpu from its cpupool and move it to the idle scheduler. In case one of the involved cpus happens to have received a sched slave event due to a vcpu former having been running on that cpu being woken up again, it can happen that this cpu will enter sched_wait_rendezvous_in() while its scheduling resource is just about to be split. It might wait for ever for the other sibling to join, which will never happen due to the resources already being modified. This can easily be avoided by: - resetting the rendezvous counters of the idle unit which is kept - checking for a new scheduling resource in sched_wait_rendezvous_in() after reacquiring the scheduling lock and resetting the counters in that case without scheduling another vcpu - moving schedule resource modifications (in schedule_cpu_rm()) and retrieving (schedule(), sched_slave() is fine already, others are not critical) into locked regions Reported-by: Igor Druzhinin <igor.druzhi...@citrix.com> Signed-off-by: Juergen Gross <jgr...@suse.com> Reviewed-by: Dario Faggioli <dfaggi...@suse.com> --- a/xen/common/schedule.c +++ b/xen/common/schedule.c @@ -2112,6 +2112,10 @@ void sched_context_switched(struct vcpu rcu_read_unlock(&sched_res_rculock); } +/* + * Switch to a new context or keep the current one running. + * On x86 it won't return, so it needs to drop the still held sched_res_rculock. + */ static void sched_context_switch(struct vcpu *vprev, struct vcpu *vnext, bool reset_idle_unit, s_time_t now) { @@ -2221,6 +2225,9 @@ static struct vcpu *sched_force_context_ * zero do_schedule() is called and the rendezvous counter for leaving * context_switch() is set. All other members will wait until the counter is * becoming zero, dropping the schedule lock in between. + * Either returns the new unit to run, or NULL if no context switch is + * required or (on Arm) has already been performed. If NULL is returned + * sched_res_rculock has been dropped. */ static struct sched_unit *sched_wait_rendezvous_in(struct sched_unit *prev, spinlock_t **lock, int cpu, @@ -2228,7 +2235,8 @@ static struct sched_unit *sched_wait_ren { struct sched_unit *next; struct vcpu *v; - unsigned int gran = get_sched_res(cpu)->granularity; + struct sched_resource *sr = get_sched_res(cpu); + unsigned int gran = sr->granularity; if ( !--prev->rendezvous_in_cnt ) { @@ -2295,6 +2303,21 @@ static struct sched_unit *sched_wait_ren atomic_set(&prev->next_task->rendezvous_out_cnt, 0); prev->rendezvous_in_cnt = 0; } + + /* + * Check for scheduling resource switched. This happens when we are + * moved away from our cpupool and cpus are subject of the idle + * scheduler now. + */ + if ( unlikely(sr != get_sched_res(cpu)) ) + { + ASSERT(is_idle_unit(prev)); + atomic_set(&prev->next_task->rendezvous_out_cnt, 0); + prev->rendezvous_in_cnt = 0; + pcpu_schedule_unlock_irq(*lock, cpu); + rcu_read_unlock(&sched_res_rculock); + return NULL; + } } return prev->next_task; @@ -2380,11 +2403,11 @@ static void schedule(void) rcu_read_lock(&sched_res_rculock); + lock = pcpu_schedule_lock_irq(cpu); + sr = get_sched_res(cpu); gran = sr->granularity; - lock = pcpu_schedule_lock_irq(cpu); - if ( prev->rendezvous_in_cnt ) { /* @@ -2965,7 +2988,10 @@ int schedule_cpu_rm(unsigned int cpu) per_cpu(sched_res_idx, cpu_iter) = 0; if ( cpu_iter == cpu ) { - idle_vcpu[cpu_iter]->sched_unit->priv = NULL; + unit = idle_vcpu[cpu_iter]->sched_unit; + unit->priv = NULL; + atomic_set(&unit->next_task->rendezvous_out_cnt, 0); + unit->rendezvous_in_cnt = 0; } else { @@ -2996,6 +3022,8 @@ int schedule_cpu_rm(unsigned int cpu) } sr->scheduler = &sched_idle_ops; sr->sched_priv = NULL; + sr->granularity = 1; + sr->cpupool = NULL; smp_mb(); sr->schedule_lock = &sched_free_cpu_lock; @@ -3008,9 +3036,6 @@ int schedule_cpu_rm(unsigned int cpu) sched_free_udata(old_ops, vpriv_old); sched_free_pdata(old_ops, ppriv_old, cpu); - sr->granularity = 1; - sr->cpupool = NULL; - out: rcu_read_unlock(&sched_res_rculock); xfree(sr_new); ++++++ 5e7cfb29-x86-ucode-AMD-fix-assert-in-compare_patch.patch ++++++ # Commit 13ed5d49a4214dc3521d4af7bfcf13fbcf5bfd63 # Date 2020-03-26 18:57:45 +0000 # Author Andrew Cooper <andrew.coop...@citrix.com> # Committer Andrew Cooper <andrew.coop...@citrix.com> x86/ucode/amd: Fix assertion in compare_patch() This is clearly a typo. Fixes: 9da23943ccd "microcode: introduce a global cache of ucode patch" Signed-off-by: Andrew Cooper <andrew.coop...@citrix.com> Reviewed-by: Wei Liu <w...@xen.org> --- a/xen/arch/x86/microcode_amd.c +++ b/xen/arch/x86/microcode_amd.c @@ -222,7 +222,7 @@ static enum microcode_match_result compa /* Both patches to compare are supposed to be applicable to local CPU. */ ASSERT(microcode_fits(new->mc_amd) != MIS_UCODE); - ASSERT(microcode_fits(new->mc_amd) != MIS_UCODE); + ASSERT(microcode_fits(old->mc_amd) != MIS_UCODE); return compare_header(new_header, old_header); } ++++++ 5e7cfb29-x86-ucode-fix-error-paths-in-apply_microcode.patch ++++++ # Commit d2a0a96cf76603b2e2b87c3ce80c3f9d098327d4 # Date 2020-03-26 18:57:45 +0000 # Author Andrew Cooper <andrew.coop...@citrix.com> # Committer Andrew Cooper <andrew.coop...@citrix.com> x86/ucode: Fix error paths in apply_microcode() In the unlikley case that patch application completes, but the resutling revision isn't expected, sig->rev doesn't get updated to match reality. It will get adjusted the next time collect_cpu_info() gets called, but in the meantime Xen might operate on a stale value. Nothing good will come of this. Rewrite the logic to always update the stashed revision, before worrying about whether the attempt was a success or failure. Take the opportunity to make the printk() messages as consistent as possible. Signed-off-by: Andrew Cooper <andrew.coop...@citrix.com> Reviewed-by: Wei Liu <w...@xen.org> --- a/xen/arch/x86/microcode_amd.c +++ b/xen/arch/x86/microcode_amd.c @@ -229,11 +229,11 @@ static enum microcode_match_result compa static int apply_microcode(const struct microcode_patch *patch) { - uint32_t rev; int hw_err; unsigned int cpu = smp_processor_id(); struct cpu_signature *sig = &per_cpu(cpu_sig, cpu); const struct microcode_header_amd *hdr; + uint32_t rev, old_rev = sig->rev; if ( !patch ) return -ENOENT; @@ -249,6 +249,7 @@ static int apply_microcode(const struct /* get patch id after patching */ rdmsrl(MSR_AMD_PATCHLEVEL, rev); + sig->rev = rev; /* * Some processors leave the ucode blob mapping as UC after the update. @@ -259,15 +260,14 @@ static int apply_microcode(const struct /* check current patch id and patch's id for match */ if ( hw_err || (rev != hdr->patch_id) ) { - printk(KERN_ERR "microcode: CPU%d update from revision " - "%#x to %#x failed\n", cpu, rev, hdr->patch_id); + printk(XENLOG_ERR + "microcode: CPU%u update rev %#x to %#x failed, result %#x\n", + cpu, old_rev, hdr->patch_id, rev); return -EIO; } - printk(KERN_WARNING "microcode: CPU%d updated from revision %#x to %#x\n", - cpu, sig->rev, hdr->patch_id); - - sig->rev = rev; + printk(XENLOG_WARNING "microcode: CPU%u updated from revision %#x to %#x\n", + cpu, old_rev, rev); return 0; } --- a/xen/arch/x86/microcode_intel.c +++ b/xen/arch/x86/microcode_intel.c @@ -285,10 +285,10 @@ static enum microcode_match_result compa static int apply_microcode(const struct microcode_patch *patch) { uint64_t msr_content; - unsigned int val[2]; - unsigned int cpu_num = raw_smp_processor_id(); + unsigned int cpu = smp_processor_id(); struct cpu_signature *sig = &this_cpu(cpu_sig); const struct microcode_intel *mc_intel; + uint32_t rev, old_rev = sig->rev; if ( !patch ) return -ENOENT; @@ -309,20 +309,20 @@ static int apply_microcode(const struct /* get the current revision from MSR 0x8B */ rdmsrl(MSR_IA32_UCODE_REV, msr_content); - val[1] = (uint32_t)(msr_content >> 32); + sig->rev = rev = msr_content >> 32; - if ( val[1] != mc_intel->hdr.rev ) + if ( rev != mc_intel->hdr.rev ) { - printk(KERN_ERR "microcode: CPU%d update from revision " - "%#x to %#x failed. Resulting revision is %#x.\n", cpu_num, - sig->rev, mc_intel->hdr.rev, val[1]); + printk(XENLOG_ERR + "microcode: CPU%u update rev %#x to %#x failed, result %#x\n", + cpu, old_rev, mc_intel->hdr.rev, rev); return -EIO; } - printk(KERN_INFO "microcode: CPU%d updated from revision " - "%#x to %#x, date = %04x-%02x-%02x\n", - cpu_num, sig->rev, val[1], mc_intel->hdr.year, + + printk(XENLOG_WARNING + "microcode: CPU%u updated from revision %#x to %#x, date = %04x-%02x-%02x\n", + cpu, old_rev, rev, mc_intel->hdr.year, mc_intel->hdr.month, mc_intel->hdr.day); - sig->rev = val[1]; return 0; } ++++++ 5e7dd83b-libx86-CPUID-fix-not-just-leaf-7.patch ++++++ # Commit eb0bad81fceb3e81df5f73441771b49b732edf56 # Date 2020-03-27 11:40:59 +0100 # Author Jan Beulich <jbeul...@suse.com> # Committer Jan Beulich <jbeul...@suse.com> libx86/CPUID: fix (not just) leaf 7 processing For one, subleaves within the respective union shouldn't live in separate sub-structures. And then x86_cpuid_policy_fill_native() should, as it did originally, iterate over all subleaves here as well as over all main leaves. Switch to using a "<= MIN()"-based approach similar to that used in x86_cpuid_copy_to_buffer(). Also follow this for the extended main leaves then. Fixes: 1bd2b750537b ("libx86: Fix 32bit stubdom build of x86_cpuid_policy_fill_native()") Fixes: 97e4ebdcd765 ("x86/CPUID: support leaf 7 subleaf 1 / AVX512_BF16") Signed-off-by: Jan Beulich <jbeul...@suse.com> Reviewed-by: Andrew Cooper <andrew.coop...@citrix.com> --- a/xen/include/xen/lib/x86/cpuid.h +++ b/xen/include/xen/lib/x86/cpuid.h @@ -181,8 +181,7 @@ struct cpuid_policy uint32_t _7d0; struct { DECL_BITFIELD(7d0); }; }; - }; - struct { + /* Subleaf 1. */ union { uint32_t _7a1; --- a/xen/lib/x86/cpuid.c +++ b/xen/lib/x86/cpuid.c @@ -71,8 +71,8 @@ void x86_cpuid_policy_fill_native(struct unsigned int i; cpuid_leaf(0, &p->basic.raw[0]); - for ( i = 1; i < min_t(unsigned int, ARRAY_SIZE(p->basic.raw), - p->basic.max_leaf); ++i ) + for ( i = 1; i <= MIN(p->basic.max_leaf, + ARRAY_SIZE(p->basic.raw) - 1); ++i ) { switch ( i ) { @@ -116,8 +116,8 @@ void x86_cpuid_policy_fill_native(struct { cpuid_count_leaf(7, 0, &p->feat.raw[0]); - for ( i = 1; i < min_t(unsigned int, ARRAY_SIZE(p->feat.raw), - p->feat.max_subleaf); ++i ) + for ( i = 1; i <= MIN(p->feat.max_subleaf, + ARRAY_SIZE(p->feat.raw) - 1); ++i ) cpuid_count_leaf(7, i, &p->feat.raw[i]); } @@ -172,8 +172,8 @@ void x86_cpuid_policy_fill_native(struct /* Extended leaves. */ cpuid_leaf(0x80000000, &p->extd.raw[0]); - for ( i = 1; i < min_t(unsigned int, ARRAY_SIZE(p->extd.raw), - p->extd.max_leaf + 1 - 0x80000000); ++i ) + for ( i = 1; i <= MIN(p->extd.max_leaf & 0xffffU, + ARRAY_SIZE(p->extd.raw) - 1); ++i ) cpuid_leaf(0x80000000 + i, &p->extd.raw[i]); x86_cpuid_policy_recalc_synth(p); ++++++ 5e7dfbf6-x86-ucode-AMD-potential-buffer-overrun-equiv-tab.patch ++++++ # Commit 1f97b6b9f1b5978659c5735954c37c130e7bb151 # Date 2020-03-27 13:13:26 +0000 # Author Andrew Cooper <andrew.coop...@citrix.com> # Committer Andrew Cooper <andrew.coop...@citrix.com> x86/ucode/amd: Fix potential buffer overrun with equiv table handling find_equiv_cpu_id() loops until it finds a 0 installed_cpu entry. Well formed AMD microcode containers have this property. Extend the checking in install_equiv_cpu_table() to reject tables which don't have a sentinal at the end. Signed-off-by: Andrew Cooper <andrew.coop...@citrix.com> Reviewed-by: Jan Beulich <jbeul...@suse.com> --- a/xen/arch/x86/microcode_amd.c +++ b/xen/arch/x86/microcode_amd.c @@ -321,6 +321,7 @@ static int install_equiv_cpu_table( size_t *offset) { const struct mpbhdr *mpbuf = data + *offset + 4; + const struct equiv_cpu_entry *eq; *offset += mpbuf->len + CONT_HDR_SIZE; /* add header length */ @@ -330,7 +331,9 @@ static int install_equiv_cpu_table( return -EINVAL; } - if ( mpbuf->len == 0 ) + if ( mpbuf->len == 0 || mpbuf->len % sizeof(*eq) || + (eq = (const void *)mpbuf->data, + eq[(mpbuf->len / sizeof(*eq)) - 1].installed_cpu) ) { printk(KERN_ERR "microcode: Wrong microcode equivalent cpu table length\n"); return -EINVAL; ++++++ 5e846cce-x86-HVM-fix-AMD-ECS-handling-for-Fam10.patch ++++++ # Commit 5d515b1c296ebad6889748ea1e49e063453216a3 # Date 2020-04-01 12:28:30 +0200 # Author Jan Beulich <jbeul...@suse.com> # Committer Jan Beulich <jbeul...@suse.com> x86/HVM: fix AMD ECS handling for Fam10 The involved comparison was, very likely inadvertently, converted from >= to > when making changes unrelated to the actual family range. Fixes: 9841eb71ea87 ("x86/cpuid: Drop a guests cached x86 family and model information") Signed-off-by: Jan Beulich <jbeul...@suse.com> Reviewed-by: Andrew Cooper <andrew.coop...@citrix.com> Reviewed-by: Paul Durrant <p...@xen.org> --- a/xen/arch/x86/hvm/ioreq.c +++ b/xen/arch/x86/hvm/ioreq.c @@ -1283,7 +1283,7 @@ struct hvm_ioreq_server *hvm_select_iore if ( CF8_ADDR_HI(cf8) && d->arch.cpuid->x86_vendor == X86_VENDOR_AMD && (x86_fam = get_cpu_family( - d->arch.cpuid->basic.raw_fms, NULL, NULL)) > 0x10 && + d->arch.cpuid->basic.raw_fms, NULL, NULL)) >= 0x10 && x86_fam < 0x17 ) { uint64_t msr_val; ++++++ 5e84905c-x86-ucode-AMD-fix-more-potential-buffer-overruns.patch ++++++ # Commit 718d1432000079ea7120f6cb770372afe707ce27 # Date 2020-04-01 14:00:12 +0100 # Author Andrew Cooper <andrew.coop...@citrix.com> # Committer Andrew Cooper <andrew.coop...@citrix.com> x86/ucode/amd: Fix more potential buffer overruns with microcode parsing cpu_request_microcode() doesn't know the buffer is at least 4 bytes long before inspecting UCODE_MAGIC. install_equiv_cpu_table() doesn't know the boundary of the buffer it is interpreting as an equivalency table. This case was clearly observed at one point in the past, given the subsequent overrun detection, but without comprehending that the damage was already done. Make the logic consistent with container_fast_forward() and pass size_left in to install_equiv_cpu_table(). Signed-off-by: Andrew Cooper <andrew.coop...@citrix.com> Reviewed-by: Jan Beulich <jbeul...@suse.com> --- a/xen/arch/x86/microcode_amd.c +++ b/xen/arch/x86/microcode_amd.c @@ -318,11 +318,20 @@ static int get_ucode_from_buffer_amd( static int install_equiv_cpu_table( struct microcode_amd *mc_amd, const void *data, + size_t size_left, size_t *offset) { - const struct mpbhdr *mpbuf = data + *offset + 4; + const struct mpbhdr *mpbuf; const struct equiv_cpu_entry *eq; + if ( size_left < (sizeof(*mpbuf) + 4) || + (mpbuf = data + *offset + 4, + size_left - sizeof(*mpbuf) - 4 < mpbuf->len) ) + { + printk(XENLOG_WARNING "microcode: No space for equivalent cpu table\n"); + return -EINVAL; + } + *offset += mpbuf->len + CONT_HDR_SIZE; /* add header length */ if ( mpbuf->type != UCODE_EQUIV_CPU_TABLE_TYPE ) @@ -436,7 +445,7 @@ static struct microcode_patch *cpu_reque current_cpu_id = cpuid_eax(0x00000001); - if ( *(const uint32_t *)buf != UCODE_MAGIC ) + if ( bufsize < 4 || *(const uint32_t *)buf != UCODE_MAGIC ) { printk(KERN_ERR "microcode: Wrong microcode patch file magic\n"); error = -EINVAL; @@ -466,24 +475,13 @@ static struct microcode_patch *cpu_reque */ while ( offset < bufsize ) { - error = install_equiv_cpu_table(mc_amd, buf, &offset); + error = install_equiv_cpu_table(mc_amd, buf, bufsize - offset, &offset); if ( error ) { printk(KERN_ERR "microcode: installing equivalent cpu table failed\n"); break; } - /* - * Could happen as we advance 'offset' early - * in install_equiv_cpu_table - */ - if ( offset > bufsize ) - { - printk(KERN_ERR "microcode: Microcode buffer overrun\n"); - error = -EINVAL; - break; - } - if ( find_equiv_cpu_id(mc_amd->equiv_cpu_table, current_cpu_id, &equiv_cpu_id) ) break; ++++++ 5e86f7b7-credit2-avoid-vCPUs-with-lower-creds-than-idle.patch ++++++ # Commit 36f3662f27dec32d76c0edb4c6b62b9628d6869d # Date 2020-04-03 10:45:43 +0200 # Author Dario Faggioli <dfaggi...@suse.com> # Committer Jan Beulich <jbeul...@suse.com> credit2: avoid vCPUs to ever reach lower credits than idle There have been report of stalls of guest vCPUs, when Credit2 was used. It seemed like these vCPUs were not getting scheduled for very long time, even under light load conditions (e.g., during dom0 boot). Investigations led to the discovery that --although rarely-- it can happen that a vCPU manages to run for very long timeslices. In Credit2, this means that, when runtime accounting happens, the vCPU will lose a large quantity of credits. This in turn may lead to the vCPU having less credits than the idle vCPUs (-2^30). At this point, the scheduler will pick the idle vCPU, instead of the ready to run vCPU, for a few "epochs", which often times is enough for the guest kernel to think the vCPU is not responding and crashing. An example of this situation is shown here. In fact, we can see d0v1 sitting in the runqueue while all the CPUs are idle, as it has -1254238270 credits, which is smaller than -2^30 = −1073741824: (XEN) Runqueue 0: (XEN) ncpus = 28 (XEN) cpus = 0-27 (XEN) max_weight = 256 (XEN) pick_bias = 22 (XEN) instload = 1 (XEN) aveload = 293391 (~111%) (XEN) idlers: 00,00000000,00000000,00000000,00000000,00000000,0fffffff (XEN) tickled: 00,00000000,00000000,00000000,00000000,00000000,00000000 (XEN) fully idle cores: 00,00000000,00000000,00000000,00000000,00000000,0fffffff [...] (XEN) Runqueue 0: (XEN) CPU[00] runq=0, sibling=00,..., core=00,... (XEN) CPU[01] runq=0, sibling=00,..., core=00,... [...] (XEN) CPU[26] runq=0, sibling=00,..., core=00,... (XEN) CPU[27] runq=0, sibling=00,..., core=00,... (XEN) RUNQ: (XEN) 0: [0.1] flags=0 cpu=5 credit=-1254238270 [w=256] load=262144 (~100%) We certainly don't want, under any circumstance, this to happen. Let's, therefore, define a minimum amount of credits a vCPU can have. During accounting, we make sure that, for however long the vCPU has run, it will never get to have less than such minimum amount of credits. Then, we set the credits of the idle vCPU to an even smaller value. NOTE: investigations have been done about _how_ it is possible for a vCPU to execute for so much time that its credits becomes so low. While still not completely clear, there are evidence that: - it only happens very rarely, - it appears to be both machine and workload specific, - it does not look to be a Credit2 (e.g., as it happens when running with Credit1 as well) issue, or a scheduler issue. This patch makes Credit2 more robust to events like this, whatever the cause is, and should hence be backported (as far as possible). Reported-by: Glen <glenbar...@gmail.com> Reported-by: Tomas Mozes <hydrapo...@gmail.com> Signed-off-by: Dario Faggioli <dfaggi...@suse.com> Reviewed-by: George Dunlap <george.dun...@citrix.com> --- a/tools/xentrace/formats +++ b/tools/xentrace/formats @@ -55,7 +55,7 @@ 0x00022204 CPU%(cpu)d %(tsc)d (+%(reltsc)8d) csched2:credit_add 0x00022205 CPU%(cpu)d %(tsc)d (+%(reltsc)8d) csched2:tickle_check [ dom:vcpu = 0x%(1)08x, credit = %(2)d, score = %(3)d ] 0x00022206 CPU%(cpu)d %(tsc)d (+%(reltsc)8d) csched2:tickle [ cpu = %(1)d ] -0x00022207 CPU%(cpu)d %(tsc)d (+%(reltsc)8d) csched2:credit_reset [ dom:vcpu = 0x%(1)08x, cr_start = %(2)d, cr_end = %(3)d, mult = %(4)d ] +0x00022207 CPU%(cpu)d %(tsc)d (+%(reltsc)8d) csched2:credit_reset [ dom:vcpu = 0x%(1)08x, cr_start = %(2)d, cr_end = %(3)d ] 0x00022208 CPU%(cpu)d %(tsc)d (+%(reltsc)8d) csched2:sched_tasklet 0x00022209 CPU%(cpu)d %(tsc)d (+%(reltsc)8d) csched2:update_load 0x0002220a CPU%(cpu)d %(tsc)d (+%(reltsc)8d) csched2:runq_assign [ dom:vcpu = 0x%(1)08x, rq_id = %(2)d ] --- a/tools/xentrace/xenalyze.c +++ b/tools/xentrace/xenalyze.c @@ -7718,13 +7718,12 @@ void sched_process(struct pcpu_info *p) struct { unsigned int vcpuid:16, domid:16; int credit_start, credit_end; - unsigned int multiplier; } *r = (typeof(r))ri->d; printf(" %s csched2:reset_credits d%uv%u, " - "credit_start = %d, credit_end = %d, mult = %u\n", + "credit_start = %d, credit_end = %d\n", ri->dump_header, r->domid, r->vcpuid, - r->credit_start, r->credit_end, r->multiplier); + r->credit_start, r->credit_end); } break; case TRC_SCHED_CLASS_EVT(CSCHED2, 8): /* SCHED_TASKLET */ --- a/xen/common/sched_credit2.c +++ b/xen/common/sched_credit2.c @@ -228,11 +228,21 @@ */ #define CSCHED2_CREDIT_INIT MILLISECS(10) /* + * Minimum amount of credits VMs can have. Ideally, no VM would get + * close to this (unless a vCPU manages to execute for really long + * time uninterrupted). In case it happens, it makes no sense to + * track even deeper undershoots. + * + * NOTE: If making this smaller than -CSCHED2_CREDIT_INIT, adjust + * reset_credit() accordingly. + */ +#define CSCHED2_CREDIT_MIN (-CSCHED2_CREDIT_INIT) +/* * Amount of credit the idle units have. It never changes, as idle * units does not consume credits, and it must be lower than whatever * amount of credit 'regular' unit would end up with. */ -#define CSCHED2_IDLE_CREDIT (-(1U<<30)) +#define CSCHED2_IDLE_CREDIT (CSCHED2_CREDIT_MIN-1) /* * Carryover: How much "extra" credit may be carried over after * a reset. @@ -776,10 +786,15 @@ static int get_fallback_cpu(struct csche static void t2c_update(struct csched2_runqueue_data *rqd, s_time_t time, struct csched2_unit *svc) { - uint64_t val = time * rqd->max_weight + svc->residual; + int64_t val = time * rqd->max_weight + svc->residual; svc->residual = do_div(val, svc->weight); - svc->credit -= val; + /* Getting to lower credit than CSCHED2_CREDIT_MIN makes no sense. */ + val = svc->credit - val; + if ( unlikely(val < CSCHED2_CREDIT_MIN) ) + svc->credit = CSCHED2_CREDIT_MIN; + else + svc->credit = val; } static s_time_t c2t(struct csched2_runqueue_data *rqd, s_time_t credit, struct csched2_unit *svc) @@ -1620,28 +1635,25 @@ static void reset_credit(const struct sc { struct csched2_runqueue_data *rqd = c2rqd(ops, cpu); struct list_head *iter; - int m; + int reset = CSCHED2_CREDIT_INIT; /* * Under normal circumstances, snext->credit should never be less * than -CSCHED2_MIN_TIMER. However, under some circumstances, an * unit with low credits may be allowed to run long enough that - * its credits are actually less than -CSCHED2_CREDIT_INIT. + * its credits are actually much lower than that. * (Instances have been observed, for example, where an unit with * 200us of credit was allowed to run for 11ms, giving it -10.8ms * of credit. Thus it was still negative even after the reset.) * * If this is the case for snext, we simply want to keep moving - * everyone up until it is in the black again. This fair because - * none of the other units want to run at the moment. - * - * Rather than looping, however, we just calculate a multiplier, - * avoiding an integer division and multiplication in the common - * case. + * everyone up until it is in the black again. This means that, + * since CSCHED2_CREDIT_MIN is -CSCHED2_CREDIT_INIT, we need to + * actually add 2*CSCHED2_CREDIT_INIT. */ - m = 1; - if ( snext->credit < -CSCHED2_CREDIT_INIT ) - m += (-snext->credit) / CSCHED2_CREDIT_INIT; + ASSERT(snext->credit >= CSCHED2_CREDIT_MIN); + if ( unlikely(snext->credit == CSCHED2_CREDIT_MIN) ) + reset += CSCHED2_CREDIT_INIT; list_for_each( iter, &rqd->svc ) { @@ -1672,15 +1684,7 @@ static void reset_credit(const struct sc } start_credit = svc->credit; - - /* - * Add INIT * m, avoiding integer multiplication in the common case. - */ - if ( likely(m==1) ) - svc->credit += CSCHED2_CREDIT_INIT; - else - svc->credit += m * CSCHED2_CREDIT_INIT; - + svc->credit += reset; /* "Clip" credits to max carryover */ if ( svc->credit > CSCHED2_CREDIT_INIT + CSCHED2_CARRYOVER_MAX ) svc->credit = CSCHED2_CREDIT_INIT + CSCHED2_CARRYOVER_MAX; @@ -1692,19 +1696,18 @@ static void reset_credit(const struct sc struct { unsigned unit:16, dom:16; int credit_start, credit_end; - unsigned multiplier; } d; d.dom = svc->unit->domain->domain_id; d.unit = svc->unit->unit_id; d.credit_start = start_credit; d.credit_end = svc->credit; - d.multiplier = m; __trace_var(TRC_CSCHED2_CREDIT_RESET, 1, sizeof(d), (unsigned char *)&d); } } + ASSERT(snext->credit > 0); SCHED_STAT_CRANK(credit_reset); /* No need to resort runqueue, as everyone's order should be the same. */ ++++++ 5e86f7fd-credit2-fix-credit-too-few-resets.patch ++++++ # Commit dae7b62e976b28af9c8efa150618c25501bf1650 # Date 2020-04-03 10:46:53 +0200 # Author Dario Faggioli <dfaggi...@suse.com> # Committer Jan Beulich <jbeul...@suse.com> credit2: fix credit reset happening too few times There is a bug in commit 5e4b4199667b9 ("xen: credit2: only reset credit on reset condition"). In fact, the aim of that commit was to make sure that we do not perform too many credit reset operations (which are not super cheap, and in an hot-path). But the check used to determine whether a reset is necessary was the wrong one. In fact, knowing just that some vCPUs have been skipped, while traversing the runqueue (in runq_candidate()), is not enough. We need to check explicitly whether the first vCPU in the runqueue has a negative amount of credit. Since a trace record is changed, this patch updates xentrace format file and xenalyze as well This should be backported. Signed-off-by: Dario Faggioli <dfaggi...@suse.com> Acked-by: George Dunlap <george.dun...@citrix.com> --- a/tools/xentrace/formats +++ b/tools/xentrace/formats @@ -67,7 +67,7 @@ 0x00022210 CPU%(cpu)d %(tsc)d (+%(reltsc)8d) csched2:load_check [ lrq_id[16]:orq_id[16] = 0x%(1)08x, delta = %(2)d ] 0x00022211 CPU%(cpu)d %(tsc)d (+%(reltsc)8d) csched2:load_balance [ l_bavgload = 0x%(2)08x%(1)08x, o_bavgload = 0x%(4)08x%(3)08x, lrq_id[16]:orq_id[16] = 0x%(5)08x ] 0x00022212 CPU%(cpu)d %(tsc)d (+%(reltsc)8d) csched2:pick_cpu [ b_avgload = 0x%(2)08x%(1)08x, dom:vcpu = 0x%(3)08x, rq_id[16]:new_cpu[16] = %(4)d ] -0x00022213 CPU%(cpu)d %(tsc)d (+%(reltsc)8d) csched2:runq_candidate [ dom:vcpu = 0x%(1)08x, credit = %(4)d, skipped_vcpus = %(3)d, tickled_cpu = %(2)d ] +0x00022213 CPU%(cpu)d %(tsc)d (+%(reltsc)8d) csched2:runq_candidate [ dom:vcpu = 0x%(1)08x, credit = %(3)d, tickled_cpu = %(2)d ] 0x00022214 CPU%(cpu)d %(tsc)d (+%(reltsc)8d) csched2:schedule [ rq:cpu = 0x%(1)08x, tasklet[8]:idle[8]:smt_idle[8]:tickled[8] = %(2)08x ] 0x00022215 CPU%(cpu)d %(tsc)d (+%(reltsc)8d) csched2:ratelimit [ dom:vcpu = 0x%(1)08x, runtime = %(2)d ] 0x00022216 CPU%(cpu)d %(tsc)d (+%(reltsc)8d) csched2:runq_cand_chk [ dom:vcpu = 0x%(1)08x ] --- a/tools/xentrace/xenalyze.c +++ b/tools/xentrace/xenalyze.c @@ -7855,14 +7855,12 @@ void sched_process(struct pcpu_info *p) if (opt.dump_all) { struct { unsigned vcpuid:16, domid:16; - unsigned tickled_cpu, skipped; + unsigned tickled_cpu; int credit; } *r = (typeof(r))ri->d; - printf(" %s csched2:runq_candidate d%uv%u, credit = %d, " - "%u vcpus skipped, ", - ri->dump_header, r->domid, r->vcpuid, - r->credit, r->skipped); + printf(" %s csched2:runq_candidate d%uv%u, credit = %d, ", + ri->dump_header, r->domid, r->vcpuid, r->credit); if (r->tickled_cpu == (unsigned)-1) printf("no cpu was tickled\n"); else --- a/xen/common/sched_credit2.c +++ b/xen/common/sched_credit2.c @@ -3267,8 +3267,7 @@ csched2_runtime(const struct scheduler * static struct csched2_unit * runq_candidate(struct csched2_runqueue_data *rqd, struct csched2_unit *scurr, - int cpu, s_time_t now, - unsigned int *skipped) + int cpu, s_time_t now) { struct list_head *iter, *temp; struct sched_resource *sr = get_sched_res(cpu); @@ -3276,8 +3275,6 @@ runq_candidate(struct csched2_runqueue_d struct csched2_private *prv = csched2_priv(sr->scheduler); bool yield = false, soft_aff_preempt = false; - *skipped = 0; - if ( unlikely(is_idle_unit(scurr->unit)) ) { snext = scurr; @@ -3371,12 +3368,9 @@ runq_candidate(struct csched2_runqueue_d (unsigned char *)&d); } - /* Only consider units that are allowed to run on this processor. */ + /* Only consider vcpus that are allowed to run on this processor. */ if ( !cpumask_test_cpu(cpu, svc->unit->cpu_hard_affinity) ) - { - (*skipped)++; continue; - } /* * If an unit is meant to be picked up by another processor, and such @@ -3385,7 +3379,6 @@ runq_candidate(struct csched2_runqueue_d if ( svc->tickled_cpu != -1 && svc->tickled_cpu != cpu && cpumask_test_cpu(svc->tickled_cpu, &rqd->tickled) ) { - (*skipped)++; SCHED_STAT_CRANK(deferred_to_tickled_cpu); continue; } @@ -3397,7 +3390,6 @@ runq_candidate(struct csched2_runqueue_d if ( sched_unit_master(svc->unit) != cpu && snext->credit + CSCHED2_MIGRATE_RESIST > svc->credit ) { - (*skipped)++; SCHED_STAT_CRANK(migrate_resisted); continue; } @@ -3421,14 +3413,13 @@ runq_candidate(struct csched2_runqueue_d { struct { unsigned unit:16, dom:16; - unsigned tickled_cpu, skipped; + unsigned tickled_cpu; int credit; } d; d.dom = snext->unit->domain->domain_id; d.unit = snext->unit->unit_id; d.credit = snext->credit; d.tickled_cpu = snext->tickled_cpu; - d.skipped = *skipped; __trace_var(TRC_CSCHED2_RUNQ_CANDIDATE, 1, sizeof(d), (unsigned char *)&d); @@ -3460,7 +3451,6 @@ static void csched2_schedule( struct csched2_runqueue_data *rqd; struct csched2_unit * const scurr = csched2_unit(currunit); struct csched2_unit *snext = NULL; - unsigned int skipped_units = 0; bool tickled; bool migrated = false; @@ -3538,7 +3528,7 @@ static void csched2_schedule( snext = csched2_unit(sched_idle_unit(sched_cpu)); } else - snext = runq_candidate(rqd, scurr, sched_cpu, now, &skipped_units); + snext = runq_candidate(rqd, scurr, sched_cpu, now); /* If switching from a non-idle runnable unit, put it * back on the runqueue. */ @@ -3550,6 +3540,8 @@ static void csched2_schedule( /* Accounting for non-idle tasks */ if ( !is_idle_unit(snext->unit) ) { + int top_credit; + /* If switching, remove this from the runqueue and mark it scheduled */ if ( snext != scurr ) { @@ -3577,11 +3569,15 @@ static void csched2_schedule( * 2) no other unit with higher credits wants to run. * * Here, where we want to check for reset, we need to make sure the - * proper unit is being used. In fact, runqueue_candidate() may have - * not returned the first unit in the runqueue, for various reasons + * proper unit is being used. In fact, runq_candidate() may have not + * returned the first unit in the runqueue, for various reasons * (e.g., affinity). Only trigger a reset when it does. */ - if ( skipped_units == 0 && snext->credit <= CSCHED2_CREDIT_RESET ) + if ( list_empty(&rqd->runq) ) + top_credit = snext->credit; + else + top_credit = max(snext->credit, runq_elem(rqd->runq.next)->credit); + if ( top_credit <= CSCHED2_CREDIT_RESET ) { reset_credit(ops, sched_cpu, now, snext); balance_load(ops, sched_cpu, now); ++++++ 5e876b0f-tools-xenstore-fix-use-after-free-in-xenstored.patch ++++++ Subject: tools/xenstore: fix a use after free problem in xenstored From: Juergen Gross jgr...@suse.com Fri Apr 3 13:03:40 2020 +0100 Date: Fri Apr 3 17:57:51 2020 +0100: Git: bb2a34fd740e9a26be9e2244f1a5b4cef439e5a8 Commit 562a1c0f7ef3fb ("tools/xenstore: dont unlink connection object twice") introduced a potential use after free problem in domain_cleanup(): after calling talloc_unlink() for domain->conn domain->conn is set to NULL. The problem is that domain is registered as talloc child of domain->conn, so it might be freed by the talloc_unlink() call. With Xenstore being single threaded there are normally no concurrent memory allocations running and freeing a virtual memory area normally doesn't result in that area no longer being accessible. A problem could occur only in case either a signal received results in some memory allocation done in the signal handler (SIGHUP is a primary candidate leading to reopening the log file), or in case the talloc framework would do some internal memory allocation during freeing of the memory (which would lead to clobbering of the freed domain structure). Fixes: 562a1c0f7ef3fb ("tools/xenstore: dont unlink connection object twice") Signed-off-by: Juergen Gross <jgr...@suse.com> Reviewed-by: Julien Grall <jgr...@amazon.com> diff --git a/tools/xenstore/xenstored_domain.c b/tools/xenstore/xenstored_domain.c index baddaba5df..5858185211 100644 --- a/tools/xenstore/xenstored_domain.c +++ b/tools/xenstore/xenstored_domain.c @@ -214,6 +214,7 @@ static void domain_cleanup(void) { xc_dominfo_t dominfo; struct domain *domain; + struct connection *conn; int notify = 0; again: @@ -230,8 +231,10 @@ static void domain_cleanup(void) continue; } if (domain->conn) { - talloc_unlink(talloc_autofree_context(), domain->conn); + /* domain is a talloc child of domain->conn. */ + conn = domain->conn; domain->conn = NULL; + talloc_unlink(talloc_autofree_context(), conn); notify = 0; /* destroy_domain() fires the watch */ goto again; } ++++++ 5e95ad61-xenoprof-clear-buffer-intended-to-be-shared-with-guests.patch ++++++ Subject: xenoprof: clear buffer intended to be shared with guests From: Jan Beulich jbeul...@suse.com Tue Apr 14 14:32:33 2020 +0200 Date: Tue Apr 14 14:32:33 2020 +0200: Git: 0763a7ebfcdad66cf9e5475a1301eefb29bae9ed alloc_xenheap_pages() making use of MEMF_no_scrub is fine for Xen internally used allocations, but buffers allocated to be shared with (unpriviliged) guests need to be zapped of their prior content. This is part of XSA-313. Reported-by: Ilja Van Sprundel <ivansprun...@ioactive.com> Signed-off-by: Jan Beulich <jbeul...@suse.com> Reviewed-by: Andrew Cooper <andrew.coop...@citrix.com> Reviewed-by: Wei Liu <w...@xen.org> diff --git a/xen/common/xenoprof.c b/xen/common/xenoprof.c index 4f3e799ebb..4d909fd5d6 100644 --- a/xen/common/xenoprof.c +++ b/xen/common/xenoprof.c @@ -253,6 +253,9 @@ static int alloc_xenoprof_struct( return -ENOMEM; } + for ( i = 0; i < npages; ++i ) + clear_page(d->xenoprof->rawbuf + i * PAGE_SIZE); + d->xenoprof->npages = npages; d->xenoprof->nbuf = nvcpu; d->xenoprof->bufsize = bufsize; ++++++ 5e95ad8f-xenoprof-limit-consumption-of-shared-buffer-data.patch ++++++ Subject: xenoprof: limit consumption of shared buffer data From: Jan Beulich jbeul...@suse.com Tue Apr 14 14:33:19 2020 +0200 Date: Tue Apr 14 14:33:19 2020 +0200: Git: 50ef9a3cb26e2f9383f6fdfbed361d8f174bae9f Since a shared buffer can be written to by the guest, we may only read the head and tail pointers from there (all other fields should only ever be written to). Furthermore, for any particular operation the two values must be read exactly once, with both checks and consumption happening with the thus read values. (The backtrace related xenoprof_buf_space() use in xenoprof_log_event() is an exception: The values used there get re-checked by every subsequent xenoprof_add_sample().) Since that code needed touching, also fix the double increment of the lost samples count in case the backtrace related xenoprof_add_sample() invocation in xenoprof_log_event() fails. Where code is being touched anyway, add const as appropriate, but take the opportunity to entirely drop the now unused domain parameter of xenoprof_buf_space(). This is part of XSA-313. Reported-by: Ilja Van Sprundel <ivansprun...@ioactive.com> Signed-off-by: Jan Beulich <jbeul...@suse.com> Reviewed-by: George Dunlap <george.dun...@citrix.com> Reviewed-by: Wei Liu <w...@xen.org> diff --git a/xen/common/xenoprof.c b/xen/common/xenoprof.c index 4d909fd5d6..b04726cb49 100644 --- a/xen/common/xenoprof.c +++ b/xen/common/xenoprof.c @@ -479,25 +479,22 @@ static int add_passive_list(XEN_GUEST_HANDLE_PARAM(void) arg) /* Get space in the buffer */ -static int xenoprof_buf_space(struct domain *d, xenoprof_buf_t * buf, int size) +static int xenoprof_buf_space(int head, int tail, int size) { - int head, tail; - - head = xenoprof_buf(d, buf, event_head); - tail = xenoprof_buf(d, buf, event_tail); - return ((tail > head) ? 0 : size) + tail - head - 1; } /* Check for space and add a sample. Return 1 if successful, 0 otherwise. */ -static int xenoprof_add_sample(struct domain *d, xenoprof_buf_t *buf, +static int xenoprof_add_sample(const struct domain *d, + const struct xenoprof_vcpu *v, uint64_t eip, int mode, int event) { + xenoprof_buf_t *buf = v->buffer; int head, tail, size; head = xenoprof_buf(d, buf, event_head); tail = xenoprof_buf(d, buf, event_tail); - size = xenoprof_buf(d, buf, event_size); + size = v->event_size; /* make sure indexes in shared buffer are sane */ if ( (head < 0) || (head >= size) || (tail < 0) || (tail >= size) ) @@ -506,7 +503,7 @@ static int xenoprof_add_sample(struct domain *d, xenoprof_buf_t *buf, return 0; } - if ( xenoprof_buf_space(d, buf, size) > 0 ) + if ( xenoprof_buf_space(head, tail, size) > 0 ) { xenoprof_buf(d, buf, event_log[head].eip) = eip; xenoprof_buf(d, buf, event_log[head].mode) = mode; @@ -530,7 +527,6 @@ static int xenoprof_add_sample(struct domain *d, xenoprof_buf_t *buf, int xenoprof_add_trace(struct vcpu *vcpu, uint64_t pc, int mode) { struct domain *d = vcpu->domain; - xenoprof_buf_t *buf = d->xenoprof->vcpu[vcpu->vcpu_id].buffer; /* Do not accidentally write an escape code due to a broken frame. */ if ( pc == XENOPROF_ESCAPE_CODE ) @@ -539,7 +535,8 @@ int xenoprof_add_trace(struct vcpu *vcpu, uint64_t pc, int mode) return 0; } - return xenoprof_add_sample(d, buf, pc, mode, 0); + return xenoprof_add_sample(d, &d->xenoprof->vcpu[vcpu->vcpu_id], + pc, mode, 0); } void xenoprof_log_event(struct vcpu *vcpu, const struct cpu_user_regs *regs, @@ -570,17 +567,22 @@ void xenoprof_log_event(struct vcpu *vcpu, const struct cpu_user_regs *regs, /* Provide backtrace if requested. */ if ( backtrace_depth > 0 ) { - if ( (xenoprof_buf_space(d, buf, v->event_size) < 2) || - !xenoprof_add_sample(d, buf, XENOPROF_ESCAPE_CODE, mode, - XENOPROF_TRACE_BEGIN) ) + if ( xenoprof_buf_space(xenoprof_buf(d, buf, event_head), + xenoprof_buf(d, buf, event_tail), + v->event_size) < 2 ) { xenoprof_buf(d, buf, lost_samples)++; lost_samples++; return; } + + /* xenoprof_add_sample() will increment lost_samples on failure */ + if ( !xenoprof_add_sample(d, v, XENOPROF_ESCAPE_CODE, mode, + XENOPROF_TRACE_BEGIN) ) + return; } - if ( xenoprof_add_sample(d, buf, pc, mode, event) ) + if ( xenoprof_add_sample(d, v, pc, mode, event) ) { if ( is_active(vcpu->domain) ) active_samples++; diff --git a/xen/include/xen/xenoprof.h b/xen/include/xen/xenoprof.h index 8f045abce6..f1f9446bd5 100644 --- a/xen/include/xen/xenoprof.h +++ b/xen/include/xen/xenoprof.h @@ -61,12 +61,12 @@ struct xenoprof { #ifndef CONFIG_COMPAT #define XENOPROF_COMPAT(x) 0 -#define xenoprof_buf(d, b, field) ((b)->field) +#define xenoprof_buf(d, b, field) ACCESS_ONCE((b)->field) #else #define XENOPROF_COMPAT(x) ((x)->is_compat) -#define xenoprof_buf(d, b, field) (*(!(d)->xenoprof->is_compat ? \ - &(b)->native.field : \ - &(b)->compat.field)) +#define xenoprof_buf(d, b, field) ACCESS_ONCE(*(!(d)->xenoprof->is_compat \ + ? &(b)->native.field \ + : &(b)->compat.field)) #endif struct domain; ++++++ 5e95ae77-Add-missing-memory-barrier-in-the-unlock-path-of-rwlock.patch ++++++ Subject: xen/rwlock: Add missing memory barrier in the unlock path of rwlock From: Julien Grall jgr...@amazon.com Thu Feb 20 20:54:40 2020 +0000 Date: Tue Apr 14 14:37:11 2020 +0200: Git: 6890a04072e664c25447a297fe663b45ecfd6398 The rwlock unlock paths are using atomic_sub() to release the lock. However the implementation of atomic_sub() rightfully doesn't contain a memory barrier. On Arm, this means a processor is allowed to re-order the memory access with the preceeding access. In other words, the unlock may be seen by another processor before all the memory accesses within the "critical" section. The rwlock paths already contains barrier indirectly, but they are not very useful without the counterpart in the unlock paths. The memory barriers are not necessary on x86 because loads/stores are not re-ordered with lock instructions. So add arch_lock_release_barrier() in the unlock paths that will only add memory barrier on Arm. Take the opportunity to document each lock paths explaining why a barrier is not necessary. This is XSA-314. Signed-off-by: Julien Grall <jgr...@amazon.com> Acked-by: Jan Beulich <jbeul...@suse.com> --- a/xen/include/xen/rwlock.h +++ b/xen/include/xen/rwlock.h @@ -48,6 +48,10 @@ static inline int _read_trylock(rwlock_t if ( likely(!(cnts & _QW_WMASK)) ) { cnts = (u32)atomic_add_return(_QR_BIAS, &lock->cnts); + /* + * atomic_add_return() is a full barrier so no need for an + * arch_lock_acquire_barrier(). + */ if ( likely(!(cnts & _QW_WMASK)) ) return 1; atomic_sub(_QR_BIAS, &lock->cnts); @@ -64,11 +68,19 @@ static inline void _read_lock(rwlock_t * u32 cnts; cnts = atomic_add_return(_QR_BIAS, &lock->cnts); + /* + * atomic_add_return() is a full barrier so no need for an + * arch_lock_acquire_barrier(). + */ if ( likely(!(cnts & _QW_WMASK)) ) return; /* The slowpath will decrement the reader count, if necessary. */ queue_read_lock_slowpath(lock); + /* + * queue_read_lock_slowpath() is using spinlock and therefore is a + * full barrier. So no need for an arch_lock_acquire_barrier(). + */ } static inline void _read_lock_irq(rwlock_t *lock) @@ -92,6 +104,7 @@ static inline unsigned long _read_lock_i */ static inline void _read_unlock(rwlock_t *lock) { + arch_lock_release_barrier(); /* * Atomically decrement the reader count */ @@ -121,11 +134,20 @@ static inline int _rw_is_locked(rwlock_t */ static inline void _write_lock(rwlock_t *lock) { - /* Optimize for the unfair lock case where the fair flag is 0. */ + /* + * Optimize for the unfair lock case where the fair flag is 0. + * + * atomic_cmpxchg() is a full barrier so no need for an + * arch_lock_acquire_barrier(). + */ if ( atomic_cmpxchg(&lock->cnts, 0, _QW_LOCKED) == 0 ) return; queue_write_lock_slowpath(lock); + /* + * queue_write_lock_slowpath() is using spinlock and therefore is a + * full barrier. So no need for an arch_lock_acquire_barrier(). + */ } static inline void _write_lock_irq(rwlock_t *lock) @@ -157,11 +179,16 @@ static inline int _write_trylock(rwlock_ if ( unlikely(cnts) ) return 0; + /* + * atomic_cmpxchg() is a full barrier so no need for an + * arch_lock_acquire_barrier(). + */ return likely(atomic_cmpxchg(&lock->cnts, 0, _QW_LOCKED) == 0); } static inline void _write_unlock(rwlock_t *lock) { + arch_lock_release_barrier(); /* * If the writer field is atomic, it can be cleared directly. * Otherwise, an atomic subtraction will be used to clear it. ++++++ 5e95af5e-xen-gnttab-Fix-error-path-in-map_grant_ref.patch ++++++ Subject: xen/gnttab: Fix error path in map_grant_ref() From: Ross Lagerwall ross.lagerw...@citrix.com Tue Apr 14 14:41:02 2020 +0200 Date: Tue Apr 14 14:41:02 2020 +0200: Git: da0c66c8f48042a0186799014af69db0303b1da5 Part of XSA-295 (c/s 863e74eb2cffb) inadvertently re-positioned the brackets, changing the logic. If the _set_status() call fails, the grant_map hypercall would fail with a status of 1 (rc != GNTST_okay) instead of the expected negative GNTST_* error. This error path can be taken due to bad guest state, and causes net/blk-back in Linux to crash. This is XSA-316. Signed-off-by: Ross Lagerwall <ross.lagerw...@citrix.com> Reviewed-by: Andrew Cooper <andrew.coop...@citrix.com> Reviewed-by: Julien Grall <jgr...@amazon.com> diff --git a/xen/common/grant_table.c b/xen/common/grant_table.c index 9fd6e60416..4b5344dc21 100644 --- a/xen/common/grant_table.c +++ b/xen/common/grant_table.c @@ -1032,7 +1032,7 @@ map_grant_ref( { if ( (rc = _set_status(shah, status, rd, rgt->gt_version, act, op->flags & GNTMAP_readonly, 1, - ld->domain_id) != GNTST_okay) ) + ld->domain_id)) != GNTST_okay ) goto act_release_out; if ( !act->pin ) ++++++ 5e95afb8-gnttab-fix-GNTTABOP_copy-continuation-handling.patch ++++++ Subject: gnttab: fix GNTTABOP_copy continuation handling From: Jan Beulich jbeul...@suse.com Tue Apr 14 14:42:32 2020 +0200 Date: Tue Apr 14 14:42:32 2020 +0200: Git: d6f22d5d9e8d6848ec229083ac9fb044f0adea93 The XSA-226 fix was flawed - the backwards transformation on rc was done too early, causing a continuation to not get invoked when the need for preemption was determined at the very first iteration of the request. This in particular means that all of the status fields of the individual operations would be left untouched, i.e. set to whatever the caller may or may not have initialized them to. This is part of XSA-318. Reported-by: Pawel Wieczorkiewicz <wipa...@amazon.de> Signed-off-by: Jan Beulich <jbeul...@suse.com> Reviewed-by: Juergen Gross <jgr...@suse.com> Tested-by: Pawel Wieczorkiewicz <wipa...@amazon.de> diff --git a/xen/common/grant_table.c b/xen/common/grant_table.c index 4b5344dc21..96080b3dec 100644 --- a/xen/common/grant_table.c +++ b/xen/common/grant_table.c @@ -3583,8 +3583,7 @@ do_grant_table_op( rc = gnttab_copy(copy, count); if ( rc > 0 ) { - rc = count - rc; - guest_handle_add_offset(copy, rc); + guest_handle_add_offset(copy, count - rc); uop = guest_handle_cast(copy, void); } break; @@ -3651,6 +3650,9 @@ do_grant_table_op( out: if ( rc > 0 || opaque_out != 0 ) { + /* Adjust rc, see gnttab_copy() for why this is needed. */ + if ( cmd == GNTTABOP_copy ) + rc = count - rc; ASSERT(rc < count); ASSERT((opaque_out & GNTTABOP_CMD_MASK) == 0); rc = hypercall_create_continuation(__HYPERVISOR_grant_table_op, "ihi",