Re: [PATCH RFC PKS/PMEM 05/58] kmap: Introduce k[un]map_thread

2020-11-10 Thread Thomas Gleixner
On Mon, Nov 09 2020 at 20:59, Ira Weiny wrote:
> On Tue, Nov 10, 2020 at 02:13:56AM +0100, Thomas Gleixner wrote:
> Also, we can convert the new memcpy_*_page() calls to kmap_local() as well.
> [For now my patch just uses kmap_atomic().]
>
> I've not looked at all of the patches in your latest version.  Have you
> included converting any of the kmap() call sites?  I thought you were more
> focused on converting the kmap_atomic() to kmap_local()?

I did not touch any of those yet, but it's a logical consequence to
convert all kmap() instances which are _not_ creating a global mapping
over to it.

Thanks,

tglx

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH RFC PKS/PMEM 05/58] kmap: Introduce k[un]map_thread

2020-11-09 Thread Thomas Gleixner
Ira,

On Fri, Oct 09 2020 at 12:49, ira weiny wrote:
> From: Ira Weiny 
>
> To correctly support the semantics of kmap() with Kernel protection keys
> (PKS), kmap() may be required to set the protections on multiple
> processors (globally).  Enabling PKS globally can be very expensive
> depending on the requested operation.  Furthermore, enabling a domain
> globally reduces the protection afforded by PKS.
>
> Most kmap() (Aprox 209 of 229) callers use the map within a single thread and
> have no need for the protection domain to be enabled globally.  However, the
> remaining callers do not follow this pattern and, as best I can tell, expect
> the mapping to be 'global' and available to any thread who may access the
> mapping.[1]
>
> We don't anticipate global mappings to pmem, however in general there is a
> danger in changing the semantics of kmap().  Effectively, this would cause an
> unresolved page fault with little to no information about why the failure
> occurred.
>
> To resolve this a number of options were considered.
>
> 1) Attempt to change all the thread local kmap() calls to kmap_atomic()[2]
> 2) Introduce a flags parameter to kmap() to indicate if the mapping should be
>global or not
> 3) Change ~20 call sites to 'kmap_global()' to indicate that they require a
>global enablement of the pages.
> 4) Change ~209 call sites to 'kmap_thread()' to indicate that the mapping is 
> to
>be used within that thread of execution only
>
> Option 1 is simply not feasible.  Option 2 would require all of the call sites
> of kmap() to change.  Option 3 seems like a good minimal change but there is a
> danger that new code may miss the semantic change of kmap() and not get the
> behavior the developer intended.  Therefore, #4 was chosen.

There is Option #5:

Convert the thread local kmap() invocations to the proposed kmap_local()
interface which is coming along [1].

That solves a couple of issues:

 1) It relieves the current kmap_atomic() usage sites from the implict
pagefault/preempt disable semantics which apply even when
CONFIG_HIGHMEM is disabled. kmap_local() still can be invoked from
atomic context.

 2) Due to #1 it allows to replace the conditional usage of kmap() and
kmap_atomic() for purely thread local mappings.

 3) It puts the burden on the HIGHMEM inflicted systems

 4) It is actually more efficient for most of the pure thread local use
cases on HIGHMEM inflicted systems because it avoids the overhead of
the global lock and the potential kmap slot exhaustion. A potential
preemption will be more expensive, but that's not really the case we
want to optimize for.

 5) It solves the RT issue vs. kmap_atomic()

So instead of creating yet another variety of kmap() which is just
scratching the particular PKRS itch, can we please consolidate all of
that on the wider reaching kmap_local() approach?

Thanks,

tglx
 
[1] https://lore.kernel.org/lkml/20201103092712.714480...@linutronix.de/

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH 0/3] Modernize tasklet callback API

2020-07-30 Thread Thomas Gleixner
Kees,

Kees Cook  writes:
> This is the infrastructure changes to prepare the tasklet API for
> conversion to passing the tasklet struct as the callback argument instead
> of an arbitrary unsigned long. The first patch details why this is useful
> (it's the same rationale as the timer_struct changes from a bit ago:
> less abuse during memory corruption attacks, more in line with existing
> ways of doing things in the kernel, save a little space in struct,
> etc). Notably, the existing tasklet API use is much less messy, so there
> is less to clean up.
>
> It's not clear to me which tree this should go through... Greg since it
> starts with a USB clean-up, -tip for timer or interrupt, or if I should
> just carry it. I'm open to suggestions, but if I don't hear otherwise,
> I'll just carry it.
>
> My goal is to have this merged for v5.9-rc1 so that during the v5.10
> development cycle the new API will be available. The entire tree of
> changes is here[1] currently, but to split it up by maintainer the
> infrastructure changes need to be landed first.
>
> Review and Acks appreciated! :)

I'd rather see tasklets vanish from the planet completely, but that's
going to be a daring feat. So, grudgingly:

Acked-by: Thomas Gleixner 

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


[PATCH v2] staging: most: Use DEFINE_SPINLOCK() instead of struct spinlock

2019-07-26 Thread Thomas Gleixner
From: Sebastian Andrzej Siewior 

For spinlocks the type spinlock_t should be used instead of "struct
spinlock".

Use DEFINE_SPINLOCK() and spare the run time initialization

Signed-off-by: Sebastian Andrzej Siewior 
Signed-off-by: Thomas Gleixner 
Link: https://lkml.kernel.org/r/20190704153803.12739-5-bige...@linutronix.de
---
V2: Rebase on 5.3-rc1. Massaged change log
---
 drivers/staging/most/net/net.c |3 +--
 drivers/staging/most/video/video.c |3 +--
 2 files changed, 2 insertions(+), 4 deletions(-)

--- a/drivers/staging/most/net/net.c
+++ b/drivers/staging/most/net/net.c
@@ -69,7 +69,7 @@ struct net_dev_context {
 
 static struct list_head net_devices = LIST_HEAD_INIT(net_devices);
 static struct mutex probe_disc_mt; /* ch->linked = true, most_nd_open */
-static struct spinlock list_lock; /* list_head, ch->linked = false, dev_hold */
+static DEFINE_SPINLOCK(list_lock); /* list_head, ch->linked = false, dev_hold 
*/
 static struct core_component comp;
 
 static int skb_to_mamac(const struct sk_buff *skb, struct mbo *mbo)
@@ -509,7 +509,6 @@ static int __init most_net_init(void)
 {
int err;
 
-   spin_lock_init(_lock);
mutex_init(_disc_mt);
err = most_register_component();
if (err)
--- a/drivers/staging/most/video/video.c
+++ b/drivers/staging/most/video/video.c
@@ -54,7 +54,7 @@ struct comp_fh {
 };
 
 static struct list_head video_devices = LIST_HEAD_INIT(video_devices);
-static struct spinlock list_lock;
+static DEFINE_SPINLOCK(list_lock);
 
 static inline bool data_ready(struct most_video_dev *mdev)
 {
@@ -538,7 +538,6 @@ static int __init comp_init(void)
 {
int err;
 
-   spin_lock_init(_lock);
err = most_register_component();
if (err)
return err;
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


RE: [PATCH] x86/hyper-v: Zero out the VP assist page to fix CPU offlining

2019-07-18 Thread Thomas Gleixner
On Thu, 18 Jul 2019, Dexuan Cui wrote:
> 
> The concept of the "overlay page" seems weird, and frankly speaking, 
> I don't really understand why the Hyper-V guys invented it, but as far
> as this patch here is concerned, I think the patch is safe and it can
> indeed fix the CPU offlining issue I described.

Then this needs some really good explanation and in the change log because
that's really obscure behaviour.

Thanks,

tglx
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


RE: [PATCH] x86/hyper-v: Zero out the VP assist page to fix CPU offlining

2019-07-18 Thread Thomas Gleixner
On Thu, 18 Jul 2019, Dexuan Cui wrote:
> > On Thu, 4 Jul 2019, Dexuan Cui wrote:
> > This is the allocation when the CPU is brought online for the first
> > time. So what effect has zeroing at allocation time vs. offlining and
> > potentially receiving IPIs? That allocation is never freed.
> > 
> > Neither the comment nor the changelog make any sense to me.
> > tglx
> 
> That allocation was introduced by the commit
> a46d15cc1ae5 ("x86/hyper-v: allocate and use Virtual Processor Assist Pages").
> 
> I think it's ok to not free the page when a CPU is offlined: every
> CPU uses only 1 page and CPU offlining is not really a very usual
> operation except for the scenario of hibernation (and suspend-to-memory), 
> where the CPUs are quickly onlined again, when we resume from hibernation.
> IMO Vitaly intentionally decided to not free the page for simplicity of the
> code.
> 
> When a CPU (e.g. CPU1) is being onlined, in hv_cpu_init(), we allocate the
> VP_ASSIST_PAGE page and enable the PV EOI optimization for this CPU by
> writing the MSR HV_X64_MSR_VP_ASSIST_PAGE. From now on, this CPU
> *always* uses hvp->apic_assist (which is updated by the hypervisor) to
> decide if it needs to write the EOI MSR:
> 
> static void hv_apic_eoi_write(u32 reg, u32 val)
> {
> struct hv_vp_assist_page *hvp = hv_vp_assist_page[smp_processor_id()];
> 
> if (hvp && (xchg(>apic_assist, 0) & 0x1))
> return;
> 
> wrmsr(HV_X64_MSR_EOI, val, 0);
> }
> 
> When a CPU (e.g. CPU1) is being offlined, on this CPU, we do:
> 1. in hv_cpu_die(), we disable the PV EOI optimizaton for this CPU;
> 2. we finish the remaining work of stopping this CPU;
> 3. this CPU is completed stopped.
> 
> Between 1 and 3, this CPU can still receive interrupts (e.g. IPIs from CPU0,
> and Local APIC timer interrupts), and this CPU *must* write the EOI MSR for
> every interrupt received, otherwise the hypervisor may not deliver further
> interrupts, which may be needed to stop this CPU completely.
> 
> So we need to make sure hvp->apic_assist.bit0 is zero, after we run the line
> "wrmsrl(HV_X64_MSR_VP_ASSIST_PAGE, 0);" in hv_cpu_die(). The easiest
> way is what I do in this patch. Alternatively, we can use the below patch:
> 
> @@ -188,8 +188,12 @@ static int hv_cpu_die(unsigned int cpu)
> local_irq_restore(flags);
> free_page((unsigned long)input_pg);
> 
> -   if (hv_vp_assist_page && hv_vp_assist_page[cpu])
> +   if (hv_vp_assist_page && hv_vp_assist_page[cpu]) {
> +   local_irq_save(flags);
> wrmsrl(HV_X64_MSR_VP_ASSIST_PAGE, 0);
> +   hvp->apic_assist &= ~1;
> +   local_irq_restore(flags);
> +   }
> 
> if (hv_reenlightenment_cb == NULL)
> return 0;
> 
> This second version needs 3+ lines, so I prefer the one-line version. :-)

Those are two different things. The GPF_ZERO allocation makes sense on it's
own but it _cannot_ prevent the following scenario:

cpu_init()
  if (!hvp)
 hvp = vmalloc(, GFP_ZERO);
...

hvp->apic_assist |= 1;

#1   cpu_die()
  if ()
   wrmsrl(HV_X64_MSR_VP_ASSIST_PAGE, 0);

   ---> IPI
if (!(hvp->apic_assist & 1))
   wrmsr(APIC_EOI);<- PATH not taken

#3   cpu is dead

cpu_init()
   if (!hvp)
  hvp = vmalloc(m, GFP_ZERO);  <- NOT TAKEN because hvp != NULL

So you have to come up with a better fairy tale why GFP_ZERO 'fixes' this.

Allocating hvp with GFP_ZERO makes sense on it's own so the allocated
memory has a defined state, but that's a different story.

The 3 liner patch above makes way more sense and you can spare the
local_irq_save/restore by moving the whole condition into the
irq_save/restore region above.

Thanks,

tglx



Re: [PATCH] x86/hyper-v: Zero out the VP assist page to fix CPU offlining

2019-07-17 Thread Thomas Gleixner
Dexuan,

On Thu, 4 Jul 2019, Dexuan Cui wrote:

> When a CPU is being offlined, the CPU usually still receives a few
> interrupts (e.g. reschedule IPIs), after hv_cpu_die() disables the
> HV_X64_MSR_VP_ASSIST_PAGE, so hv_apic_eoi_write() may not write the EOI
> MSR, if the apic_assist field's bit0 happens to be 1; as a result, Hyper-V
> may not be able to deliver all the interrupts to the CPU, and the CPU may
> not be stopped, and the kernel will hang soon.
> 
> The VP ASSIST PAGE is an "overlay" page (see Hyper-V TLFS's Section
> 5.2.1 "GPA Overlay Pages"), so with this fix we're sure the apic_assist
> field is still zero, after the VP ASSIST PAGE is disabled.
> 
> Fixes: ba696429d290 ("x86/hyper-v: Implement EOI assist")
> Signed-off-by: Dexuan Cui 
> ---
>  arch/x86/hyperv/hv_init.c | 8 +++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
> index 0e033ef11a9f..db51a301f759 100644
> --- a/arch/x86/hyperv/hv_init.c
> +++ b/arch/x86/hyperv/hv_init.c
> @@ -60,8 +60,14 @@ static int hv_cpu_init(unsigned int cpu)
>   if (!hv_vp_assist_page)
>   return 0;
>  
> + /*
> +  * The ZERO flag is necessary, because in the case of CPU offlining
> +  * the page can still be used by hv_apic_eoi_write() for a while,
> +  * after the VP ASSIST PAGE is disabled in hv_cpu_die().
> +  */
>   if (!*hvp)
> - *hvp = __vmalloc(PAGE_SIZE, GFP_KERNEL, PAGE_KERNEL);
> + *hvp = __vmalloc(PAGE_SIZE, GFP_KERNEL | __GFP_ZERO,
> +  PAGE_KERNEL);

This is the allocation when the CPU is brought online for the first
time. So what effect has zeroing at allocation time vs. offlining and
potentially receiving IPIs? That allocation is never freed.

Neither the comment nor the changelog make any sense to me.

Thanks,

tglx



Re: [PATCHv2] x86/hyperv: Hold cpus_read_lock() on assigning reenlightenment vector

2019-06-27 Thread Thomas Gleixner
On Mon, 17 Jun 2019, Dmitry Safonov wrote:
> @@ -196,7 +196,16 @@ void set_hv_tscchange_cb(void (*cb)(void))
>   /* Make sure callback is registered before we write to MSRs */
>   wmb();
>  
> + /*
> +  * As reenlightenment vector is global, there is no difference which
> +  * CPU will register MSR, though it should be an online CPU.
> +  * hv_cpu_die() callback guarantees that on CPU teardown
> +  * another CPU will re-register MSR back.
> +  */
> + cpus_read_lock();
> + re_ctrl.target_vp = hv_vp_index[raw_smp_processor_id()];
>   wrmsrl(HV_X64_MSR_REENLIGHTENMENT_CONTROL, *((u64 *)_ctrl));
> + cpus_read_unlock();

Should work

>   wrmsrl(HV_X64_MSR_TSC_EMULATION_CONTROL, *((u64 *)_ctrl));
>  }
>  EXPORT_SYMBOL_GPL(set_hv_tscchange_cb);
> @@ -239,6 +248,7 @@ static int hv_cpu_die(unsigned int cpu)
>  
>   rdmsrl(HV_X64_MSR_REENLIGHTENMENT_CONTROL, *((u64 *)_ctrl));
>   if (re_ctrl.target_vp == hv_vp_index[cpu]) {
> + lockdep_assert_cpus_held();

So you're not trusting the hotplug core code to hold the lock when it
brings a CPU down and invokes that callback? Come on

>   /* Reassign to some other online CPU */
>   new_cpu = cpumask_any_but(cpu_online_mask, cpu);

Thanks,

tglx
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH] x86/hyperv: Disable preemption while setting reenlightenment vector

2019-06-13 Thread Thomas Gleixner
On Wed, 12 Jun 2019, Vitaly Kuznetsov wrote:
> Dmitry Safonov  writes:
> > diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
> > index 1608050e9df9..0bdd79ecbff8 100644
> > --- a/arch/x86/hyperv/hv_init.c
> > +++ b/arch/x86/hyperv/hv_init.c
> > @@ -91,7 +91,7 @@ EXPORT_SYMBOL_GPL(hv_max_vp_index);
> >  static int hv_cpu_init(unsigned int cpu)
> >  {
> > u64 msr_vp_index;
> > -   struct hv_vp_assist_page **hvp = _vp_assist_page[smp_processor_id()];
> > +   struct hv_vp_assist_page **hvp = _vp_assist_page[cpu];
> > void **input_arg;
> > struct page *pg;
> >  
> > @@ -103,7 +103,7 @@ static int hv_cpu_init(unsigned int cpu)
> >  
> > hv_get_vp_index(msr_vp_index);
> >  
> > -   hv_vp_index[smp_processor_id()] = msr_vp_index;
> > +   hv_vp_index[cpu] = msr_vp_index;
> >  
> > if (msr_vp_index > hv_max_vp_index)
> > hv_max_vp_index = msr_vp_index;
> 
> The above is unrelated cleanup (as cpu == smp_processor_id() for
> CPUHP_AP_ONLINE_DYN callbacks), right? As I'm pretty sure these can'd be
> preempted.

They can be preempted, but they are guaranteed to run on the upcoming CPU,
i.e. smp_processor_id() is allowed even in preemptible context as the task
cannot migrate.

Thanks,

tglx
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH] x86/Hyper-V: Fix definition HV_MAX_FLUSH_REP_COUNT

2019-03-20 Thread Thomas Gleixner
On Fri, 15 Mar 2019, Paolo Bonzini wrote:

> On 22/02/19 11:48, lantianyu1...@gmail.com wrote:
> > From: Lan Tianyu 
> > 
> > The max flush rep count of HvFlushGuestPhysicalAddressList hypercall
> > is equal with how many entries of union hv_gpa_page_range can be populated
> > into the input parameter page. The origin code lacks parenthesis around
> > PAGE_SIZE - 2 * sizeof(u64). This patch is to fix it.
> > 
> > Cc: 
> > Fixs: cc4edae4b9(x86/hyper-v: Add HvFlushGuestAddressList hypercall support)
> > Signed-off-by: Lan Tianyu 
> > ---
> >  arch/x86/include/asm/hyperv-tlfs.h | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/arch/x86/include/asm/hyperv-tlfs.h 
> > b/arch/x86/include/asm/hyperv-tlfs.h
> > index 705dafc2d11a..2bdbbbcfa393 100644
> > --- a/arch/x86/include/asm/hyperv-tlfs.h
> > +++ b/arch/x86/include/asm/hyperv-tlfs.h
> > @@ -841,7 +841,7 @@ union hv_gpa_page_range {
> >   * count is equal with how many entries of union hv_gpa_page_range can
> >   * be populated into the input parameter page.
> >   */
> > -#define HV_MAX_FLUSH_REP_COUNT (PAGE_SIZE - 2 * sizeof(u64) /  \
> > +#define HV_MAX_FLUSH_REP_COUNT ((PAGE_SIZE - 2 * sizeof(u64)) /\
> > sizeof(union hv_gpa_page_range))
> >  
> >  struct hv_guest_mapping_flush_list {
> > 
> 
> Queued for after he merge window, with the fixed "Fixes" tag.

That's upstream already:

  9cd05ad2910b ("x86/hyper-v: Fix definition of HV_MAX_FLUSH_REP_COUNT")

Thanks,

tglx
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH V3 1/3] x86/Hyper-V: Set x2apic destination mode to physical when x2apic is available

2019-02-10 Thread Thomas Gleixner
On Thu, 7 Feb 2019, lantianyu1...@gmail.com wrote:

> From: Lan Tianyu 
> 
> Hyper-V doesn't provide irq remapping for IO-APIC. To enable x2apic,
> set x2apic destination mode to physcial mode when x2apic is available
> and Hyper-V IOMMU driver makes sure cpus assigned with IO-APIC irqs have
> 8-bit APIC id.

This looks good now. Can that be applied independent of the IOMMU stuff or
should this go together. If the latter:

   Reviewed-by: Thomas Gleixner 

If not, I just queue if for 5.1. Let me know,

Thanks,

tglx
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH V4 2/15] KVM/MMU: Add tlb flush with range helper function

2018-10-14 Thread Thomas Gleixner
On Sun, 14 Oct 2018, Liran Alon wrote:
> > On 13 Oct 2018, at 17:53, lantianyu1...@gmail.com wrote:
> > 
> > From: Lan Tianyu 
> > 
> > This patch is to add wrapper functions for tlb_remote_flush_with_range
> > callback.
> > 
> > Signed-off-by: Lan Tianyu 
> > ---
> > Change sicne V3:
> >   Remove code of updating "tlbs_dirty"
> > Change since V2:
> >   Fix comment in the kvm_flush_remote_tlbs_with_range()
> > ---
> > arch/x86/kvm/mmu.c | 40 
> > 1 file changed, 40 insertions(+)
> > 
> > diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> > index c73d9f650de7..ff656d85903a 100644
> > --- a/arch/x86/kvm/mmu.c
> > +++ b/arch/x86/kvm/mmu.c
> > @@ -264,6 +264,46 @@ static void mmu_spte_set(u64 *sptep, u64 spte);
> > static union kvm_mmu_page_role
> > kvm_mmu_calc_root_page_role(struct kvm_vcpu *vcpu);
> > 
> > +
> > +static inline bool kvm_available_flush_tlb_with_range(void)
> > +{
> > +   return kvm_x86_ops->tlb_remote_flush_with_range;
> > +}
> 
> Seems that kvm_available_flush_tlb_with_range() is not used in this patch…

What's wrong with that? 

It provides the implementation and later patches make use of it. It's a
sensible way to split patches into small, self contained entities.

Thanks,

tglx
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support

2018-10-03 Thread Thomas Gleixner
On Wed, 3 Oct 2018, Andy Lutomirski wrote:
> > On Oct 3, 2018, at 5:01 AM, Vitaly Kuznetsov  wrote:
> > Not all Hyper-V hosts support reenlightenment notifications (and, if I'm
> > not mistaken, you need to enable nesting for the VM to get the feature -
> > and most VMs don't have this) so I think we'll have to keep Hyper-V
> > vclock for the time being.
> > 
> But this does suggest that the correct way to pass a clock through to an
> L2 guest where L0 is HV is to make L1 use the “tsc” clock and L2 use
> kvmclock (or something newer and better).  This would require adding
> support for atomic frequency changes all the way through the timekeeping
> and arch code.
>
> John, tglx, would that be okay or crazy?

Not sure what you mean. I think I lost you somewhere on the way.

Thanks,

tglx___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [patch 09/11] x86/vdso: Simplify the invalid vclock case

2018-09-27 Thread Thomas Gleixner
On Tue, 18 Sep 2018, Thomas Gleixner wrote:
> On Tue, 18 Sep 2018, Thomas Gleixner wrote:
> > So if the TSC on CPU1 is slightly behind the TSC on CPU0 then now1 can be
> > smaller than cycle_last. The TSC sync stuff does not catch the small delta
> > for unknown raisins. I'll go and find that machine and test that again.
> 
> Of course it does not trigger anymore. We accumulated code between the
> point in timekeeping_advance() where the TSC is read and the update of the
> VDSO data.
> 
> I'll might have to get an 2.6ish kernel booted on that machine and try with
> that again. /me shudders

Actually it does happen, because the TSC is very slowly drifting apart due
to SMI wreckage trying to hide itself. It just takes a very long time.

Thanks,

tglx


___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [patch 09/11] x86/vdso: Simplify the invalid vclock case

2018-09-27 Thread Thomas Gleixner
On Wed, 19 Sep 2018, Thomas Gleixner wrote:
> On Tue, 18 Sep 2018, Andy Lutomirski wrote:
> > > On Sep 18, 2018, at 3:46 PM, Thomas Gleixner  wrote:
> > > On Tue, 18 Sep 2018, Andy Lutomirski wrote:
> > >> Do we do better if we use signed arithmetic for the whole calculation?
> > >> Then a small backwards movement would result in a small backwards result.
> > >> Or we could offset everything so that we’d have to go back several
> > >> hundred ms before we cross zero.
> > > 
> > > That would be probably the better solution as signed math would be
> > > problematic when the resulting ns value becomes negative. As the delta is
> > > really small, otherwise the TSC sync check would have caught it, the 
> > > caller
> > > should never be able to observe time going backwards.
> > > 
> > > I'll have a look into that. It needs some thought vs. the fractional part
> > > of the base time, but it should be not rocket science to get that
> > > correct. Famous last words...
> > > 
> > 
> > It’s also fiddly to tune. If you offset it too much, then the fancy
> > divide-by-repeated-subtraction loop will hurt more than the comparison to
> > last.
> 
> Not really. It's sufficient to offset it by at max. 1000 cycles or so. That
> won't hurt the magic loop, but it will definitely cover that slight offset
> case.

I got it working, but first of all the gain is close to 0.

There is this other subtle issue that we've seen TSCs slowly drifting apart
which is caught by the TSC watchdog eventually, but if it exeeds the offset
_before_ the watchdog triggers, we're back to square one.

So I rather stay on the safe side and just accept that we have to deal with
that. Sigh.

Thanks,

tglx___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [patch 09/11] x86/vdso: Simplify the invalid vclock case

2018-09-19 Thread Thomas Gleixner
On Wed, 19 Sep 2018, Rasmus Villemoes wrote:
> On 2018-09-19 00:46, Thomas Gleixner wrote:
> > On Tue, 18 Sep 2018, Andy Lutomirski wrote:
> >>>
> >>
> >> Do we do better if we use signed arithmetic for the whole calculation?
> >> Then a small backwards movement would result in a small backwards result.
> >> Or we could offset everything so that we’d have to go back several
> >> hundred ms before we cross zero.
> > 
> > That would be probably the better solution as signed math would be
> > problematic when the resulting ns value becomes negative. As the delta is
> > really small, otherwise the TSC sync check would have caught it, the caller
> > should never be able to observe time going backwards.
> > 
> > I'll have a look into that. It needs some thought vs. the fractional part
> > of the base time, but it should be not rocket science to get that
> > correct. Famous last words...
> 
> Does the sentinel need to be U64_MAX? What if vgetcyc and its minions
> returned gtod->cycle_last-1 (for some value of 1), and the caller just
> does "if ((s64)cycles - (s64)last < 0) return fallback; ns +=
> (cycles-last)* ...". That should just be a "sub ; js ; ". It's an extra
> load of ->cycle_last, but only on the path where we're heading for the
> fallback anyway. The value of 1 can be adjusted so that in the "js"
> path, we could detect and accept an rdtsc_ordered() call that's just a
> few 10s of cycles behind last and treat that as 0 and continue back on
> the normal path. But maybe it's hard to get gcc to generate the expected
> code.

I played around with a lot of variants and GCC generates all kinds of
interesting ASM. And at some point optimizing that math code is not buying
anything because the LFENCE before RDTSC is dominating all of it.

Thanks,

tglx___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [patch 09/11] x86/vdso: Simplify the invalid vclock case

2018-09-18 Thread Thomas Gleixner
On Tue, 18 Sep 2018, Andy Lutomirski wrote:
> > On Sep 18, 2018, at 3:46 PM, Thomas Gleixner  wrote:
> > On Tue, 18 Sep 2018, Andy Lutomirski wrote:
> >> Do we do better if we use signed arithmetic for the whole calculation?
> >> Then a small backwards movement would result in a small backwards result.
> >> Or we could offset everything so that we’d have to go back several
> >> hundred ms before we cross zero.
> > 
> > That would be probably the better solution as signed math would be
> > problematic when the resulting ns value becomes negative. As the delta is
> > really small, otherwise the TSC sync check would have caught it, the caller
> > should never be able to observe time going backwards.
> > 
> > I'll have a look into that. It needs some thought vs. the fractional part
> > of the base time, but it should be not rocket science to get that
> > correct. Famous last words...
> > 
> 
> It’s also fiddly to tune. If you offset it too much, then the fancy
> divide-by-repeated-subtraction loop will hurt more than the comparison to
> last.

Not really. It's sufficient to offset it by at max. 1000 cycles or so. That
won't hurt the magic loop, but it will definitely cover that slight offset
case.

Thanks,

tglx

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [patch 09/11] x86/vdso: Simplify the invalid vclock case

2018-09-18 Thread Thomas Gleixner
On Tue, 18 Sep 2018, Andy Lutomirski wrote:
> > On Sep 18, 2018, at 12:52 AM, Thomas Gleixner  wrote:
> > 
> >> On Mon, 17 Sep 2018, John Stultz wrote:
> >>> On Mon, Sep 17, 2018 at 12:25 PM, Andy Lutomirski  wrote:
> >>> Also, I'm not entirely convinced that this "last" thing is needed at
> >>> all.  John, what's the scenario under which we need it?
> >> 
> >> So my memory is probably a bit foggy, but I recall that as we
> >> accelerated gettimeofday, we found that even on systems that claimed
> >> to have synced TSCs, they were actually just slightly out of sync.
> >> Enough that right after cycles_last had been updated, a read on
> >> another cpu could come in just behind cycles_last, resulting in a
> >> negative interval causing lots of havoc.
> >> 
> >> So the sanity check is needed to avoid that case.
> > 
> > Your memory serves you right. That's indeed observable on CPUs which
> > lack TSC_ADJUST.
> > 
> > @Andy: Welcome to the wonderful world of TSC.
> > 
> 
> Do we do better if we use signed arithmetic for the whole calculation?
> Then a small backwards movement would result in a small backwards result.
> Or we could offset everything so that we’d have to go back several
> hundred ms before we cross zero.

That would be probably the better solution as signed math would be
problematic when the resulting ns value becomes negative. As the delta is
really small, otherwise the TSC sync check would have caught it, the caller
should never be able to observe time going backwards.

I'll have a look into that. It needs some thought vs. the fractional part
of the base time, but it should be not rocket science to get that
correct. Famous last words...

Thanks,

tglx


___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [patch 09/11] x86/vdso: Simplify the invalid vclock case

2018-09-18 Thread Thomas Gleixner
On Tue, 18 Sep 2018, Thomas Gleixner wrote:
> So if the TSC on CPU1 is slightly behind the TSC on CPU0 then now1 can be
> smaller than cycle_last. The TSC sync stuff does not catch the small delta
> for unknown raisins. I'll go and find that machine and test that again.

Of course it does not trigger anymore. We accumulated code between the
point in timekeeping_advance() where the TSC is read and the update of the
VDSO data.

I'll might have to get an 2.6ish kernel booted on that machine and try with
that again. /me shudders

Thanks,

tglx
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [patch 09/11] x86/vdso: Simplify the invalid vclock case

2018-09-18 Thread Thomas Gleixner
On Tue, 18 Sep 2018, Peter Zijlstra wrote:
> On Tue, Sep 18, 2018 at 12:41:57PM +0200, Thomas Gleixner wrote:
> > I still have one of the machines which is affected by this.
> 
> Are we sure this isn't a load vs rdtsc reorder? Because if I look at the
> current code:

The load order of last vs. rdtsc does not matter at all.

CPU0CPU1


now0 = rdtsc_ordered();
...
tk->cycle_last = now0;

gtod->seq++;
gtod->cycle_last = tk->cycle_last;
...
gtod->seq++;
seq_begin(gtod->seq);
now1 = rdtsc_ordered();

So if the TSC on CPU1 is slightly behind the TSC on CPU0 then now1 can be
smaller than cycle_last. The TSC sync stuff does not catch the small delta
for unknown raisins. I'll go and find that machine and test that again.

Thanks,

tglx



___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [patch 09/11] x86/vdso: Simplify the invalid vclock case

2018-09-18 Thread Thomas Gleixner
On Tue, 18 Sep 2018, Thomas Gleixner wrote:
> On Tue, 18 Sep 2018, Thomas Gleixner wrote:
> > On Tue, 18 Sep 2018, Peter Zijlstra wrote:
> > > > Your memory serves you right. That's indeed observable on CPUs which
> > > > lack TSC_ADJUST.
> > > 
> > > But, if the gtod code can observe this, then why doesn't the code that
> > > checks the sync?
> > 
> > Because it depends where the involved CPUs are in the topology. The sync
> > code might just run on the same package an simply not see it. Yes, w/o
> > TSC_ADJUST the TSC sync code can just fail to see the havoc.
> 
> Even with TSC adjust the TSC can be slightly off by design on multi-socket
> systems.

Here are the gory details:

   
https://lore.kernel.org/lkml/3c1737210708230408i7a8049a9m5db49e6c4d89a...@mail.gmail.com/

The changelog has an explanation as well.

d8bb6f4c1670 ("x86: tsc prevent time going backwards")

I still have one of the machines which is affected by this.

Thanks,

tglx
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [patch 09/11] x86/vdso: Simplify the invalid vclock case

2018-09-18 Thread Thomas Gleixner
On Tue, 18 Sep 2018, Thomas Gleixner wrote:
> On Tue, 18 Sep 2018, Peter Zijlstra wrote:
> > > Your memory serves you right. That's indeed observable on CPUs which
> > > lack TSC_ADJUST.
> > 
> > But, if the gtod code can observe this, then why doesn't the code that
> > checks the sync?
> 
> Because it depends where the involved CPUs are in the topology. The sync
> code might just run on the same package an simply not see it. Yes, w/o
> TSC_ADJUST the TSC sync code can just fail to see the havoc.

Even with TSC adjust the TSC can be slightly off by design on multi-socket
systems.

Thanks,

tglx
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [patch 09/11] x86/vdso: Simplify the invalid vclock case

2018-09-18 Thread Thomas Gleixner
On Tue, 18 Sep 2018, Peter Zijlstra wrote:
> On Tue, Sep 18, 2018 at 09:52:26AM +0200, Thomas Gleixner wrote:
> > On Mon, 17 Sep 2018, John Stultz wrote:
> > > On Mon, Sep 17, 2018 at 12:25 PM, Andy Lutomirski  wrote:
> > > > Also, I'm not entirely convinced that this "last" thing is needed at
> > > > all.  John, what's the scenario under which we need it?
> > > 
> > > So my memory is probably a bit foggy, but I recall that as we
> > > accelerated gettimeofday, we found that even on systems that claimed
> > > to have synced TSCs, they were actually just slightly out of sync.
> > > Enough that right after cycles_last had been updated, a read on
> > > another cpu could come in just behind cycles_last, resulting in a
> > > negative interval causing lots of havoc.
> > > 
> > > So the sanity check is needed to avoid that case.
> > 
> > Your memory serves you right. That's indeed observable on CPUs which
> > lack TSC_ADJUST.
> 
> But, if the gtod code can observe this, then why doesn't the code that
> checks the sync?

Because it depends where the involved CPUs are in the topology. The sync
code might just run on the same package an simply not see it. Yes, w/o
TSC_ADJUST the TSC sync code can just fail to see the havoc.

Thanks,

tglx




___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [patch 09/11] x86/vdso: Simplify the invalid vclock case

2018-09-18 Thread Thomas Gleixner
On Mon, 17 Sep 2018, John Stultz wrote:
> On Mon, Sep 17, 2018 at 12:25 PM, Andy Lutomirski  wrote:
> > Also, I'm not entirely convinced that this "last" thing is needed at
> > all.  John, what's the scenario under which we need it?
> 
> So my memory is probably a bit foggy, but I recall that as we
> accelerated gettimeofday, we found that even on systems that claimed
> to have synced TSCs, they were actually just slightly out of sync.
> Enough that right after cycles_last had been updated, a read on
> another cpu could come in just behind cycles_last, resulting in a
> negative interval causing lots of havoc.
> 
> So the sanity check is needed to avoid that case.

Your memory serves you right. That's indeed observable on CPUs which
lack TSC_ADJUST.

@Andy: Welcome to the wonderful world of TSC.

Thanks,

tglx

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


[patch V2 11/11] x66/vdso: Add CLOCK_TAI support

2018-09-17 Thread Thomas Gleixner
With the storage array in place it's now trivial to support CLOCK_TAI in
the vdso. Extend the base time storage array and add the update code.

Signed-off-by: Thomas Gleixner 
---

V2: Remove the masking trick

 arch/x86/entry/vsyscall/vsyscall_gtod.c |4 
 arch/x86/include/asm/vgtod.h|4 ++--
 2 files changed, 6 insertions(+), 2 deletions(-)

--- a/arch/x86/entry/vsyscall/vsyscall_gtod.c
+++ b/arch/x86/entry/vsyscall/vsyscall_gtod.c
@@ -51,6 +51,10 @@ void update_vsyscall(struct timekeeper *
base->sec = tk->xtime_sec;
base->nsec = tk->tkr_mono.xtime_nsec;
 
+   base = >basetime[CLOCK_TAI];
+   base->sec = tk->xtime_sec + (s64)tk->tai_offset;
+   base->nsec = tk->tkr_mono.xtime_nsec;
+
base = >basetime[CLOCK_MONOTONIC];
base->sec = tk->xtime_sec + tk->wall_to_monotonic.tv_sec;
nsec = tk->tkr_mono.xtime_nsec;
--- a/arch/x86/include/asm/vgtod.h
+++ b/arch/x86/include/asm/vgtod.h
@@ -18,8 +18,8 @@ struct vgtod_ts {
u64 nsec;
 };
 
-#define VGTOD_BASES(CLOCK_MONOTONIC_COARSE + 1)
-#define VGTOD_HRES (BIT(CLOCK_REALTIME) | BIT(CLOCK_MONOTONIC))
+#define VGTOD_BASES(CLOCK_TAI + 1)
+#define VGTOD_HRES (BIT(CLOCK_REALTIME) | BIT(CLOCK_MONOTONIC) | 
BIT(CLOCK_TAI))
 #define VGTOD_COARSE   (BIT(CLOCK_REALTIME_COARSE) | 
BIT(CLOCK_MONOTONIC_COARSE))
 
 /*


___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


[patch V2 03/11] x86/vdso: Enforce 64bit clocksource

2018-09-17 Thread Thomas Gleixner
All VDSO clock sources are TSC based and use CLOCKSOURCE_MASK(64). There is
no point in masking with all FF. Get rid of it and enforce the mask in the
sanity checker.

Signed-off-by: Thomas Gleixner 
---
 arch/x86/entry/vdso/vclock_gettime.c |2 +-
 arch/x86/kernel/time.c   |6 ++
 2 files changed, 7 insertions(+), 1 deletion(-)

--- a/arch/x86/entry/vdso/vclock_gettime.c
+++ b/arch/x86/entry/vdso/vclock_gettime.c
@@ -199,7 +199,7 @@ notrace static inline u64 vgetsns(int *m
 #endif
else
return 0;
-   v = (cycles - gtod->cycle_last) & gtod->mask;
+   v = cycles - gtod->cycle_last;
return v * gtod->mult;
 }
 
--- a/arch/x86/kernel/time.c
+++ b/arch/x86/kernel/time.c
@@ -120,4 +120,10 @@ void clocksource_arch_init(struct clocks
cs->name, cs->archdata.vclock_mode);
cs->archdata.vclock_mode = VCLOCK_NONE;
}
+
+   if (cs->mask != CLOCKSOURCE_MASK(64)) {
+   pr_warn("clocksource %s registered with invalid mask %016llx. 
Disabling vclock.\n",
+   cs->name, cs->mask);
+   cs->archdata.vclock_mode = VCLOCK_NONE;
+   }
 }


___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


[patch V2 06/11] x86/vdso: Collapse high resolution functions

2018-09-17 Thread Thomas Gleixner
do_realtime() and do_monotonic() are now the same except for the storage
array index. Hand the index in as an argument and collapse the functions.

Signed-off-by: Thomas Gleixner 
---
 arch/x86/entry/vdso/vclock_gettime.c |   35 +++
 1 file changed, 7 insertions(+), 28 deletions(-)

--- a/arch/x86/entry/vdso/vclock_gettime.c
+++ b/arch/x86/entry/vdso/vclock_gettime.c
@@ -203,35 +203,12 @@ notrace static inline u64 vgetsns(int *m
return v * gtod->mult;
 }
 
-/* Code size doesn't matter (vdso is 4k anyway) and this is faster. */
-notrace static int __always_inline do_realtime(struct timespec *ts)
+notrace static int do_hres(clockid_t clk, struct timespec *ts)
 {
-   struct vgtod_ts *base = >basetime[CLOCK_REALTIME];
+   struct vgtod_ts *base = >basetime[clk];
unsigned int seq;
-   u64 ns;
int mode;
-
-   do {
-   seq = gtod_read_begin(gtod);
-   mode = gtod->vclock_mode;
-   ts->tv_sec = base->sec;
-   ns = base->nsec;
-   ns += vgetsns();
-   ns >>= gtod->shift;
-   } while (unlikely(gtod_read_retry(gtod, seq)));
-
-   ts->tv_sec += __iter_div_u64_rem(ns, NSEC_PER_SEC, );
-   ts->tv_nsec = ns;
-
-   return mode;
-}
-
-notrace static int __always_inline do_monotonic(struct timespec *ts)
-{
-   struct vgtod_ts *base = >basetime[CLOCK_MONOTONIC];
-   unsigned int seq;
u64 ns;
-   int mode;
 
do {
seq = gtod_read_begin(gtod);
@@ -276,11 +253,11 @@ notrace int __vdso_clock_gettime(clockid
 {
switch (clock) {
case CLOCK_REALTIME:
-   if (do_realtime(ts) == VCLOCK_NONE)
+   if (do_hres(CLOCK_REALTIME, ts) == VCLOCK_NONE)
goto fallback;
break;
case CLOCK_MONOTONIC:
-   if (do_monotonic(ts) == VCLOCK_NONE)
+   if (do_hres(CLOCK_MONOTONIC, ts) == VCLOCK_NONE)
goto fallback;
break;
case CLOCK_REALTIME_COARSE:
@@ -303,7 +280,9 @@ int clock_gettime(clockid_t, struct time
 notrace int __vdso_gettimeofday(struct timeval *tv, struct timezone *tz)
 {
if (likely(tv != NULL)) {
-   if (unlikely(do_realtime((struct timespec *)tv) == VCLOCK_NONE))
+   struct timespec *ts = (struct timespec *) tv;
+
+   if (unlikely(do_hres(CLOCK_REALTIME, ts) == VCLOCK_NONE))
return vdso_fallback_gtod(tv, tz);
tv->tv_usec /= 1000;
}


___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


[patch V2 09/11] x86/vdso: Simplify the invalid vclock case

2018-09-17 Thread Thomas Gleixner
The code flow for the vclocks is convoluted as it requires the vclocks
which can be invalidated separately from the vsyscall_gtod_data sequence to
store the fact in a separate variable. That's inefficient.

Restructure the code so the vclock readout returns cycles and the
conversion to nanoseconds is handled at the call site.

If the clock gets invalidated or vclock is already VCLOCK_NONE, return
U64_MAX as the cycle value, which is invalid for all clocks and leave the
sequence loop immediately in that case by calling the fallback function
directly.

This allows to remove the gettimeofday fallback as it now uses the
clock_gettime() fallback and does the nanoseconds to microseconds
conversion in the same way as it does when the vclock is functional. It
does not make a difference whether the division by 1000 happens in the
kernel fallback or in userspace.

Generates way better code and gains a few cycles back.

Signed-off-by: Thomas Gleixner 
---
 arch/x86/entry/vdso/vclock_gettime.c |   81 +--
 1 file changed, 21 insertions(+), 60 deletions(-)

--- a/arch/x86/entry/vdso/vclock_gettime.c
+++ b/arch/x86/entry/vdso/vclock_gettime.c
@@ -48,16 +48,6 @@ notrace static long vdso_fallback_gettim
return ret;
 }
 
-notrace static long vdso_fallback_gtod(struct timeval *tv, struct timezone *tz)
-{
-   long ret;
-
-   asm("syscall" : "=a" (ret) :
-   "0" (__NR_gettimeofday), "D" (tv), "S" (tz) : "memory");
-   return ret;
-}
-
-
 #else
 
 notrace static long vdso_fallback_gettime(long clock, struct timespec *ts)
@@ -75,21 +65,6 @@ notrace static long vdso_fallback_gettim
return ret;
 }
 
-notrace static long vdso_fallback_gtod(struct timeval *tv, struct timezone *tz)
-{
-   long ret;
-
-   asm(
-   "mov %%ebx, %%edx \n"
-   "mov %2, %%ebx \n"
-   "call __kernel_vsyscall \n"
-   "mov %%edx, %%ebx \n"
-   : "=a" (ret)
-   : "0" (__NR_gettimeofday), "g" (tv), "c" (tz)
-   : "memory", "edx");
-   return ret;
-}
-
 #endif
 
 #ifdef CONFIG_PARAVIRT_CLOCK
@@ -98,7 +73,7 @@ static notrace const struct pvclock_vsys
return (const struct pvclock_vsyscall_time_info *)_page;
 }
 
-static notrace u64 vread_pvclock(int *mode)
+static notrace u64 vread_pvclock(void)
 {
const struct pvclock_vcpu_time_info *pvti = _pvti0()->pvti;
u64 ret;
@@ -130,10 +105,8 @@ static notrace u64 vread_pvclock(int *mo
do {
version = pvclock_read_begin(pvti);
 
-   if (unlikely(!(pvti->flags & PVCLOCK_TSC_STABLE_BIT))) {
-   *mode = VCLOCK_NONE;
-   return 0;
-   }
+   if (unlikely(!(pvti->flags & PVCLOCK_TSC_STABLE_BIT)))
+   return U64_MAX;
 
ret = __pvclock_read_cycles(pvti, rdtsc_ordered());
} while (pvclock_read_retry(pvti, version));
@@ -148,17 +121,12 @@ static notrace u64 vread_pvclock(int *mo
 }
 #endif
 #ifdef CONFIG_HYPERV_TSCPAGE
-static notrace u64 vread_hvclock(int *mode)
+static notrace u64 vread_hvclock(void)
 {
const struct ms_hyperv_tsc_page *tsc_pg =
(const struct ms_hyperv_tsc_page *)_page;
-   u64 current_tick = hv_read_tsc_page(tsc_pg);
-
-   if (current_tick != U64_MAX)
-   return current_tick;
 
-   *mode = VCLOCK_NONE;
-   return 0;
+   return hv_read_tsc_page(tsc_pg);
 }
 #endif
 
@@ -182,47 +150,42 @@ notrace static u64 vread_tsc(void)
return last;
 }
 
-notrace static inline u64 vgetsns(int *mode)
+notrace static inline u64 vgetcyc(int mode)
 {
-   u64 v;
-   cycles_t cycles;
-
-   if (gtod->vclock_mode == VCLOCK_TSC)
-   cycles = vread_tsc();
+   if (mode == VCLOCK_TSC)
+   return vread_tsc();
 #ifdef CONFIG_PARAVIRT_CLOCK
-   else if (gtod->vclock_mode == VCLOCK_PVCLOCK)
-   cycles = vread_pvclock(mode);
+   else if (mode == VCLOCK_PVCLOCK)
+   return vread_pvclock();
 #endif
 #ifdef CONFIG_HYPERV_TSCPAGE
-   else if (gtod->vclock_mode == VCLOCK_HVCLOCK)
-   cycles = vread_hvclock(mode);
+   else if (mode == VCLOCK_HVCLOCK)
+   return vread_hvclock();
 #endif
-   else
-   return 0;
-   v = cycles - gtod->cycle_last;
-   return v * gtod->mult;
+   return U64_MAX;
 }
 
 notrace static int do_hres(clockid_t clk, struct timespec *ts)
 {
struct vgtod_ts *base = >basetime[clk];
unsigned int seq;
-   int mode;
-   u64 ns;
+   u64 cycles, ns;
 
do {
seq = gtod_read_begin(gtod);
-   mode = gtod->vclock_mode;
ts->tv_sec = base-&

[patch V2 10/11] x86/vdso: Move cycle_last handling into the caller

2018-09-17 Thread Thomas Gleixner
Dereferencing gtod->cycle_last all over the place and foing the cycles <
last comparison in the vclock read functions generates horrible code. Doing
it at the call site is much better and gains a few cycles both for TSC and
pvclock.

Caveat: This adds the comparison to the hyperv vclock as well, but I have
no way to test that.

Signed-off-by: Thomas Gleixner 
---
 arch/x86/entry/vdso/vclock_gettime.c |   39 ++-
 1 file changed, 7 insertions(+), 32 deletions(-)

--- a/arch/x86/entry/vdso/vclock_gettime.c
+++ b/arch/x86/entry/vdso/vclock_gettime.c
@@ -76,9 +76,8 @@ static notrace const struct pvclock_vsys
 static notrace u64 vread_pvclock(void)
 {
const struct pvclock_vcpu_time_info *pvti = _pvti0()->pvti;
-   u64 ret;
-   u64 last;
u32 version;
+   u64 ret;
 
/*
 * Note: The kernel and hypervisor must guarantee that cpu ID
@@ -111,13 +110,7 @@ static notrace u64 vread_pvclock(void)
ret = __pvclock_read_cycles(pvti, rdtsc_ordered());
} while (pvclock_read_retry(pvti, version));
 
-   /* refer to vread_tsc() comment for rationale */
-   last = gtod->cycle_last;
-
-   if (likely(ret >= last))
-   return ret;
-
-   return last;
+   return ret;
 }
 #endif
 #ifdef CONFIG_HYPERV_TSCPAGE
@@ -130,30 +123,10 @@ static notrace u64 vread_hvclock(void)
 }
 #endif
 
-notrace static u64 vread_tsc(void)
-{
-   u64 ret = (u64)rdtsc_ordered();
-   u64 last = gtod->cycle_last;
-
-   if (likely(ret >= last))
-   return ret;
-
-   /*
-* GCC likes to generate cmov here, but this branch is extremely
-* predictable (it's just a function of time and the likely is
-* very likely) and there's a data dependence, so force GCC
-* to generate a branch instead.  I don't barrier() because
-* we don't actually need a barrier, and if this function
-* ever gets inlined it will generate worse code.
-*/
-   asm volatile ("");
-   return last;
-}
-
 notrace static inline u64 vgetcyc(int mode)
 {
if (mode == VCLOCK_TSC)
-   return vread_tsc();
+   return (u64)rdtsc_ordered();
 #ifdef CONFIG_PARAVIRT_CLOCK
else if (mode == VCLOCK_PVCLOCK)
return vread_pvclock();
@@ -168,17 +141,19 @@ notrace static inline u64 vgetcyc(int mo
 notrace static int do_hres(clockid_t clk, struct timespec *ts)
 {
struct vgtod_ts *base = >basetime[clk];
+   u64 cycles, last, ns;
unsigned int seq;
-   u64 cycles, ns;
 
do {
seq = gtod_read_begin(gtod);
ts->tv_sec = base->sec;
ns = base->nsec;
+   last = gtod->cycle_last;
cycles = vgetcyc(gtod->vclock_mode);
if (unlikely((s64)cycles < 0))
return vdso_fallback_gettime(clk, ts);
-   ns += (cycles - gtod->cycle_last) * gtod->mult;
+   if (cycles > last)
+   ns += (cycles - last) * gtod->mult;
ns >>= gtod->shift;
} while (unlikely(gtod_read_retry(gtod, seq)));
 


___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


[patch V2 05/11] x86/vdso: Introduce and use vgtod_ts

2018-09-17 Thread Thomas Gleixner
It's desired to support more clocks in the VDSO, e.g. CLOCK_TAI. This
results either in indirect calls due to the larger switch case, which then
requires retpolines or when the compiler is forced to avoid jump tables it
results in even more conditionals.

To avoid both variants which are bad for performance the high resolution
functions and the coarse grained functions will be collapsed into one for
each. That requires to store the clock specific base time in an array.

Introcude struct vgtod_ts for storage and convert the data store, the
update function and the individual clock functions over to use it.

The new storage does not longer use gtod_long_t for seconds depending on 32
or 64 bit compile because this needs to be the full 64bit value even for
32bit when a Y2038 function is added. No point in keeping the distinction
alive in the internal representation.

Signed-off-by: Thomas Gleixner 
---
 arch/x86/entry/vdso/vclock_gettime.c|   24 +--
 arch/x86/entry/vsyscall/vsyscall_gtod.c |   51 
 arch/x86/include/asm/vgtod.h|   36 --
 3 files changed, 61 insertions(+), 50 deletions(-)

--- a/arch/x86/entry/vdso/vclock_gettime.c
+++ b/arch/x86/entry/vdso/vclock_gettime.c
@@ -206,6 +206,7 @@ notrace static inline u64 vgetsns(int *m
 /* Code size doesn't matter (vdso is 4k anyway) and this is faster. */
 notrace static int __always_inline do_realtime(struct timespec *ts)
 {
+   struct vgtod_ts *base = >basetime[CLOCK_REALTIME];
unsigned int seq;
u64 ns;
int mode;
@@ -213,8 +214,8 @@ notrace static int __always_inline do_re
do {
seq = gtod_read_begin(gtod);
mode = gtod->vclock_mode;
-   ts->tv_sec = gtod->wall_time_sec;
-   ns = gtod->wall_time_snsec;
+   ts->tv_sec = base->sec;
+   ns = base->nsec;
ns += vgetsns();
ns >>= gtod->shift;
} while (unlikely(gtod_read_retry(gtod, seq)));
@@ -227,6 +228,7 @@ notrace static int __always_inline do_re
 
 notrace static int __always_inline do_monotonic(struct timespec *ts)
 {
+   struct vgtod_ts *base = >basetime[CLOCK_MONOTONIC];
unsigned int seq;
u64 ns;
int mode;
@@ -234,8 +236,8 @@ notrace static int __always_inline do_mo
do {
seq = gtod_read_begin(gtod);
mode = gtod->vclock_mode;
-   ts->tv_sec = gtod->monotonic_time_sec;
-   ns = gtod->monotonic_time_snsec;
+   ts->tv_sec = base->sec;
+   ns = base->nsec;
ns += vgetsns();
ns >>= gtod->shift;
} while (unlikely(gtod_read_retry(gtod, seq)));
@@ -248,21 +250,25 @@ notrace static int __always_inline do_mo
 
 notrace static void do_realtime_coarse(struct timespec *ts)
 {
+   struct vgtod_ts *base = >basetime[CLOCK_REALTIME_COARSE];
unsigned int seq;
+
do {
seq = gtod_read_begin(gtod);
-   ts->tv_sec = gtod->wall_time_coarse_sec;
-   ts->tv_nsec = gtod->wall_time_coarse_nsec;
+   ts->tv_sec = base->sec;
+   ts->tv_nsec = base->nsec;
} while (unlikely(gtod_read_retry(gtod, seq)));
 }
 
 notrace static void do_monotonic_coarse(struct timespec *ts)
 {
+   struct vgtod_ts *base = >basetime[CLOCK_MONOTONIC_COARSE];
unsigned int seq;
+
do {
seq = gtod_read_begin(gtod);
-   ts->tv_sec = gtod->monotonic_time_coarse_sec;
-   ts->tv_nsec = gtod->monotonic_time_coarse_nsec;
+   ts->tv_sec = base->sec;
+   ts->tv_nsec = base->nsec;
} while (unlikely(gtod_read_retry(gtod, seq)));
 }
 
@@ -318,7 +324,7 @@ int gettimeofday(struct timeval *, struc
 notrace time_t __vdso_time(time_t *t)
 {
/* This is atomic on x86 so we don't need any locks. */
-   time_t result = READ_ONCE(gtod->wall_time_sec);
+   time_t result = READ_ONCE(gtod->basetime[CLOCK_REALTIME].sec);
 
if (t)
*t = result;
--- a/arch/x86/entry/vsyscall/vsyscall_gtod.c
+++ b/arch/x86/entry/vsyscall/vsyscall_gtod.c
@@ -31,6 +31,8 @@ void update_vsyscall(struct timekeeper *
 {
int vclock_mode = tk->tkr_mono.clock->archdata.vclock_mode;
struct vsyscall_gtod_data *vdata = _gtod_data;
+   struct vgtod_ts *base;
+   u64 nsec;
 
/* Mark the new vclock used. */
BUILD_BUG_ON(VCLOCK_MAX >= 32);
@@ -45,34 +47,33 @@ void update_vsyscall(struct timekeeper *
vdata->mult = tk->tkr_mono.mult;
vdata->shift= tk->tkr_mono.shift;
 
-   vdata->wall_time_sec= tk->xtime_sec;
-   vdata->wall_time_snsec  = tk->tkr_mono.xtime_nsec;
-
-   

[patch V2 08/11] x86/vdso: Replace the clockid switch case

2018-09-17 Thread Thomas Gleixner
Now that the time getter functions use the clockid as index into the
storage array for the base time access, the switch case can be replaced.

- Check for clockid >= MAX_CLOCKS and for negative clockid (CPU/FD) first
  and call the fallback function right away.

- After establishing that clockid is < MAX_CLOCKS, convert the clockid to a
  bitmask

- Check for the supported high resolution and coarse functions by anding
  the bitmask of supported clocks and check whether a bit is set.

This completely avoids jump tables, reduces the number of conditionals and
makes the VDSO extensible for other clock ids.

Signed-off-by: Thomas Gleixner 
---
 arch/x86/entry/vdso/vclock_gettime.c |   38 ---
 1 file changed, 18 insertions(+), 20 deletions(-)

--- a/arch/x86/entry/vdso/vclock_gettime.c
+++ b/arch/x86/entry/vdso/vclock_gettime.c
@@ -239,29 +239,27 @@ notrace static void do_coarse(clockid_t
 
 notrace int __vdso_clock_gettime(clockid_t clock, struct timespec *ts)
 {
-   switch (clock) {
-   case CLOCK_REALTIME:
-   if (do_hres(CLOCK_REALTIME, ts) == VCLOCK_NONE)
-   goto fallback;
-   break;
-   case CLOCK_MONOTONIC:
-   if (do_hres(CLOCK_MONOTONIC, ts) == VCLOCK_NONE)
-   goto fallback;
-   break;
-   case CLOCK_REALTIME_COARSE:
-   do_coarse(CLOCK_REALTIME_COARSE, ts);
-   break;
-   case CLOCK_MONOTONIC_COARSE:
-   do_coarse(CLOCK_MONOTONIC_COARSE, ts);
-   break;
-   default:
-   goto fallback;
-   }
+   unsigned int msk;
+
+   /* Sort out negative (CPU/FD) and invalid clocks */
+   if (unlikely((unsigned int) clock >= MAX_CLOCKS))
+   return vdso_fallback_gettime(clock, ts);
 
-   return 0;
-fallback:
+   /*
+* Convert the clockid to a bitmask and use it to check which
+* clocks are handled in the VDSO directly.
+*/
+   msk = 1U << clock;
+   if (likely(msk & VGTOD_HRES)) {
+   if (do_hres(clock, ts) != VCLOCK_NONE)
+   return 0;
+   } else if (msk & VGTOD_COARSE) {
+   do_coarse(clock, ts);
+   return 0;
+   }
return vdso_fallback_gettime(clock, ts);
 }
+
 int clock_gettime(clockid_t, struct timespec *)
__attribute__((weak, alias("__vdso_clock_gettime")));
 


___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


[patch V2 01/11] clocksource: Provide clocksource_arch_init()

2018-09-17 Thread Thomas Gleixner
Architectures have extra archdata in the clocksource, e.g. for VDSO
support. There are no sanity checks or general initializations for this
available. Add support for that.

Signed-off-by: Thomas Gleixner 
---
 include/linux/clocksource.h |5 +
 kernel/time/Kconfig |4 
 kernel/time/clocksource.c   |2 ++
 3 files changed, 11 insertions(+)

--- a/include/linux/clocksource.h
+++ b/include/linux/clocksource.h
@@ -241,6 +241,11 @@ static inline void __clocksource_update_
__clocksource_update_freq_scale(cs, 1000, khz);
 }
 
+#ifdef CONFIG_ARCH_CLOCKSOURCE_INIT
+extern void clocksource_arch_init(struct clocksource *cs);
+#else
+static inline void clocksource_arch_init(struct clocksource *cs) { }
+#endif
 
 extern int timekeeping_notify(struct clocksource *clock);
 
--- a/kernel/time/Kconfig
+++ b/kernel/time/Kconfig
@@ -12,6 +12,10 @@ config CLOCKSOURCE_WATCHDOG
 config ARCH_CLOCKSOURCE_DATA
bool
 
+# Architecture has extra clocksource init called from registration
+config ARCH_CLOCKSOURCE_INIT
+   bool
+
 # Clocksources require validation of the clocksource against the last
 # cycle update - x86/TSC misfeature
 config CLOCKSOURCE_VALIDATE_LAST_CYCLE
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -937,6 +937,8 @@ int __clocksource_register_scale(struct
 {
unsigned long flags;
 
+   clocksource_arch_init(cs);
+
/* Initialize mult/shift and max_idle_ns */
__clocksource_update_freq_scale(cs, scale, freq);
 


___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


[patch V2 07/11] x86/vdso: Collapse coarse functions

2018-09-17 Thread Thomas Gleixner
do_realtime_coarse() and do_monotonic_coarse() are now the same except for
the storage array index. Hand the index in as an argument and collapse the
functions.

Signed-off-by: Thomas Gleixner 
---
 arch/x86/entry/vdso/vclock_gettime.c |   20 
 1 file changed, 4 insertions(+), 16 deletions(-)

--- a/arch/x86/entry/vdso/vclock_gettime.c
+++ b/arch/x86/entry/vdso/vclock_gettime.c
@@ -225,21 +225,9 @@ notrace static int do_hres(clockid_t clk
return mode;
 }
 
-notrace static void do_realtime_coarse(struct timespec *ts)
+notrace static void do_coarse(clockid_t clk, struct timespec *ts)
 {
-   struct vgtod_ts *base = >basetime[CLOCK_REALTIME_COARSE];
-   unsigned int seq;
-
-   do {
-   seq = gtod_read_begin(gtod);
-   ts->tv_sec = base->sec;
-   ts->tv_nsec = base->nsec;
-   } while (unlikely(gtod_read_retry(gtod, seq)));
-}
-
-notrace static void do_monotonic_coarse(struct timespec *ts)
-{
-   struct vgtod_ts *base = >basetime[CLOCK_MONOTONIC_COARSE];
+   struct vgtod_ts *base = >basetime[clk];
unsigned int seq;
 
do {
@@ -261,10 +249,10 @@ notrace int __vdso_clock_gettime(clockid
goto fallback;
break;
case CLOCK_REALTIME_COARSE:
-   do_realtime_coarse(ts);
+   do_coarse(CLOCK_REALTIME_COARSE, ts);
break;
case CLOCK_MONOTONIC_COARSE:
-   do_monotonic_coarse(ts);
+   do_coarse(CLOCK_MONOTONIC_COARSE, ts);
break;
default:
goto fallback;


___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


[patch V2 02/11] x86/time: Implement clocksource_arch_init()

2018-09-17 Thread Thomas Gleixner
Runtime validate the VCLOCK_MODE in clocksource::archdata and disable
VCLOCK if invalid, which disables the VDSO but keeps the system running.

Signed-off-by: Thomas Gleixner 
---

V2: Fix the VCLOCK_MAX check

 arch/x86/Kconfig   |1 +
 arch/x86/kernel/time.c |   16 
 2 files changed, 17 insertions(+)

--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -48,6 +48,7 @@ config X86
select ACPI_SYSTEM_POWER_STATES_SUPPORT if ACPI
select ANON_INODES
select ARCH_CLOCKSOURCE_DATA
+   select ARCH_CLOCKSOURCE_INIT
select ARCH_DISCARD_MEMBLOCK
select ARCH_HAS_ACPI_TABLE_UPGRADE  if ACPI
select ARCH_HAS_DEBUG_VIRTUAL
--- a/arch/x86/kernel/time.c
+++ b/arch/x86/kernel/time.c
@@ -10,6 +10,7 @@
  *
  */
 
+#include 
 #include 
 #include 
 #include 
@@ -105,3 +106,18 @@ void __init time_init(void)
 {
late_time_init = x86_late_time_init;
 }
+
+/*
+ * Sanity check the vdso related archdata content.
+ */
+void clocksource_arch_init(struct clocksource *cs)
+{
+   if (cs->archdata.vclock_mode == VCLOCK_NONE)
+   return;
+
+   if (cs->archdata.vclock_mode > VCLOCK_MAX) {
+   pr_warn("clocksource %s registered with invalid vclock_mode %d. 
Disabling vclock.\n",
+   cs->name, cs->archdata.vclock_mode);
+   cs->archdata.vclock_mode = VCLOCK_NONE;
+   }
+}


___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


[patch V2 04/11] x86/vdso: Use unsigned int consistently for vsyscall_gtod_data::seq

2018-09-17 Thread Thomas Gleixner
The sequence count in vgtod_data is unsigned int, but the call sites use
unsigned long, which is a pointless exercise. Fix the call sites and
replace 'unsigned' with unsinged 'int' while at it.

Signed-off-by: Thomas Gleixner 
---
 arch/x86/entry/vdso/vclock_gettime.c |8 
 arch/x86/include/asm/vgtod.h |   10 +-
 2 files changed, 9 insertions(+), 9 deletions(-)

--- a/arch/x86/entry/vdso/vclock_gettime.c
+++ b/arch/x86/entry/vdso/vclock_gettime.c
@@ -206,7 +206,7 @@ notrace static inline u64 vgetsns(int *m
 /* Code size doesn't matter (vdso is 4k anyway) and this is faster. */
 notrace static int __always_inline do_realtime(struct timespec *ts)
 {
-   unsigned long seq;
+   unsigned int seq;
u64 ns;
int mode;
 
@@ -227,7 +227,7 @@ notrace static int __always_inline do_re
 
 notrace static int __always_inline do_monotonic(struct timespec *ts)
 {
-   unsigned long seq;
+   unsigned int seq;
u64 ns;
int mode;
 
@@ -248,7 +248,7 @@ notrace static int __always_inline do_mo
 
 notrace static void do_realtime_coarse(struct timespec *ts)
 {
-   unsigned long seq;
+   unsigned int seq;
do {
seq = gtod_read_begin(gtod);
ts->tv_sec = gtod->wall_time_coarse_sec;
@@ -258,7 +258,7 @@ notrace static void do_realtime_coarse(s
 
 notrace static void do_monotonic_coarse(struct timespec *ts)
 {
-   unsigned long seq;
+   unsigned int seq;
do {
seq = gtod_read_begin(gtod);
ts->tv_sec = gtod->monotonic_time_coarse_sec;
--- a/arch/x86/include/asm/vgtod.h
+++ b/arch/x86/include/asm/vgtod.h
@@ -15,9 +15,9 @@ typedef unsigned long gtod_long_t;
  * so be carefull by modifying this structure.
  */
 struct vsyscall_gtod_data {
-   unsigned seq;
+   unsigned int seq;
 
-   int vclock_mode;
+   int vclock_mode;
u64 cycle_last;
u64 mask;
u32 mult;
@@ -44,9 +44,9 @@ static inline bool vclock_was_used(int v
return READ_ONCE(vclocks_used) & (1 << vclock);
 }
 
-static inline unsigned gtod_read_begin(const struct vsyscall_gtod_data *s)
+static inline unsigned int gtod_read_begin(const struct vsyscall_gtod_data *s)
 {
-   unsigned ret;
+   unsigned int ret;
 
 repeat:
ret = READ_ONCE(s->seq);
@@ -59,7 +59,7 @@ static inline unsigned gtod_read_begin(c
 }
 
 static inline int gtod_read_retry(const struct vsyscall_gtod_data *s,
-   unsigned start)
+ unsigned int start)
 {
smp_rmb();
return unlikely(s->seq != start);


___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


[patch V2 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support

2018-09-17 Thread Thomas Gleixner
Matt attempted to add CLOCK_TAI support to the VDSO clock_gettime()
implementation, which extended the clockid switch case and added yet
another slightly different copy of the same code.

Especially the extended switch case is problematic as the compiler tends to
generate a jump table which then requires to use retpolines. If jump tables
are disabled it adds yet another conditional to the existing maze.

This series takes a different approach by consolidating the almost
identical functions into one implementation for high resolution clocks and
one for the coarse grained clock ids by storing the base data for each
clock id in an array which is indexed by the clock id.

This completely eliminates the switch case and allows further
simplifications of the code base, which at the end all together gain a few
cycles performance or at least stay on par with todays code. The resulting
performance depends heavily on the micro architecture and the compiler.

Changes vs. V1:

  - Fix the VCLOCK_MAX sanity check
  - Remove the magic clock masking and extend the storage array

Thanks,

tglx

8<---
 arch/x86/Kconfig|1 
 arch/x86/entry/vdso/vclock_gettime.c|  199 
 arch/x86/entry/vsyscall/vsyscall_gtod.c |   55 
 arch/x86/include/asm/vgtod.h|   42 +++---
 arch/x86/kernel/time.c  |   22 +++
 include/linux/clocksource.h |5 
 kernel/time/Kconfig |4 
 kernel/time/clocksource.c   |2 
 8 files changed, 140 insertions(+), 190 deletions(-)



___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support

2018-09-17 Thread Thomas Gleixner
On Fri, 14 Sep 2018, Arnd Bergmann wrote:
> On Fri, Sep 14, 2018 at 2:52 PM Thomas Gleixner  wrote:
> A couple of architectures (s390, ia64, riscv, powerpc, arm64)
> implement the vdso as assembler code at the moment, so they
> won't be as easy to consolidate (other than outright replacing all
> the code).
> 
> The other five:
> arch/x86/entry/vdso/vclock_gettime.c
> arch/sparc/vdso/vclock_gettime.c
> arch/nds32/kernel/vdso/gettimeofday.c
> arch/mips/vdso/gettimeofday.c
> arch/arm/vdso/vgettimeofday.c
>
> are basically all minor variations of the same code base and could be
> consolidated to some degree.
> Any suggestions here? Should we plan to do that consolitdation based on
> your new version, or just add clock_gettime64 in arm32 and x86-32, and then
> be done with it? The other ones will obviously still be fast for 32-bit time_t
> and will have a working non-vdso sys_clock_getttime64().

In principle consolidating all those implementations should be possible to
some extent and probably worthwhile. What's arch specific are the actual
accessors to the hardware clocks.

> I also wonder about clock_getres(): half the architectures seem to implement
> it in vdso, but notably arm32 and x86 don't, and I had not expected it to be
> performance critical given that the result is easily cached in user space.

getres() is not really performance critical, but adding it does not create
a huge problem either.

Thanks,

tglx

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [patch 11/11] x66/vdso: Add CLOCK_TAI support

2018-09-16 Thread Thomas Gleixner
On Fri, 14 Sep 2018, Andy Lutomirski wrote:
> > On Sep 14, 2018, at 7:27 AM, Thomas Gleixner  wrote:
> > On Fri, 14 Sep 2018, Andy Lutomirski wrote:
> >> That’s... horrible. In an amazing way. Can you add BUILD_BUG_ON somewhere
> >> to assert that this actually works?
> > 
> > Sure, but changing any of the clock ids will cause more wreckage than that.
> > 
> I’m more concerned that we add a new one and break the magic
> masking. Maybe two start sharing the same slot.

You are right. The obvious extension is CLOCK_BOOTTIME. That's id 7 which
indeed conflicts. I'll remove the magic.

Thanks,

tglx___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [patch 02/11] x86/time: Implement clocksource_arch_init()

2018-09-15 Thread Thomas Gleixner
On Fri, 14 Sep 2018, Vitaly Kuznetsov wrote:
> Thomas Gleixner  writes:
> > +
> > +   if (cs->archdata.vclock_mode >= VCLOCK_MAX) {
> 
> It should be '>' as VCLOCK_MAX == VCLOCK_HVCLOCK == 3

Indeed.

Thanks,

tglx
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [patch 11/11] x66/vdso: Add CLOCK_TAI support

2018-09-14 Thread Thomas Gleixner
On Fri, 14 Sep 2018, Andy Lutomirski wrote:
> > On Sep 14, 2018, at 5:50 AM, Thomas Gleixner  wrote:
> > 
> > With the storage array in place it's now trivial to support CLOCK_TAI in
> > the vdso. Instead of extending the array to accomodate CLOCK_TAI, make use
> > of the fact that:
> > 
> > - CLOCK ids are set in stone
> > - CLOCK_THREAD_CPUTIME is never going to be supported in the VDSO so
> >   the array slot 3 is unused
> > - CLOCK_TAI is id 11 which results in 3 when masked with 0x3
> > 
> > Add the mask to the basetime array lookup and set up the CLOCK_TAI base
> > time in update_vsyscall().
> 
> That’s... horrible. In an amazing way. Can you add BUILD_BUG_ON somewhere
> to assert that this actually works?

Sure, but changing any of the clock ids will cause more wreckage than that.

Thanks,

tglx___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support

2018-09-14 Thread Thomas Gleixner
On Fri, 14 Sep 2018, Florian Weimer wrote:

> On 09/14/2018 03:05 PM, Thomas Gleixner wrote:
> > On Fri, 14 Sep 2018, Florian Weimer wrote:
> > 
> > > On 09/14/2018 02:50 PM, Thomas Gleixner wrote:
> > > > Matt attempted to add CLOCK_TAI support to the VDSO clock_gettime()
> > > > implementation, which extended the clockid switch case and added yet
> > > > another slightly different copy of the same code.
> > > > 
> > > > Especially the extended switch case is problematic as the compiler tends
> > > > to
> > > > generate a jump table which then requires to use retpolines.
> > > 
> > > Does vDSO code really have to use retpolines?  It's in userspace, after
> > > all.
> > 
> > Unless you have IBRS/STIPB enabled, you need user space ratpoutine as well.
> 
> I don't think this is a consensus position, and it obviously depends on the
> (sub)architecture.

It does, but we are not building kernels for specific micro architectures
nor do distros AFAIK.

But that aside, even with jump tables the generated code (ratpoutine
disabled) is not any better than with the current approach. It's even worse
and needs the same optimizations at the end and then the main difference is
that you have 3 copies of one function and 2 of the other. Not a win at
all.

Thanks,

tglx


___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support

2018-09-14 Thread Thomas Gleixner
On Fri, 14 Sep 2018, Florian Weimer wrote:

> On 09/14/2018 02:50 PM, Thomas Gleixner wrote:
> > Matt attempted to add CLOCK_TAI support to the VDSO clock_gettime()
> > implementation, which extended the clockid switch case and added yet
> > another slightly different copy of the same code.
> > 
> > Especially the extended switch case is problematic as the compiler tends to
> > generate a jump table which then requires to use retpolines.
> 
> Does vDSO code really have to use retpolines?  It's in userspace, after all.

Unless you have IBRS/STIPB enabled, you need user space ratpoutine as well.

Thanks,

tglx
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


[patch 06/11] x86/vdso: Collapse high resolution functions

2018-09-14 Thread Thomas Gleixner
do_realtime() and do_monotonic() are now the same except for the storage
array index. Hand the index in as an argument and collapse the functions.

Signed-off-by: Thomas Gleixner 
---
 arch/x86/entry/vdso/vclock_gettime.c |   35 +++
 1 file changed, 7 insertions(+), 28 deletions(-)

--- a/arch/x86/entry/vdso/vclock_gettime.c
+++ b/arch/x86/entry/vdso/vclock_gettime.c
@@ -203,35 +203,12 @@ notrace static inline u64 vgetsns(int *m
return v * gtod->mult;
 }
 
-/* Code size doesn't matter (vdso is 4k anyway) and this is faster. */
-notrace static int __always_inline do_realtime(struct timespec *ts)
+notrace static int do_hres(clockid_t clk, struct timespec *ts)
 {
-   struct vgtod_ts *base = >basetime[CLOCK_REALTIME];
+   struct vgtod_ts *base = >basetime[clk];
unsigned int seq;
-   u64 ns;
int mode;
-
-   do {
-   seq = gtod_read_begin(gtod);
-   mode = gtod->vclock_mode;
-   ts->tv_sec = base->sec;
-   ns = base->nsec;
-   ns += vgetsns();
-   ns >>= gtod->shift;
-   } while (unlikely(gtod_read_retry(gtod, seq)));
-
-   ts->tv_sec += __iter_div_u64_rem(ns, NSEC_PER_SEC, );
-   ts->tv_nsec = ns;
-
-   return mode;
-}
-
-notrace static int __always_inline do_monotonic(struct timespec *ts)
-{
-   struct vgtod_ts *base = >basetime[CLOCK_MONOTONIC];
-   unsigned int seq;
u64 ns;
-   int mode;
 
do {
seq = gtod_read_begin(gtod);
@@ -276,11 +253,11 @@ notrace int __vdso_clock_gettime(clockid
 {
switch (clock) {
case CLOCK_REALTIME:
-   if (do_realtime(ts) == VCLOCK_NONE)
+   if (do_hres(CLOCK_REALTIME, ts) == VCLOCK_NONE)
goto fallback;
break;
case CLOCK_MONOTONIC:
-   if (do_monotonic(ts) == VCLOCK_NONE)
+   if (do_hres(CLOCK_MONOTONIC, ts) == VCLOCK_NONE)
goto fallback;
break;
case CLOCK_REALTIME_COARSE:
@@ -303,7 +280,9 @@ int clock_gettime(clockid_t, struct time
 notrace int __vdso_gettimeofday(struct timeval *tv, struct timezone *tz)
 {
if (likely(tv != NULL)) {
-   if (unlikely(do_realtime((struct timespec *)tv) == VCLOCK_NONE))
+   struct timespec *ts = (struct timespec *) tv;
+
+   if (unlikely(do_hres(CLOCK_REALTIME, ts) == VCLOCK_NONE))
return vdso_fallback_gtod(tv, tz);
tv->tv_usec /= 1000;
}


___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


[patch 07/11] x86/vdso: Collapse coarse functions

2018-09-14 Thread Thomas Gleixner
do_realtime_coarse() and do_monotonic_coarse() are now the same except for
the storage array index. Hand the index in as an argument and collapse the
functions.

Signed-off-by: Thomas Gleixner 
---
 arch/x86/entry/vdso/vclock_gettime.c |   20 
 1 file changed, 4 insertions(+), 16 deletions(-)

--- a/arch/x86/entry/vdso/vclock_gettime.c
+++ b/arch/x86/entry/vdso/vclock_gettime.c
@@ -225,21 +225,9 @@ notrace static int do_hres(clockid_t clk
return mode;
 }
 
-notrace static void do_realtime_coarse(struct timespec *ts)
+notrace static void do_coarse(clockid_t clk, struct timespec *ts)
 {
-   struct vgtod_ts *base = >basetime[CLOCK_REALTIME_COARSE];
-   unsigned int seq;
-
-   do {
-   seq = gtod_read_begin(gtod);
-   ts->tv_sec = base->sec;
-   ts->tv_nsec = base->nsec;
-   } while (unlikely(gtod_read_retry(gtod, seq)));
-}
-
-notrace static void do_monotonic_coarse(struct timespec *ts)
-{
-   struct vgtod_ts *base = >basetime[CLOCK_MONOTONIC_COARSE];
+   struct vgtod_ts *base = >basetime[clk];
unsigned int seq;
 
do {
@@ -261,10 +249,10 @@ notrace int __vdso_clock_gettime(clockid
goto fallback;
break;
case CLOCK_REALTIME_COARSE:
-   do_realtime_coarse(ts);
+   do_coarse(CLOCK_REALTIME_COARSE, ts);
break;
case CLOCK_MONOTONIC_COARSE:
-   do_monotonic_coarse(ts);
+   do_coarse(CLOCK_MONOTONIC_COARSE, ts);
break;
default:
goto fallback;


___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


[patch 04/11] x86/vdso: Use unsigned int consistently for vsyscall_gtod_data::seq

2018-09-14 Thread Thomas Gleixner
The sequence count in vgtod_data is unsigned int, but the call sites use
unsigned long, which is a pointless exercise. Fix the call sites and
replace 'unsigned' with unsinged 'int' while at it.

Signed-off-by: Thomas Gleixner 
---
 arch/x86/entry/vdso/vclock_gettime.c |8 
 arch/x86/include/asm/vgtod.h |   10 +-
 2 files changed, 9 insertions(+), 9 deletions(-)

--- a/arch/x86/entry/vdso/vclock_gettime.c
+++ b/arch/x86/entry/vdso/vclock_gettime.c
@@ -206,7 +206,7 @@ notrace static inline u64 vgetsns(int *m
 /* Code size doesn't matter (vdso is 4k anyway) and this is faster. */
 notrace static int __always_inline do_realtime(struct timespec *ts)
 {
-   unsigned long seq;
+   unsigned int seq;
u64 ns;
int mode;
 
@@ -227,7 +227,7 @@ notrace static int __always_inline do_re
 
 notrace static int __always_inline do_monotonic(struct timespec *ts)
 {
-   unsigned long seq;
+   unsigned int seq;
u64 ns;
int mode;
 
@@ -248,7 +248,7 @@ notrace static int __always_inline do_mo
 
 notrace static void do_realtime_coarse(struct timespec *ts)
 {
-   unsigned long seq;
+   unsigned int seq;
do {
seq = gtod_read_begin(gtod);
ts->tv_sec = gtod->wall_time_coarse_sec;
@@ -258,7 +258,7 @@ notrace static void do_realtime_coarse(s
 
 notrace static void do_monotonic_coarse(struct timespec *ts)
 {
-   unsigned long seq;
+   unsigned int seq;
do {
seq = gtod_read_begin(gtod);
ts->tv_sec = gtod->monotonic_time_coarse_sec;
--- a/arch/x86/include/asm/vgtod.h
+++ b/arch/x86/include/asm/vgtod.h
@@ -15,9 +15,9 @@ typedef unsigned long gtod_long_t;
  * so be carefull by modifying this structure.
  */
 struct vsyscall_gtod_data {
-   unsigned seq;
+   unsigned int seq;
 
-   int vclock_mode;
+   int vclock_mode;
u64 cycle_last;
u64 mask;
u32 mult;
@@ -44,9 +44,9 @@ static inline bool vclock_was_used(int v
return READ_ONCE(vclocks_used) & (1 << vclock);
 }
 
-static inline unsigned gtod_read_begin(const struct vsyscall_gtod_data *s)
+static inline unsigned int gtod_read_begin(const struct vsyscall_gtod_data *s)
 {
-   unsigned ret;
+   unsigned int ret;
 
 repeat:
ret = READ_ONCE(s->seq);
@@ -59,7 +59,7 @@ static inline unsigned gtod_read_begin(c
 }
 
 static inline int gtod_read_retry(const struct vsyscall_gtod_data *s,
-   unsigned start)
+ unsigned int start)
 {
smp_rmb();
return unlikely(s->seq != start);


___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


[patch 08/11] x86/vdso: Replace the clockid switch case

2018-09-14 Thread Thomas Gleixner
Now that the time getter functions use the clockid as index into the
storage array for the base time access, the switch case can be replaced.

- Check for clockid >= MAX_CLOCKS and for negative clockid (CPU/FD) first
  and call the fallback function right away.

- After establishing that clockid is < MAX_CLOCKS, convert the clockid to a
  bitmask

- Check for the supported high resolution and coarse functions by anding
  the bitmask of supported clocks and check whether a bit is set.

This completely avoids jump tables, reduces the number of conditionals and
makes the VDSO extensible for other clock ids.

Signed-off-by: Thomas Gleixner 
---
 arch/x86/entry/vdso/vclock_gettime.c |   38 ---
 1 file changed, 18 insertions(+), 20 deletions(-)

--- a/arch/x86/entry/vdso/vclock_gettime.c
+++ b/arch/x86/entry/vdso/vclock_gettime.c
@@ -239,29 +239,27 @@ notrace static void do_coarse(clockid_t
 
 notrace int __vdso_clock_gettime(clockid_t clock, struct timespec *ts)
 {
-   switch (clock) {
-   case CLOCK_REALTIME:
-   if (do_hres(CLOCK_REALTIME, ts) == VCLOCK_NONE)
-   goto fallback;
-   break;
-   case CLOCK_MONOTONIC:
-   if (do_hres(CLOCK_MONOTONIC, ts) == VCLOCK_NONE)
-   goto fallback;
-   break;
-   case CLOCK_REALTIME_COARSE:
-   do_coarse(CLOCK_REALTIME_COARSE, ts);
-   break;
-   case CLOCK_MONOTONIC_COARSE:
-   do_coarse(CLOCK_MONOTONIC_COARSE, ts);
-   break;
-   default:
-   goto fallback;
-   }
+   unsigned int msk;
+
+   /* Sort out negative (CPU/FD) and invalid clocks */
+   if (unlikely((unsigned int) clock >= MAX_CLOCKS))
+   return vdso_fallback_gettime(clock, ts);
 
-   return 0;
-fallback:
+   /*
+* Convert the clockid to a bitmask and use it to check which
+* clocks are handled in the VDSO directly.
+*/
+   msk = 1U << clock;
+   if (likely(msk & VGTOD_HRES)) {
+   if (do_hres(clock, ts) != VCLOCK_NONE)
+   return 0;
+   } else if (msk & VGTOD_COARSE) {
+   do_coarse(clock, ts);
+   return 0;
+   }
return vdso_fallback_gettime(clock, ts);
 }
+
 int clock_gettime(clockid_t, struct timespec *)
__attribute__((weak, alias("__vdso_clock_gettime")));
 


___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


[patch 11/11] x66/vdso: Add CLOCK_TAI support

2018-09-14 Thread Thomas Gleixner
With the storage array in place it's now trivial to support CLOCK_TAI in
the vdso. Instead of extending the array to accomodate CLOCK_TAI, make use
of the fact that:

 - CLOCK ids are set in stone
 - CLOCK_THREAD_CPUTIME is never going to be supported in the VDSO so
   the array slot 3 is unused
 - CLOCK_TAI is id 11 which results in 3 when masked with 0x3

Add the mask to the basetime array lookup and set up the CLOCK_TAI base
time in update_vsyscall().

The performance impact of the mask operation is within the noise.

Signed-off-by: Thomas Gleixner 
---
 arch/x86/entry/vdso/vclock_gettime.c|2 +-
 arch/x86/entry/vsyscall/vsyscall_gtod.c |4 
 arch/x86/include/asm/vgtod.h|6 +-
 3 files changed, 10 insertions(+), 2 deletions(-)

--- a/arch/x86/entry/vdso/vclock_gettime.c
+++ b/arch/x86/entry/vdso/vclock_gettime.c
@@ -140,7 +140,7 @@ notrace static inline u64 vgetcyc(int mo
 
 notrace static int do_hres(clockid_t clk, struct timespec *ts)
 {
-   struct vgtod_ts *base = >basetime[clk];
+   struct vgtod_ts *base = >basetime[clk & VGTOD_HRES_MASK];
u64 cycles, last, ns;
unsigned int seq;
 
--- a/arch/x86/entry/vsyscall/vsyscall_gtod.c
+++ b/arch/x86/entry/vsyscall/vsyscall_gtod.c
@@ -51,6 +51,10 @@ void update_vsyscall(struct timekeeper *
base->sec = tk->xtime_sec;
base->nsec = tk->tkr_mono.xtime_nsec;
 
+   base = >basetime[VGTOD_TAI];
+   base->sec = tk->xtime_sec + (s64)tk->tai_offset;
+   base->nsec = tk->tkr_mono.xtime_nsec;
+
base = >basetime[CLOCK_MONOTONIC];
base->sec = tk->xtime_sec + tk->wall_to_monotonic.tv_sec;
nsec = tk->tkr_mono.xtime_nsec;
--- a/arch/x86/include/asm/vgtod.h
+++ b/arch/x86/include/asm/vgtod.h
@@ -19,9 +19,13 @@ struct vgtod_ts {
 };
 
 #define VGTOD_BASES(CLOCK_MONOTONIC_COARSE + 1)
-#define VGTOD_HRES (BIT(CLOCK_REALTIME) | BIT(CLOCK_MONOTONIC))
+#define VGTOD_HRES (BIT(CLOCK_REALTIME) | BIT(CLOCK_MONOTONIC) | 
BIT(CLOCK_TAI))
 #define VGTOD_COARSE   (BIT(CLOCK_REALTIME_COARSE) | 
BIT(CLOCK_MONOTONIC_COARSE))
 
+/* Abuse CLOCK_THREAD_CPUTIME_ID for VGTOD CLOCK TAI */
+#define VGTOD_HRES_MASK0x3
+#define VGTOD_TAI  (CLOCK_TAI & VGTOD_HRES_MASK)
+
 /*
  * vsyscall_gtod_data will be accessed by 32 and 64 bit code at the same time
  * so be carefull by modifying this structure.


___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


[patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support

2018-09-14 Thread Thomas Gleixner
Matt attempted to add CLOCK_TAI support to the VDSO clock_gettime()
implementation, which extended the clockid switch case and added yet
another slightly different copy of the same code.

Especially the extended switch case is problematic as the compiler tends to
generate a jump table which then requires to use retpolines. If jump tables
are disabled it adds yet another conditional to the existing maze.

This series takes a different approach by consolidating the almost
identical functions into one implementation for high resolution clocks and
one for the coarse grained clock ids by storing the base data for each
clock id in an array which is indexed by the clock id.

This completely eliminates the switch case and allows further
simplifications of the code base, which at the end all together gain a few
cycles performance or at least stay on par with todays code. The resulting
performance depends heavily on the micro architecture and the compiler.

Thanks,

tglx

8<---
 arch/x86/Kconfig|1 
 arch/x86/entry/vdso/vclock_gettime.c|  199 
 arch/x86/entry/vsyscall/vsyscall_gtod.c |   55 
 arch/x86/include/asm/vgtod.h|   46 ---
 arch/x86/kernel/time.c  |   22 +++
 include/linux/clocksource.h |5 
 kernel/time/Kconfig |4 
 kernel/time/clocksource.c   |2 
 8 files changed, 144 insertions(+), 190 deletions(-)

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


[patch 05/11] x86/vdso: Introduce and use vgtod_ts

2018-09-14 Thread Thomas Gleixner
It's desired to support more clocks in the VDSO, e.g. CLOCK_TAI. This
results either in indirect calls due to the larger switch case, which then
requires retpolines or when the compiler is forced to avoid jump tables it
results in even more conditionals.

To avoid both variants which are bad for performance the high resolution
functions and the coarse grained functions will be collapsed into one for
each. That requires to store the clock specific base time in an array.

Introcude struct vgtod_ts for storage and convert the data store, the
update function and the individual clock functions over to use it.

The new storage does not longer use gtod_long_t for seconds depending on 32
or 64 bit compile because this needs to be the full 64bit value even for
32bit when a Y2038 function is added. No point in keeping the distinction
alive in the internal representation.

Signed-off-by: Thomas Gleixner 
---
 arch/x86/entry/vdso/vclock_gettime.c|   24 +--
 arch/x86/entry/vsyscall/vsyscall_gtod.c |   51 
 arch/x86/include/asm/vgtod.h|   36 --
 3 files changed, 61 insertions(+), 50 deletions(-)

--- a/arch/x86/entry/vdso/vclock_gettime.c
+++ b/arch/x86/entry/vdso/vclock_gettime.c
@@ -206,6 +206,7 @@ notrace static inline u64 vgetsns(int *m
 /* Code size doesn't matter (vdso is 4k anyway) and this is faster. */
 notrace static int __always_inline do_realtime(struct timespec *ts)
 {
+   struct vgtod_ts *base = >basetime[CLOCK_REALTIME];
unsigned int seq;
u64 ns;
int mode;
@@ -213,8 +214,8 @@ notrace static int __always_inline do_re
do {
seq = gtod_read_begin(gtod);
mode = gtod->vclock_mode;
-   ts->tv_sec = gtod->wall_time_sec;
-   ns = gtod->wall_time_snsec;
+   ts->tv_sec = base->sec;
+   ns = base->nsec;
ns += vgetsns();
ns >>= gtod->shift;
} while (unlikely(gtod_read_retry(gtod, seq)));
@@ -227,6 +228,7 @@ notrace static int __always_inline do_re
 
 notrace static int __always_inline do_monotonic(struct timespec *ts)
 {
+   struct vgtod_ts *base = >basetime[CLOCK_MONOTONIC];
unsigned int seq;
u64 ns;
int mode;
@@ -234,8 +236,8 @@ notrace static int __always_inline do_mo
do {
seq = gtod_read_begin(gtod);
mode = gtod->vclock_mode;
-   ts->tv_sec = gtod->monotonic_time_sec;
-   ns = gtod->monotonic_time_snsec;
+   ts->tv_sec = base->sec;
+   ns = base->nsec;
ns += vgetsns();
ns >>= gtod->shift;
} while (unlikely(gtod_read_retry(gtod, seq)));
@@ -248,21 +250,25 @@ notrace static int __always_inline do_mo
 
 notrace static void do_realtime_coarse(struct timespec *ts)
 {
+   struct vgtod_ts *base = >basetime[CLOCK_REALTIME_COARSE];
unsigned int seq;
+
do {
seq = gtod_read_begin(gtod);
-   ts->tv_sec = gtod->wall_time_coarse_sec;
-   ts->tv_nsec = gtod->wall_time_coarse_nsec;
+   ts->tv_sec = base->sec;
+   ts->tv_nsec = base->nsec;
} while (unlikely(gtod_read_retry(gtod, seq)));
 }
 
 notrace static void do_monotonic_coarse(struct timespec *ts)
 {
+   struct vgtod_ts *base = >basetime[CLOCK_MONOTONIC_COARSE];
unsigned int seq;
+
do {
seq = gtod_read_begin(gtod);
-   ts->tv_sec = gtod->monotonic_time_coarse_sec;
-   ts->tv_nsec = gtod->monotonic_time_coarse_nsec;
+   ts->tv_sec = base->sec;
+   ts->tv_nsec = base->nsec;
} while (unlikely(gtod_read_retry(gtod, seq)));
 }
 
@@ -318,7 +324,7 @@ int gettimeofday(struct timeval *, struc
 notrace time_t __vdso_time(time_t *t)
 {
/* This is atomic on x86 so we don't need any locks. */
-   time_t result = READ_ONCE(gtod->wall_time_sec);
+   time_t result = READ_ONCE(gtod->basetime[CLOCK_REALTIME].sec);
 
if (t)
*t = result;
--- a/arch/x86/entry/vsyscall/vsyscall_gtod.c
+++ b/arch/x86/entry/vsyscall/vsyscall_gtod.c
@@ -31,6 +31,8 @@ void update_vsyscall(struct timekeeper *
 {
int vclock_mode = tk->tkr_mono.clock->archdata.vclock_mode;
struct vsyscall_gtod_data *vdata = _gtod_data;
+   struct vgtod_ts *base;
+   u64 nsec;
 
/* Mark the new vclock used. */
BUILD_BUG_ON(VCLOCK_MAX >= 32);
@@ -45,34 +47,33 @@ void update_vsyscall(struct timekeeper *
vdata->mult = tk->tkr_mono.mult;
vdata->shift= tk->tkr_mono.shift;
 
-   vdata->wall_time_sec= tk->xtime_sec;
-   vdata->wall_time_snsec  = tk->tkr_mono.xtime_nsec;
-
-   

[patch 09/11] x86/vdso: Simplify the invalid vclock case

2018-09-14 Thread Thomas Gleixner
The code flow for the vclocks is convoluted as it requires the vclocks
which can be invalidated separately from the vsyscall_gtod_data sequence to
store the fact in a separate variable. That's inefficient.

Restructure the code so the vclock readout returns cycles and the
conversion to nanoseconds is handled at the call site.

If the clock gets invalidated or vclock is already VCLOCK_NONE, return
U64_MAX as the cycle value, which is invalid for all clocks and leave the
sequence loop immediately in that case by calling the fallback function
directly.

This allows to remove the gettimeofday fallback as it now uses the
clock_gettime() fallback and does the nanoseconds to microseconds
conversion in the same way as it does when the vclock is functional. It
does not make a difference whether the division by 1000 happens in the
kernel fallback or in userspace.

Generates way better code and gains a few cycles back.

Signed-off-by: Thomas Gleixner 
---
 arch/x86/entry/vdso/vclock_gettime.c |   81 +--
 1 file changed, 21 insertions(+), 60 deletions(-)

--- a/arch/x86/entry/vdso/vclock_gettime.c
+++ b/arch/x86/entry/vdso/vclock_gettime.c
@@ -48,16 +48,6 @@ notrace static long vdso_fallback_gettim
return ret;
 }
 
-notrace static long vdso_fallback_gtod(struct timeval *tv, struct timezone *tz)
-{
-   long ret;
-
-   asm("syscall" : "=a" (ret) :
-   "0" (__NR_gettimeofday), "D" (tv), "S" (tz) : "memory");
-   return ret;
-}
-
-
 #else
 
 notrace static long vdso_fallback_gettime(long clock, struct timespec *ts)
@@ -75,21 +65,6 @@ notrace static long vdso_fallback_gettim
return ret;
 }
 
-notrace static long vdso_fallback_gtod(struct timeval *tv, struct timezone *tz)
-{
-   long ret;
-
-   asm(
-   "mov %%ebx, %%edx \n"
-   "mov %2, %%ebx \n"
-   "call __kernel_vsyscall \n"
-   "mov %%edx, %%ebx \n"
-   : "=a" (ret)
-   : "0" (__NR_gettimeofday), "g" (tv), "c" (tz)
-   : "memory", "edx");
-   return ret;
-}
-
 #endif
 
 #ifdef CONFIG_PARAVIRT_CLOCK
@@ -98,7 +73,7 @@ static notrace const struct pvclock_vsys
return (const struct pvclock_vsyscall_time_info *)_page;
 }
 
-static notrace u64 vread_pvclock(int *mode)
+static notrace u64 vread_pvclock(void)
 {
const struct pvclock_vcpu_time_info *pvti = _pvti0()->pvti;
u64 ret;
@@ -130,10 +105,8 @@ static notrace u64 vread_pvclock(int *mo
do {
version = pvclock_read_begin(pvti);
 
-   if (unlikely(!(pvti->flags & PVCLOCK_TSC_STABLE_BIT))) {
-   *mode = VCLOCK_NONE;
-   return 0;
-   }
+   if (unlikely(!(pvti->flags & PVCLOCK_TSC_STABLE_BIT)))
+   return U64_MAX;
 
ret = __pvclock_read_cycles(pvti, rdtsc_ordered());
} while (pvclock_read_retry(pvti, version));
@@ -148,17 +121,12 @@ static notrace u64 vread_pvclock(int *mo
 }
 #endif
 #ifdef CONFIG_HYPERV_TSCPAGE
-static notrace u64 vread_hvclock(int *mode)
+static notrace u64 vread_hvclock(void)
 {
const struct ms_hyperv_tsc_page *tsc_pg =
(const struct ms_hyperv_tsc_page *)_page;
-   u64 current_tick = hv_read_tsc_page(tsc_pg);
-
-   if (current_tick != U64_MAX)
-   return current_tick;
 
-   *mode = VCLOCK_NONE;
-   return 0;
+   return hv_read_tsc_page(tsc_pg);
 }
 #endif
 
@@ -182,47 +150,42 @@ notrace static u64 vread_tsc(void)
return last;
 }
 
-notrace static inline u64 vgetsns(int *mode)
+notrace static inline u64 vgetcyc(int mode)
 {
-   u64 v;
-   cycles_t cycles;
-
-   if (gtod->vclock_mode == VCLOCK_TSC)
-   cycles = vread_tsc();
+   if (mode == VCLOCK_TSC)
+   return vread_tsc();
 #ifdef CONFIG_PARAVIRT_CLOCK
-   else if (gtod->vclock_mode == VCLOCK_PVCLOCK)
-   cycles = vread_pvclock(mode);
+   else if (mode == VCLOCK_PVCLOCK)
+   return vread_pvclock();
 #endif
 #ifdef CONFIG_HYPERV_TSCPAGE
-   else if (gtod->vclock_mode == VCLOCK_HVCLOCK)
-   cycles = vread_hvclock(mode);
+   else if (mode == VCLOCK_HVCLOCK)
+   return vread_hvclock();
 #endif
-   else
-   return 0;
-   v = cycles - gtod->cycle_last;
-   return v * gtod->mult;
+   return U64_MAX;
 }
 
 notrace static int do_hres(clockid_t clk, struct timespec *ts)
 {
struct vgtod_ts *base = >basetime[clk];
unsigned int seq;
-   int mode;
-   u64 ns;
+   u64 cycles, ns;
 
do {
seq = gtod_read_begin(gtod);
-   mode = gtod->vclock_mode;
ts->tv_sec = base-&

[patch 02/11] x86/time: Implement clocksource_arch_init()

2018-09-14 Thread Thomas Gleixner
Runtime validate the VCLOCK_MODE in clocksource::archdata and disable
VCLOCK if invalid, which disables the VDSO but keeps the system running.

Signed-off-by: Thomas Gleixner 
---
 arch/x86/Kconfig   |1 +
 arch/x86/kernel/time.c |   16 
 2 files changed, 17 insertions(+)

--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -48,6 +48,7 @@ config X86
select ACPI_SYSTEM_POWER_STATES_SUPPORT if ACPI
select ANON_INODES
select ARCH_CLOCKSOURCE_DATA
+   select ARCH_CLOCKSOURCE_INIT
select ARCH_DISCARD_MEMBLOCK
select ARCH_HAS_ACPI_TABLE_UPGRADE  if ACPI
select ARCH_HAS_DEBUG_VIRTUAL
--- a/arch/x86/kernel/time.c
+++ b/arch/x86/kernel/time.c
@@ -10,6 +10,7 @@
  *
  */
 
+#include 
 #include 
 #include 
 #include 
@@ -105,3 +106,18 @@ void __init time_init(void)
 {
late_time_init = x86_late_time_init;
 }
+
+/*
+ * Sanity check the vdso related archdata content.
+ */
+void clocksource_arch_init(struct clocksource *cs)
+{
+   if (cs->archdata.vclock_mode == VCLOCK_NONE)
+   return;
+
+   if (cs->archdata.vclock_mode >= VCLOCK_MAX) {
+   pr_warn("clocksource %s registered with invalid vclock_mode %d. 
Disabling vclock.\n",
+   cs->name, cs->archdata.vclock_mode);
+   cs->archdata.vclock_mode = VCLOCK_NONE;
+   }
+}


___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


[patch 01/11] clocksource: Provide clocksource_arch_init()

2018-09-14 Thread Thomas Gleixner
Architectures have extra archdata in the clocksource, e.g. for VDSO
support. There are no sanity checks or general initializations for this
available. Add support for that.

Signed-off-by: Thomas Gleixner 
---
 include/linux/clocksource.h |5 +
 kernel/time/Kconfig |4 
 kernel/time/clocksource.c   |2 ++
 3 files changed, 11 insertions(+)

--- a/include/linux/clocksource.h
+++ b/include/linux/clocksource.h
@@ -241,6 +241,11 @@ static inline void __clocksource_update_
__clocksource_update_freq_scale(cs, 1000, khz);
 }
 
+#ifdef CONFIG_ARCH_CLOCKSOURCE_INIT
+extern void clocksource_arch_init(struct clocksource *cs);
+#else
+static inline void clocksource_arch_init(struct clocksource *cs) { }
+#endif
 
 extern int timekeeping_notify(struct clocksource *clock);
 
--- a/kernel/time/Kconfig
+++ b/kernel/time/Kconfig
@@ -12,6 +12,10 @@ config CLOCKSOURCE_WATCHDOG
 config ARCH_CLOCKSOURCE_DATA
bool
 
+# Architecture has extra clocksource init called from registration
+config ARCH_CLOCKSOURCE_INIT
+   bool
+
 # Clocksources require validation of the clocksource against the last
 # cycle update - x86/TSC misfeature
 config CLOCKSOURCE_VALIDATE_LAST_CYCLE
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -937,6 +937,8 @@ int __clocksource_register_scale(struct
 {
unsigned long flags;
 
+   clocksource_arch_init(cs);
+
/* Initialize mult/shift and max_idle_ns */
__clocksource_update_freq_scale(cs, scale, freq);
 


___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


[patch 03/11] x86/vdso: Enforce 64bit clocksource

2018-09-14 Thread Thomas Gleixner
All VDSO clock sources are TSC based and use CLOCKSOURCE_MASK(64). There is
no point in masking with all FF. Get rid of it and enforce the mask in the
sanity checker.

Signed-off-by: Thomas Gleixner 
---
 arch/x86/entry/vdso/vclock_gettime.c |2 +-
 arch/x86/kernel/time.c   |6 ++
 2 files changed, 7 insertions(+), 1 deletion(-)

--- a/arch/x86/entry/vdso/vclock_gettime.c
+++ b/arch/x86/entry/vdso/vclock_gettime.c
@@ -199,7 +199,7 @@ notrace static inline u64 vgetsns(int *m
 #endif
else
return 0;
-   v = (cycles - gtod->cycle_last) & gtod->mask;
+   v = cycles - gtod->cycle_last;
return v * gtod->mult;
 }
 
--- a/arch/x86/kernel/time.c
+++ b/arch/x86/kernel/time.c
@@ -120,4 +120,10 @@ void clocksource_arch_init(struct clocks
cs->name, cs->archdata.vclock_mode);
cs->archdata.vclock_mode = VCLOCK_NONE;
}
+
+   if (cs->mask != CLOCKSOURCE_MASK(64)) {
+   pr_warn("clocksource %s registered with invalid mask %016llx. 
Disabling vclock.\n",
+   cs->name, cs->mask);
+   cs->archdata.vclock_mode = VCLOCK_NONE;
+   }
 }


___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


[patch 10/11] x86/vdso: Move cycle_last handling into the caller

2018-09-14 Thread Thomas Gleixner
Dereferencing gtod->cycle_last all over the place and foing the cycles <
last comparison in the vclock read functions generates horrible code. Doing
it at the call site is much better and gains a few cycles both for TSC and
pvclock.

Caveat: This adds the comparison to the hyperv vclock as well, but I have
no way to test that.

Signed-off-by: Thomas Gleixner 
---
 arch/x86/entry/vdso/vclock_gettime.c |   39 ++-
 1 file changed, 7 insertions(+), 32 deletions(-)

--- a/arch/x86/entry/vdso/vclock_gettime.c
+++ b/arch/x86/entry/vdso/vclock_gettime.c
@@ -76,9 +76,8 @@ static notrace const struct pvclock_vsys
 static notrace u64 vread_pvclock(void)
 {
const struct pvclock_vcpu_time_info *pvti = _pvti0()->pvti;
-   u64 ret;
-   u64 last;
u32 version;
+   u64 ret;
 
/*
 * Note: The kernel and hypervisor must guarantee that cpu ID
@@ -111,13 +110,7 @@ static notrace u64 vread_pvclock(void)
ret = __pvclock_read_cycles(pvti, rdtsc_ordered());
} while (pvclock_read_retry(pvti, version));
 
-   /* refer to vread_tsc() comment for rationale */
-   last = gtod->cycle_last;
-
-   if (likely(ret >= last))
-   return ret;
-
-   return last;
+   return ret;
 }
 #endif
 #ifdef CONFIG_HYPERV_TSCPAGE
@@ -130,30 +123,10 @@ static notrace u64 vread_hvclock(void)
 }
 #endif
 
-notrace static u64 vread_tsc(void)
-{
-   u64 ret = (u64)rdtsc_ordered();
-   u64 last = gtod->cycle_last;
-
-   if (likely(ret >= last))
-   return ret;
-
-   /*
-* GCC likes to generate cmov here, but this branch is extremely
-* predictable (it's just a function of time and the likely is
-* very likely) and there's a data dependence, so force GCC
-* to generate a branch instead.  I don't barrier() because
-* we don't actually need a barrier, and if this function
-* ever gets inlined it will generate worse code.
-*/
-   asm volatile ("");
-   return last;
-}
-
 notrace static inline u64 vgetcyc(int mode)
 {
if (mode == VCLOCK_TSC)
-   return vread_tsc();
+   return (u64)rdtsc_ordered();
 #ifdef CONFIG_PARAVIRT_CLOCK
else if (mode == VCLOCK_PVCLOCK)
return vread_pvclock();
@@ -168,17 +141,19 @@ notrace static inline u64 vgetcyc(int mo
 notrace static int do_hres(clockid_t clk, struct timespec *ts)
 {
struct vgtod_ts *base = >basetime[clk];
+   u64 cycles, last, ns;
unsigned int seq;
-   u64 cycles, ns;
 
do {
seq = gtod_read_begin(gtod);
ts->tv_sec = base->sec;
ns = base->nsec;
+   last = gtod->cycle_last;
cycles = vgetcyc(gtod->vclock_mode);
if (unlikely((s64)cycles < 0))
return vdso_fallback_gettime(clk, ts);
-   ns += (cycles - gtod->cycle_last) * gtod->mult;
+   if (cycles > last)
+   ns += (cycles - last) * gtod->mult;
ns >>= gtod->shift;
} while (unlikely(gtod_read_retry(gtod, seq)));
 


___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH V3 1/4] X86/Hyper-V: Add flush HvFlushGuestPhysicalAddressSpace hypercall support

2018-07-19 Thread Thomas Gleixner
On Thu, 19 Jul 2018, Tianyu Lan wrote:
> On 7/19/2018 8:05 PM, Thomas Gleixner wrote:
> > You already have the SPDX identifier. So the GPL boilerplate is not really
> > required, unless your legal departement insist on it.
> > 
> 
> Hi Thomas:
>   Thanks for your reminder. How about the following?
> 
> 
> +// SPDX-License-Identifier: GPL-2.0
> +
> +/*
> + * Hyper-V nested virtualization code.
> + *
> + * Copyright (C) 2018, Microsoft, Inc.
> + *
> + * Author : Lan Tianyu 
> + */

Perfect :)
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH V3 1/4] X86/Hyper-V: Add flush HvFlushGuestPhysicalAddressSpace hypercall support

2018-07-19 Thread Thomas Gleixner
On Thu, 19 Jul 2018, Tianyu Lan wrote:
> @@ -0,0 +1,64 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +/*
> + * Hyper-V nested virtualization code.
> + *
> + * Copyright (C) 2018, Microsoft, Inc.
> + *
> + * Author : Lan Tianyu 
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms of the GNU General Public License version 2 as published
> + * by the Free Software Foundation.
> + *
> + * This program is distributed in the hope that it will be useful, but
> + * WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE, GOOD TITLE or
> + * NON INFRINGEMENT.  See the GNU General Public License for more
> + * details.

You already have the SPDX identifier. So the GPL boilerplate is not really
required, unless your legal departement insist on it.

Thanks,

tglx
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


RE: [PATCH 1/1] X86/Hyper-V:: Fix the circular dependency in IPI enlightenment.

2018-07-06 Thread Thomas Gleixner
On Fri, 6 Jul 2018, Michael Kelley (EOSG) wrote:
> > -Original Message-
> > From: Thomas Gleixner 
> > Sent: Friday, July 6, 2018 3:42 AM
> > To: KY Srinivasan 
> > Cc: Ingo Molnar ; x...@kernel.org; 
> > gre...@linuxfoundation.org; linux-
> > ker...@vger.kernel.org; de...@linuxdriverproject.org; o...@aepfle.de; 
> > a...@canonical.com;
> > jasow...@redhat.com; h...@zytor.com; Stephen Hemminger 
> > ;
> > Michael Kelley (EOSG) ; vkuzn...@redhat.com
> > Subject: RE: [PATCH 1/1] X86/Hyper-V:: Fix the circular dependency in IPI 
> > enlightenment.

Can you please strip that useless duplication of the mail header from your
replies. Manualy if you can't teach your MUA to avoid that :)

> > On Fri, 6 Jul 2018, Thomas Gleixner wrote:
> > Applied it to x86/urgent and fixed up the '-1' sloppy hack as pointed out
> > by Vitaly and merged x86/urgent into x86/hyperv.
> > 
> > Please check both branches for correctness.
> 
> Changes look good to me.  Some pre-existing signed/unsigned type sloppiness is
> still there in that hv_cpu_number_to_vp_number() should be 'u32' instead
> of 'int', along with related local variables, but that's probably best to fix 
> in
> linux-next on top of Vitaly's other changes.  This code will work.

Yes, delta patch is fine on top of x86/hyperv.

Thanks,

tglx
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


RE: [PATCH 1/1] X86/Hyper-V:: Fix the circular dependency in IPI enlightenment.

2018-07-06 Thread Thomas Gleixner
On Fri, 6 Jul 2018, Thomas Gleixner wrote:
> On Fri, 6 Jul 2018, KY Srinivasan wrote:
> > > 
> > > The problem is that the wreckage is in Linus tree and needs to be fixed
> > > there, i.e. via x86/urgent.
> > > 
> > > Now we have the new bits queued in x86/hyperv already which collide. So
> > > we
> > > need to merge x86/urgent into x86/hyperv after applying the fix and mop up
> > > the merge wreckage in x86/hyperv.
> > > 
> > > I'll have a look tomorrow morning unless you beat me to it.
> > 
> > I can rebase this patch against the latest tip and resend (tomorrow).
> 
> That doesn't help as we need to push the original fix to Linus ...

Applied it to x86/urgent and fixed up the '-1' sloppy hack as pointed out
by Vitaly and merged x86/urgent into x86/hyperv.

Please check both branches for correctness.

Thanks,

tglx

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


RE: [PATCH 1/1] X86/Hyper-V:: Fix the circular dependency in IPI enlightenment.

2018-07-06 Thread Thomas Gleixner
On Fri, 6 Jul 2018, KY Srinivasan wrote:
> > 
> > The problem is that the wreckage is in Linus tree and needs to be fixed
> > there, i.e. via x86/urgent.
> > 
> > Now we have the new bits queued in x86/hyperv already which collide. So
> > we
> > need to merge x86/urgent into x86/hyperv after applying the fix and mop up
> > the merge wreckage in x86/hyperv.
> > 
> > I'll have a look tomorrow morning unless you beat me to it.
> 
> I can rebase this patch against the latest tip and resend (tomorrow).

That doesn't help as we need to push the original fix to Linus ...
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH 1/1] X86/Hyper-V:: Fix the circular dependency in IPI enlightenment.

2018-07-05 Thread Thomas Gleixner
On Fri, 6 Jul 2018, Ingo Molnar wrote:
> * KY Srinivasan  wrote:
> > I am confused. The label ipi_mask_done was introduced in this patch
> > (the patch under question fixes a circular dependency in this patch):
> > 
> > commit 68bb7bfb7985df2bd15c2dc975cb68b7a901488a
> > Author: K. Y. Srinivasan 
> > Date:   Wed May 16 14:53:31 2018 -0700
> > 
> > X86/Hyper-V: Enable IPI enlightenments
> > 
> > Hyper-V supports hypercalls to implement IPI; use them.
> > 
> > Signed-off-by: K. Y. Srinivasan 
> > Signed-off-by: Thomas Gleixner 
> > Reviewed-by: Michael Kelley 
> > 
> > This patch was committed by Thomas some weeks ago and is in linux-next.
> > This patch is also in 4.18-rc3.
> 
> And then that name was changed to a different label in:
> 
>   4bd06060762b: x86/hyper-v: Use cheaper HVCALL_SEND_IPI hypercall when 
> possible
> 
> So maybe you were testing on an older kernel. Could you try the latest -tip?

The problem is that the wreckage is in Linus tree and needs to be fixed
there, i.e. via x86/urgent.

Now we have the new bits queued in x86/hyperv already which collide. So we
need to merge x86/urgent into x86/hyperv after applying the fix and mop up
the merge wreckage in x86/hyperv.

I'll have a look tomorrow morning unless you beat me to it.

Thanks,

tglx
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH] x86/hyper-v: use cheaper HVCALL_FLUSH_VIRTUAL_ADDRESS_{LIST, SPACE} hypercalls when possible

2018-06-19 Thread Thomas Gleixner
On Tue, 19 Jun 2018, Vitaly Kuznetsov wrote:
> Thomas Gleixner  writes:
> 
> > On Fri, 15 Jun 2018, Vitaly Kuznetsov wrote:
> >>   * Fills in gva_list starting from offset. Returns the number of items 
> >> added.
> >> @@ -93,10 +95,19 @@ static void hyperv_flush_tlb_others(const struct 
> >> cpumask *cpus,
> >>if (cpumask_equal(cpus, cpu_present_mask)) {
> >>flush->flags |= HV_FLUSH_ALL_PROCESSORS;
> >>} else {
> >> +  /*
> >> +   * It is highly likely that VP ids are in ascending order
> >> +   * matching Linux CPU ids; Check VP index for the highest CPU
> >> +   * in the supplied set to see if EX hypercall is required.
> >> +   * This is just a best guess but should work most of the time.
> >
> > TLB flushing based on 'best guess' and 'should work most of the time' is
> > not a brilliant approach.
> >
> 
> Oh no no no, that's not what I meant :-)
> 
> We have the following problem: from the supplied CPU set we need to
> figure out if we can get away with HVCALL_FLUSH_VIRTUAL_ADDRESS_{LIST,
> SPACE} hypercalls which are cheaper or if we need to use more expensing
> HVCALL_FLUSH_VIRTUAL_ADDRESS_{LIST, SPACE}_EX ones. The dividing line is
> the highest VP_INDEX of the supplied CPU set: in case it is < 64 cheaper
> hypercalls are OK. Now how do we check that? In the patch I have the
> following approach:
> 1) Check VP number for the highest CPU in the supplied set. In case it
> is > 64 we for sure need more expensive hypercalls. This is the "guess"
> which works most of the time because Linux CPU ids usually match
> VP_INDEXes.
> 
> 2) In case the answer to the previous question was negative we start
> preparing input for the cheaper hypercall. However, if while walking the
> CPU set we meet a CPU with VP_INDEX higher than 64 we'll discard the
> prepared input and switch to the more expensive hypercall.
> 
> Said that the 'guess' here is just an optimization to avoid walking the
> whole CPU set when we find the required answer quickly by looking at the
> highest bit. This will help big systems with hundreds of CPUs.

Care to fix the comment to avoid the offending words?

Thanks,

tglx
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH] x86/hyper-v: use cheaper HVCALL_FLUSH_VIRTUAL_ADDRESS_{LIST, SPACE} hypercalls when possible

2018-06-19 Thread Thomas Gleixner
On Fri, 15 Jun 2018, Vitaly Kuznetsov wrote:
>   * Fills in gva_list starting from offset. Returns the number of items added.
> @@ -93,10 +95,19 @@ static void hyperv_flush_tlb_others(const struct cpumask 
> *cpus,
>   if (cpumask_equal(cpus, cpu_present_mask)) {
>   flush->flags |= HV_FLUSH_ALL_PROCESSORS;
>   } else {
> + /*
> +  * It is highly likely that VP ids are in ascending order
> +  * matching Linux CPU ids; Check VP index for the highest CPU
> +  * in the supplied set to see if EX hypercall is required.
> +  * This is just a best guess but should work most of the time.

TLB flushing based on 'best guess' and 'should work most of the time' is
not a brilliant approach.

Thanks,

tglx
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH] HV: Fix definition of struct hv_vp_assist_page.

2018-06-06 Thread Thomas Gleixner
On Mon, 21 May 2018, Tianyu Lan wrote:

KY 

> The struct hv_vp_assist_page was defined incorrectly.
> The "vtl_control" should be u64[3], "nested_enlightenments_control"
> should be a u64 and there is 7 reserved bytes following "enlighten_vmentry".
> This patch is to fix it.
> 
> Signed-off-by: Lan Tianyu 
> ---
>  arch/x86/include/asm/hyperv-tlfs.h | 9 +
>  1 file changed, 5 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/include/asm/hyperv-tlfs.h 
> b/arch/x86/include/asm/hyperv-tlfs.h
> index f7be6d03a310..fae0a5431cdd 100644
> --- a/arch/x86/include/asm/hyperv-tlfs.h
> +++ b/arch/x86/include/asm/hyperv-tlfs.h
> @@ -496,10 +496,11 @@ struct hv_timer_message_payload {
>  /* Define virtual processor assist page structure. */
>  struct hv_vp_assist_page {
>   __u32 apic_assist;
> - __u32 reserved;
> - __u64 vtl_control[2];
> - __u64 nested_enlightenments_control[2];
> - __u32 enlighten_vmentry;
> + __u32 reserved1;
> + __u64 vtl_control[3];
> + __u64 nested_enlightenments_control;
> + __u8 enlighten_vmentry;
> + __u8 reserved2[7];
>   __u64 current_nested_vmcs;
>  };
>  
> -- 
> 2.14.3
> 
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


RE: [PATCH V2 0/5] X86: Hyper-V: APIC enlightenments

2018-05-16 Thread Thomas Gleixner
On Wed, 16 May 2018, KY Srinivasan wrote:

> 
> 
> > -Original Message-
> > From: k...@linuxonhyperv.com <k...@linuxonhyperv.com>
> > Sent: Thursday, May 3, 2018 11:07 PM
> > To: x...@kernel.org; gre...@linuxfoundation.org; linux-
> > ker...@vger.kernel.org; de...@linuxdriverproject.org; o...@aepfle.de;
> > a...@canonical.com; jasow...@redhat.com; t...@linutronix.de;
> > h...@zytor.com; Stephen Hemminger <sthem...@microsoft.com>; Michael
> > Kelley (EOSG) <michael.h.kel...@microsoft.com>; vkuzn...@redhat.com
> > Cc: KY Srinivasan <k...@microsoft.com>
> > Subject: [PATCH V2 0/5] X86: Hyper-V: APIC enlightenments
> > 
> > From: "K. Y. Srinivasan" <k...@microsoft.com>
> > 
> > Implement APIC related enlightenments.
> > 
> > V2: Addressed comments from Thomas Gleixner <t...@linutronix.de>
> > and Michael Kelley (EOSG) <michael.h.kel...@microsoft.com>.
> 
> Thomas,
> 
> I think I have addressed all your comments;
> let me know if you have any additional comments on this patch set.

Looks good now. There is just the build robot fallout which needs to be
addressed AFAICT.

Thanks,

tglx
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


RE: [PATCH 5/5] X86: Hyper-V: Consolidate the allocation of the hypercall input page

2018-04-27 Thread Thomas Gleixner
On Fri, 27 Apr 2018, KY Srinivasan wrote:

> 
> 
> > -Original Message-
> > From: Thomas Gleixner <t...@linutronix.de>
> > Sent: Thursday, April 26, 2018 3:24 PM
> > To: KY Srinivasan <k...@microsoft.com>
> > Cc: x...@kernel.org; gre...@linuxfoundation.org; linux-
> > ker...@vger.kernel.org; de...@linuxdriverproject.org; o...@aepfle.de;
> > a...@canonical.com; jasow...@redhat.com; h...@zytor.com; Stephen
> > Hemminger <sthem...@microsoft.com>; Michael Kelley (EOSG)
> > <michael.h.kel...@microsoft.com>; vkuzn...@redhat.com
> > Subject: Re: [PATCH 5/5] X86: Hyper-V: Consolidate the allocation of the
> > hypercall input page
> > 
> > On Wed, 25 Apr 2018, k...@linuxonhyperv.com wrote:
> > 
> > > From: "K. Y. Srinivasan" <k...@microsoft.com>
> > >
> > > Consolidate the allocation of the hypercall input page.
> > 
> > Again. You can provide the new way of allocation first, then you don't have
> > to add code first in order to remove it later again. 
> 
> I have implemented the new way upfront for the new code - the IPI code
> [PATCH 2/5] X86: Hyper-V: Enable IPI enlightenments.
> What I am doing here using that infrastructure for the TLB flush 
> enlightenments
> and getting rid of unnecessary code.

Ok. I maybe misread it, but a bit more elaborate change log might help to
avoid that.

Thanks,

tglx
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


RE: [PATCH 1/5] X86: Hyper-V: Enlighten APIC access

2018-04-26 Thread Thomas Gleixner
Michael,

On Thu, 26 Apr 2018, Michael Kelley (EOSG) wrote:

> > -Original Message-
> > From: k...@linuxonhyperv.com 
> > Sent: Wednesday, April 25, 2018 11:13 AM
> > To: x...@kernel.org; gre...@linuxfoundation.org; 
> > linux-ker...@vger.kernel.org;
> > de...@linuxdriverproject.org; o...@aepfle.de; a...@canonical.com; 
> > jasow...@redhat.com;
> > t...@linutronix.de; h...@zytor.com; Stephen Hemminger 
> > ;
> > Michael Kelley (EOSG) ; vkuzn...@redhat.com
> > Cc: KY Srinivasan 
> > Subject: [PATCH 1/5] X86: Hyper-V: Enlighten APIC access

Please fix your MUA not to pointlessly copy the whole mail header.

> > +void __init hv_apic_init(void)
> > +{
> > +   if (ms_hyperv.hints & HV_X64_APIC_ACCESS_RECOMMENDED) {
> > +   pr_info("Hyper-V: Using MSR ased APIC access\n");
> 
> Typo here.  "ased" should be "based".

And please trim the reply to the relevant point. It's annoying to find that
single line of review comment in the useless pile of quoted patch.

Thanks,

tglx
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH 5/5] X86: Hyper-V: Consolidate the allocation of the hypercall input page

2018-04-26 Thread Thomas Gleixner
On Wed, 25 Apr 2018, k...@linuxonhyperv.com wrote:

> From: "K. Y. Srinivasan" 
> 
> Consolidate the allocation of the hypercall input page.

Again. You can provide the new way of allocation first, then you don't have
to add code first in order to remove it later again.

Thanks,

tglx
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH 4/5] X86: Hyper-V: Consolidate code for converting cpumask to vpset

2018-04-26 Thread Thomas Gleixner
On Wed, 25 Apr 2018, k...@linuxonhyperv.com wrote:
>  
>   if (!cpumask_equal(mask, cpu_present_mask)) {
> - ipi_arg->vp_set.format = HV_GENERIC_SET_SPARCE_4K;
> + ipi_arg->vp_set.format = HV_GENERIC_SET_SPARSE_4K;

Please move this patch before the others, so you can use SPARSE in the new
code right away.

Thanks,

tglx
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH 3/5] X86: Hyper-V: Enhanced IPI enlightenment

2018-04-26 Thread Thomas Gleixner
On Wed, 25 Apr 2018, k...@linuxonhyperv.com wrote:
>  
> +struct ipi_arg_ex {
> + u32 vector;
> + u32  reserved;
> + struct hv_vpset vp_set;

Please align that in tabular fashion for easy of reading

u32 vector;
u32 reserved;
struct hv_vpset vp_set;

> +};
> +
>  static struct apic orig_apic;
>  
>  static u64 hv_apic_icr_read(void)
> @@ -97,6 +103,40 @@ static void hv_apic_eoi_write(u32 reg, u32 val)
>   * IPI implementation on Hyper-V.
>   */
>  
> +static int __send_ipi_mask_ex(const struct cpumask *mask, int vector)
> +{
> + int nr_bank = 0;
> + struct ipi_arg_ex **arg;
> + struct ipi_arg_ex *ipi_arg;
> + int ret = 1;
> + unsigned long flags;

This is really horrible to read.

struct ipi_arg_ex *ipi_arg;
struct ipi_arg_ex **arg;
unsigned long flags;
bool ret = false;
int nr_bank = 0;

is really more conveniant for quick reading.

So the other more limited function has a lot more sanity checks vs. vector
number and other things. Why are they not required here? Comment please.

> + local_irq_save(flags);
> + arg = (struct ipi_arg_ex **)this_cpu_ptr(hyperv_pcpu_input_arg);
> +
> + ipi_arg = *arg;
> + if (unlikely(!ipi_arg))
> + goto ipi_mask_ex_done;
> +
> + ipi_arg->vector = vector;
> + ipi_arg->reserved = 0;
> + ipi_arg->vp_set.valid_bank_mask = 0;
> +
> + if (!cpumask_equal(mask, cpu_present_mask)) {
> + ipi_arg->vp_set.format = HV_GENERIC_SET_SPARCE_4K;
> + nr_bank = cpumask_to_vpset(&(ipi_arg->vp_set), mask);

nr_bank really confused me. bank_nr is what you mean, not number of banks,
right?

> + }
> + if (!nr_bank)
> + ipi_arg->vp_set.format = HV_GENERIC_SET_ALL;
> +
> + ret = hv_do_rep_hypercall(HVCALL_SEND_IPI_EX, 0, nr_bank,
> +   ipi_arg, NULL);
> +
> +ipi_mask_ex_done:
> + local_irq_restore(flags);
> + return ret;
> +}

Thanks,

tglx
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH 2/5] X86: Hyper-V: Enable IPI enlightenments

2018-04-26 Thread Thomas Gleixner
On Wed, 25 Apr 2018, k...@linuxonhyperv.com wrote:
> +static int __send_ipi_mask(const struct cpumask *mask, int vector)
> +{
> + int cur_cpu, vcpu;
> + struct ipi_arg_non_ex **arg;
> + struct ipi_arg_non_ex *ipi_arg;
> + int ret = 1;

So this indicates whether __send_ipi_mask() can send to @mask or not. So
please make it a bool and let it return false when it does not work, true
otherwise. If you had used -E then it would have been more obvious, but
this is really a boolean decision.

> + unsigned long flags;
> +
> + if (cpumask_empty(mask))
> + return 0;
> +
> + if (!hv_hypercall_pg)
> + return ret;
> +
> + if ((vector < HV_IPI_LOW_VECTOR) || (vector > HV_IPI_HIGH_VECTOR))
> + return ret;
> +
> + local_irq_save(flags);
> + arg = (struct ipi_arg_non_ex **)this_cpu_ptr(hyperv_pcpu_input_arg);
> +
> + ipi_arg = *arg;
> + if (unlikely(!ipi_arg))
> + goto ipi_mask_done;
> +
> +

Stray newline

> + ipi_arg->vector = vector;
> + ipi_arg->reserved = 0;
> + ipi_arg->cpu_mask = 0;
> +
> + for_each_cpu(cur_cpu, mask) {
> + vcpu = hv_cpu_number_to_vp_number(cur_cpu);
> + if (vcpu >= 64)
> + goto ipi_mask_done;

This is completely magic and deserves a comment.

> +
> + __set_bit(vcpu, (unsigned long *)_arg->cpu_mask);
> + }
> +
> + ret = hv_do_hypercall(HVCALL_SEND_IPI, ipi_arg, NULL);
> +
> +ipi_mask_done:
> + local_irq_restore(flags);
> + return ret;
> +}



>  static int hv_cpu_init(unsigned int cpu)
>  {
>   u64 msr_vp_index;
>   struct hv_vp_assist_page **hvp = _vp_assist_page[smp_processor_id()];
> + void **input_arg;
> +
> + input_arg = (void **)this_cpu_ptr(hyperv_pcpu_input_arg);
> + *input_arg = page_address(alloc_page(GFP_ATOMIC));

This is called from the cpu hotplug thread and there is no need for an
atomic allocation. Please use GFP_KERNEL.

>   hv_get_vp_index(msr_vp_index);
>  
> @@ -217,6 +224,10 @@ static int hv_cpu_die(unsigned int cpu)
>  {
>   struct hv_reenlightenment_control re_ctrl;
>   unsigned int new_cpu;
> + void **input_arg;
> +
> + input_arg = (void **)this_cpu_ptr(hyperv_pcpu_input_arg);
> + free_page((unsigned long)*input_arg);

Hrm. Again this is called from the CPU hotplug thread when the cou is about
to go down. But you can be scheduled out after free() and before disabling
the assist thing below and the pointer persist. There is no guarantee that
nothing sends an IPI anymore after this point.

So you have two options here:

  1) Disable interrupts, get the pointer, set the per cpu pointer to NULL,
 reenable interruots and free the page

  2) Keep the page around and check for it in the CPU UP path and avoid the
 allocation when the CPU comes online again.

>   if (hv_vp_assist_page && hv_vp_assist_page[cpu])
>   wrmsrl(HV_X64_MSR_VP_ASSIST_PAGE, 0);
> @@ -260,6 +271,12 @@ void __init hyperv_init(void)
>   if ((ms_hyperv.features & required_msrs) != required_msrs)
>   return;
>  
> + /* Allocate the per-CPU state for the hypercall input arg */
> + hyperv_pcpu_input_arg = alloc_percpu(void  *);
> +
> + if (hyperv_pcpu_input_arg == NULL)
> + return;

Huch. When that allocation fails, you return and ignore the rest of the
function which has been there before. Weird decision.

Thanks,

tglx
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH 1/5] X86: Hyper-V: Enlighten APIC access

2018-04-26 Thread Thomas Gleixner
On Wed, 25 Apr 2018, k...@linuxonhyperv.com wrote:
> --- /dev/null
> +++ b/arch/x86/hyperv/hv_apic.c
> @@ -0,0 +1,98 @@
> +// SPDX-License-Identifier: GPL-2.0

Thanks for putting the license identifier in.

> +
> +/*
> + * Hyper-V specific APIC code.
> + *
> + * Copyright (C) 2018, Microsoft, Inc.
> + *
> + * Author : K. Y. Srinivasan 

But can you please check with your lawyers whether you can avoid the
pointless boilerplate? The SPDX identifier should cover it.

> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms of the GNU General Public License version 2 as published
> + * by the Free Software Foundation.
> + *
> + * This program is distributed in the hope that it will be useful, but
> + * WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE, GOOD TITLE or
> + * NON INFRINGEMENT.  See the GNU General Public License for more
> + * details.
> + *
> + */
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 

We usually order the includes

#include 
...
#include 

#include 
#include 


> -void hyperv_init(void);
> +void __init hyperv_init(void);
>  void hyperv_setup_mmu_ops(void);
>  void hyper_alloc_mmu(void);
>  void hyperv_report_panic(struct pt_regs *regs, long err);
> @@ -269,14 +269,16 @@ void hyperv_reenlightenment_intr(struct pt_regs *regs);
>  void set_hv_tscchange_cb(void (*cb)(void));
>  void clear_hv_tscchange_cb(void);
>  void hyperv_stop_tsc_emulation(void);
> +void hv_apic_init(void);
>  #else /* CONFIG_HYPERV */
> -static inline void hyperv_init(void) {}
> +static __init  inline void hyperv_init(void) {}

The __init on the empty inline function is pointless.

Other than the few nits. This looks well done!

Thanks,

tglx
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH v4 5/7] x86/irq: Count Hyper-V reenlightenment interrupts

2018-01-30 Thread Thomas Gleixner
On Tue, 30 Jan 2018, Thomas Gleixner wrote:

> On Tue, 30 Jan 2018, Radim Krčmář wrote:
> > 2018-01-29 22:48+0100, Thomas Gleixner:
> > > On Wed, 24 Jan 2018, Radim Krčmář wrote:
> > > > 2018-01-24 14:23+0100, Vitaly Kuznetsov:
> > > > > Hyper-V reenlightenment interrupts arrive when the VM is migrated, 
> > > > > we're
> > > > > not supposed to see many of them. However, it may be important to know
> > > > > that the event has happened in case we have L2 nested guests.
> > > > > 
> > > > > Signed-off-by: Vitaly Kuznetsov <vkuzn...@redhat.com>
> > > > > Reviewed-by: Thomas Gleixner <t...@linutronix.de>
> > > > > ---
> > > > 
> > > > Thomas,
> > > > 
> > > > I think the expectation is that this series will go through the KVM
> > > > tree.  Would you prefer a topic branch?
> > > 
> > > Is there any dependency outside of plain 4.15? If not, I'll put it into
> > > x86/hyperv and let KVM folks pull it over.
> > 
> > There isn't;  we'll wait for x86/hyperv, thanks.
> 
> Will be there tomorrow. I'll let you know.

   git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git x86/hyperv

Feel free to pull it into your tree.

Thanks,

tglx___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH v4 5/7] x86/irq: Count Hyper-V reenlightenment interrupts

2018-01-30 Thread Thomas Gleixner
On Tue, 30 Jan 2018, Radim Krčmář wrote:
> 2018-01-29 22:48+0100, Thomas Gleixner:
> > On Wed, 24 Jan 2018, Radim Krčmář wrote:
> > > 2018-01-24 14:23+0100, Vitaly Kuznetsov:
> > > > Hyper-V reenlightenment interrupts arrive when the VM is migrated, we're
> > > > not supposed to see many of them. However, it may be important to know
> > > > that the event has happened in case we have L2 nested guests.
> > > > 
> > > > Signed-off-by: Vitaly Kuznetsov <vkuzn...@redhat.com>
> > > > Reviewed-by: Thomas Gleixner <t...@linutronix.de>
> > > > ---
> > > 
> > > Thomas,
> > > 
> > > I think the expectation is that this series will go through the KVM
> > > tree.  Would you prefer a topic branch?
> > 
> > Is there any dependency outside of plain 4.15? If not, I'll put it into
> > x86/hyperv and let KVM folks pull it over.
> 
> There isn't;  we'll wait for x86/hyperv, thanks.

Will be there tomorrow. I'll let you know.
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH v4 5/7] x86/irq: Count Hyper-V reenlightenment interrupts

2018-01-29 Thread Thomas Gleixner
On Wed, 24 Jan 2018, Radim Krčmář wrote:

> 2018-01-24 14:23+0100, Vitaly Kuznetsov:
> > Hyper-V reenlightenment interrupts arrive when the VM is migrated, we're
> > not supposed to see many of them. However, it may be important to know
> > that the event has happened in case we have L2 nested guests.
> > 
> > Signed-off-by: Vitaly Kuznetsov <vkuzn...@redhat.com>
> > Reviewed-by: Thomas Gleixner <t...@linutronix.de>
> > ---
> 
> Thomas,
> 
> I think the expectation is that this series will go through the KVM
> tree.  Would you prefer a topic branch?

Is there any dependency outside of plain 4.15? If not, I'll put it into
x86/hyperv and let KVM folks pull it over.

Thanks,

tglx___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH v3 3/7] x86/hyper-v: reenlightenment notifications support

2018-01-19 Thread Thomas Gleixner
On Fri, 19 Jan 2018, Vitaly Kuznetsov wrote:
> kbuild test robot  writes:
> 
> > Hi Vitaly,
> >
> > Thank you for the patch! Perhaps something to improve:
> >
> > [auto build test WARNING on tip/auto-latest]
> > [also build test WARNING on v4.15-rc8 next-20180118]
> > [cannot apply to tip/x86/core kvm/linux-next]
> > [if your patch is applied to the wrong git tree, please drop us a note to 
> > help improve the system]
> >
> > url:
> > https://github.com/0day-ci/linux/commits/Vitaly-Kuznetsov/x86-kvm-hyperv-stable-clocksorce-for-L2-guests-when-running-nested-KVM-on-Hyper-V/20180119-160814
> > config: x86_64-allmodconfig (attached as .config)
> > compiler: gcc-7 (Debian 7.2.0-12) 7.2.1 20171025
> > reproduce:
> > # save the attached .config to linux build tree
> > make ARCH=x86_64 
> >
> > All warnings (new ones prefixed by >>):
> >
> >In file included from include/linux/kasan.h:17:0,
> > from include/linux/slab.h:129,
> > from include/linux/irq.h:26,
> > from arch/x86/include/asm/hardirq.h:6,
> > from include/linux/hardirq.h:9,
> > from include/linux/interrupt.h:13,
> > from arch/x86/include/asm/mshyperv.h:8,
> > from arch/x86//entry/vdso/vdso32/../vclock_gettime.c:20,
> > from arch/x86//entry/vdso/vdso32/vclock_gettime.c:33:
> >arch/x86/include/asm/pgtable.h: In function 'clone_pgd_range':
> >arch/x86/include/asm/pgtable.h:1129:9: error: implicit declaration of 
> > function 'kernel_to_user_pgdp'; did you mean 'u64_to_user_ptr'? 
> > [-Werror=implicit-function-declaration]
> >  memcpy(kernel_to_user_pgdp(dst), kernel_to_user_pgdp(src),
> > ^~~
> 
> Sorry but I'm failing to see how this (and all the rest) is related to
> my patch ...

You added '#include ' to mshyperv.h which is included in
vclock_gettime.c and pulls in other stuff which fails to expand

Thanks,

tglx

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH v3 3/7] x86/hyper-v: reenlightenment notifications support

2018-01-16 Thread Thomas Gleixner
On Tue, 16 Jan 2018, Vitaly Kuznetsov wrote:
> Hyper-V supports Live Migration notification. This is supposed to be used
> in conjunction with TSC emulation: when we are migrated to a host with
> different TSC frequency for some short period host emulates our accesses
> to TSC and sends us an interrupt to notify about the event. When we're
> done updating everything we can disable TSC emulation and everything will
> start working fast again.
> 
> We didn't need these notifications before as Hyper-V guests are not
> supposed to use TSC as a clocksource: in Linux we even mark it as unstable
> on boot. Guests normally use 'tsc page' clocksouce and host updates its
> values on migrations automatically.
> 
> Things change when we want to run nested virtualization: even when we pass
> through PV clocksources (kvm-clock or tsc page) to our guests we need to
> know TSC frequency and when it changes.
> 
> Hyper-V Top Level Functional Specification (as of v5.0b) wrongly specifies
> EAX:BIT(12) of CPUID:0x4009 as the feature identification bit. The
> right one to check is EAX:BIT(13) of CPUID:0x4003. I was assured that
> the fix in on the way.
> 
> Signed-off-by: Vitaly Kuznetsov <vkuzn...@redhat.com>

Reviewed-by: Thomas Gleixner <t...@linutronix.de>
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH v3 5/7] x86/irq: Count Hyper-V reenlightenment interrupts

2018-01-16 Thread Thomas Gleixner
On Tue, 16 Jan 2018, Vitaly Kuznetsov wrote:

> Hyper-V reenlightenment interrupts arrive when the VM is migrated, we're
> not supposed to see many of them. However, it may be important to know
> that the event has happened in case we have L2 nested guests.
> 
> Signed-off-by: Vitaly Kuznetsov <vkuzn...@redhat.com>

Reviewed-by: Thomas Gleixner <t...@linutronix.de>
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH v3 4/7] x86/hyper-v: redirect reenlightment notifications on CPU offlining

2018-01-16 Thread Thomas Gleixner
On Tue, 16 Jan 2018, Vitaly Kuznetsov wrote:

> It is very unlikely for CPUs to get offlined when we run on Hyper-V as
> we have a protection in vmbus module which prevents it when we have any
> VMBus devices assigned. This, however,  may change in future if an option
> to reassign an already active channel will be added. It is also possible
> to run without any Hyper-V devices of have a CPU with no assigned channels.
> 
> Reassign reenlightenment notifications to some other active CPU when
> the CPU which is assigned to get them goes offline.
> 
> Signed-off-by: Vitaly Kuznetsov <vkuzn...@redhat.com>

Reviewed-by: Thomas Gleixner <t...@linutronix.de>
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH v2 4/7] x86/hyper-v: redirect reenlightment notifications on CPU offlining

2018-01-14 Thread Thomas Gleixner
On Wed, 13 Dec 2017, Vitaly Kuznetsov wrote:
> +static int hv_cpu_die(unsigned int cpu)
> +{
> + struct hv_reenlightenment_control re_ctrl;
> + int i;
> + static DEFINE_SPINLOCK(lock);
> +
> + if (hv_reenlightenment_cb == NULL)
> + return 0;
> +
> + /* Make sure the CPU we migrate to is not going away too */
> + spin_lock();

What kind of voodoo is this? CPU hotplug is serialized already...

> + rdmsrl(HV_X64_MSR_REENLIGHTENMENT_CONTROL, *((u64 *)_ctrl));
> + if (re_ctrl.target_vp == hv_vp_index[cpu]) {
> + /* Find some other online CPU */
> + for_each_online_cpu(i) {


cpu = cpumask_any_but(cpu_online_mask);

Hmm?

Thanks,

tglx
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH v2 3/7] x86/hyper-v: reenlightenment notifications support

2018-01-14 Thread Thomas Gleixner
On Wed, 13 Dec 2017, Vitaly Kuznetsov wrote:
> +void hyperv_reenlightenment_intr(struct pt_regs *regs)

Lacks __visible and __irq_entry annotations.

Thanks,

tglx
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH v2 2/7] x86/hyper-v: add a function to read both TSC and TSC page value simulateneously

2018-01-14 Thread Thomas Gleixner
On Wed, 13 Dec 2017, Vitaly Kuznetsov wrote:

> This is going to be used from KVM code where we need to get both
> TSC and TSC page value.
> 
> When Hyper-V code is compiled out just return rdtsc(), this will allow us
> to avoid ugly ifdefs in non-Hyper-V code.

That's not what the patch implements

> +static inline u64 hv_read_tsc_page_tsc(const struct ms_hyperv_tsc_page 
> *tsc_pg,
> +u64 *cur_tsc)
> +{
> + BUG();
> + return U64_MAX;
> +}

Thanks,

tglx
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH v2 1/7] x86/hyper-v: check for required priviliges in hyperv_init()

2018-01-14 Thread Thomas Gleixner


On Wed, 13 Dec 2017, Vitaly Kuznetsov wrote:

> In hyperv_init() we presume we always have access to VP index and hypercall
> MSRs while according to the specification we should check if we're allowed
> to access the corresponding MSRs before accessing them.
> 
> Signed-off-by: Vitaly Kuznetsov <vkuzn...@redhat.com>

Reviewed-by: Thomas Gleixner <t...@linutronix.de>

> ---
>  arch/x86/hyperv/hv_init.c | 9 -
>  1 file changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
> index 189a398290db..21f9d53d9f00 100644
> --- a/arch/x86/hyperv/hv_init.c
> +++ b/arch/x86/hyperv/hv_init.c
> @@ -110,12 +110,19 @@ static int hv_cpu_init(unsigned int cpu)
>   */
>  void hyperv_init(void)
>  {
> - u64 guest_id;
> + u64 guest_id, required_msrs;
>   union hv_x64_msr_hypercall_contents hypercall_msr;
>  
>   if (x86_hyper_type != X86_HYPER_MS_HYPERV)
>   return;
>  
> + /* Absolutely required MSRs */
> + required_msrs = HV_X64_MSR_HYPERCALL_AVAILABLE |
> + HV_X64_MSR_VP_INDEX_AVAILABLE;
> +
> + if ((ms_hyperv.features & required_msrs) != required_msrs)
> + return;
> +
>   /* Allocate percpu VP index */
>   hv_vp_index = kmalloc_array(num_possible_cpus(), sizeof(*hv_vp_index),
>   GFP_KERNEL);
> -- 
> 2.14.3
> 
> 
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: PROBLEM: 4.15.0-rc3 APIC causes lockups on Core 2 Duo laptop

2017-12-29 Thread Thomas Gleixner
On Fri, 29 Dec 2017, Alexandru Chirvasitu wrote:
> All right, I tried to do some more digging around, in the hope of
> getting as close to the source of the problem as I can.
> 
> I went back to the very first commit that went astray for me, 2db1f95
> (which is the only one actually panicking), and tried to move from its
> parent 90ad9e2 (that boots fine) to it gradually, altering the code in
> small chunks.
> 
> I tried to ignore the stuff that clearly shouldn't make a difference,
> such as definitions. So in the end I get defined-but-unused-function
> errors in my compilations, but I'm ignoring those for now. Some
> results:
> 
> (1) When I move from the good commit 90ad9e2 according to the attached
> bad-diff (which moves partly towards 2db1f95), I get a panic.
> 
> (2) On the other hand, when I further change this last panicking
> commit by simply doing
> 
> 
> 
> removed activate / deactivate from x86_vector_domain_ops
> 
> diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
> index 7317ba5a..063594d 100644
> --- a/arch/x86/kernel/apic/vector.c
> +++ b/arch/x86/kernel/apic/vector.c
> @@ -514,8 +514,6 @@ void x86_vector_debug_show(struct seq_file *m, struct 
> irq_domain *d,
>  static const struct irq_domain_ops x86_vector_domain_ops = {
> .alloc  = x86_vector_alloc_irqs,
> .free   = x86_vector_free_irqs,
> -   .activate   = x86_vector_activate,
> -   .deactivate = x86_vector_deactivate,
>  #ifdef CONFIG_GENERIC_IRQ_DEBUGFS
> .debug_show = x86_vector_debug_show,
>  #endif
> 
> 
> all is well. 

Nice detective work. Unfortunately that's not a real solution ...

Can you try the patch below on top of Linus tree, please?

Thanks,

tglx

8<-
--- a/kernel/irq/msi.c
+++ b/kernel/irq/msi.c
@@ -339,6 +339,40 @@ int msi_domain_populate_irqs(struct irq_
return ret;
 }
 
+/*
+ * Carefully check whether the device can use reservation mode. If
+ * reservation mode is enabled then the early activation will assign a
+ * dummy vector to the device. If the PCI/MSI device does not support
+ * masking of the entry then this can result in spurious interrupts when
+ * the device driver is not absolutely careful. But even then a malfunction
+ * of the hardware could result in a spurious interrupt on the dummy vector
+ * and render the device unusable. If the entry can be masked then the core
+ * logic will prevent the spurious interrupt and reservation mode can be
+ * used. For now reservation mode is restricted to PCI/MSI.
+ */
+static bool msi_check_reservation_mode(struct irq_domain *domain,
+  struct msi_domain_info *info,
+  struct device *dev)
+{
+   struct msi_desc *desc;
+
+   if (domain->bus_token != DOMAIN_BUS_PCI_MSI)
+   return false;
+
+   if (!(info->flags & MSI_FLAG_MUST_REACTIVATE))
+   return false;
+
+   if (IS_ENABLED(CONFIG_PCI_MSI) && pci_msi_ignore_mask)
+   return false;
+
+   /*
+* Checking the first MSI descriptor is sufficient. MSIX supports
+* masking and MSI does so when the maskbit is set.
+*/
+   desc = first_msi_entry(dev);
+   return desc->msi_attrib.is_msix || desc->msi_attrib.maskbit;
+}
+
 /**
  * msi_domain_alloc_irqs - Allocate interrupts from a MSI interrupt domain
  * @domain:The domain to allocate from
@@ -353,9 +387,11 @@ int msi_domain_alloc_irqs(struct irq_dom
 {
struct msi_domain_info *info = domain->host_data;
struct msi_domain_ops *ops = info->ops;
-   msi_alloc_info_t arg;
+   struct irq_data *irq_data;
struct msi_desc *desc;
+   msi_alloc_info_t arg;
int i, ret, virq;
+   bool can_reserve;
 
ret = msi_domain_prepare_irqs(domain, dev, nvec, );
if (ret)
@@ -385,6 +421,8 @@ int msi_domain_alloc_irqs(struct irq_dom
if (ops->msi_finish)
ops->msi_finish(, 0);
 
+   can_reserve = msi_check_reservation_mode(domain, info, dev);
+
for_each_msi_entry(desc, dev) {
virq = desc->irq;
if (desc->nvec_used == 1)
@@ -397,17 +435,28 @@ int msi_domain_alloc_irqs(struct irq_dom
 * the MSI entries before the PCI layer enables MSI in the
 * card. Otherwise the card latches a random msi message.
 */
-   if (info->flags & MSI_FLAG_ACTIVATE_EARLY) {
-   struct irq_data *irq_data;
+   if (!(info->flags & MSI_FLAG_ACTIVATE_EARLY))
+   continue;
 
+   irq_data = irq_domain_get_irq_data(domain, desc->irq);
+   if (!can_reserve)
+   irqd_clr_can_reserve(irq_data);
+   ret = 

Re: PROBLEM: 4.15.0-rc3 APIC causes lockups on Core 2 Duo laptop

2017-12-29 Thread Thomas Gleixner
On Thu, 28 Dec 2017, Alexandru Chirvasitu wrote:
> On Fri, Dec 29, 2017 at 12:36:37AM +0100, Thomas Gleixner wrote:
> > On Thu, 28 Dec 2017, Alexandru Chirvasitu wrote:
> > 
> > > Attached, but heads up on this: when redirecting the output of lspci
> > > -vvv to a text file as root I get
> > > 
> > > pcilib: sysfs_read_vpd: read failed: Input/output error
> > > 
> > > I can find bugs filed for various distros to this same effect, but
> > > haven't tracked down any explanations.
> > 
> > Weird, but the info looks complete.
> > 
> > Can you please add 'pci=nomsi' to the 4.15 kernel command line and see
> > whether that works?
> 
> It does (emailing from that successful boot as we speak). I'm on a
> clean 4.15-rc5 (as in no patches, etc.). 
> 
> This was also suggested way at the top of this thread by Dexuan Cui
> for 4.15-rc3 (where this exchange started), and it worked back then
> too.

I missed that part of the conversation. Let me stare into the MSI code
again.

Thanks,

tglx
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: PROBLEM: 4.15.0-rc3 APIC causes lockups on Core 2 Duo laptop

2017-12-28 Thread Thomas Gleixner
On Thu, 28 Dec 2017, Alexandru Chirvasitu wrote:

> Attached, but heads up on this: when redirecting the output of lspci
> -vvv to a text file as root I get
> 
> pcilib: sysfs_read_vpd: read failed: Input/output error
> 
> I can find bugs filed for various distros to this same effect, but
> haven't tracked down any explanations.

Weird, but the info looks complete.

Can you please add 'pci=nomsi' to the 4.15 kernel command line and see
whether that works?

Thanks,

tglx
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: PROBLEM: 4.15.0-rc3 APIC causes lockups on Core 2 Duo laptop

2017-12-28 Thread Thomas Gleixner
On Thu, 28 Dec 2017, Thomas Gleixner wrote:

> On Thu, 28 Dec 2017, Alexandru Chirvasitu wrote:
> 
> > Attached.
> > 
> > I don't have a 4.14 family kernel available at the moment on that
> > machine. What I'm attaching comes from the 4.13 one I was playing with
> > yesterday, what with kexec and all.
> 
> Good enough. Thanks !

Spoke too fast. Could you please run that command as root?

And while at it please provide the output of

cat /proc/interrupts

as well.

Thanks,

tglx

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: PROBLEM: 4.15.0-rc3 APIC causes lockups on Core 2 Duo laptop

2017-12-28 Thread Thomas Gleixner
On Thu, 28 Dec 2017, Alexandru Chirvasitu wrote:

> Attached.
> 
> I don't have a 4.14 family kernel available at the moment on that
> machine. What I'm attaching comes from the 4.13 one I was playing with
> yesterday, what with kexec and all.

Good enough. Thanks !
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: PROBLEM: 4.15.0-rc3 APIC causes lockups on Core 2 Duo laptop

2017-12-28 Thread Thomas Gleixner
On Thu, 28 Dec 2017, Thomas Gleixner wrote:
> On Thu, 28 Dec 2017, Alexandru Chirvasitu wrote:
> 
> > No; it seems to be tied to this specific issue, and I was seeing even
> > before getting logs just now, whenever I'd start one of the bad
> > kernels in recovery mode.
> > 
> > But no, I've never seen that in any other logs, or on any other
> > screens outside of those popping up in relation to this problem.
> 
> Ok. I'll dig into it and we have a 100% reproducer reported by someone else
> now, which might give us simpler insight into that issue. I'll let you know
> once I have something to test. Might be a couple of days though.

Can you please provide the output of

lspci -vvv

from a working kernel, preferrably 4.14.y

Thanks,

tglx
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


RE: PROBLEM: 4.15.0-rc3 APIC causes lockups on Core 2 Duo laptop

2017-12-28 Thread Thomas Gleixner
On Thu, 28 Dec 2017, Dexuan Cui wrote:

> > From: Thomas Gleixner [mailto:t...@linutronix.de]
> > Sent: Thursday, December 28, 2017 03:03
> > > > On Wed, Dec 20, 2017 at 02:12:05AM +, Dexuan Cui wrote:
> > 
> > > > For Linux VM running on Hyper-V, we did get "spurious APIC interrupt
> > > > through vector " and a patchset, which included the patch you identifed
> > > > ("genirq: Add config option for reservation mode"), was made to fix the
> > > > issue. But since you're using a physical machine rathter than a VM, I
> > > > suspect it should be a different issue.
> > 
> > Aaargh! Why was this never reported and where is that magic patchset?
> > tglx
> 
> Hi Thomas,
> The Hyper-V specific issue was reported and you made a patchset to fix it:
> https://patchwork.kernel.org/patch/10006171/
> https://lkml.org/lkml/2017/10/17/120

Ah, ok. I did not make the connection. I'll have a scan through the tree to
figure out whether there is some other weird place which is missing that.

Thanks,

tglx
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: PROBLEM: 4.15.0-rc3 APIC causes lockups on Core 2 Duo laptop

2017-12-28 Thread Thomas Gleixner
On Thu, 28 Dec 2017, Alexandru Chirvasitu wrote:

> No; it seems to be tied to this specific issue, and I was seeing even
> before getting logs just now, whenever I'd start one of the bad
> kernels in recovery mode.
> 
> But no, I've never seen that in any other logs, or on any other
> screens outside of those popping up in relation to this problem.

Ok. I'll dig into it and we have a 100% reproducer reported by someone else
now, which might give us simpler insight into that issue. I'll let you know
once I have something to test. Might be a couple of days though.

Thanks,

tglx
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: PROBLEM: 4.15.0-rc3 APIC causes lockups on Core 2 Duo laptop

2017-12-28 Thread Thomas Gleixner
On Thu, 28 Dec 2017, Alexandru Chirvasitu wrote:
> On Thu, Dec 28, 2017 at 05:10:28PM +0100, Thomas Gleixner wrote:
> > On Thu, 28 Dec 2017, Alexandru Chirvasitu wrote:
> > > Actually, it decided to cooperate for just long enough for me to get
> > > the dmesg out. Attached.
> > > 
> > > This is from the kernel you asked about: Dou's patch + yours, i.e. the
> > > latest one in that git log I just sent, booted up with 'apic=debug'.
> > 
> > Ok. As I suspected that warning does not trigger. I would have been
> > massively surprised if that happened. So Dou's patch is just a red herring
> > and just might change the timing enough to make the problem 'hide'.
> > 
> > Can you try something completely different please?
> > 
> > Just use plain Linus tree without any additional patches on top and disable
> > CONFIG_NO_HZ_IDLE, i.e. select CONFIG_HZ_PERIODIC.
> > 
> > If that works, then reenable it and add 'nohz=off' to the kernel command
> > line.
> >
> 
> No go here I'm afraid:
> 
> Linus' clean 4.15-rc5 compiled with CONFIG_HZ_PERIODIC exhibits the
> familiar behaviour: lockups, sometimes instant upon trying to log in,
> sometimes logging me in and freaking out seconds later.

Ok. So it's not the issue I had in mind. 

Back to some of the interesting bits in the logs:

[   36.017942] spurious APIC interrupt through vector ff on CPU#0, should never 
happen.

Does that message ever show up in 4.14 or 4.9?

Thanks,

tglx
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: PROBLEM: 4.15.0-rc3 APIC causes lockups on Core 2 Duo laptop

2017-12-28 Thread Thomas Gleixner
On Thu, 28 Dec 2017, Alexandru Chirvasitu wrote:
> Actually, it decided to cooperate for just long enough for me to get
> the dmesg out. Attached.
> 
> This is from the kernel you asked about: Dou's patch + yours, i.e. the
> latest one in that git log I just sent, booted up with 'apic=debug'.

Ok. As I suspected that warning does not trigger. I would have been
massively surprised if that happened. So Dou's patch is just a red herring
and just might change the timing enough to make the problem 'hide'.

Can you try something completely different please?

Just use plain Linus tree without any additional patches on top and disable
CONFIG_NO_HZ_IDLE, i.e. select CONFIG_HZ_PERIODIC.

If that works, then reenable it and add 'nohz=off' to the kernel command
line.

Thanks,

tglx

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: PROBLEM: 4.15.0-rc3 APIC causes lockups on Core 2 Duo laptop

2017-12-28 Thread Thomas Gleixner
On Thu, 28 Dec 2017, Alexandru Chirvasitu wrote:
> On Thu, Dec 28, 2017 at 12:00:47PM +0100, Thomas Gleixner wrote:
> > Ok, lets take a step back. The bisect/kexec attempts led us away from the
> > initial problem which is the machine locking up after login, right?
> >
> 
> Yes; sorry about that..

Nothing to be sorry about.

> x86/vector: Replace the raw_spin_lock() with
> 
> diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
> index 7504491..e5bab02 100644
> --- a/arch/x86/kernel/apic/vector.c
> +++ b/arch/x86/kernel/apic/vector.c
> @@ -726,6 +726,7 @@ static int apic_set_affinity(struct irq_data *irqd,
>  const struct cpumask *dest, bool force)
>  {
> struct apic_chip_data *apicd = apic_chip_data(irqd);
> +   unsigned long flags;
> int err;
>  
> /*
> @@ -740,13 +741,13 @@ static int apic_set_affinity(struct irq_data *irqd,
> (apicd->is_managed || apicd->can_reserve))
> return IRQ_SET_MASK_OK;
>  
> -   raw_spin_lock(_lock);
> +   raw_spin_lock_irqsave(_lock, flags);
> cpumask_and(vector_searchmask, dest, cpu_online_mask);
> if (irqd_affinity_is_managed(irqd))
> err = assign_managed_vector(irqd, vector_searchmask);
> else
> err = assign_vector_locked(irqd, vector_searchmask);
> -   raw_spin_unlock(_lock);
> +   raw_spin_unlock_irqrestore(_lock, flags);
> return err ? err : IRQ_SET_MASK_OK;
>  }
> 
> With this, I still get the lockup messages after login, but not the
> freezes!

That's really interesting. There should be no code path which calls into
that with interrupts enabled. I assume you never ran that kernel with
CONFIG_PROVE_LOCKING=y.

Find below a debug patch which should show us the call chain for that
case. Please apply that on top of Dou's patch so the machine stays
accessible. Plain output from dmesg is sufficient.

> The lockups register in the log, which I am attaching (see below for
> attachment naming conventions).

Hmm. That's RCU lockups and that backtrace on the CPU which gets the stall
looks very familiar. I'd like to see the above result first and then I'll
send you another pile of patches which might cure that RCU issue.

Thanks,

tglx

8<---
--- a/arch/x86/kernel/apic/vector.c
+++ b/arch/x86/kernel/apic/vector.c
@@ -729,6 +729,8 @@ static int apic_set_affinity(struct irq_
unsigned long flags;
int err;
 
+   WARN_ON_ONCE(!irqs_disabled());
+
/*
 * Core code can call here for inactive interrupts. For inactive
 * interrupts which use managed or reservation mode there is no




___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: PROBLEM: 4.15.0-rc3 APIC causes lockups on Core 2 Duo laptop

2017-12-28 Thread Thomas Gleixner
On Wed, 20 Dec 2017, Alexandru Chirvasitu wrote:
> Merging the contents of another exchange spawned from the original

> > On Wed, Dec 20, 2017 at 02:12:05AM +, Dexuan Cui wrote:

> > For Linux VM running on Hyper-V, we did get "spurious APIC interrupt
> > through vector " and a patchset, which included the patch you identifed
> > ("genirq: Add config option for reservation mode"), was made to fix the
> > issue. But since you're using a physical machine rathter than a VM, I
> > suspect it should be a different issue. 

Aaargh! Why was this never reported and where is that magic patchset?

Thanks,

tglx
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: PROBLEM: 4.15.0-rc3 APIC causes lockups on Core 2 Duo laptop

2017-12-28 Thread Thomas Gleixner
On Wed, 20 Dec 2017, Alexandru Chirvasitu wrote:
> On Wed, Dec 20, 2017 at 11:58:57AM +0800, Dou Liyang wrote:
> > At 12/20/2017 08:31 AM, Thomas Gleixner wrote:
> > > > I had never heard of 'bisect' before this casual mention (you might tell
> > > > I am a bit out of my depth). I've since applied it to Linus' tree 
> > > > between
> > > 
> > > > bebc608 Linux 4.14 (good)
> > > > 
> > > > and
> > > > 
> > > > 4fbd8d1 Linux 4.15-rc1 (bad)
> > > 
> > > Is Linus current head 4.15-rc4 bad as well?
> > > 
> > [...]
> 
> Yes. Exactly the same symptoms on
> 
> 1291a0d5 Linux 4.15-rc4
> 
> compiled just now from Linus' tree. 

Ok, lets take a step back. The bisect/kexec attempts led us away from the
initial problem which is the machine locking up after login, right?

Could you try the patch below on top of Linus tree (rc5+)?

Thanks,

tglx

8<---
--- a/arch/x86/kernel/apic/apic_flat_64.c
+++ b/arch/x86/kernel/apic/apic_flat_64.c
@@ -151,7 +151,7 @@ static struct apic apic_flat __ro_after_
.apic_id_valid  = default_apic_id_valid,
.apic_id_registered = flat_apic_id_registered,
 
-   .irq_delivery_mode  = dest_LowestPrio,
+   .irq_delivery_mode  = dest_Fixed,
.irq_dest_mode  = 1, /* logical */
 
.disable_esr= 0,
--- a/arch/x86/kernel/apic/probe_32.c
+++ b/arch/x86/kernel/apic/probe_32.c
@@ -105,7 +105,7 @@ static struct apic apic_default __ro_aft
.apic_id_valid  = default_apic_id_valid,
.apic_id_registered = default_apic_id_registered,
 
-   .irq_delivery_mode  = dest_LowestPrio,
+   .irq_delivery_mode  = dest_Fixed,
/* logical delivery broadcast to all CPUs: */
.irq_dest_mode  = 1,
 
--- a/arch/x86/kernel/apic/x2apic_cluster.c
+++ b/arch/x86/kernel/apic/x2apic_cluster.c
@@ -184,7 +184,7 @@ static struct apic apic_x2apic_cluster _
.apic_id_valid  = x2apic_apic_id_valid,
.apic_id_registered = x2apic_apic_id_registered,
 
-   .irq_delivery_mode  = dest_LowestPrio,
+   .irq_delivery_mode  = dest_Fixed,
.irq_dest_mode  = 1, /* logical */
 
.disable_esr= 0,

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


RE: PROBLEM: 4.15.0-rc3 APIC causes lockups on Core 2 Duo laptop

2017-12-23 Thread Thomas Gleixner
On Sat, 23 Dec 2017, Dexuan Cui wrote:

> > From: Alexandru Chirvasitu [mailto:achirva...@gmail.com]
> > Sent: Friday, December 22, 2017 14:29
> > 
> > The output of that precise command run just now on a freshly-compiled
> > copy of that commit is attached.
> > 
> > On Fri, Dec 22, 2017 at 09:31:28PM +, Dexuan Cui wrote:
> > > > From: Alexandru Chirvasitu [mailto:achirva...@gmail.com]
> > > > Sent: Friday, December 22, 2017 06:21
> > > >
> > > > In the absence of logs, the best I can do at the moment is attach a
> > > > picture of the screen I am presented with on the apic=debug boot
> > > > attempt.
> > > > Alex
> > >
> > > The panic happens in irq_matrix_assign_system+0x4e/0xd0 in your picture.
> > > IMO we should find which line of code causes the panic. I suppose
> > > "objdump -D kernel/irq/matrix.o" can help to do that.
> > >
> > > Thanks,
> > > -- Dexuan
> 
> The BUG_ON panic happens at line 147:
>BUG_ON(!test_and_clear_bit(bit, cm->alloc_map));
> 
> I'm sure Thomas and Dou know it better than me. 

I'll have a look after the holidays.

Thanks,

tglx
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: PROBLEM: 4.15.0-rc3 APIC causes lockups on Core 2 Duo laptop

2017-12-19 Thread Thomas Gleixner
On Tue, 19 Dec 2017, Alexandru Chirvasitu wrote:

> I had never heard of 'bisect' before this casual mention (you might tell
> I am a bit out of my depth). I've since applied it to Linus' tree between

> bebc608 Linux 4.14 (good)
> 
> and
> 
> 4fbd8d1 Linux 4.15-rc1 (bad)

Is Linus current head 4.15-rc4 bad as well?

> It took about 13 attempts (I had access to a faster machine to compile
> on, and ccache helped once the cache built up some momentum). The result
> is (as presented by 'git bisect' at the end of the process, between the
> --- dividers added by me for clarity):

> --- start of output ---
> 
> 2b5175c4fa974b6aa05bbd2ee8d443a8036a1714 is the first bad commit
> commit 2b5175c4fa974b6aa05bbd2ee8d443a8036a1714
> Author: Thomas Gleixner <t...@linutronix.de>
> Date:   Tue Oct 17 09:54:57 2017 +0200
> 
> genirq: Add config option for reservation mode
> 
> The interrupt reservation mode requires reactivation of PCI/MSI
> interrupts. Create a config option, so the PCI code can set the
> corresponding flag when required.
> 
> Signed-off-by: Thomas Gleixner <t...@linutronix.de>
> Cc: Josh Poulson <jopou...@microsoft.com>
> Cc: Mihai Costache <v-mi...@microsoft.com>
> Cc: Stephen Hemminger <sthem...@microsoft.com>
> Cc: Marc Zyngier <marc.zyng...@arm.com>
> Cc: linux-...@vger.kernel.org
> Cc: Haiyang Zhang <haiya...@microsoft.com>
> Cc: Dexuan Cui <de...@microsoft.com>
> Cc: Simon Xiao <six...@microsoft.com>
> Cc: Saeed Mahameed <sae...@mellanox.com>
> Cc: Jork Loeser <jork.loe...@microsoft.com>
> Cc: Bjorn Helgaas <bhelg...@google.com>
> Cc: de...@linuxdriverproject.org
> Cc: KY Srinivasan <k...@microsoft.com>
> Link: https://lkml.kernel.org/r/20171017075600.369375...@linutronix.de
> 
> :04 04 5e73031cc0c8411a20722cce7876ab7b82ed3858 
> dcf98e7a6b7d5f7c5353b7ccab02125e6d332ec8 M  kernel
> 
> --- end of output ---
> 
> Consequently, I am cc-ing in the listed addresses.

Thanks for doing that bisect, but unfortunately this commit cannot be the
problematic one, It merily adds a config symbol, but it does not change any
code at all. It has no effect whatsoever. So something might have gone
wrong in your bisecting.

I CC'ed Dou Liyang. He has changed the early APIC setup code and there has
been an issue reported already. Though I lost track of that. Dou, any
pointers?

Thanks,

tglx


___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH 0/9] posix_clocks: Prepare syscalls for 64 bit time_t conversion

2017-11-17 Thread Thomas Gleixner
On Fri, 17 Nov 2017, Arnd Bergmann wrote:
> On Fri, Nov 17, 2017 at 10:54 AM, Thomas Gleixner <t...@linutronix.de> wrote:
> > On Fri, 17 Nov 2017, Arnd Bergmann wrote:
> >> On Fri, Nov 17, 2017 at 9:58 AM, Thomas Gleixner <t...@linutronix.de> 
> >> wrote:
> >>
> >> No, syscall that existing 32-bit user space enters would be handled by
> >> compat_sys_nanosleep() on both 32-bit and 64-bit kernels at that
> >> point. The idea here is to make the code path more uniform between
> >> 32-bit and 64-bit kernels.
> >
> > So on a 32bit system compat_sys_nanosleep() would be the legacy
> > sys_nanosleep() with the existing syscall number, but you don't want to
> > introduce a new sys_nanosleep64() for 32bit. That makes a lot of sense.
> >
> > So back to your original question whether to use #if (MAGIC logic) or a
> > separate config symbol. Please use the latter, these magic logic constructs
> > are harder to read and prone to get wrong at some point. Having the
> > decision logic in one place is always the right thing to do.
> 
> How about this:
> 
> config LEGACY_TIME_SYSCALLS
>   def_bool 64BIT || !64BIT_TIME
>   help
> This controls the compilation of the following system calls:
> time, stime,
> gettimeofday, settimeofday, adjtimex, nanosleep, alarm, getitimer,
> setitimer, select, utime, utimes, futimesat, and
> {old,new}{l,f,}stat{,64}.
> These all pass 32-bit time_t arguments on 32-bit architectures and
> are replaced by other interfaces (e.g. posix timers and clocks, 
> statx).
> C libraries implementing 64-bit time_t in 32-bit architectures have to
> implement the handles by wrapping around the newer interfaces.

s/handles/handling/ 

> New architectures should not explicitly disable this.

New architectures should never enable this, right?

Thanks,

tglx
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH 0/9] posix_clocks: Prepare syscalls for 64 bit time_t conversion

2017-11-17 Thread Thomas Gleixner
On Fri, 17 Nov 2017, Arnd Bergmann wrote:
> On Fri, Nov 17, 2017 at 9:58 AM, Thomas Gleixner <t...@linutronix.de> wrote:
> > On Fri, 17 Nov 2017, Arnd Bergmann wrote:
> >> On Thu, Nov 16, 2017 at 10:04 AM, Thomas Gleixner <t...@linutronix.de> 
> >> wrote:
> >> > On Wed, 15 Nov 2017, Deepa Dinamani wrote:
> >> >> Would this work for everyone?
> >> >
> >> > Having extra config switches which are selectable by architectures and
> >> > removed when everything is converted is definitely the right way to go.
> >> >
> >> > That allows you to gradually convert stuff w/o inflicting wreckage all 
> >> > over
> >> > the place.
> >>
> >> The CONFIG_64BIT_TIME would do that nicely for the new stuff like
> >> the conditional definition of __kernel_timespec, this one would get
> >> removed after we convert all architectures.
> >>
> >> A second issue is how to control the compilation of the compat syscalls.
> >> CONFIG_COMPAT_32BIT_TIME handles that and could be defined
> >> in Kconfig as 'def_bool (!64BIT && CONFIG_64BIT_TIME) || COMPAT',
> >> this is then just a more readable way of expressing exactly when the
> >> functions should be built.
> >>
> >> For completeness, there may be a third category, depending on how
> >> we handle things like sys_nanosleep(): Here, we want the native
> >> sys_nanosleep on 64-bit architectures, and compat_sys_nanosleep()
> >> to handle the 32-bit time_t variant on both 32-bit and 64-bit targets,
> >> but our plan is to not have a native 32-bit sys_nanosleep on 32-bit
> >> architectures any more, as new glibc should call clock_nanosleep()
> >> with a new syscall number instead. Should we then enclose
> >
> > Isn't that going to break existing userspace?
> 
> No, syscall that existing 32-bit user space enters would be handled by
> compat_sys_nanosleep() on both 32-bit and 64-bit kernels at that
> point. The idea here is to make the code path more uniform between
> 32-bit and 64-bit kernels.

So on a 32bit system compat_sys_nanosleep() would be the legacy
sys_nanosleep() with the existing syscall number, but you don't want to
introduce a new sys_nanosleep64() for 32bit. That makes a lot of sense.

So back to your original question whether to use #if (MAGIC logic) or a
separate config symbol. Please use the latter, these magic logic constructs
are harder to read and prone to get wrong at some point. Having the
decision logic in one place is always the right thing to do.

Thanks,

tglx
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH 0/9] posix_clocks: Prepare syscalls for 64 bit time_t conversion

2017-11-17 Thread Thomas Gleixner
On Fri, 17 Nov 2017, Arnd Bergmann wrote:
> On Thu, Nov 16, 2017 at 10:04 AM, Thomas Gleixner <t...@linutronix.de> wrote:
> > On Wed, 15 Nov 2017, Deepa Dinamani wrote:
> >> Would this work for everyone?
> >
> > Having extra config switches which are selectable by architectures and
> > removed when everything is converted is definitely the right way to go.
> >
> > That allows you to gradually convert stuff w/o inflicting wreckage all over
> > the place.
> 
> The CONFIG_64BIT_TIME would do that nicely for the new stuff like
> the conditional definition of __kernel_timespec, this one would get
> removed after we convert all architectures.
> 
> A second issue is how to control the compilation of the compat syscalls.
> CONFIG_COMPAT_32BIT_TIME handles that and could be defined
> in Kconfig as 'def_bool (!64BIT && CONFIG_64BIT_TIME) || COMPAT',
> this is then just a more readable way of expressing exactly when the
> functions should be built.
> 
> For completeness, there may be a third category, depending on how
> we handle things like sys_nanosleep(): Here, we want the native
> sys_nanosleep on 64-bit architectures, and compat_sys_nanosleep()
> to handle the 32-bit time_t variant on both 32-bit and 64-bit targets,
> but our plan is to not have a native 32-bit sys_nanosleep on 32-bit
> architectures any more, as new glibc should call clock_nanosleep()
> with a new syscall number instead. Should we then enclose

Isn't that going to break existing userspace?

Thanks

tglx
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


  1   2   >