Re: [PATCH V2] x86/Hyper-V: don't allocate clockevent device when synthetic timer is unavailable

2020-03-31 Thread Vitaly Kuznetsov
ltyker...@gmail.com writes:

> From: Tianyu Lan 
>
> Current code initializes clock event data structure for syn timer
> even when it's unavailable. Fix it.
>
> Signed-off-by: Tianyu Lan 
> ---
> Change since v1:
>   Update title and commit log. 
>
>  drivers/hv/hv.c | 15 +--
>  1 file changed, 9 insertions(+), 6 deletions(-)
>
> diff --git a/drivers/hv/hv.c b/drivers/hv/hv.c
> index 632d25674e7f..2e893768fc76 100644
> --- a/drivers/hv/hv.c
> +++ b/drivers/hv/hv.c
> @@ -212,13 +212,16 @@ int hv_synic_alloc(void)
>   tasklet_init(_cpu->msg_dpc,
>vmbus_on_msg_dpc, (unsigned long) hv_cpu);
>  
> - hv_cpu->clk_evt = kzalloc(sizeof(struct clock_event_device),
> -   GFP_KERNEL);
> - if (hv_cpu->clk_evt == NULL) {
> - pr_err("Unable to allocate clock event device\n");
> - goto err;
> + if (ms_hyperv.features & HV_MSR_SYNTIMER_AVAILABLE) {
> + hv_cpu->clk_evt =
> + kzalloc(sizeof(struct clock_event_device),
> +   GFP_KERNEL);
> + if (hv_cpu->clk_evt == NULL) {
> + pr_err("Unable to allocate clock event 
> device\n");
> + goto err;
> + }
> + hv_init_clockevent_device(hv_cpu->clk_evt, cpu);
>   }
> - hv_init_clockevent_device(hv_cpu->clk_evt, cpu);
>  
>   hv_cpu->synic_message_page =
>   (void *)get_zeroed_page(GFP_ATOMIC);

Thank you for fixing the subject! I had one more question on the
previous version which still stands: which tree is this patch for?
Upstream, clockevent allocation has moved to
drivers/clocksource/hyperv_timer.c and the code there is different.

Is this intended for some stable tree?

-- 
Vitaly

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [Update PATCH] x86/Hyper-V: Initialize Syn timer clock when it's

2020-03-30 Thread Vitaly Kuznetsov
ltyker...@gmail.com writes:

> From: Tianyu Lan 
>
> Current code initializes clock event data structure for syn timer
> even when it's not available. Fix it.
>
> Signed-off-by: Tianyu Lan 
> ---
> - Fix the wrong title.

The new one is ... weird too :-)

I think it was supposed to be something like "x86/Hyper-V: don't
allocate clockevent device when synthetic timer is unavailable"

>  
>  drivers/hv/hv.c | 15 +--

Which tree is this patch for? Upstream clockevent allocation has moved
to drivers/clocksource/hyperv_timer.c 

>  1 file changed, 9 insertions(+), 6 deletions(-)
>
> diff --git a/drivers/hv/hv.c b/drivers/hv/hv.c
> index 632d25674e7f..2e893768fc76 100644
> --- a/drivers/hv/hv.c
> +++ b/drivers/hv/hv.c
> @@ -212,13 +212,16 @@ int hv_synic_alloc(void)
>   tasklet_init(_cpu->msg_dpc,
>vmbus_on_msg_dpc, (unsigned long) hv_cpu);
>  
> - hv_cpu->clk_evt = kzalloc(sizeof(struct clock_event_device),
> -   GFP_KERNEL);
> - if (hv_cpu->clk_evt == NULL) {
> - pr_err("Unable to allocate clock event device\n");
> - goto err;
> + if (ms_hyperv.features & HV_MSR_SYNTIMER_AVAILABLE) {
> + hv_cpu->clk_evt =
> + kzalloc(sizeof(struct clock_event_device),
> +   GFP_KERNEL);
> + if (hv_cpu->clk_evt == NULL) {
> + pr_err("Unable to allocate clock event 
> device\n");
> + goto err;
> + }
> + hv_init_clockevent_device(hv_cpu->clk_evt, cpu);
>   }
> - hv_init_clockevent_device(hv_cpu->clk_evt, cpu);
>  
>   hv_cpu->synic_message_page =
>   (void *)get_zeroed_page(GFP_ATOMIC);

-- 
Vitaly

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: general protection fault in __apic_accept_irq

2019-09-05 Thread Vitaly Kuznetsov
Wanpeng Li  writes:

> On Thu, 5 Sep 2019 at 16:53, syzbot
>  wrote:
>>
>> Hello,
>>
>> syzbot found the following crash on:
>>
>> HEAD commit:3b47fd5c Merge tag 'nfs-for-5.3-4' of git://git.linux-nfs...
>> git tree:   upstream
>> console output: https://syzkaller.appspot.com/x/log.txt?x=124af12a60
>> kernel config:  https://syzkaller.appspot.com/x/.config?x=144488c6c6c6d2b6
>> dashboard link: https://syzkaller.appspot.com/bug?extid=dff25ee91f0c7d5c1695
>> compiler:   clang version 9.0.0 (/home/glider/llvm/clang
>> 80fee25776c2fb61e74c1ecb1a523375c2500b69)
>> syz repro:  https://syzkaller.appspot.com/x/repro.syz?x=1095467660
>> C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=1752fe0a60
>>
>> The bug was bisected to:
>>
>> commit 0aa67255f54df192d29aec7ac6abb1249d45bda7
>> Author: Vitaly Kuznetsov 
>> Date:   Mon Nov 26 15:47:29 2018 +
>>
>>  x86/hyper-v: move synic/stimer control structures definitions to
>> hyperv-tlfs.h
>>
>> bisection log:  https://syzkaller.appspot.com/x/bisect.txt?x=156128c160
>> console output: https://syzkaller.appspot.com/x/log.txt?x=136128c160
>>
>> IMPORTANT: if you fix the bug, please add the following tag to the commit:
>> Reported-by: syzbot+dff25ee91f0c7d5c1...@syzkaller.appspotmail.com
>> Fixes: 0aa67255f54d ("x86/hyper-v: move synic/stimer control structures
>> definitions to hyperv-tlfs.h")
>>
>> kvm [9347]: vcpu0, guest rIP: 0xcc Hyper-V uhandled wrmsr: 0x4004 data
>> 0x94
>> kvm [9347]: vcpu0, guest rIP: 0xcc Hyper-V uhandled wrmsr: 0x4004 data
>> 0x48c
>> kvm [9347]: vcpu0, guest rIP: 0xcc Hyper-V uhandled wrmsr: 0x4004 data
>> 0x4ac
>> kvm [9347]: vcpu0, guest rIP: 0xcc Hyper-V uhandled wrmsr: 0x4005 data
>> 0x1520
>> kvm [9347]: vcpu0, guest rIP: 0xcc Hyper-V uhandled wrmsr: 0x4006 data
>> 0x15d4
>> kvm [9347]: vcpu0, guest rIP: 0xcc Hyper-V uhandled wrmsr: 0x4007 data
>> 0x15c4
>> kasan: CONFIG_KASAN_INLINE enabled
>> kasan: GPF could be caused by NULL-ptr deref or user memory access
>> general protection fault:  [#1] PREEMPT SMP KASAN
>> CPU: 0 PID: 9347 Comm: syz-executor665 Not tainted 5.3.0-rc7+ #0
>> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
>> Google 01/01/2011
>> RIP: 0010:__apic_accept_irq+0x46/0x740 arch/x86/kvm/lapic.c:1029
>
> Thanks for the report, I found the root cause, will send a patch soon.
>

I'm really interested in how any issue can be caused by 0aa67255f54d as
we just moved some definitions from a c file to a common header... (ok,
we did more than that, some structures gained '__packed' but it all
still seems legitimate to me and I can't recall any problems with
genuine Hyper-V...)

-- 
Vitaly
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH] x86/hyperv: Disable preemption while setting reenlightenment vector

2019-06-14 Thread Vitaly Kuznetsov
Peter Zijlstra  writes:

>
> I know you probably can't change the HV interface, but I'm thinking its
> rather daft you have to specify a CPU at all for this. The HV can just
> pick one and send the notification there, who cares.

Generally speaking, hypervisor can't know if the CPU is offline (or
e.g. 'isolated') from guest's perspective so I think having an option to
specify affinity for reenlightenment notification is rather a good
thing, not bad.

(Actually, I don't remember if I tried specifying 'HV_ANY' (U32_MAX-1)
here to see what happens. But then I doubt it'll notice the fact that we 
offlined some CPU so we may get a totally unexpected IRQ there).

-- 
Vitaly
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH] x86/hyperv: Disable preemption while setting reenlightenment vector

2019-06-14 Thread Vitaly Kuznetsov
Dmitry Safonov  writes:

> On 6/14/19 11:08 AM, Vitaly Kuznetsov wrote:
>> Peter Zijlstra  writes:
>> 
>>> @@ -182,7 +182,7 @@ void set_hv_tscchange_cb(void (*cb)(void))
>>> struct hv_reenlightenment_control re_ctrl = {
>>> .vector = HYPERV_REENLIGHTENMENT_VECTOR,
>>> .enabled = 1,
>>> -   .target_vp = hv_vp_index[smp_processor_id()]
>>> +   .target_vp = hv_vp_index[raw_smp_processor_id()]
>>> };
>>> struct hv_tsc_emulation_control emu_ctrl = {.enabled = 1};
>>>  
>> 
>> Yes, this should do, thanks! I'd also suggest to leave a comment like
>>  /* 
>>  * This function can get preemted and migrate to a different CPU
>>   * but this doesn't matter. We just need to assign
>>   * reenlightenment notification to some online CPU. In case this
>>  * CPU goes offline, hv_cpu_die() will re-assign it to some
>>   * other online CPU.
>>   */
>
> What if the cpu goes down just before wrmsrl()?
> I mean, hv_cpu_die() will reassign another cpu, but this thread will be
> resumed on some other cpu and will write cpu number which is at that
> moment already down?
>

Right you are, we need to guarantee wrmsr() happens before the CPU gets
a chance to go offline: we don't save the cpu number anywhere, we just
read it with rdmsr() in hv_cpu_die().

>
> And I presume it's guaranteed that during hv_cpu_die() no other cpu may
> go down:
> : new_cpu = cpumask_any_but(cpu_online_mask, cpu);
> : re_ctrl.target_vp = hv_vp_index[new_cpu];
> : wrmsrl(HV_X64_MSR_REENLIGHTENMENT_CONTROL, *((u64 *)_ctrl));

I *think* I got convinced that CPUs don't go offline simultaneously when
I was writing this.

-- 
Vitaly
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH] x86/hyperv: Disable preemption while setting reenlightenment vector

2019-06-14 Thread Vitaly Kuznetsov
Peter Zijlstra  writes:

> @@ -182,7 +182,7 @@ void set_hv_tscchange_cb(void (*cb)(void))
>   struct hv_reenlightenment_control re_ctrl = {
>   .vector = HYPERV_REENLIGHTENMENT_VECTOR,
>   .enabled = 1,
> - .target_vp = hv_vp_index[smp_processor_id()]
> + .target_vp = hv_vp_index[raw_smp_processor_id()]
>   };
>   struct hv_tsc_emulation_control emu_ctrl = {.enabled = 1};
>  

Yes, this should do, thanks! I'd also suggest to leave a comment like
/* 
 * This function can get preemted and migrate to a different CPU
 * but this doesn't matter. We just need to assign
 * reenlightenment notification to some online CPU. In case this
 * CPU goes offline, hv_cpu_die() will re-assign it to some
 * other online CPU.
 */
  
-- 
Vitaly
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH] x86/hyperv: Disable preemption while setting reenlightenment vector

2019-06-14 Thread Vitaly Kuznetsov
Thomas Gleixner  writes:

> On Wed, 12 Jun 2019, Vitaly Kuznetsov wrote:
>> Dmitry Safonov  writes:
>> > diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
>> > index 1608050e9df9..0bdd79ecbff8 100644
>> > --- a/arch/x86/hyperv/hv_init.c
>> > +++ b/arch/x86/hyperv/hv_init.c
>> > @@ -91,7 +91,7 @@ EXPORT_SYMBOL_GPL(hv_max_vp_index);
>> >  static int hv_cpu_init(unsigned int cpu)
>> >  {
>> >u64 msr_vp_index;
>> > -  struct hv_vp_assist_page **hvp = _vp_assist_page[smp_processor_id()];
>> > +  struct hv_vp_assist_page **hvp = _vp_assist_page[cpu];
>> >void **input_arg;
>> >struct page *pg;
>> >  
>> > @@ -103,7 +103,7 @@ static int hv_cpu_init(unsigned int cpu)
>> >  
>> >hv_get_vp_index(msr_vp_index);
>> >  
>> > -  hv_vp_index[smp_processor_id()] = msr_vp_index;
>> > +  hv_vp_index[cpu] = msr_vp_index;
>> >  
>> >if (msr_vp_index > hv_max_vp_index)
>> >hv_max_vp_index = msr_vp_index;
>> 
>> The above is unrelated cleanup (as cpu == smp_processor_id() for
>> CPUHP_AP_ONLINE_DYN callbacks), right? As I'm pretty sure these can'd be
>> preempted.
>
> They can be preempted, but they are guaranteed to run on the upcoming CPU,
> i.e. smp_processor_id() is allowed even in preemptible context as the task
> cannot migrate.
>

Ah, right, thanks! The guarantee that they don't migrate should be enough.

-- 
Vitaly
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH] x86/hyperv: Disable preemption while setting reenlightenment vector

2019-06-12 Thread Vitaly Kuznetsov
Peter Zijlstra  writes:

> On Tue, Jun 11, 2019 at 10:20:03PM +0100, Dmitry Safonov wrote:
>> KVM support may be compiled as dynamic module, which triggers the
>> following splat on modprobe:
>> 
>>  KVM: vmx: using Hyper-V Enlightened VMCS
>>  BUG: using smp_processor_id() in preemptible [] code: modprobe/466 
>> caller is debug_smp_processor_id+0x17/0x19
>>  CPU: 0 PID: 466 Comm: modprobe Kdump: loaded Not tainted 4.19.43 #1
>>  Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS 
>> 090007  06/02/2017
>>  Call Trace:
>>   dump_stack+0x61/0x7e
>>   check_preemption_disabled+0xd4/0xe6
>>   debug_smp_processor_id+0x17/0x19
>>   set_hv_tscchange_cb+0x1b/0x89
>>   kvm_arch_init+0x14a/0x163 [kvm]
>>   kvm_init+0x30/0x259 [kvm]
>>   vmx_init+0xed/0x3db [kvm_intel]
>>   do_one_initcall+0x89/0x1bc
>>   do_init_module+0x5f/0x207
>>   load_module+0x1b34/0x209b
>>   __ia32_sys_init_module+0x17/0x19
>>   do_fast_syscall_32+0x121/0x1fa
>>   entry_SYSENTER_compat+0x7f/0x91
>> 
>> The easiest solution seems to be disabling preemption while setting up
>> reenlightment MSRs. While at it, fix hv_cpu_*() callbacks.
>> 
>> Fixes: 93286261de1b4 ("x86/hyperv: Reenlightenment notifications
>> support")
>> 
>> Cc: Andy Lutomirski 
>> Cc: Borislav Petkov 
>> Cc: Cathy Avery 
>> Cc: Haiyang Zhang 
>> Cc: "H. Peter Anvin" 
>> Cc: Ingo Molnar 
>> Cc: "K. Y. Srinivasan" 
>> Cc: "Michael Kelley (EOSG)" 
>> Cc: Mohammed Gamal 
>> Cc: Paolo Bonzini 
>> Cc: Radim Krčmář 
>> Cc: Roman Kagan 
>> Cc: Sasha Levin 
>> Cc: Stephen Hemminger 
>> Cc: Thomas Gleixner 
>> Cc: Vitaly Kuznetsov 
>> 
>> Cc: de...@linuxdriverproject.org
>> Cc: k...@vger.kernel.org
>> Cc: linux-hyp...@vger.kernel.org
>> Cc: x...@kernel.org
>> Reported-by: Prasanna Panchamukhi 
>> Signed-off-by: Dmitry Safonov 
>> ---
>>  arch/x86/hyperv/hv_init.c | 9 ++---
>>  1 file changed, 6 insertions(+), 3 deletions(-)
>> 
>> diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
>> index 1608050e9df9..0bdd79ecbff8 100644
>> --- a/arch/x86/hyperv/hv_init.c
>> +++ b/arch/x86/hyperv/hv_init.c
>> @@ -91,7 +91,7 @@ EXPORT_SYMBOL_GPL(hv_max_vp_index);
>>  static int hv_cpu_init(unsigned int cpu)
>>  {
>>  u64 msr_vp_index;
>> -struct hv_vp_assist_page **hvp = _vp_assist_page[smp_processor_id()];
>> +struct hv_vp_assist_page **hvp = _vp_assist_page[cpu];
>>  void **input_arg;
>>  struct page *pg;
>>  
>> @@ -103,7 +103,7 @@ static int hv_cpu_init(unsigned int cpu)
>>  
>>  hv_get_vp_index(msr_vp_index);
>>  
>> -hv_vp_index[smp_processor_id()] = msr_vp_index;
>> +hv_vp_index[cpu] = msr_vp_index;
>>  
>>  if (msr_vp_index > hv_max_vp_index)
>>  hv_max_vp_index = msr_vp_index;
>> @@ -182,7 +182,6 @@ void set_hv_tscchange_cb(void (*cb)(void))
>>  struct hv_reenlightenment_control re_ctrl = {
>>  .vector = HYPERV_REENLIGHTENMENT_VECTOR,
>>  .enabled = 1,
>> -.target_vp = hv_vp_index[smp_processor_id()]
>>  };
>>  struct hv_tsc_emulation_control emu_ctrl = {.enabled = 1};
>>  
>> @@ -196,7 +195,11 @@ void set_hv_tscchange_cb(void (*cb)(void))
>>  /* Make sure callback is registered before we write to MSRs */
>>  wmb();
>>  
>> +preempt_disable();
>> +re_ctrl.target_vp = hv_vp_index[smp_processor_id()];
>>  wrmsrl(HV_X64_MSR_REENLIGHTENMENT_CONTROL, *((u64 *)_ctrl));
>> +preempt_enable();
>> +
>>  wrmsrl(HV_X64_MSR_TSC_EMULATION_CONTROL, *((u64 *)_ctrl));
>>  }
>>  EXPORT_SYMBOL_GPL(set_hv_tscchange_cb);
>
> This looks bogus, MSRs are a per-cpu resource, you had better know what
> CPUs you're on and be stuck to it when you do wrmsr. This just fudges
> the code to make the warning go away and doesn't fix the actual problem
> afaict.

Actually, we don't care which CPU will receive the reenlightenment
notification and TSC Emulation in Hyper-V is, of course, global. We have
code which re-assignes the notification to some other CPU in case the
one it's currently assigned to goes away (see hv_cpu_die()).

-- 
Vitaly
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH] x86/hyperv: Disable preemption while setting reenlightenment vector

2019-06-12 Thread Vitaly Kuznetsov
Dmitry Safonov  writes:

> KVM support may be compiled as dynamic module, which triggers the
> following splat on modprobe:
>
>  KVM: vmx: using Hyper-V Enlightened VMCS
>  BUG: using smp_processor_id() in preemptible [] code: modprobe/466 
> caller is debug_smp_processor_id+0x17/0x19
>  CPU: 0 PID: 466 Comm: modprobe Kdump: loaded Not tainted 4.19.43 #1
>  Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS 
> 090007  06/02/2017
>  Call Trace:
>   dump_stack+0x61/0x7e
>   check_preemption_disabled+0xd4/0xe6
>   debug_smp_processor_id+0x17/0x19
>   set_hv_tscchange_cb+0x1b/0x89
>   kvm_arch_init+0x14a/0x163 [kvm]
>   kvm_init+0x30/0x259 [kvm]
>   vmx_init+0xed/0x3db [kvm_intel]
>   do_one_initcall+0x89/0x1bc
>   do_init_module+0x5f/0x207
>   load_module+0x1b34/0x209b
>   __ia32_sys_init_module+0x17/0x19
>   do_fast_syscall_32+0x121/0x1fa
>   entry_SYSENTER_compat+0x7f/0x91

Hm, I never noticed this one, you probably need something like
CONFIG_PREEMPT enabled so see it.

>
> The easiest solution seems to be disabling preemption while setting up
> reenlightment MSRs. While at it, fix hv_cpu_*() callbacks.
>
> Fixes: 93286261de1b4 ("x86/hyperv: Reenlightenment notifications
> support")
>
> Cc: Andy Lutomirski 
> Cc: Borislav Petkov 
> Cc: Cathy Avery 
> Cc: Haiyang Zhang 
> Cc: "H. Peter Anvin" 
> Cc: Ingo Molnar 
> Cc: "K. Y. Srinivasan" 
> Cc: "Michael Kelley (EOSG)" 
> Cc: Mohammed Gamal 
> Cc: Paolo Bonzini 
> Cc: Radim Krčmář 
> Cc: Roman Kagan 
> Cc: Sasha Levin 
> Cc: Stephen Hemminger 
> Cc: Thomas Gleixner 
> Cc: Vitaly Kuznetsov 
>
> Cc: de...@linuxdriverproject.org
> Cc: k...@vger.kernel.org
> Cc: linux-hyp...@vger.kernel.org
> Cc: x...@kernel.org
> Reported-by: Prasanna Panchamukhi 
> Signed-off-by: Dmitry Safonov 
> ---
>  arch/x86/hyperv/hv_init.c | 9 ++---
>  1 file changed, 6 insertions(+), 3 deletions(-)
>
> diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
> index 1608050e9df9..0bdd79ecbff8 100644
> --- a/arch/x86/hyperv/hv_init.c
> +++ b/arch/x86/hyperv/hv_init.c
> @@ -91,7 +91,7 @@ EXPORT_SYMBOL_GPL(hv_max_vp_index);
>  static int hv_cpu_init(unsigned int cpu)
>  {
>   u64 msr_vp_index;
> - struct hv_vp_assist_page **hvp = _vp_assist_page[smp_processor_id()];
> + struct hv_vp_assist_page **hvp = _vp_assist_page[cpu];
>   void **input_arg;
>   struct page *pg;
>  
> @@ -103,7 +103,7 @@ static int hv_cpu_init(unsigned int cpu)
>  
>   hv_get_vp_index(msr_vp_index);
>  
> - hv_vp_index[smp_processor_id()] = msr_vp_index;
> + hv_vp_index[cpu] = msr_vp_index;
>  
>   if (msr_vp_index > hv_max_vp_index)
>   hv_max_vp_index = msr_vp_index;

The above is unrelated cleanup (as cpu == smp_processor_id() for
CPUHP_AP_ONLINE_DYN callbacks), right? As I'm pretty sure these can'd be
preempted.

> @@ -182,7 +182,6 @@ void set_hv_tscchange_cb(void (*cb)(void))
>   struct hv_reenlightenment_control re_ctrl = {
>   .vector = HYPERV_REENLIGHTENMENT_VECTOR,
>   .enabled = 1,
> - .target_vp = hv_vp_index[smp_processor_id()]
>   };
>   struct hv_tsc_emulation_control emu_ctrl = {.enabled = 1};
>  
> @@ -196,7 +195,11 @@ void set_hv_tscchange_cb(void (*cb)(void))
>   /* Make sure callback is registered before we write to MSRs */
>   wmb();
>  
> + preempt_disable();
> + re_ctrl.target_vp = hv_vp_index[smp_processor_id()];
>   wrmsrl(HV_X64_MSR_REENLIGHTENMENT_CONTROL, *((u64 *)_ctrl));
> + preempt_enable();
> +

My personal preference would be to do something like
   int cpu = get_cpu();

   ... set things up ...

   put_cpu();

instead, there are no long-running things in the whole function. But
what you've done should work too, so

Reviewed-by: Vitaly Kuznetsov 

>   wrmsrl(HV_X64_MSR_TSC_EMULATION_CONTROL, *((u64 *)_ctrl));
>  }
>  EXPORT_SYMBOL_GPL(set_hv_tscchange_cb);

-- 
Vitaly
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH v3 1/2] PCI: hv: Replace hv_vp_set with hv_vpset

2019-02-27 Thread Vitaly Kuznetsov
Maya Nakamura  writes:

> Remove a duplicate definition of VP set (hv_vp_set) and use the common
> definition (hv_vpset) that is used in other places.
>
> Change the order of the members in struct hv_pcibus_device so that the
> declaration of retarget_msi_interrupt_params is the last member. Struct
> hv_vpset, which contains a flexible array, is nested two levels deep in
> struct hv_pcibus_device via retarget_msi_interrupt_params.
>
> Add a comment that retarget_msi_interrupt_params should be the last member
> of struct hv_pcibus_device.
>
> Signed-off-by: Maya Nakamura 
> ---
> Change in v3:
> - Correct the v2 change log.
>
> Change in v2:
> - Update the commit message.
>
>  drivers/pci/controller/pci-hyperv.c | 25 -
>  1 file changed, 12 insertions(+), 13 deletions(-)
>
> diff --git a/drivers/pci/controller/pci-hyperv.c 
> b/drivers/pci/controller/pci-hyperv.c
> index 9ba4d12c179c..da8b58d8630d 100644
> --- a/drivers/pci/controller/pci-hyperv.c
> +++ b/drivers/pci/controller/pci-hyperv.c
> @@ -393,12 +393,6 @@ struct hv_interrupt_entry {
>  
>  #define HV_VP_SET_BANK_COUNT_MAX 5 /* current implementation limit */
>  
> -struct hv_vp_set {
> - u64 format; /* 0 (HvGenericSetSparse4k) */
> - u64 valid_banks;
> - u64 masks[HV_VP_SET_BANK_COUNT_MAX];
> -};
> -
>  /*
>   * flags for hv_device_interrupt_target.flags
>   */
> @@ -410,7 +404,7 @@ struct hv_device_interrupt_target {
>   u32 flags;
>   union {
>   u64  vp_mask;
> - struct hv_vp_set vp_set;
> + struct hv_vpset vp_set;
>   };
>  };
>  
> @@ -460,12 +454,16 @@ struct hv_pcibus_device {
>   struct msi_controller msi_chip;
>   struct irq_domain *irq_domain;
>  
> - /* hypercall arg, must not cross page boundary */
> - struct retarget_msi_interrupt retarget_msi_interrupt_params;
> -
>   spinlock_t retarget_msi_interrupt_lock;
>  
>   struct workqueue_struct *wq;
> +
> + /* hypercall arg, must not cross page boundary */
> + struct retarget_msi_interrupt retarget_msi_interrupt_params;
> +

One more thing here (and again sorry for being late): this structure is
being used as Hyper-V hypercall argument and these have special
requirements on alignment (8 bytes). struct retarget_msi_interrupt is
packed and depending on your environment/compiler you may or may not see
the issue but I has able to get HVCALL_RETARGET_INTERRUPT failing with
the return value of 4 (HV_STATUS_INVALID_ALIGNMENT).

I suggest we add __aligned(8) to 'struct retarget_msi_interrupt'
definition (I haven't tested this but should work). This is not a new
issue, it existed before your patch, but the code movement you do may
change the exposure.


> + /*
> +  * Don't put anything here: retarget_msi_interrupt_params must be last
> +  */
>  };
>  
>  /*
> @@ -955,12 +953,13 @@ static void hv_irq_unmask(struct irq_data *data)
>*/
>   params->int_target.flags |=
>   HV_DEVICE_INTERRUPT_TARGET_PROCESSOR_SET;
> - params->int_target.vp_set.valid_banks =
> + params->int_target.vp_set.valid_bank_mask =
>   (1ull << HV_VP_SET_BANK_COUNT_MAX) - 1;
>  
>   /*
>* var-sized hypercall, var-size starts after vp_mask (thus
> -  * vp_set.format does not count, but vp_set.valid_banks does).
> +  * vp_set.format does not count, but vp_set.valid_bank_mask
> +  * does).
>*/
>   var_size = 1 + HV_VP_SET_BANK_COUNT_MAX;
>  
> @@ -974,7 +973,7 @@ static void hv_irq_unmask(struct irq_data *data)
>   goto exit_unlock;
>   }
>  
> - params->int_target.vp_set.masks[cpu_vmbus / 64] |=
> + params->int_target.vp_set.bank_contents[cpu_vmbus / 64] 
> |=
>   (1ULL << (cpu_vmbus & 63));
>   }
>   } else {

-- 
Vitaly
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH v3 2/2] PCI: hv: Refactor hv_irq_unmask() to use cpumask_to_vpset()

2019-02-27 Thread Vitaly Kuznetsov
Maya Nakamura  writes:

> Remove the duplicate implementation of cpumask_to_vpset() and use the
> shared implementation. Export hv_max_vp_index, which is required by
> cpumask_to_vpset().
>
> Apply changes to hv_irq_unmask() based on feedback.
>

I just noticed an issue with this patch, sorry I've missed it before. I
don't see the commit in Linus' tree, not sure if we should amend this
one or a follow-up patch is needed.

> Signed-off-by: Maya Nakamura 
> ---
> Changes in v3:
>  - Modify to catch all failures from cpumask_to_vpset().
>  - Correct the v2 change log about the commit message.
>
> Changes in v2:
>  - Remove unnecessary nr_bank initialization.
>  - Delete two unnecessary dev_err()'s.
>  - Unlock before returning.
>  - Update the commit message.
>
>  arch/x86/hyperv/hv_init.c   |  1 +
>  drivers/pci/controller/pci-hyperv.c | 38 +
>  2 files changed, 18 insertions(+), 21 deletions(-)
>
> diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
> index 7abb09e2eeb8..7f2eed1fc81b 100644
> --- a/arch/x86/hyperv/hv_init.c
> +++ b/arch/x86/hyperv/hv_init.c
> @@ -96,6 +96,7 @@ void  __percpu **hyperv_pcpu_input_arg;
>  EXPORT_SYMBOL_GPL(hyperv_pcpu_input_arg);
>  
>  u32 hv_max_vp_index;
> +EXPORT_SYMBOL_GPL(hv_max_vp_index);
>  
>  static int hv_cpu_init(unsigned int cpu)
>  {
> diff --git a/drivers/pci/controller/pci-hyperv.c 
> b/drivers/pci/controller/pci-hyperv.c
> index da8b58d8630d..a78def332bbc 100644
> --- a/drivers/pci/controller/pci-hyperv.c
> +++ b/drivers/pci/controller/pci-hyperv.c
> @@ -391,8 +391,6 @@ struct hv_interrupt_entry {
>   u32 data;
>  };
>  
> -#define HV_VP_SET_BANK_COUNT_MAX 5 /* current implementation limit */
> -
>  /*
>   * flags for hv_device_interrupt_target.flags
>   */
> @@ -908,12 +906,12 @@ static void hv_irq_unmask(struct irq_data *data)
>   struct retarget_msi_interrupt *params;
>   struct hv_pcibus_device *hbus;
>   struct cpumask *dest;
> + cpumask_var_t tmp;
>   struct pci_bus *pbus;
>   struct pci_dev *pdev;
>   unsigned long flags;
>   u32 var_size = 0;
> - int cpu_vmbus;
> - int cpu;
> + int cpu, nr_bank;
>   u64 res;
>  
>   dest = irq_data_get_effective_affinity_mask(data);
> @@ -953,29 +951,27 @@ static void hv_irq_unmask(struct irq_data *data)
>*/
>   params->int_target.flags |=
>   HV_DEVICE_INTERRUPT_TARGET_PROCESSOR_SET;
> - params->int_target.vp_set.valid_bank_mask =
> - (1ull << HV_VP_SET_BANK_COUNT_MAX) - 1;
> +
> + if (!alloc_cpumask_var(, GFP_KERNEL)) {

We can't use GFP_KERNEL here: this is happening under
hbus->retarget_msi_interrupt_lock spinlock, we should use GFP_ATOMIC
instead. It may, however, make sense to add the cpumask to a
pre-allocated structure (e.g. struct hv_pcibus_device) to make sure the
allocation never fails.

> + res = 1;
> + goto exit_unlock;
> + }
> +
> + cpumask_and(tmp, dest, cpu_online_mask);
> + nr_bank = cpumask_to_vpset(>int_target.vp_set, tmp);
> + free_cpumask_var(tmp);
> +
> + if (nr_bank <= 0) {
> + res = 1;
> + goto exit_unlock;
> + }
>  
>   /*
>* var-sized hypercall, var-size starts after vp_mask (thus
>* vp_set.format does not count, but vp_set.valid_bank_mask
>* does).
>*/
> - var_size = 1 + HV_VP_SET_BANK_COUNT_MAX;
> -
> - for_each_cpu_and(cpu, dest, cpu_online_mask) {
> - cpu_vmbus = hv_cpu_number_to_vp_number(cpu);
> -
> - if (cpu_vmbus >= HV_VP_SET_BANK_COUNT_MAX * 64) {
> - dev_err(>hdev->device,
> - "too high CPU %d", cpu_vmbus);
> - res = 1;
> - goto exit_unlock;
> - }
> -
> - params->int_target.vp_set.bank_contents[cpu_vmbus / 64] 
> |=
> - (1ULL << (cpu_vmbus & 63));
> - }
> + var_size = 1 + nr_bank;
>   } else {
>   for_each_cpu_and(cpu, dest, cpu_online_mask) {
>   params->int_target.vp_mask |=

-- 
Vitaly
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [RFC PATCH] x86, hyperv: fix kernel panic when kexec on HyperV VM

2019-02-26 Thread Vitaly Kuznetsov
Kairui Song  writes:

> When hypercalls is used for sending IPIs, kexec will fail with a kernel
> panic like this:
>
> kexec_core: Starting new kernel
> BUG: unable to handle kernel NULL pointer dereference at 
> PGD 800057995067 P4D 800057995067 PUD 57990067 PMD 0
> Oops: 0002 [#1] SMP PTI
> CPU: 0 PID: 1016 Comm: kexec Not tainted 4.18.16-300.fc29.x86_64 #1
> Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS 
> Hyper-V UEFI Release v3.0 03/02/2018
> RIP: 0010:0xc901d000
> Code: Bad RIP value.
> RSP: 0018:c9000495bcf0 EFLAGS: 00010046
> RAX:  RBX: c901d000 RCX: 00020015
> RDX: 7f553000 RSI:  RDI: c9000495bd28
> RBP: 0002 R08:  R09: 8238aaf8
> R10: 8238aae0 R11:  R12: 88007f553008
> R13: 0001 R14: 8800ff553000 R15: 
> FS:  7ff5c0e67b80() GS:880078e0() knlGS:
> CS:  0010 DS:  ES:  CR0: 80050033
> CR2: c901cfd6 CR3: 4f678006 CR4: 003606f0
> Call Trace:
>  ? __send_ipi_mask+0x1c6/0x2d0
>  ? hv_send_ipi_mask_allbutself+0x6d/0xb0
>  ? mp_save_irq+0x70/0x70
>  ? __ioapic_read_entry+0x32/0x50
>  ? ioapic_read_entry+0x39/0x50
>  ? clear_IO_APIC_pin+0xb8/0x110
>  ? native_stop_other_cpus+0x6e/0x170
>  ? native_machine_shutdown+0x22/0x40
>  ? kernel_kexec+0x136/0x156
>  ? __do_sys_reboot+0x1be/0x210
>  ? kmem_cache_free+0x1b1/0x1e0
>  ? __dentry_kill+0x10b/0x160
>  ? _cond_resched+0x15/0x30
>  ? dentry_kill+0x47/0x170
>  ? dput.part.34+0xc6/0x100
>  ? __fput+0x147/0x220
>  ? _cond_resched+0x15/0x30
>  ? task_work_run+0x38/0xa0
>  ? do_syscall_64+0x5b/0x160
>  ? entry_SYSCALL_64_after_hwframe+0x44/0xa9
> Modules linked in: ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack 
> ebtable_nat ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 
> ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_conntrack_ipv4 
> nf_defrag_ipv4 nf_nat_ipv4 nf_nat iptable_mangle iptable_raw iptable_security 
> nf_conntrack ip_set nfnetlink ebtable_filter ebtables ip6table_filter 
> ip6_tables sunrpc vfat fat crct10dif_pclmul crc32_pclmul ghash_clmulni_intel 
> intel_rapl_perf hv_balloon joydev xfs libcrc32c hv_storvsc serio_raw 
> scsi_transport_fc hv_netvsc hyperv_keyboard hyperv_fb hid_hyperv crc32c_intel 
> hv_vmbus
>
> That's because HyperV's machine_ops.shutdown allow registering a hook to
> be called upon shutdown, hv_vmbus will invalidate the hypercall page
> using this hook. But hv_hypercall_pg is still pointing to this invalid
> page, any hypercall based operation will panic the kernel. And kexec
> progress will send IPIs for stopping CPUs.
>
> This fix this by simply reset hv_hypercall_pg to NULL when the page is
> revoked to avoid any misuse. IPI sending will fallback to use non
> hypercall based method. This only happens on kexec / kdump so setting to
> NULL should be good enough.
>
> Fixes: 68bb7bfb7985 ("X86/Hyper-V: Enable IPI enlightenments")
> Signed-off-by: Kairui Song 
>
> ---
>
> I'm not sure about the details of what happened after the
>
> wrmsrl(HV_X64_MSR_HYPERCALL, hypercall_msr.as_uint64);
>
> But this fix should be valid, please let me know if I get anything
> wrong, thanks.
>
>  arch/x86/hyperv/hv_init.c | 4 
>  1 file changed, 4 insertions(+)
>
> diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
> index 7abb09e2eeb8..92291c18d716 100644
> --- a/arch/x86/hyperv/hv_init.c
> +++ b/arch/x86/hyperv/hv_init.c
> @@ -406,6 +406,10 @@ void hyperv_cleanup(void)
>   /* Reset our OS id */
>   wrmsrl(HV_X64_MSR_GUEST_OS_ID, 0);
>  
> + /* Cleanup page reference before reset the page */
> + hv_hypercall_pg = NULL;
> + wmb();
> +
>   /* Reset the hypercall page */
>   hypercall_msr.as_uint64 = 0;
>   wrmsrl(HV_X64_MSR_HYPERCALL, hypercall_msr.as_uint64);

This all looks correct to me. We need to reset HV_X64_MSR_HYPERCALL as
the hypercall page will remain 'special' otherwise so dumping it may
have undesired consequences (though, I think that last time I checked it
it was possible to read from this page without issues - and this should
be enough for kdump. But I'd rather keep things as they are as one
additional wrmsr doesn't hurt).

-- 
Vitaly
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH 1/2] PCI: hv: Replace hv_vp_set with hv_vpset

2019-01-24 Thread Vitaly Kuznetsov
Maya Nakamura  writes:

> Remove a duplicate definition of VP set (hv_vp_set) and use the common
> definition (hv_vpset) that is used in other places.
>
> Signed-off-by: Maya Nakamura 
> ---
>  drivers/pci/controller/pci-hyperv.c | 25 -
>  1 file changed, 12 insertions(+), 13 deletions(-)
>
> diff --git a/drivers/pci/controller/pci-hyperv.c 
> b/drivers/pci/controller/pci-hyperv.c
> index 9ba4d12c179c..da8b58d8630d 100644
> --- a/drivers/pci/controller/pci-hyperv.c
> +++ b/drivers/pci/controller/pci-hyperv.c
> @@ -393,12 +393,6 @@ struct hv_interrupt_entry {
>  
>  #define HV_VP_SET_BANK_COUNT_MAX 5 /* current implementation limit */
>  
> -struct hv_vp_set {
> - u64 format; /* 0 (HvGenericSetSparse4k) */
> - u64 valid_banks;
> - u64 masks[HV_VP_SET_BANK_COUNT_MAX];
> -};
> -
>  /*
>   * flags for hv_device_interrupt_target.flags
>   */
> @@ -410,7 +404,7 @@ struct hv_device_interrupt_target {
>   u32 flags;
>   union {
>   u64  vp_mask;
> - struct hv_vp_set vp_set;
> + struct hv_vpset vp_set;
>   };
>  };
>  
> @@ -460,12 +454,16 @@ struct hv_pcibus_device {
>   struct msi_controller msi_chip;
>   struct irq_domain *irq_domain;
>  
> - /* hypercall arg, must not cross page boundary */
> - struct retarget_msi_interrupt retarget_msi_interrupt_params;
> -
>   spinlock_t retarget_msi_interrupt_lock;
>  
>   struct workqueue_struct *wq;
> +
> + /* hypercall arg, must not cross page boundary */
> + struct retarget_msi_interrupt retarget_msi_interrupt_params;
> +
> + /*
> +  * Don't put anything here: retarget_msi_interrupt_params must be last
> +  */


This change seems to be unrelated and not anyhow described in the commit
message - or did I miss something?

>  };
>  
>  /*
> @@ -955,12 +953,13 @@ static void hv_irq_unmask(struct irq_data *data)
>*/
>   params->int_target.flags |=
>   HV_DEVICE_INTERRUPT_TARGET_PROCESSOR_SET;
> - params->int_target.vp_set.valid_banks =
> + params->int_target.vp_set.valid_bank_mask =
>   (1ull << HV_VP_SET_BANK_COUNT_MAX) - 1;
>  
>   /*
>* var-sized hypercall, var-size starts after vp_mask (thus
> -  * vp_set.format does not count, but vp_set.valid_banks does).
> +  * vp_set.format does not count, but vp_set.valid_bank_mask
> +  * does).
>*/
>   var_size = 1 + HV_VP_SET_BANK_COUNT_MAX;
>  
> @@ -974,7 +973,7 @@ static void hv_irq_unmask(struct irq_data *data)
>   goto exit_unlock;
>   }
>  
> - params->int_target.vp_set.masks[cpu_vmbus / 64] |=
> + params->int_target.vp_set.bank_contents[cpu_vmbus / 64] 
> |=
>   (1ULL << (cpu_vmbus & 63));
>   }
>   } else {

-- 
Vitaly
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH] hv_balloon: avoid touching uninitialized struct page during tail onlining

2019-01-08 Thread Vitaly Kuznetsov
Dan Carpenter  writes:

> On Mon, Jan 07, 2019 at 07:38:20PM +0100, Vitaly Kuznetsov wrote:
>> (I remember Greg disliked when people were tagging patches for stable@
>> themselves, he prefered maintainers deciding if the particular commit
>> deserves stable@ or not - but as you have a tree now we may as well have
>> different rules :-)
>
> You're thinking of Dave Miller.  So far as I know he's the only
> maintainer who handles stable that way.

This may actually be the case, you think "one of the the mighty Linux
overlords" but your fingers type "Greg" instread. Sorry for the
confusion :-)

-- 
Vitaly
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH] hv_balloon: avoid touching uninitialized struct page during tail onlining

2019-01-08 Thread Vitaly Kuznetsov
David Hildenbrand  writes:

> On 07.01.19 14:44, Vitaly Kuznetsov wrote:
>> David Hildenbrand  writes:
>> 
...
>>> On 04.01.19 15:19, Vitaly Kuznetsov wrote:
>>>>if (start_pfn > has->start_pfn &&
>>>> -  !PageReserved(pfn_to_page(start_pfn - 1)))
>>>> +  online_section_nr(pfn_to_section_nr(start_pfn)))
>>>>hv_bring_pgs_online(has, start_pfn, pgs_ol);
>>>>  
>>>>}
>>>>
>>>
>>> I wonder if you should use pfn_to_online_page() and check for PageOffline().
>>>
>>> (I guess online_section_nr() should also do the trick)
>> 
>> I'm worried a bit about racing with mm code here as we're not doing
>> mem_hotplug_begin()/done() so I'd slightly prefer keeping
>> online_section_nr() (pfn_to_online_page() also uses it but then it gets
>> to the particular struct page). Moreover, with pfn_to_online_page() we
>> will be looking at some other pfn - because the start_pfn is definitelly
>> offline (pre-patch we were looking at start_pfn-1). Just looking at the
>> whole section seems cleaner.
>
> Fine with me. I guess the section can never be offlined as it still
> contains reserved pages if not fully "fake-onlined" by HV code already.
>
> But we could still consider mem_hotplug_begin()/done() as we could have
> a online section although online_pages() has not completed yet. So we
> could actually touch some "semi onlined section".

Yes, exactly, if we race with section onlining here we may never online
the tail so it will remain 'semi onlined'. I'm going to propose
exporting mem_hotplug_begin()/done() to modules to fix this (in a
separate patch because I anticipate some pushback :-)

>
> Acked-by: David Hildenbrand 
>

Thanks!

-- 
Vitaly
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH] hv_balloon: avoid touching uninitialized struct page during tail onlining

2019-01-07 Thread Vitaly Kuznetsov
Sasha Levin  writes:

> On Mon, Jan 07, 2019 at 02:44:30PM +0100, Vitaly Kuznetsov wrote:
>>P.S. I still think about bringing mem_hotplug_begin()/done() to
>>hv_balloon but that's going to be a separate discussion, here I want to
>>have a small fix backportable to stable.
>
> This should probably be marked for stable then :)
>

Yes, please :-)

I didn't test old kernels but this seems to be an old issue,
e.g. DEFERRED_STRUCT_PAGE_INIT appeared in 4.2 - not sure if memory
hotplug code was using it back then but, oh well, it's all history now.

(I remember Greg disliked when people were tagging patches for stable@
themselves, he prefered maintainers deciding if the particular commit
deserves stable@ or not - but as you have a tree now we may as well have
different rules :-)

-- 
Vitaly
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH] hv_balloon: avoid touching uninitialized struct page during tail onlining

2019-01-07 Thread Vitaly Kuznetsov
David Hildenbrand  writes:

> On 04.01.19 15:19, Vitaly Kuznetsov wrote:
>> Hyper-V memory hotplug protocol has 2M granularity and in Linux x86 we use
>> 128M. To deal with it we implement partial section onlining by registering
>> custom page onlining callback (hv_online_page()). Later, when more memory
>> arrives we try to online the 'tail' (see hv_bring_pgs_online()).
>> 
>> It was found that in some cases this 'tail' onlining causes issues:
>> 
>>  BUG: Bad page state in process kworker/0:2  pfn:109e3a
>>  page:e08344278e80 count:0 mapcount:1 mapping: index:0x0
>>  flags: 0xf8000()
>>  raw: 000f8000 dead0100 dead0200 
>>  raw:    
>>  page dumped because: nonzero mapcount
>>  ...
>>  Workqueue: events hot_add_req [hv_balloon]
>>  Call Trace:
>>   dump_stack+0x5c/0x80
>>   bad_page.cold.112+0x7f/0xb2
>>   free_pcppages_bulk+0x4b8/0x690
>>   free_unref_page+0x54/0x70
>>   hv_page_online_one+0x5c/0x80 [hv_balloon]
>>   hot_add_req.cold.24+0x182/0x835 [hv_balloon]
>>   ...
>> 
>> Turns out that we now have deferred struct page initialization for memory
>> hotplug so e.g. memory_block_action() in drivers/base/memory.c does
>> pages_correctly_probed() check and in that check it avoids inspecting
>> struct pages and checks sections instead. But in Hyper-V balloon driver we
>> do PageReserved(pfn_to_page()) check and this is now wrong.
>> 
>> Switch to checking online_section_nr() instead.
>> 
>> Signed-off-by: Vitaly Kuznetsov 
>> ---
>>  drivers/hv/hv_balloon.c | 10 ++
>>  1 file changed, 6 insertions(+), 4 deletions(-)
>> 
>> diff --git a/drivers/hv/hv_balloon.c b/drivers/hv/hv_balloon.c
>> index 5301fef16c31..7c6349a50ef1 100644
>> --- a/drivers/hv/hv_balloon.c
>> +++ b/drivers/hv/hv_balloon.c
>> @@ -888,12 +888,14 @@ static unsigned long handle_pg_range(unsigned long 
>> pg_start,
>>  pfn_cnt -= pgs_ol;
>>  /*
>>   * Check if the corresponding memory block is already
>> - * online by checking its last previously backed page.
>> - * In case it is we need to bring rest (which was not
>> - * backed previously) online too.
>> + * online. It is possible to observe struct pages still
>> + * being uninitialized here so check section instead.
>> + * In case the section is online we need to bring the
>> + * rest of pfns (which were not backed previously)
>> + * online too.
>>   */
>>  if (start_pfn > has->start_pfn &&
>> -!PageReserved(pfn_to_page(start_pfn - 1)))
>> +online_section_nr(pfn_to_section_nr(start_pfn)))
>>  hv_bring_pgs_online(has, start_pfn, pgs_ol);
>>  
>>  }
>> 
>
> I wonder if you should use pfn_to_online_page() and check for PageOffline().
>
> (I guess online_section_nr() should also do the trick)

I'm worried a bit about racing with mm code here as we're not doing
mem_hotplug_begin()/done() so I'd slightly prefer keeping
online_section_nr() (pfn_to_online_page() also uses it but then it gets
to the particular struct page). Moreover, with pfn_to_online_page() we
will be looking at some other pfn - because the start_pfn is definitelly
offline (pre-patch we were looking at start_pfn-1). Just looking at the
whole section seems cleaner.

P.S. I still think about bringing mem_hotplug_begin()/done() to
hv_balloon but that's going to be a separate discussion, here I want to
have a small fix backportable to stable.

Thanks,

-- 
Vitaly
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


[PATCH] hv_balloon: avoid touching uninitialized struct page during tail onlining

2019-01-04 Thread Vitaly Kuznetsov
Hyper-V memory hotplug protocol has 2M granularity and in Linux x86 we use
128M. To deal with it we implement partial section onlining by registering
custom page onlining callback (hv_online_page()). Later, when more memory
arrives we try to online the 'tail' (see hv_bring_pgs_online()).

It was found that in some cases this 'tail' onlining causes issues:

 BUG: Bad page state in process kworker/0:2  pfn:109e3a
 page:e08344278e80 count:0 mapcount:1 mapping: index:0x0
 flags: 0xf8000()
 raw: 000f8000 dead0100 dead0200 
 raw:    
 page dumped because: nonzero mapcount
 ...
 Workqueue: events hot_add_req [hv_balloon]
 Call Trace:
  dump_stack+0x5c/0x80
  bad_page.cold.112+0x7f/0xb2
  free_pcppages_bulk+0x4b8/0x690
  free_unref_page+0x54/0x70
  hv_page_online_one+0x5c/0x80 [hv_balloon]
  hot_add_req.cold.24+0x182/0x835 [hv_balloon]
  ...

Turns out that we now have deferred struct page initialization for memory
hotplug so e.g. memory_block_action() in drivers/base/memory.c does
pages_correctly_probed() check and in that check it avoids inspecting
struct pages and checks sections instead. But in Hyper-V balloon driver we
do PageReserved(pfn_to_page()) check and this is now wrong.

Switch to checking online_section_nr() instead.

Signed-off-by: Vitaly Kuznetsov 
---
 drivers/hv/hv_balloon.c | 10 ++
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/drivers/hv/hv_balloon.c b/drivers/hv/hv_balloon.c
index 5301fef16c31..7c6349a50ef1 100644
--- a/drivers/hv/hv_balloon.c
+++ b/drivers/hv/hv_balloon.c
@@ -888,12 +888,14 @@ static unsigned long handle_pg_range(unsigned long 
pg_start,
pfn_cnt -= pgs_ol;
/*
 * Check if the corresponding memory block is already
-* online by checking its last previously backed page.
-* In case it is we need to bring rest (which was not
-* backed previously) online too.
+* online. It is possible to observe struct pages still
+* being uninitialized here so check section instead.
+* In case the section is online we need to bring the
+* rest of pfns (which were not backed previously)
+* online too.
 */
if (start_pfn > has->start_pfn &&
-   !PageReserved(pfn_to_page(start_pfn - 1)))
+   online_section_nr(pfn_to_section_nr(start_pfn)))
hv_bring_pgs_online(has, start_pfn, pgs_ol);
 
}
-- 
2.20.1

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH 1/4] arm64: hyperv: Add core Hyper-V include files

2018-12-10 Thread Vitaly Kuznetsov
k...@linuxonhyperv.com writes:

> +
> +static inline u64 hv_read_tsc_page_tsc(const struct ms_hyperv_tsc_page 
> *tsc_pg,
> +u64 *cur_tsc)
> +{
> + u64 scale, offset;
> + u32 sequence;
> +
> + /*
> +  * The protocol for reading Hyper-V TSC page is specified in Hypervisor
> +  * Top-Level Functional Specification.  To get the reference time we
> +  * must do the following:
> +  * - READ ReferenceTscSequence
> +  *   A special '0' value indicates the time source is unreliable and we
> +  *   need to use something else.
> +  * - ReferenceTime =
> +  * ((HWclock val) * ReferenceTscScale) >> 64) + ReferenceTscOffset
> +  * - READ ReferenceTscSequence again. In case its value has changed
> +  *   since our first reading we need to discard ReferenceTime and repeat
> +  *   the whole sequence as the hypervisor was updating the page in
> +  *   between.
> +  */
> + do {
> + sequence = READ_ONCE(tsc_pg->tsc_sequence);

(sorry if this was already discussed before)

Current x86 code doing this actually checks for '0' here (note the
comment about this special value above):

sequence = READ_ONCE(tsc_pg->tsc_sequence);
if (!sequence)
return U64_MAX;

Was it removed intentionally (and we need to fix the comment then)? 

-- 
Vitaly
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


RE: [PATCH net-next] hv_netvsc: fix vf serial matching with pci slot info

2018-10-12 Thread Vitaly Kuznetsov
Haiyang Zhang  writes:

>> -Original Message-
>> From: Haiyang Zhang 
>> Sent: Thursday, October 11, 2018 4:15 PM
>> To: da...@davemloft.net; net...@vger.kernel.org
>> Cc: Haiyang Zhang ; KY Srinivasan
>> ; Stephen Hemminger ;
>> o...@aepfle.de; vkuznets ;
>> de...@linuxdriverproject.org; linux-ker...@vger.kernel.org
>> Subject: [PATCH net-next] hv_netvsc: fix vf serial matching with pci slot 
>> info
>> 
>> From: Haiyang Zhang 
>> 
>> The VF device's serial number is saved as a string in PCI slot's kobj name, 
>> not
>> the slot->number. This patch corrects the netvsc driver, so the VF device 
>> can be
>> successfully paired with synthetic NIC.
>> 
>> Fixes: 00d7ddba1143 ("hv_netvsc: pair VF based on serial number")
>> Signed-off-by: Haiyang Zhang 
>
> Thanks Stephen for the reminder -- I added the "reported-by" here:
>
> Reported-by: Vitaly Kuznetsov 

Thanks) 

The difference in the hack I sent to Stephen was that instead of using
kstrtou32() and checking the return value I opted for snprintf() and
doing strncmp().

-- 
Vitaly
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support

2018-10-04 Thread Vitaly Kuznetsov
Andy Lutomirski  writes:

> On Thu, Oct 4, 2018 at 9:43 AM Marcelo Tosatti  wrote:
>>
>> On Wed, Oct 03, 2018 at 03:32:08PM -0700, Andy Lutomirski wrote:
>> > On Wed, Oct 3, 2018 at 12:01 PM Marcelo Tosatti  
>> > wrote:
>> > >
>> > > On Tue, Oct 02, 2018 at 10:15:49PM -0700, Andy Lutomirski wrote:
>> > > > Hi Vitaly, Paolo, Radim, etc.,
>> > > >
>> > > > On Fri, Sep 14, 2018 at 5:52 AM Thomas Gleixner  
>> > > > wrote:
>> > > > >
>> > > > > Matt attempted to add CLOCK_TAI support to the VDSO clock_gettime()
>> > > > > implementation, which extended the clockid switch case and added yet
>> > > > > another slightly different copy of the same code.
>> > > > >
>> > > > > Especially the extended switch case is problematic as the compiler 
>> > > > > tends to
>> > > > > generate a jump table which then requires to use retpolines. If jump 
>> > > > > tables
>> > > > > are disabled it adds yet another conditional to the existing maze.
>> > > > >
>> > > > > This series takes a different approach by consolidating the almost
>> > > > > identical functions into one implementation for high resolution 
>> > > > > clocks and
>> > > > > one for the coarse grained clock ids by storing the base data for 
>> > > > > each
>> > > > > clock id in an array which is indexed by the clock id.
>> > > > >
>> > > >
>> > > > I was trying to understand more of the implications of this patch
>> > > > series, and I was again reminded that there is an entire extra copy of
>> > > > the vclock reading code in arch/x86/kvm/x86.c.  And the purpose of
>> > > > that code is very, very opaque.
>> > > >
>> > > > Can one of you explain what the code is even doing?  From a couple of
>> > > > attempts to read through it, it's a whole bunch of
>> > > > probably-extremely-buggy code that,
>> > >
>> > > Yes, probably.
>> > >
>> > > > drumroll please, tries to atomically read the TSC value and the time.  
>> > > > And decide whether the
>> > > > result is "based on the TSC".
>> > >
>> > > I think "based on the TSC" refers to whether TSC clocksource is being
>> > > used.
>> > >
>> > > > And then synthesizes a TSC-to-ns
>> > > > multiplier and shift, based on *something other than the actual
>> > > > multiply and shift used*.
>> > > >
>> > > > IOW, unless I'm totally misunderstanding it, the code digs into the
>> > > > private arch clocksource data intended for the vDSO, uses a poorly
>> > > > maintained copy of the vDSO code to read the time (instead of doing
>> > > > the sane thing and using the kernel interfaces for this), and
>> > > > propagates a totally made up copy to the guest.
>> > >
>> > > I posted kernel interfaces for this, and it was suggested to
>> > > instead write a "in-kernel user of pvclock data".
>> > >
>> > > If you can get kernel interfaces to replace that, go for it. I prefer
>> > > kernel interfaces as well.
>> > >
>> > > >  And gets it entirely
>> > > > wrong when doing nested virt, since, unless there's some secret in
>> > > > this maze, it doesn't acutlaly use the scaling factor from the host
>> > > > when it tells the guest what to do.
>> > > >
>> > > > I am really, seriously tempted to send a patch to simply delete all
>> > > > this code.
>> > >
>> > > If your patch which deletes the code gets the necessary features right,
>> > > sure, go for it.
>> > >
>> > > > The correct way to do it is to hook
>> > >
>> > > Can you expand on the correct way to do it?
>> > >
>> > > > And I don't see how it's even possible to pass kvmclock correctly to
>> > > > the L2 guest when L0 is hyperv.  KVM could pass *hyperv's* clock, but
>> > > > L1 isn't notified when the data structure changes, so how the heck is
>> > > > it supposed to update the kvmclock structure?
>> > >
>> > > I don't parse your question.
>> >
>> > Let me ask it more intelligently: when the "reenlightenment" IRQ
>> > happens, what tells KVM to do its own update for its guests?
>>
>> Update of what, and why it needs to update anything from IRQ?
>>
>> The update i can think of is from host kernel clocksource,
>> which there is a notifier for.
>>
>>
>
> Unless I've missed some serious magic, L2 guests see kvmclock, not hv.
> So we have the following sequence of events:
>
>  - L0 migrates the whole VM.  Starting now, RDTSC is emulated to match
> the old host, which applies in L1 and L2.
>
>  - An IRQ is queued to L1.
>
>  - L1 acknowledges that it noticed the TSC change.

Before the acknowledgement we actually pause all guests so they don't
notice the change 

>  RDTSC stops being emulated for L1 and L2.

 and right after that we update all kvmclocks for all L2s and
unpause them so all their readings are still correct (see
kvm_hyperv_tsc_notifier()).

>
>  - L2 reads the TSC.  It has no idea that anything changed, and it
> gets the wrong answer.

I have to admit I forgot what happens if L2 uses raw TSC. I *think* that
we actually adjust TSC offset along with adjusting kvmclocks so the
reading is still correct. I'll have to check this.

All bets are off in case L2 was using 

Re: [patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support

2018-10-04 Thread Vitaly Kuznetsov
Marcelo Tosatti  writes:

> On Wed, Oct 03, 2018 at 11:22:58AM +0200, Vitaly Kuznetsov wrote:
>> 
>> There is a very long history of different (hardware) issues Marcelo was
>> fighting with and the current code is the survived Frankenstein.
>
> Right, the code has to handle different TSC modes.
>
>>  E.g. it
>> is very, very unclear what "catchup", "always catchup" and
>> masterclock-less mode in general are and if we still need them.
>
> Catchup means handling exposed (to guest) TSC frequency smaller than
> HW TSC frequency.
>
> That is "frankenstein" code, could be removed.
>
>> That said I'm all for simplification. I'm not sure if we still need to
>> care about buggy hardware though.
>
> What simplification is that again? 
>

I was hoping to hear this from you :-) If I am to suggest how we can
move forward I'd propose:
- Check if pure TSC can be used on SkyLake+ systems (where TSC scaling
is supported).
- Check if non-masterclock mode is still needed. E.g. HyperV's TSC page
clocksource is a single page for the whole VM, not a per-cpu thing. Can
we think that all the buggy hardware is already gone?

-- 
Vitaly
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types

2018-10-03 Thread Vitaly Kuznetsov
Dave Hansen  writes:

> On 10/03/2018 06:52 AM, Vitaly Kuznetsov wrote:
>> It is more than just memmaps (e.g. forking udev process doing memory
>> onlining also needs memory) but yes, the main idea is to make the
>> onlining synchronous with hotplug.
>
> That's a good theoretical concern.
>
> But, is it a problem we need to solve in practice?

Yes, unfortunately. It was previously discovered that when we try to
hotplug tons of memory to a low memory system (this is a common scenario
with VMs) we end up with OOM because for all new memory blocks we need
to allocate page tables, struct pages, ... and we need memory to do
that. The userspace program doing memory onlining also needs memory to
run and in case it prefers to fork to handle hundreds of notfifications
... well, it may get OOMkilled before it manages to online anything.

Allocating all kernel objects from the newly hotplugged blocks would
definitely help to manage the situation but as I said this won't solve
the 'forking udev' problem completely (it will likely remain in
'extreme' cases only. We can probably work around it by onlining with a
dedicated process which doesn't do memory allocation).

-- 
Vitaly
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types

2018-10-03 Thread Vitaly Kuznetsov
Michal Hocko  writes:

> On Wed 03-10-18 15:38:04, Vitaly Kuznetsov wrote:
>> David Hildenbrand  writes:
>> 
>> > On 02/10/2018 15:47, Michal Hocko wrote:
>> ...
>> >> 
>> >> Why do you need a generic hotplug rule in the first place? Why don't you
>> >> simply provide different set of rules for different usecases? Let users
>> >> decide which usecase they prefer rather than try to be clever which
>> >> almost always hits weird corner cases.
>> >> 
>> >
>> > Memory hotplug has to work as reliable as we can out of the box. Letting
>> > the user make simple decisions like "oh, I am on hyper-V, I want to
>> > online memory to the normal zone" does not feel right. But yes, we
>> > should definitely allow to make modifications.
>> 
>> Last time I was thinking about the imperfectness of the auto-online
>> solution we have and any other solution we're able to suggest an idea
>> came to my mind - what if we add an eBPF attach point to the
>> auto-onlining mechanism effecively offloading decision-making to
>> userspace. We'll of couse need to provide all required data (e.g. how
>> memory blocks are aligned with physical DIMMs as it makes no sense to
>> online part of DIMM as normal and the rest as movable as it's going to
>> be impossible to unplug such DIMM anyways).
>
> And how does that differ from the notification mechanism we have? Just
> by not relying on the process scheduling? If yes then this revolves
> around the implementation detail that you care about time-to-hot-add
> vs. time-to-online. And that is a solveable problem - just allocate
> memmaps from the hot-added memory.

It is more than just memmaps (e.g. forking udev process doing memory
onlining also needs memory) but yes, the main idea is to make the
onlining synchronous with hotplug.

>
> As David said some of the memory cannot be onlined without further steps
> (e.g. when it is standby as David called it) and then I fail to see how
> eBPF help in any way.

and also, we can fight till the end of days here trying to come up with
an onlining solution which would work for everyone and eBPF would move
this decision to distro level.

-- 
Vitaly
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types

2018-10-03 Thread Vitaly Kuznetsov
David Hildenbrand  writes:

> On 02/10/2018 15:47, Michal Hocko wrote:
...
>> 
>> Why do you need a generic hotplug rule in the first place? Why don't you
>> simply provide different set of rules for different usecases? Let users
>> decide which usecase they prefer rather than try to be clever which
>> almost always hits weird corner cases.
>> 
>
> Memory hotplug has to work as reliable as we can out of the box. Letting
> the user make simple decisions like "oh, I am on hyper-V, I want to
> online memory to the normal zone" does not feel right. But yes, we
> should definitely allow to make modifications.

Last time I was thinking about the imperfectness of the auto-online
solution we have and any other solution we're able to suggest an idea
came to my mind - what if we add an eBPF attach point to the
auto-onlining mechanism effecively offloading decision-making to
userspace. We'll of couse need to provide all required data (e.g. how
memory blocks are aligned with physical DIMMs as it makes no sense to
online part of DIMM as normal and the rest as movable as it's going to
be impossible to unplug such DIMM anyways).

-- 
Vitaly
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH 02/16] hv_balloon: Replace spin_is_locked() with lockdep

2018-10-03 Thread Vitaly Kuznetsov
Lance Roy  writes:

> lockdep_assert_held() is better suited to checking locking requirements,
> since it won't get confused when someone else holds the lock. This is
> also a step towards possibly removing spin_is_locked().
>
> Signed-off-by: Lance Roy 
> Cc: "K. Y. Srinivasan" 
> Cc: Haiyang Zhang 
> Cc: Stephen Hemminger 
> Cc: 
> ---
>  drivers/hv/hv_balloon.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/hv/hv_balloon.c b/drivers/hv/hv_balloon.c
> index b1b788082793..41631512ae97 100644
> --- a/drivers/hv/hv_balloon.c
> +++ b/drivers/hv/hv_balloon.c
> @@ -689,7 +689,7 @@ static void hv_page_online_one(struct hv_hotadd_state 
> *has, struct page *pg)
>   __online_page_increment_counters(pg);
>   __online_page_free(pg);
>  
> - WARN_ON_ONCE(!spin_is_locked(_device.ha_lock));
> + lockdep_assert_held(_device.ha_lock);
>   dm_device.num_pages_onlined++;
>  }

Reviewed-by: Vitaly Kuznetsov 

However,

lockdep_assert_held() is a no-op when !CONFIG_LOCKDEP but this doesn't
really matter: hv_page_online_one() is static and it has only two call
sites, both taking the dm_device.ha_lock lock - so the warning may just
go away.

-- 
Vitaly
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support

2018-10-03 Thread Vitaly Kuznetsov
Andy Lutomirski  writes:

>> On Oct 3, 2018, at 2:22 AM, Vitaly Kuznetsov  wrote:
>> 
>> Andy Lutomirski  writes:
>> 
>>> Hi Vitaly, Paolo, Radim, etc.,
>>> 
>> The notification you're talking about exists, it is called
>> Reenligntenment, see 0092e4346f49 "x86/kvm: Support Hyper-V
>> reenlightenment"). When TSC page changes (and this only happens when L1
>> is migrated to a different host with a different TSC frequency and TSC
>> scaling is not supported by the CPU) we receive an interrupt in L1 (at
>> this moment all TSC accesses are emulated which guarantees the
>> correctness of the readings), pause all L2 guests, update their kvmclock
>> structures with new data (we already know the new TSC frequency) and
>> then tell L0 that we're done and it can stop emulating TSC accesses.
>
> That’s delightful!  Does the emulation magic also work for L1 user
> mode?

As far as I understand - yes, all rdtsc* calls will trap into L0.

>  If so, couldn’t we drop the HyperV vclock entirely and just
> fold the adjustment into the core timekeeping data?  (Preferably the
> actual core data, which would require core changes, but it could
> plausibly be done in arch code, too.)

Not all Hyper-V hosts support reenlightenment notifications (and, if I'm
not mistaken, you need to enable nesting for the VM to get the feature -
and most VMs don't have this) so I think we'll have to keep Hyper-V
vclock for the time being.

-- 
Vitaly
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support

2018-10-03 Thread Vitaly Kuznetsov
Andy Lutomirski  writes:

> Hi Vitaly, Paolo, Radim, etc.,
>
> On Fri, Sep 14, 2018 at 5:52 AM Thomas Gleixner  wrote:
>>
>> Matt attempted to add CLOCK_TAI support to the VDSO clock_gettime()
>> implementation, which extended the clockid switch case and added yet
>> another slightly different copy of the same code.
>>
>> Especially the extended switch case is problematic as the compiler tends to
>> generate a jump table which then requires to use retpolines. If jump tables
>> are disabled it adds yet another conditional to the existing maze.
>>
>> This series takes a different approach by consolidating the almost
>> identical functions into one implementation for high resolution clocks and
>> one for the coarse grained clock ids by storing the base data for each
>> clock id in an array which is indexed by the clock id.
>>
>
> I was trying to understand more of the implications of this patch
> series, and I was again reminded that there is an entire extra copy of
> the vclock reading code in arch/x86/kvm/x86.c.  And the purpose of
> that code is very, very opaque.
>
> Can one of you explain what the code is even doing?  From a couple of
> attempts to read through it, it's a whole bunch of
> probably-extremely-buggy code that, drumroll please, tries to
> atomically read the TSC value and the time.  And decide whether the
> result is "based on the TSC".  And then synthesizes a TSC-to-ns
> multiplier and shift, based on *something other than the actual
> multiply and shift used*.
>
> IOW, unless I'm totally misunderstanding it, the code digs into the
> private arch clocksource data intended for the vDSO, uses a poorly
> maintained copy of the vDSO code to read the time (instead of doing
> the sane thing and using the kernel interfaces for this), and
> propagates a totally made up copy to the guest.  And gets it entirely
> wrong when doing nested virt, since, unless there's some secret in
> this maze, it doesn't acutlaly use the scaling factor from the host
> when it tells the guest what to do.
>
> I am really, seriously tempted to send a patch to simply delete all
> this code.  The correct way to do it is to hook

"I have discovered a truly marvelous proof of this, which this margin is
too narrow to contain" :-)

There is a very long history of different (hardware) issues Marcelo was
fighting with and the current code is the survived Frankenstein. E.g. it
is very, very unclear what "catchup", "always catchup" and
masterclock-less mode in general are and if we still need them.

That said I'm all for simplification. I'm not sure if we still need to
care about buggy hardware though.

>
> And I don't see how it's even possible to pass kvmclock correctly to
> the L2 guest when L0 is hyperv.  KVM could pass *hyperv's* clock, but
> L1 isn't notified when the data structure changes, so how the heck is
> it supposed to update the kvmclock structure?

Well, this kind of works in the the followin way:
L1's clocksource is 'tsc_page' which is, basically, a compliment to TSC:
two numbers provided by L0: offset and scale and KVM was tought to treat
this clocksource as a good one (see b0c39dc68e3b "x86/kvm: Pass stable
clocksource to guests when running nested on Hyper-V").

The notification you're talking about exists, it is called
Reenligntenment, see 0092e4346f49 "x86/kvm: Support Hyper-V
reenlightenment"). When TSC page changes (and this only happens when L1
is migrated to a different host with a different TSC frequency and TSC
scaling is not supported by the CPU) we receive an interrupt in L1 (at
this moment all TSC accesses are emulated which guarantees the
correctness of the readings), pause all L2 guests, update their kvmclock
structures with new data (we already know the new TSC frequency) and
then tell L0 that we're done and it can stop emulating TSC accesses.

(Nothing like this exists for KVM-on-KVM, by the way, when L1's
clocksource is 'kvmclock' L2s won't get a stable kvmclock clocksource.)

-- 
Vitaly
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [patch 02/11] x86/time: Implement clocksource_arch_init()

2018-09-14 Thread Vitaly Kuznetsov
Thomas Gleixner  writes:

> Runtime validate the VCLOCK_MODE in clocksource::archdata and disable
> VCLOCK if invalid, which disables the VDSO but keeps the system running.
>
> Signed-off-by: Thomas Gleixner 
> ---
>  arch/x86/Kconfig   |1 +
>  arch/x86/kernel/time.c |   16 
>  2 files changed, 17 insertions(+)
>
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -48,6 +48,7 @@ config X86
>   select ACPI_SYSTEM_POWER_STATES_SUPPORT if ACPI
>   select ANON_INODES
>   select ARCH_CLOCKSOURCE_DATA
> + select ARCH_CLOCKSOURCE_INIT
>   select ARCH_DISCARD_MEMBLOCK
>   select ARCH_HAS_ACPI_TABLE_UPGRADE  if ACPI
>   select ARCH_HAS_DEBUG_VIRTUAL
> --- a/arch/x86/kernel/time.c
> +++ b/arch/x86/kernel/time.c
> @@ -10,6 +10,7 @@
>   *
>   */
>
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -105,3 +106,18 @@ void __init time_init(void)
>  {
>   late_time_init = x86_late_time_init;
>  }
> +
> +/*
> + * Sanity check the vdso related archdata content.
> + */
> +void clocksource_arch_init(struct clocksource *cs)
> +{
> + if (cs->archdata.vclock_mode == VCLOCK_NONE)
> + return;
> +
> + if (cs->archdata.vclock_mode >= VCLOCK_MAX) {

It should be '>' as VCLOCK_MAX == VCLOCK_HVCLOCK == 3

> + pr_warn("clocksource %s registered with invalid vclock_mode %d. 
> Disabling vclock.\n",
> + cs->name, cs->archdata.vclock_mode);
> + cs->archdata.vclock_mode = VCLOCK_NONE;
> + }
> +}

-- 
  Vitaly
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [patch 10/11] x86/vdso: Move cycle_last handling into the caller

2018-09-14 Thread Vitaly Kuznetsov
Thomas Gleixner  writes:

> Dereferencing gtod->cycle_last all over the place and foing the cycles <
> last comparison in the vclock read functions generates horrible code. Doing
> it at the call site is much better and gains a few cycles both for TSC and
> pvclock.
>
> Caveat: This adds the comparison to the hyperv vclock as well, but I have
> no way to test that.

While the change looks obviously correct for Hyper-V vclock too, by saying
that you encouraged me to give the whole series a shot. I did and turns
out Hyper-V vclock is broken: simple clock_gettime(CLOCK_REALTIME, )
test goes from 70 cpu cycles to 1000 (meaning that we're falling back to
syscall I guess). Bisecting now, stay tuned!

-- 
  Vitaly
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


[PATCH] tools: hv: fcopy: set 'error' in case an unknown operation was requested

2018-09-04 Thread Vitaly Kuznetsov
'error' variable is left uninitialized in case we see an unknown operation.
As we don't immediately return and proceed to pwrite() we need to set it
to something, HV_E_FAIL sounds good enough.

Signed-off-by: Vitaly Kuznetsov 
---
 tools/hv/hv_fcopy_daemon.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/tools/hv/hv_fcopy_daemon.c b/tools/hv/hv_fcopy_daemon.c
index d78aed86af09..8ff8cb1a11f4 100644
--- a/tools/hv/hv_fcopy_daemon.c
+++ b/tools/hv/hv_fcopy_daemon.c
@@ -234,6 +234,7 @@ int main(int argc, char *argv[])
break;
 
default:
+   error = HV_E_FAIL;
syslog(LOG_ERR, "Unknown operation: %d",
buffer.hdr.operation);
 
-- 
2.14.4

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH] vmbus: don't return values for uninitalized channels

2018-08-30 Thread Vitaly Kuznetsov
k...@linuxonhyperv.com writes:

> From: Stephen Hemminger 
>
> For unsupported device types, the vmbus channel ringbuffer is never
> initialized, and therefore reading the sysfs files will return garbage
> or cause a kernel OOPS.
>
> Fixes: c2e5df616e1a ("vmbus: add per-channel sysfs info")
>
> Signed-off-by: Stephen Hemminger 
> Signed-off-by: K. Y. Srinivasan 
> Cc:  # 4.15

As this is also marked for stable, any chance we can get it in 4.19? Or
at least in char-misc-next?

> ---
>  drivers/hv/vmbus_drv.c | 3 +++
>  1 file changed, 3 insertions(+)
>
> diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
> index e6d8fdac6d8b..4bbc420d1213 100644
> --- a/drivers/hv/vmbus_drv.c
> +++ b/drivers/hv/vmbus_drv.c
> @@ -1368,6 +1368,9 @@ static ssize_t vmbus_chan_attr_show(struct kobject 
> *kobj,
>   if (!attribute->show)
>   return -EIO;
>
> + if (chan->state != CHANNEL_OPENED_STATE)
> + return -EINVAL;
> +
>   return attribute->show(chan, buf);
>  }

Thanks,

-- 
  Vitaly
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH char-misc 1/1] Drivers: hv: vmbus: Make synic_initialized flag per-cpu

2018-08-28 Thread Vitaly Kuznetsov
"Michael Kelley (EOSG)"  writes:

> From: Vitaly Kuznetsov   Sent: Wednesday, August 1, 2018 
> 2:26 AM
>
>> > I was trying to decide if there are any arguments in favor of one
>> > approach vs. the other:  a per-cpu flag in memory or checking
>> > the synic_control "enable" bit.   Seems like a wash to me, in which
>> > case I have a slight preference for the per-cpu flag in memory vs.
>> > creating another function to return sctrl.enable.  But I'm completely
>> > open to reasons why checking sctrl.enable is better.
>> 
>> Just a few thoughts: reading MSR is definitely slower but we avoid
>> 'shadowing' the state, the reading is always correct. In case there's a
>> chance the SynIC will get disabled from host side we can only find this
>> out by doing MSR read. This is a purely theoretical possibility, I
>> believe, we can go ahead with this patch.
>
> Vitaly -- just to confirm:  you are OK with the patch as is?  (I'll
> check, but I may need to rebase on the latest code.)

Yes, feel free to use my R-b tag.

-- 
  Vitaly
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH] vmbus: don't return values for uninitalized channels

2018-08-13 Thread Vitaly Kuznetsov
Stephen Hemminger  writes:

> For unsupported device types, the vmbus channel ringbuffer is never
> initialized, and therefore reading the sysfs files will return garbage
> or cause a kernel OOPS.
>
> Fixes: c2e5df616e1a ("vmbus: add per-channel sysfs info")
> Signed-off-by: Stephen Hemminger 
> ---
>  drivers/hv/vmbus_drv.c | 3 +++
>  1 file changed, 3 insertions(+)
>
> diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
> index b10fe26c4891..c9a466be7709 100644
> --- a/drivers/hv/vmbus_drv.c
> +++ b/drivers/hv/vmbus_drv.c
> @@ -1178,6 +1178,9 @@ static ssize_t vmbus_chan_attr_show(struct kobject 
> *kobj,
>   if (!attribute->show)
>   return -EIO;
>
> + if (chan->state != CHANNEL_OPENED_STATE)
> + return -EINVAL;
> +
>   return attribute->show(chan, buf);
>  }

Tested-by: Vitaly Kuznetsov 

-- 
  Vitaly
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH char-misc 1/1] Drivers: hv: vmbus: Make synic_initialized flag per-cpu

2018-08-01 Thread Vitaly Kuznetsov
"Michael Kelley (EOSG)"  writes:

> From: Vitaly Kuznetsov  Sent: Tuesday, July 31, 2018 
> 4:20 AM
>> 
>> Alternatively, we can get rid of synic_initialized flag altogether:
>> hv_synic_init() never fails in the first place but we can always
>> implement something like:
>> 
>> int hv_synic_is_initialized(void) {
>>  union hv_synic_scontrol sctrl;
>> 
>>  hv_get_synic_state(sctrl.as_uint64);
>> 
>>  return sctrl.enable;
>> }
>> 
>> as it doesn't seem that we need to check synic state on _other_ CPUs.
>> 
>
> I was trying to decide if there are any arguments in favor of one
> approach vs. the other:  a per-cpu flag in memory or checking
> the synic_control "enable" bit.   Seems like a wash to me, in which
> case I have a slight preference for the per-cpu flag in memory vs.
> creating another function to return sctrl.enable.  But I'm completely
> open to reasons why checking sctrl.enable is better.

Just a few thoughts: reading MSR is definitely slower but we avoid
'shadowing' the state, the reading is always correct. In case there's a
chance the SynIC will get disabled from host side we can only find this
out by doing MSR read. This is a purely theoretical possibility, I
believe, we can go ahead with this patch.

-- 
  Vitaly
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH char-misc 1/1] Drivers: hv: vmbus: Make synic_initialized flag per-cpu

2018-07-31 Thread Vitaly Kuznetsov
mhkelle...@gmail.com writes:

> From: Michael Kelley 
>
> The synic_initialized flag is part of the global hv_context
> structure.  But the Hyper-V synthetic interrupt controller is
> fundamentally a per-cpu device, and other synic related
> fields are in hv_per_cpu_context.  In a multi-CPU system,
> synic_initialized gets set multiple times, making the test in
> hv_synic_cleanup() invalid.  Fix this by moving the flag to
> hv_per_cpu_context and adjusting the references.
>
> Signed-off-by: Michael Kelley 
> ---
>  drivers/hv/hv.c   | 16 +++-
>  drivers/hv/hyperv_vmbus.h |  4 ++--
>  2 files changed, 9 insertions(+), 11 deletions(-)
>
> diff --git a/drivers/hv/hv.c b/drivers/hv/hv.c
> index 312fe5e..8d4fe0e 100644
> --- a/drivers/hv/hv.c
> +++ b/drivers/hv/hv.c
> @@ -33,9 +33,7 @@
>  #include "hyperv_vmbus.h"
>
>  /* The one and only */
> -struct hv_context hv_context = {
> - .synic_initialized  = false,
> -};
> +struct hv_context hv_context;
>
>  /*
>   * If false, we're using the old mechanism for stimer0 interrupts
> @@ -315,7 +313,7 @@ int hv_synic_init(unsigned int cpu)
>
>   hv_set_synic_state(sctrl.as_uint64);
>
> - hv_context.synic_initialized = true;
> + hv_cpu->synic_initialized = true;
>
>   /*
>* Register the per-cpu clockevent source.
> @@ -354,6 +352,8 @@ void hv_synic_clockevents_cleanup(void)
>   */
>  int hv_synic_cleanup(unsigned int cpu)
>  {
> + struct hv_per_cpu_context *hv_cpu
> + = per_cpu_ptr(hv_context.cpu_context, cpu);
>   union hv_synic_sint shared_sint;
>   union hv_synic_simp simp;
>   union hv_synic_siefp siefp;
> @@ -362,7 +362,7 @@ int hv_synic_cleanup(unsigned int cpu)
>   bool channel_found = false;
>   unsigned long flags;
>
> - if (!hv_context.synic_initialized)
> + if (!hv_cpu->synic_initialized)
>   return -EFAULT;
>
>   /*
> @@ -395,12 +395,8 @@ int hv_synic_cleanup(unsigned int cpu)
>
>   /* Turn off clockevent device */
>   if (ms_hyperv.features & HV_MSR_SYNTIMER_AVAILABLE) {
> - struct hv_per_cpu_context *hv_cpu
> - = this_cpu_ptr(hv_context.cpu_context);
> -
>   clockevents_unbind_device(hv_cpu->clk_evt, cpu);
>   hv_ce_shutdown(hv_cpu->clk_evt);
> - put_cpu_ptr(hv_cpu);
>   }
>
>   hv_get_synint_state(VMBUS_MESSAGE_SINT, shared_sint.as_uint64);
> @@ -428,5 +424,7 @@ int hv_synic_cleanup(unsigned int cpu)
>   sctrl.enable = 0;
>   hv_set_synic_state(sctrl.as_uint64);
>
> + hv_cpu->synic_initialized = false;
> +
>   return 0;
>  }
> diff --git a/drivers/hv/hyperv_vmbus.h b/drivers/hv/hyperv_vmbus.h
> index 72eaba3..eadd3df 100644
> --- a/drivers/hv/hyperv_vmbus.h
> +++ b/drivers/hv/hyperv_vmbus.h
> @@ -202,6 +202,8 @@ enum {
>  struct hv_per_cpu_context {
>   void *synic_message_page;
>   void *synic_event_page;
> + bool synic_initialized;
> +
>   /*
>* buffer to post messages to the host.
>*/
> @@ -230,8 +232,6 @@ struct hv_context {
>
>   void *tsc_page;
>
> - bool synic_initialized;
> -
>   struct hv_per_cpu_context __percpu *cpu_context;
>
>   /*

Reviewed-by: Vitaly Kuznetsov 

Alternatively, we can get rid of synic_initialized flag altogether:
hv_synic_init() never fails in the first place but we can always
implement something like:

int hv_synic_is_initialized(void) {
union hv_synic_scontrol sctrl;

hv_get_synic_state(sctrl.as_uint64);

return sctrl.enable;
}

as it doesn't seem that we need to check synic state on _other_ CPUs.

-- 
  Vitaly
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH v3] hv_netvsc: Add per-cpu ethtool stats for netvsc

2018-07-24 Thread Vitaly Kuznetsov
Yidong Ren  writes:

>> From: Yidong Ren 
>> Sent: Monday, July 23, 2018 6:26 PM
>> +pcpu_sum = kvmalloc(sizeof(struct netvsc_ethtool_pcpu_stats) *
>> +num_present_cpus(), GFP_KERNEL);
>
> Since there is no plan for CPU hotplug in Hyper-V in short term, it is fine 
> to use num_present_cpus for now. We can move to debugfs later if necessary.

While you do for_each_present_cpu() in netvsc_get_ethtool_stats(),
netvsc_get_pcpu_stats() does for_each_possible_cpu(). This looks
inconsistent.

The allocation you're doing here is short-lived so I would suggest you
use possible_cpus everywhere. Even knowing there's no CPU hotplug on
Hyper-V at this moment, it can appear later and we'll get a hard-to-find
issue. Moreover, we may consider using netvsc driver on e.g. KVM with
Hyper-V enlightenments and KVM has CPU hotplug already.

-- 
  Vitaly
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


[PATCH 2/2] x86/hyper-v: check for VP_INVAL in hyperv_flush_tlb_others()

2018-07-09 Thread Vitaly Kuznetsov
Commit 1268ed0c474a ("x86/hyper-v: Fix the circular dependency in IPI
 enlightenment") pre-filled hv_vp_index with VP_INVAL so it is now
(theoretically) possible to observe hv_cpu_number_to_vp_number()
returning VP_INVAL. We need to check for that in hyperv_flush_tlb_others().

Not checking for VP_INVAL on the first call site where we do

 if (hv_cpu_number_to_vp_number(cpumask_last(cpus)) >= 64)
goto do_ex_hypercall;

is OK, in case we're eligible for non-ex hypercall we'll catch the
issue later in for_each_cpu() cycle and in case we'll be doing ex-
hypercall cpumask_to_vpset() will fail.

It would be nice to change hv_cpu_number_to_vp_number() return
value's type to 'u32' but this will likely be a bigger change as
all call sites need to be checked first.

Fixes: 1268ed0c474a ("x86/hyper-v: Fix the circular dependency in IPI 
enlightenment")
Signed-off-by: Vitaly Kuznetsov 
---
 arch/x86/hyperv/mmu.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/arch/x86/hyperv/mmu.c b/arch/x86/hyperv/mmu.c
index 453d2355cd61..1147e1fed7ff 100644
--- a/arch/x86/hyperv/mmu.c
+++ b/arch/x86/hyperv/mmu.c
@@ -111,6 +111,11 @@ static void hyperv_flush_tlb_others(const struct cpumask 
*cpus,
 
for_each_cpu(cpu, cpus) {
vcpu = hv_cpu_number_to_vp_number(cpu);
+   if (vcpu == VP_INVAL) {
+   local_irq_restore(flags);
+   goto do_native;
+   }
+
if (vcpu >= 64)
goto do_ex_hypercall;
 
-- 
2.14.4

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


[PATCH 0/2] x86/hyper-v: cope with VP_INVAL in PV TLB flush code

2018-07-09 Thread Vitaly Kuznetsov
Commit 1268ed0c474a ("x86/hyper-v: Fix the circular dependency in IPI
 enlightenment") made it possible to observe VP_INVAL returned from 
hv_cpu_number_to_vp_number() and cpumask_to_vpset() and PV TLB flush
code needs to be adjusted.

The window when VP_INVAL is observable is very short, I'm not even sure
we do TLB flushes during this period (secodary CPUs bringup on boot, there
is no CPU hotplug on Hyper-V yet). This, however, may change in future so
let's fix this now.

Thomas, Ingo: these patches are for 'tip/x86/hyperv'. I don't think we have
a real issue in 4.18 but I can definitely prepare fixes for it if you think
this is needed.

Vitaly Kuznetsov (2):
  x86/hyper-v: check cpumask_to_vpset() return value in
hyperv_flush_tlb_others_ex()
  x86/hyper-v: check for VP_INVAL in hyperv_flush_tlb_others()

 arch/x86/hyperv/mmu.c | 7 +++
 1 file changed, 7 insertions(+)

-- 
2.14.4

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


[PATCH 1/2] x86/hyper-v: check cpumask_to_vpset() return value in hyperv_flush_tlb_others_ex()

2018-07-09 Thread Vitaly Kuznetsov
Commit 1268ed0c474a ("x86/hyper-v: Fix the circular dependency in IPI
enlightenment") made cpumask_to_vpset() return '-1' when there is a CPU
with unknown VP index in the supplied set. This needs to be checked before
we pass 'nr_bank' to hypercall.

Fixes: 1268ed0c474a ("x86/hyper-v: Fix the circular dependency in IPI 
enlightenment")
Signed-off-by: Vitaly Kuznetsov 
---
 arch/x86/hyperv/mmu.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/x86/hyperv/mmu.c b/arch/x86/hyperv/mmu.c
index 0d90e515ec98..453d2355cd61 100644
--- a/arch/x86/hyperv/mmu.c
+++ b/arch/x86/hyperv/mmu.c
@@ -186,6 +186,8 @@ static u64 hyperv_flush_tlb_others_ex(const struct cpumask 
*cpus,
 
flush->hv_vp_set.format = HV_GENERIC_SET_SPARSE_4K;
nr_bank = cpumask_to_vpset(&(flush->hv_vp_set), cpus);
+   if (nr_bank < 0)
+   return U64_MAX;
 
/*
 * We can flush not more than max_gvas with one hypercall. Flush the
-- 
2.14.4

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH 1/1] X86/Hyper-V:: Fix the circular dependency in IPI enlightenment.

2018-07-04 Thread Vitaly Kuznetsov
k...@linuxonhyperv.com writes:

> From: "K. Y. Srinivasan" 
>
> The IPI hypercalls depend on being able to map the Linux notion of CPU ID
> to the hypervisor's notion of the CPU ID. The array hv_vp_index[] provides
> this mapping. Code for populating this array depends on the IPI functionality.
> Break this circular dependency.
>
> Fixes: 68bb7bfb7985 ("X86/Hyper-V: Enable IPI enlightenments")
>
> Signed-off-by: K. Y. Srinivasan 
> Tested-by: Michael Kelley 
> ---
>  arch/x86/hyperv/hv_apic.c   | 5 +
>  arch/x86/hyperv/hv_init.c   | 5 -
>  arch/x86/include/asm/mshyperv.h | 2 ++
>  3 files changed, 11 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/hyperv/hv_apic.c b/arch/x86/hyperv/hv_apic.c
> index f68855499391..63d7c196739f 100644
> --- a/arch/x86/hyperv/hv_apic.c
> +++ b/arch/x86/hyperv/hv_apic.c
> @@ -114,6 +114,8 @@ static bool __send_ipi_mask_ex(const struct cpumask 
> *mask, int vector)
>   ipi_arg->vp_set.format = HV_GENERIC_SET_SPARSE_4K;
>   nr_bank = cpumask_to_vpset(&(ipi_arg->vp_set), mask);
>   }
> + if (nr_bank == -1)
> + goto ipi_mask_ex_done;
>   if (!nr_bank)
>   ipi_arg->vp_set.format = HV_GENERIC_SET_ALL;
>
> @@ -158,6 +160,9 @@ static bool __send_ipi_mask(const struct cpumask *mask, 
> int vector)
>
>   for_each_cpu(cur_cpu, mask) {
>   vcpu = hv_cpu_number_to_vp_number(cur_cpu);
> + if (vcpu == -1)
> + goto ipi_mask_done;
> +

Nit: hv_vp_index is u32 *, it would make sense to use U32_MAX instead of
'-1' everywhere in this patch.

>   /*
>* This particular version of the IPI hypercall can
>* only target upto 64 CPUs.
> diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
> index 4c431e1c1eff..04159893702e 100644
> --- a/arch/x86/hyperv/hv_init.c
> +++ b/arch/x86/hyperv/hv_init.c
> @@ -265,7 +265,7 @@ void __init hyperv_init(void)
>  {
>   u64 guest_id, required_msrs;
>   union hv_x64_msr_hypercall_contents hypercall_msr;
> - int cpuhp;
> + int cpuhp, i;
>
>   if (x86_hyper_type != X86_HYPER_MS_HYPERV)
>   return;
> @@ -293,6 +293,9 @@ void __init hyperv_init(void)
>   if (!hv_vp_index)
>   return;
>
> + for (i = 0; i < num_possible_cpus(); i++)
> + hv_vp_index[i] = -1;
> +
>   hv_vp_assist_page = kcalloc(num_possible_cpus(),
>   sizeof(*hv_vp_assist_page), GFP_KERNEL);
>   if (!hv_vp_assist_page) {
> diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
> index 3cd14311edfa..dee3f7347253 100644
> --- a/arch/x86/include/asm/mshyperv.h
> +++ b/arch/x86/include/asm/mshyperv.h
> @@ -281,6 +281,8 @@ static inline int cpumask_to_vpset(struct hv_vpset *vpset,
>*/
>   for_each_cpu(cpu, cpus) {
>   vcpu = hv_cpu_number_to_vp_number(cpu);
> + if (vcpu == -1)
> + return -1;
>   vcpu_bank = vcpu / 64;
>   vcpu_offset = vcpu % 64;
>   __set_bit(vcpu_offset, (unsigned long *)

-- 
  Vitaly
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH 0/4] x86/hyper-v: optimize PV IPIs

2018-06-28 Thread Vitaly Kuznetsov
Vitaly Kuznetsov  writes:

> Wanpeng Li  writes:
>
>> Hi Vitaly, (fix my reply mess this time)
>> On Sat, 23 Jun 2018 at 01:09, Vitaly Kuznetsov  wrote:
>>>
>>> When reviewing my "x86/hyper-v: use cheaper HVCALL_FLUSH_VIRTUAL_ADDRESS_
>>> {LIST,SPACE} hypercalls when possible" patch Michael suggested to apply the
>>> same idea to PV IPIs. Here we go!
>>>
>>> Despite what Hyper-V TLFS says about HVCALL_SEND_IPI hypercall, it can
>>> actually be 'fast' (passing parameters through registers). Use that too.
>>>
>>> This series can collide with my "KVM: x86: hyperv: PV IPI support for
>>> Windows guests" series as I rename ipi_arg_non_ex/ipi_arg_ex structures
>>> there. Depending on which one gets in first we may need to do tiny
>>> adjustments.
>>
>> As hyperv PV TLB flush has already been merged, is there any other
>> obvious multicast IPIs scenarios? qemu supports interrupt remapping
>> since two years ago, I think windows guest can switch to cluster mode
>> after entering x2APIC, so sending IPI per cluster. In addition, you
>> can also post the benchmark result for this PV IPI optimization,
>> although it also fixes the bug which you mentioned above.
>
> I got confused, which of my patch series are you actually looking at?
> :-)
>
> This particular one ("x86/hyper-v: optimize PV IPIs") is not about
> KVM/qemu, it is for Linux running on top on real Hyper-V server. We
> already support PV IPIs and here I'm just trying to optimize the way how
> we send them by switching to a cheaper hypercall (and using 'fast'
> version of it) when possible. I don't actually have a good benchmark
> (and I don't remember seeing one when K.Y. posted PV IPI support) but
> this can be arranged I guess: I can write a dump 'IPI sender' in kernel
> and send e.g. 1000 IPIs.

So I used the IPI benchmark (https://lkml.org/lkml/2017/12/19/141,
thanks for the tip!) on this series. On a 16 vCPU guest (WS2016) I'm
getting the following:

Before:
Dry-run:0   203110
Self-IPI:   6167430 11645550
Normal IPI: 380479300   475881820
Broadcast IPI:  0   2557371420

After:
Dry-run:0   214280 (not interesting)
Self-IPI:   5706210 10697640   (- 8%)
Normal IPI: 379330010   450158830  (- 5%)
Broadcast IPI:  0   2340427160 (- 8%)

-- 
  Vitaly
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH 0/4] x86/hyper-v: optimize PV IPIs

2018-06-27 Thread Vitaly Kuznetsov
Wanpeng Li  writes:

> Hi Vitaly, (fix my reply mess this time)
> On Sat, 23 Jun 2018 at 01:09, Vitaly Kuznetsov  wrote:
>>
>> When reviewing my "x86/hyper-v: use cheaper HVCALL_FLUSH_VIRTUAL_ADDRESS_
>> {LIST,SPACE} hypercalls when possible" patch Michael suggested to apply the
>> same idea to PV IPIs. Here we go!
>>
>> Despite what Hyper-V TLFS says about HVCALL_SEND_IPI hypercall, it can
>> actually be 'fast' (passing parameters through registers). Use that too.
>>
>> This series can collide with my "KVM: x86: hyperv: PV IPI support for
>> Windows guests" series as I rename ipi_arg_non_ex/ipi_arg_ex structures
>> there. Depending on which one gets in first we may need to do tiny
>> adjustments.
>
> As hyperv PV TLB flush has already been merged, is there any other
> obvious multicast IPIs scenarios? qemu supports interrupt remapping
> since two years ago, I think windows guest can switch to cluster mode
> after entering x2APIC, so sending IPI per cluster. In addition, you
> can also post the benchmark result for this PV IPI optimization,
> although it also fixes the bug which you mentioned above.

I got confused, which of my patch series are you actually looking at?
:-)

This particular one ("x86/hyper-v: optimize PV IPIs") is not about
KVM/qemu, it is for Linux running on top on real Hyper-V server. We
already support PV IPIs and here I'm just trying to optimize the way how
we send them by switching to a cheaper hypercall (and using 'fast'
version of it) when possible. I don't actually have a good benchmark
(and I don't remember seeing one when K.Y. posted PV IPI support) but
this can be arranged I guess: I can write a dump 'IPI sender' in kernel
and send e.g. 1000 IPIs.

-- 
  Vitaly
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH 0/4] x86/hyper-v: optimize PV IPIs

2018-06-27 Thread Vitaly Kuznetsov
Wanpeng Li  writes:

> Hi Vitaly, (fix my reply mess this time)
> On Sat, 23 Jun 2018 at 01:09, Vitaly Kuznetsov  wrote:
>>
>> When reviewing my "x86/hyper-v: use cheaper HVCALL_FLUSH_VIRTUAL_ADDRESS_
>> {LIST,SPACE} hypercalls when possible" patch Michael suggested to apply the
>> same idea to PV IPIs. Here we go!
>>
>> Despite what Hyper-V TLFS says about HVCALL_SEND_IPI hypercall, it can
>> actually be 'fast' (passing parameters through registers). Use that too.
>>
>> This series can collide with my "KVM: x86: hyperv: PV IPI support for
>> Windows guests" series as I rename ipi_arg_non_ex/ipi_arg_ex structures
>> there. Depending on which one gets in first we may need to do tiny
>> adjustments.
>
> As hyperv PV TLB flush has already been merged, is there any other
> obvious multicast IPIs scenarios? qemu supports interrupt remapping
> since two years ago, I think windows guest can switch to cluster mode
> after entering x2APIC, so sending IPI per cluster. 

When we manifest ourselves as Hyper-V Windows 'forgets' about x2apic
mode: Hyper-V has a concept of 'Synthetic interrupt controller' - an
xapic extension which we also support in KVM. I don't really know any
obvious scenarios for mass IPIs in Windows besides TLB flush but I'm
worried they may exist. Without PV IPIs any such attempt will likely
lead to a crash.

In general, I do care more about completeness and correctness of our
Hyper-V emulation at this point: Windows is only being tested on 'real'
Hyper-Vs so when we emulate a subset of enlightenments we're on our own
when something is not working. It is also very helpfult for
Linux-on-Hyper-V depelopment as we can see how Windows-on-Hyper-v
behaves :-)

> In addition, you
> can also post the benchmark result for this PV IPI optimization,
> although it also fixes the bug which you mentioned above.

I'd love to get to know how to trigger mass IPIs in Windows so a
benchmark can be performed...

> I can post one variant for Linux guest PV IPI if it also makes
> sense. :)

With x2apic support I'm actually not sure. Maybe configurations with
a very large number of vCPUs and IPIs going to > 256 vCPUs can benefit
from a 'single hypercall' solution.

-- 
  Vitaly
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


[PATCH 3/4] x86/hyper-v: use cheaper HVCALL_SEND_IPI hypercall when possible

2018-06-22 Thread Vitaly Kuznetsov
When there is no need to send an IPI to a CPU with VP number > 64
we can do the job with fast HVCALL_SEND_IPI hypercall.

Signed-off-by: Vitaly Kuznetsov 
---
 arch/x86/hyperv/hv_apic.c | 29 -
 1 file changed, 20 insertions(+), 9 deletions(-)

diff --git a/arch/x86/hyperv/hv_apic.c b/arch/x86/hyperv/hv_apic.c
index 90055f89223b..ee962784d25b 100644
--- a/arch/x86/hyperv/hv_apic.c
+++ b/arch/x86/hyperv/hv_apic.c
@@ -99,6 +99,9 @@ static bool __send_ipi_mask_ex(const struct cpumask *mask, 
int vector)
int nr_bank = 0;
int ret = 1;
 
+   if (!(ms_hyperv.hints & HV_X64_EX_PROCESSOR_MASKS_RECOMMENDED))
+   return false;
+
local_irq_save(flags);
arg = (struct ipi_arg_ex **)this_cpu_ptr(hyperv_pcpu_input_arg);
 
@@ -140,8 +143,18 @@ static bool __send_ipi_mask(const struct cpumask *mask, 
int vector)
if ((vector < HV_IPI_LOW_VECTOR) || (vector > HV_IPI_HIGH_VECTOR))
return false;
 
-   if ((ms_hyperv.hints & HV_X64_EX_PROCESSOR_MASKS_RECOMMENDED))
-   return __send_ipi_mask_ex(mask, vector);
+   /*
+* From the supplied CPU set we need to figure out if we can get away
+* with cheaper HVCALL_SEND_IPI hypercall. This is possible when the
+* highest VP number in the set is < 64. As VP numbers are usually in
+* ascending order and match Linux CPU ids, here is an optimization:
+* we check the VP number for the highest bit in the supplied set first
+* so we can quickly find out if using HVCALL_SEND_IPI_EX hypercall is
+* a must. We will also check all VP numbers when walking the supplied
+* CPU set to remain correct in all cases.
+*/
+   if (hv_cpu_number_to_vp_number(cpumask_last(mask)) >= 64)
+   goto do_ex_hypercall;
 
ipi_arg.vector = vector;
ipi_arg.cpu_mask = 0;
@@ -153,16 +166,17 @@ static bool __send_ipi_mask(const struct cpumask *mask, 
int vector)
 * only target upto 64 CPUs.
 */
if (vcpu >= 64)
-   goto ipi_mask_done;
+   goto do_ex_hypercall;
 
__set_bit(vcpu, (unsigned long *)_arg.cpu_mask);
}
 
ret = hv_do_fast_hypercall16(HVCALL_SEND_IPI, ipi_arg.vector,
 ipi_arg.cpu_mask);
-
-ipi_mask_done:
return ((ret == 0) ? true : false);
+
+do_ex_hypercall:
+   return __send_ipi_mask_ex(mask, vector);
 }
 
 static bool __send_ipi_one(int cpu, int vector)
@@ -218,10 +232,7 @@ static void hv_send_ipi_self(int vector)
 void __init hv_apic_init(void)
 {
if (ms_hyperv.hints & HV_X64_CLUSTER_IPI_RECOMMENDED) {
-   if ((ms_hyperv.hints & HV_X64_EX_PROCESSOR_MASKS_RECOMMENDED))
-   pr_info("Hyper-V: Using ext hypercalls for IPI\n");
-   else
-   pr_info("Hyper-V: Using IPI hypercalls\n");
+   pr_info("Hyper-V: Using IPI hypercalls\n");
/*
 * Set the IPI entry points.
 */
-- 
2.14.4

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


[PATCH 4/4] x86/hyper-v: trace PV IPI send

2018-06-22 Thread Vitaly Kuznetsov
Trace Hyper-V PV IPIs the same way we do PV TLB flush.

Signed-off-by: Vitaly Kuznetsov 
---
 arch/x86/hyperv/hv_apic.c   |  4 
 arch/x86/include/asm/trace/hyperv.h | 15 +++
 2 files changed, 19 insertions(+)

diff --git a/arch/x86/hyperv/hv_apic.c b/arch/x86/hyperv/hv_apic.c
index ee962784d25b..657a2b8c738a 100644
--- a/arch/x86/hyperv/hv_apic.c
+++ b/arch/x86/hyperv/hv_apic.c
@@ -31,6 +31,8 @@
 #include 
 #include 
 
+#include 
+
 static struct apic orig_apic;
 
 static u64 hv_apic_icr_read(void)
@@ -134,6 +136,8 @@ static bool __send_ipi_mask(const struct cpumask *mask, int 
vector)
struct ipi_arg_non_ex ipi_arg;
int ret = 1;
 
+   trace_hyperv_send_ipi_mask(mask, vector);
+
if (cpumask_empty(mask))
return true;
 
diff --git a/arch/x86/include/asm/trace/hyperv.h 
b/arch/x86/include/asm/trace/hyperv.h
index 4253bca99989..9c0d4b588e3f 100644
--- a/arch/x86/include/asm/trace/hyperv.h
+++ b/arch/x86/include/asm/trace/hyperv.h
@@ -28,6 +28,21 @@ TRACE_EVENT(hyperv_mmu_flush_tlb_others,
  __entry->addr, __entry->end)
);
 
+TRACE_EVENT(hyperv_send_ipi_mask,
+   TP_PROTO(const struct cpumask *cpus,
+int vector),
+   TP_ARGS(cpus, vector),
+   TP_STRUCT__entry(
+   __field(unsigned int, ncpus)
+   __field(int, vector)
+   ),
+   TP_fast_assign(__entry->ncpus = cpumask_weight(cpus);
+  __entry->vector = vector;
+   ),
+   TP_printk("ncpus %d vector %x",
+ __entry->ncpus, __entry->vector)
+   );
+
 #endif /* CONFIG_HYPERV */
 
 #undef TRACE_INCLUDE_PATH
-- 
2.14.4

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


[PATCH 0/4] x86/hyper-v: optimize PV IPIs

2018-06-22 Thread Vitaly Kuznetsov
When reviewing my "x86/hyper-v: use cheaper HVCALL_FLUSH_VIRTUAL_ADDRESS_
{LIST,SPACE} hypercalls when possible" patch Michael suggested to apply the
same idea to PV IPIs. Here we go!

Despite what Hyper-V TLFS says about HVCALL_SEND_IPI hypercall, it can
actually be 'fast' (passing parameters through registers). Use that too.

This series can collide with my "KVM: x86: hyperv: PV IPI support for
Windows guests" series as I rename ipi_arg_non_ex/ipi_arg_ex structures
there. Depending on which one gets in first we may need to do tiny
adjustments.

Vitaly Kuznetsov (4):
  x86/hyper-v: implement hv_do_fast_hypercall16
  x86/hyper-v: use 'fast' hypercall for HVCALL_SEND_IPI
  x86/hyper-v: use cheaper HVCALL_SEND_IPI hypercall when possible
  x86/hyper-v: trace PV IPI send

 arch/x86/hyperv/hv_apic.c   | 57 -
 arch/x86/include/asm/mshyperv.h | 34 ++
 arch/x86/include/asm/trace/hyperv.h | 15 ++
 3 files changed, 80 insertions(+), 26 deletions(-)

-- 
2.14.4

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


[PATCH 2/4] x86/hyper-v: use 'fast' hypercall for HVCALL_SEND_IPI

2018-06-22 Thread Vitaly Kuznetsov
Current Hyper-V TLFS (v5.0b) claims that HvCallSendSyntheticClusterIpi
hypercall can't be 'fast' (passing parameters through registers) but
apparently this is not true, Windows always uses 'fast' version. We can
do the same in Linux too.

Signed-off-by: Vitaly Kuznetsov 
---
 arch/x86/hyperv/hv_apic.c | 22 ++
 1 file changed, 6 insertions(+), 16 deletions(-)

diff --git a/arch/x86/hyperv/hv_apic.c b/arch/x86/hyperv/hv_apic.c
index f68855499391..90055f89223b 100644
--- a/arch/x86/hyperv/hv_apic.c
+++ b/arch/x86/hyperv/hv_apic.c
@@ -128,10 +128,8 @@ static bool __send_ipi_mask_ex(const struct cpumask *mask, 
int vector)
 static bool __send_ipi_mask(const struct cpumask *mask, int vector)
 {
int cur_cpu, vcpu;
-   struct ipi_arg_non_ex **arg;
-   struct ipi_arg_non_ex *ipi_arg;
+   struct ipi_arg_non_ex ipi_arg;
int ret = 1;
-   unsigned long flags;
 
if (cpumask_empty(mask))
return true;
@@ -145,16 +143,8 @@ static bool __send_ipi_mask(const struct cpumask *mask, 
int vector)
if ((ms_hyperv.hints & HV_X64_EX_PROCESSOR_MASKS_RECOMMENDED))
return __send_ipi_mask_ex(mask, vector);
 
-   local_irq_save(flags);
-   arg = (struct ipi_arg_non_ex **)this_cpu_ptr(hyperv_pcpu_input_arg);
-
-   ipi_arg = *arg;
-   if (unlikely(!ipi_arg))
-   goto ipi_mask_done;
-
-   ipi_arg->vector = vector;
-   ipi_arg->reserved = 0;
-   ipi_arg->cpu_mask = 0;
+   ipi_arg.vector = vector;
+   ipi_arg.cpu_mask = 0;
 
for_each_cpu(cur_cpu, mask) {
vcpu = hv_cpu_number_to_vp_number(cur_cpu);
@@ -165,13 +155,13 @@ static bool __send_ipi_mask(const struct cpumask *mask, 
int vector)
if (vcpu >= 64)
goto ipi_mask_done;
 
-   __set_bit(vcpu, (unsigned long *)_arg->cpu_mask);
+   __set_bit(vcpu, (unsigned long *)_arg.cpu_mask);
}
 
-   ret = hv_do_hypercall(HVCALL_SEND_IPI, ipi_arg, NULL);
+   ret = hv_do_fast_hypercall16(HVCALL_SEND_IPI, ipi_arg.vector,
+ipi_arg.cpu_mask);
 
 ipi_mask_done:
-   local_irq_restore(flags);
return ((ret == 0) ? true : false);
 }
 
-- 
2.14.4

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


[PATCH 1/4] x86/hyper-v: implement hv_do_fast_hypercall16

2018-06-22 Thread Vitaly Kuznetsov
Implement 'Fast' hypercall with two 64-bit input parameter. This is
going to be used for HvCallSendSyntheticClusterIpi hypercall.

Signed-off-by: Vitaly Kuznetsov 
---
 arch/x86/include/asm/mshyperv.h | 34 ++
 1 file changed, 34 insertions(+)

diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
index 3cd14311edfa..da25642940d3 100644
--- a/arch/x86/include/asm/mshyperv.h
+++ b/arch/x86/include/asm/mshyperv.h
@@ -193,6 +193,40 @@ static inline u64 hv_do_fast_hypercall8(u16 code, u64 
input1)
return hv_status;
 }
 
+/* Fast hypercall with 16 bytes of input */
+static inline u64 hv_do_fast_hypercall16(u16 code, u64 input1, u64 input2)
+{
+   u64 hv_status, control = (u64)code | HV_HYPERCALL_FAST_BIT;
+
+#ifdef CONFIG_X86_64
+   {
+   __asm__ __volatile__("mov %4, %%r8\n"
+CALL_NOSPEC
+: "=a" (hv_status), ASM_CALL_CONSTRAINT,
+  "+c" (control), "+d" (input1)
+: "r" (input2),
+  THUNK_TARGET(hv_hypercall_pg)
+: "cc", "r8", "r9", "r10", "r11");
+   }
+#else
+   {
+   u32 input1_hi = upper_32_bits(input1);
+   u32 input1_lo = lower_32_bits(input1);
+   u32 input2_hi = upper_32_bits(input2);
+   u32 input2_lo = lower_32_bits(input2);
+
+   __asm__ __volatile__ (CALL_NOSPEC
+ : "=A"(hv_status),
+   "+c"(input1_lo), ASM_CALL_CONSTRAINT
+ : "A" (control), "b" (input1_hi),
+   "D"(input2_hi), "S"(input2_lo),
+   THUNK_TARGET(hv_hypercall_pg)
+ : "cc");
+   }
+#endif
+   return hv_status;
+}
+
 /*
  * Rep hypercalls. Callers of this functions are supposed to ensure that
  * rep_count and varhead_size comply with Hyper-V hypercall definition.
-- 
2.14.4

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: Will the name of hyperv_clocksource_tsc_page or hyperv_clocksource pages change?

2018-06-22 Thread Vitaly Kuznetsov
"Alma Eyre (Sonata Software North America)" 
writes:

> Hello,
>
> This is Alma supporting Azure for Japanese customers. I had a question from a 
> customer that I could not find the answers for. I saw this 
> github(https://github.com/torvalds/linux/commit/88c9281a9fba67636ab26c1fd6afbc78a632374f)
>  page, and I was
> wondering if someone on this list might be able to answer the question.
>
> Will the name of hyperv_clocksource_tsc_page or hyperv_clocksource pages 
> change?
>
> Background:
>
> The customer is experiencing "tsc: Fast TSC calibration failed" error
> on their CentOS 7.4(3.10.0-693.11.6.el7) VM.

Hi Alma,

I think the following upstream commit would help:

commit 71c2a2d0a81f096a2932fccb39a500116fece554
Author: Vitaly Kuznetsov 
Date:   Thu Jun 22 18:07:30 2017 +0800

x86/hyperv: Read TSC frequency from a synthetic MSR

>
> My research:
>
> Although I could find information that both of these pages are maintained by 
> Microsoft, I could not find any information about whether these pages are 
> subject to name change.
>
> Regarding both
>
> Here(https://github.com/torvalds/linux/commit/88c9281a9fba67636ab26c1fd6afbc78a632374f)
>  it says "On Hyper-V platform there
>
> are two good clocksources: MSR-based hyperv_clocksource and
>
> recently introduced TSC page."
>
> Regarding hyperv_clocksource_tsc_page
>
> The mechanism is detailed 
> here(https://opensource.com/article/17/6/timekeeping-linux-vms) but whether 
> or not this page will ever change names is not noted. It also says "Microsoft 
> reinvented the pv_clock protocol with their own TSC page proctol, "
>
> Here(https://lists.linuxfoundation.org/pipermail/virtualization/2017-February/034235.html)
>  it says that the TSC page is documented, but I cannot find the documentation.
>

TSC page clocksource is documented in TLFS:
https://github.com/Microsoft/Virtualization-Documentation/raw/master/tlfs/Hypervisor%20Top%20Level%20Functional%20Specification%20v5.0b.pdf
(12.6.2 Format of the Reference TSC Page)

But to be honest I didn't get your question. In case you're asking if
'hyperv_clocksource_tsc_page' name is stable than the answer is: there
is no guarantee. Nobody will probably change the name just for the sake
of change but it can be changed for a reason.

-- 
  Vitaly
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


[PATCH v2] x86/hyper-v: use cheaper HVCALL_FLUSH_VIRTUAL_ADDRESS_{LIST, SPACE} hypercalls when possible

2018-06-21 Thread Vitaly Kuznetsov
While working on Hyper-V style PV TLB flush support in KVM I noticed that
real Windows guests use TLB flush hypercall in a somewhat smarter way: when
the flush needs to be performed on a subset of first 64 vCPUs or on all
present vCPUs Windows avoids more expensive hypercalls which support
sparse CPU sets and uses their 'cheap' counterparts. This means that
HV_X64_EX_PROCESSOR_MASKS_RECOMMENDED name is actually a misnomer: EX
hypercalls (which support sparse CPU sets) are "available", not
"recommended". This makes sense as they are actually harder to parse.

Nothing stops us from being equally 'smart' in Linux too. Switch to
doing cheaper hypercalls whenever possible.

Signed-off-by: Vitaly Kuznetsov 
Reviewed-by: Michael Kelley 
---
Changes since v1:
- Re-worded comment about the optimization in hyperv_flush_tlb_others()
  [Thomas Gleixner ]
---
 arch/x86/hyperv/mmu.c | 73 ++-
 1 file changed, 32 insertions(+), 41 deletions(-)

diff --git a/arch/x86/hyperv/mmu.c b/arch/x86/hyperv/mmu.c
index de27615c51ea..0d90e515ec98 100644
--- a/arch/x86/hyperv/mmu.c
+++ b/arch/x86/hyperv/mmu.c
@@ -16,6 +16,8 @@
 /* Each gva in gva_list encodes up to 4096 pages to flush */
 #define HV_TLB_FLUSH_UNIT (4096 * PAGE_SIZE)
 
+static u64 hyperv_flush_tlb_others_ex(const struct cpumask *cpus,
+ const struct flush_tlb_info *info);
 
 /*
  * Fills in gva_list starting from offset. Returns the number of items added.
@@ -93,10 +95,24 @@ static void hyperv_flush_tlb_others(const struct cpumask 
*cpus,
if (cpumask_equal(cpus, cpu_present_mask)) {
flush->flags |= HV_FLUSH_ALL_PROCESSORS;
} else {
+   /*
+* From the supplied CPU set we need to figure out if we can get
+* away with cheaper HVCALL_FLUSH_VIRTUAL_ADDRESS_{LIST,SPACE}
+* hypercalls. This is possible when the highest VP number in
+* the set is < 64. As VP numbers are usually in ascending order
+* and match Linux CPU ids, here is an optimization: we check
+* the VP number for the highest bit in the supplied set first
+* so we can quickly find out if using *_EX hypercalls is a
+* must. We will also check all VP numbers when walking the
+* supplied CPU set to remain correct in all cases.
+*/
+   if (hv_cpu_number_to_vp_number(cpumask_last(cpus)) >= 64)
+   goto do_ex_hypercall;
+
for_each_cpu(cpu, cpus) {
vcpu = hv_cpu_number_to_vp_number(cpu);
if (vcpu >= 64)
-   goto do_native;
+   goto do_ex_hypercall;
 
__set_bit(vcpu, (unsigned long *)
  >processor_mask);
@@ -123,7 +139,12 @@ static void hyperv_flush_tlb_others(const struct cpumask 
*cpus,
status = hv_do_rep_hypercall(HVCALL_FLUSH_VIRTUAL_ADDRESS_LIST,
 gva_n, 0, flush, NULL);
}
+   goto check_status;
+
+do_ex_hypercall:
+   status = hyperv_flush_tlb_others_ex(cpus, info);
 
+check_status:
local_irq_restore(flags);
 
if (!(status & HV_HYPERCALL_RESULT_MASK))
@@ -132,35 +153,22 @@ static void hyperv_flush_tlb_others(const struct cpumask 
*cpus,
native_flush_tlb_others(cpus, info);
 }
 
-static void hyperv_flush_tlb_others_ex(const struct cpumask *cpus,
-  const struct flush_tlb_info *info)
+static u64 hyperv_flush_tlb_others_ex(const struct cpumask *cpus,
+ const struct flush_tlb_info *info)
 {
int nr_bank = 0, max_gvas, gva_n;
struct hv_tlb_flush_ex **flush_pcpu;
struct hv_tlb_flush_ex *flush;
-   u64 status = U64_MAX;
-   unsigned long flags;
-
-   trace_hyperv_mmu_flush_tlb_others(cpus, info);
+   u64 status;
 
-   if (!hv_hypercall_pg)
-   goto do_native;
-
-   if (cpumask_empty(cpus))
-   return;
-
-   local_irq_save(flags);
+   if (!(ms_hyperv.hints & HV_X64_EX_PROCESSOR_MASKS_RECOMMENDED))
+   return U64_MAX;
 
flush_pcpu = (struct hv_tlb_flush_ex **)
 this_cpu_ptr(hyperv_pcpu_input_arg);
 
flush = *flush_pcpu;
 
-   if (unlikely(!flush)) {
-   local_irq_restore(flags);
-   goto do_native;
-   }
-
if (info->mm) {
/*
 * AddressSpace argument must match the CR3 with PCID bits
@@ -176,15 +184,8 @@ static void hyperv_flush_tlb_others_ex(const struct 
cpumask *cpus,
 
flush->hv_vp_set.valid_bank_mask = 0;
 
-   if (!cpumask_equal(cpus, cpu_present_mask)) {
-   flush->hv_vp_set.format

Re: [PATCH] x86/hyper-v: use cheaper HVCALL_FLUSH_VIRTUAL_ADDRESS_{LIST, SPACE} hypercalls when possible

2018-06-20 Thread Vitaly Kuznetsov
KY Srinivasan  writes:

>> -Original Message-
>> From: Michael Kelley (EOSG)
>> Sent: Tuesday, June 19, 2018 10:57 AM
>> To: Vitaly Kuznetsov ; x...@kernel.org
>> Cc: de...@linuxdriverproject.org; linux-ker...@vger.kernel.org; KY
>> Srinivasan ; Haiyang Zhang
>> ; Stephen Hemminger
>> ; Thomas Gleixner ; Ingo
>> Molnar ; H. Peter Anvin ; Tianyu Lan
>> 
>> Subject: RE: [PATCH] x86/hyper-v: use cheaper
>> HVCALL_FLUSH_VIRTUAL_ADDRESS_{LIST,SPACE} hypercalls when possible
>> 
>> ...
>>>
>> This is a good idea.  We should probably do the same with the hypercalls for
>> sending
>> IPIs -- try the simpler version first and move to the more complex _EX
>> version only
>> if necessary.
> I am not sure if this would work correctly. When I was developing the IPI 
> enlightenment, 
> what I remember was that the guest is expected to use the API recommended by 
> the Hypervisor.
>

I was under the same impression when I implemented PV TLB flush. Turns
out HV_X64_EX_PROCESSOR_MASKS_RECOMMENDED is a misnomer or at least
Windows treats it as HV_X64_EX_PROCESSOR_MASKS_AVAILABLE instead using
only when needed.

My guess would be that the situation with IPI is the same. In any case I
can try to implement Hyper-V style PV IPIs for Windows in KVM and we'll
see how Windows uses these hypercalls :-)

-- 
  Vitaly
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH] x86/hyper-v: use cheaper HVCALL_FLUSH_VIRTUAL_ADDRESS_{LIST, SPACE} hypercalls when possible

2018-06-20 Thread Vitaly Kuznetsov
"Michael Kelley (EOSG)"  writes:

>> -Original Message-
>> From: linux-kernel-ow...@vger.kernel.org 
>>  On Behalf
>> Of Vitaly Kuznetsov
>> Sent: Friday, June 15, 2018 9:30 AM
>> To: x...@kernel.org
>> Cc: de...@linuxdriverproject.org; linux-ker...@vger.kernel.org; KY Srinivasan
>> ; Haiyang Zhang ; Stephen 
>> Hemminger
>> ; Thomas Gleixner ; Ingo Molnar
>> ; H. Peter Anvin ; Tianyu Lan
>> 
>> Subject: [PATCH] x86/hyper-v: use cheaper 
>> HVCALL_FLUSH_VIRTUAL_ADDRESS_{LIST,SPACE}
>> hypercalls when possible
>> 
>> While working on Hyper-V style PV TLB flush support in KVM I noticed that
>> real Windows guests use TLB flush hypercall in a somewhat smarter way: when
>> the flush needs to be performed on a subset of first 64 vCPUs or on all
>> present vCPUs Windows avoids more expensive hypercalls which support
>> sparse CPU sets and uses their 'cheap' counterparts. This means that
>> HV_X64_EX_PROCESSOR_MASKS_RECOMMENDED name is actually a misnomer: EX
>> hypercalls (which support sparse CPU sets) are "available", not
>> "recommended". This makes sense as they are actually harder to parse.
>> 
>> Nothing stops us from being equally 'smart' in Linux too. Switch to
>> doing cheaper hypercalls whenever possible.
>> 
>> Signed-off-by: Vitaly Kuznetsov 
>> ---
>
> This is a good idea.  We should probably do the same with the hypercalls for 
> sending
> IPIs -- try the simpler version first and move to the more complex _EX 
> version only
> if necessary.
>
> A complication:  We've recently found a problem with the code for doing IPI
> hypercalls, and the bug affects the TLB flush code as well.  As secondary CPUs
> are started, there's a window of time where the hv_vp_index entry for a
> secondary CPU is uninitialized.  We are seeing IPIs happening in that window, 
> and
> the IPI hypercall code uses the uninitialized hv_vp_index entry.   Same thing 
> could
> happen with the TLB flush hypercall code.  I didn't actually see any 
> occurrences of
> the TLB case in my tracing, but we should fix it anyway in case a TLB flush 
> gets
> added at some point in the future.
>
> KY has a patch coming.  In the patch, hv_cpu_number_to_vp_number()
> and cpumask_to_vpset() can both return U32_MAX if they encounter an
> uninitialized hv_vp_index entry, and the code needs to be able to bail out to
> the native functions for that particular IPI or TLB flush operation.  Once the
> initialization of secondary CPUs is complete, the uninitialized situation 
> won't
> happen again, and the hypercall path will always be used.

Sure,

with TLB flush we can always fall back to doing it natively (by sending
IPIs).

>
> We'll need to coordinate on these patches.  Be aware that the IPI flavor of 
> the
> bug is currently causing random failures when booting 4.18 RC1 on Hyper-V VMs
> with large vCPU counts.

Thanks for the heads up! This particular patch is just an optimization
so there's no rush, IPI fix is definitely more important.

>
> Reviewed-by:  Michael Kelley 

Thanks!

-- 
  Vitaly
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH] x86/hyper-v: use cheaper HVCALL_FLUSH_VIRTUAL_ADDRESS_{LIST, SPACE} hypercalls when possible

2018-06-19 Thread Vitaly Kuznetsov
Thomas Gleixner  writes:

> On Tue, 19 Jun 2018, Vitaly Kuznetsov wrote:
>> Thomas Gleixner  writes:
>> 
>> > On Fri, 15 Jun 2018, Vitaly Kuznetsov wrote:
>> >>   * Fills in gva_list starting from offset. Returns the number of items 
>> >> added.
>> >> @@ -93,10 +95,19 @@ static void hyperv_flush_tlb_others(const struct 
>> >> cpumask *cpus,
>> >>   if (cpumask_equal(cpus, cpu_present_mask)) {
>> >>   flush->flags |= HV_FLUSH_ALL_PROCESSORS;
>> >>   } else {
>> >> + /*
>> >> +  * It is highly likely that VP ids are in ascending order
>> >> +  * matching Linux CPU ids; Check VP index for the highest CPU
>> >> +  * in the supplied set to see if EX hypercall is required.
>> >> +  * This is just a best guess but should work most of the time.
>> >
>> > TLB flushing based on 'best guess' and 'should work most of the time' is
>> > not a brilliant approach.
>> >
>> 
>> Oh no no no, that's not what I meant :-)
>> 
>> We have the following problem: from the supplied CPU set we need to
>> figure out if we can get away with HVCALL_FLUSH_VIRTUAL_ADDRESS_{LIST,
>> SPACE} hypercalls which are cheaper or if we need to use more expensing
>> HVCALL_FLUSH_VIRTUAL_ADDRESS_{LIST, SPACE}_EX ones. The dividing line is
>> the highest VP_INDEX of the supplied CPU set: in case it is < 64 cheaper
>> hypercalls are OK. Now how do we check that? In the patch I have the
>> following approach:
>> 1) Check VP number for the highest CPU in the supplied set. In case it
>> is > 64 we for sure need more expensive hypercalls. This is the "guess"
>> which works most of the time because Linux CPU ids usually match
>> VP_INDEXes.
>> 
>> 2) In case the answer to the previous question was negative we start
>> preparing input for the cheaper hypercall. However, if while walking the
>> CPU set we meet a CPU with VP_INDEX higher than 64 we'll discard the
>> prepared input and switch to the more expensive hypercall.
>> 
>> Said that the 'guess' here is just an optimization to avoid walking the
>> whole CPU set when we find the required answer quickly by looking at the
>> highest bit. This will help big systems with hundreds of CPUs.
>
> Care to fix the comment to avoid the offending words?
>

Sure, will re-word in v2.

-- 
  Vitaly
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH] x86/hyper-v: use cheaper HVCALL_FLUSH_VIRTUAL_ADDRESS_{LIST, SPACE} hypercalls when possible

2018-06-19 Thread Vitaly Kuznetsov
Thomas Gleixner  writes:

> On Fri, 15 Jun 2018, Vitaly Kuznetsov wrote:
>>   * Fills in gva_list starting from offset. Returns the number of items 
>> added.
>> @@ -93,10 +95,19 @@ static void hyperv_flush_tlb_others(const struct cpumask 
>> *cpus,
>>  if (cpumask_equal(cpus, cpu_present_mask)) {
>>  flush->flags |= HV_FLUSH_ALL_PROCESSORS;
>>  } else {
>> +/*
>> + * It is highly likely that VP ids are in ascending order
>> + * matching Linux CPU ids; Check VP index for the highest CPU
>> + * in the supplied set to see if EX hypercall is required.
>> + * This is just a best guess but should work most of the time.
>
> TLB flushing based on 'best guess' and 'should work most of the time' is
> not a brilliant approach.
>

Oh no no no, that's not what I meant :-)

We have the following problem: from the supplied CPU set we need to
figure out if we can get away with HVCALL_FLUSH_VIRTUAL_ADDRESS_{LIST,
SPACE} hypercalls which are cheaper or if we need to use more expensing
HVCALL_FLUSH_VIRTUAL_ADDRESS_{LIST, SPACE}_EX ones. The dividing line is
the highest VP_INDEX of the supplied CPU set: in case it is < 64 cheaper
hypercalls are OK. Now how do we check that? In the patch I have the
following approach:
1) Check VP number for the highest CPU in the supplied set. In case it
is > 64 we for sure need more expensive hypercalls. This is the "guess"
which works most of the time because Linux CPU ids usually match
VP_INDEXes.

2) In case the answer to the previous question was negative we start
preparing input for the cheaper hypercall. However, if while walking the
CPU set we meet a CPU with VP_INDEX higher than 64 we'll discard the
prepared input and switch to the more expensive hypercall.

Said that the 'guess' here is just an optimization to avoid walking the
whole CPU set when we find the required answer quickly by looking at the
highest bit. This will help big systems with hundreds of CPUs.

-- 
  Vitaly
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


[PATCH] x86/hyper-v: use cheaper HVCALL_FLUSH_VIRTUAL_ADDRESS_{LIST, SPACE} hypercalls when possible

2018-06-15 Thread Vitaly Kuznetsov
While working on Hyper-V style PV TLB flush support in KVM I noticed that
real Windows guests use TLB flush hypercall in a somewhat smarter way: when
the flush needs to be performed on a subset of first 64 vCPUs or on all
present vCPUs Windows avoids more expensive hypercalls which support
sparse CPU sets and uses their 'cheap' counterparts. This means that
HV_X64_EX_PROCESSOR_MASKS_RECOMMENDED name is actually a misnomer: EX
hypercalls (which support sparse CPU sets) are "available", not
"recommended". This makes sense as they are actually harder to parse.

Nothing stops us from being equally 'smart' in Linux too. Switch to
doing cheaper hypercalls whenever possible.

Signed-off-by: Vitaly Kuznetsov 
---
 arch/x86/hyperv/mmu.c | 68 ---
 1 file changed, 27 insertions(+), 41 deletions(-)

diff --git a/arch/x86/hyperv/mmu.c b/arch/x86/hyperv/mmu.c
index de27615c51ea..7519a44b462e 100644
--- a/arch/x86/hyperv/mmu.c
+++ b/arch/x86/hyperv/mmu.c
@@ -16,6 +16,8 @@
 /* Each gva in gva_list encodes up to 4096 pages to flush */
 #define HV_TLB_FLUSH_UNIT (4096 * PAGE_SIZE)
 
+static u64 hyperv_flush_tlb_others_ex(const struct cpumask *cpus,
+ const struct flush_tlb_info *info);
 
 /*
  * Fills in gva_list starting from offset. Returns the number of items added.
@@ -93,10 +95,19 @@ static void hyperv_flush_tlb_others(const struct cpumask 
*cpus,
if (cpumask_equal(cpus, cpu_present_mask)) {
flush->flags |= HV_FLUSH_ALL_PROCESSORS;
} else {
+   /*
+* It is highly likely that VP ids are in ascending order
+* matching Linux CPU ids; Check VP index for the highest CPU
+* in the supplied set to see if EX hypercall is required.
+* This is just a best guess but should work most of the time.
+*/
+   if (hv_cpu_number_to_vp_number(cpumask_last(cpus)) >= 64)
+   goto do_ex_hypercall;
+
for_each_cpu(cpu, cpus) {
vcpu = hv_cpu_number_to_vp_number(cpu);
if (vcpu >= 64)
-   goto do_native;
+   goto do_ex_hypercall;
 
__set_bit(vcpu, (unsigned long *)
  >processor_mask);
@@ -123,7 +134,12 @@ static void hyperv_flush_tlb_others(const struct cpumask 
*cpus,
status = hv_do_rep_hypercall(HVCALL_FLUSH_VIRTUAL_ADDRESS_LIST,
 gva_n, 0, flush, NULL);
}
+   goto check_status;
+
+do_ex_hypercall:
+   status = hyperv_flush_tlb_others_ex(cpus, info);
 
+check_status:
local_irq_restore(flags);
 
if (!(status & HV_HYPERCALL_RESULT_MASK))
@@ -132,35 +148,22 @@ static void hyperv_flush_tlb_others(const struct cpumask 
*cpus,
native_flush_tlb_others(cpus, info);
 }
 
-static void hyperv_flush_tlb_others_ex(const struct cpumask *cpus,
-  const struct flush_tlb_info *info)
+static u64 hyperv_flush_tlb_others_ex(const struct cpumask *cpus,
+ const struct flush_tlb_info *info)
 {
int nr_bank = 0, max_gvas, gva_n;
struct hv_tlb_flush_ex **flush_pcpu;
struct hv_tlb_flush_ex *flush;
-   u64 status = U64_MAX;
-   unsigned long flags;
-
-   trace_hyperv_mmu_flush_tlb_others(cpus, info);
+   u64 status;
 
-   if (!hv_hypercall_pg)
-   goto do_native;
-
-   if (cpumask_empty(cpus))
-   return;
-
-   local_irq_save(flags);
+   if (!(ms_hyperv.hints & HV_X64_EX_PROCESSOR_MASKS_RECOMMENDED))
+   return U64_MAX;
 
flush_pcpu = (struct hv_tlb_flush_ex **)
 this_cpu_ptr(hyperv_pcpu_input_arg);
 
flush = *flush_pcpu;
 
-   if (unlikely(!flush)) {
-   local_irq_restore(flags);
-   goto do_native;
-   }
-
if (info->mm) {
/*
 * AddressSpace argument must match the CR3 with PCID bits
@@ -176,15 +179,8 @@ static void hyperv_flush_tlb_others_ex(const struct 
cpumask *cpus,
 
flush->hv_vp_set.valid_bank_mask = 0;
 
-   if (!cpumask_equal(cpus, cpu_present_mask)) {
-   flush->hv_vp_set.format = HV_GENERIC_SET_SPARSE_4K;
-   nr_bank = cpumask_to_vpset(&(flush->hv_vp_set), cpus);
-   }
-
-   if (!nr_bank) {
-   flush->hv_vp_set.format = HV_GENERIC_SET_ALL;
-   flush->flags |= HV_FLUSH_ALL_PROCESSORS;
-   }
+   flush->hv_vp_set.format = HV_GENERIC_SET_SPARSE_4K;
+   nr_bank = cpumask_to_vpset(&(flush->hv_vp_set), cpus);
 
/*
 * We can flush not more than max_gvas with one hypercall. Flush the
@@ -213,12 +209,7 @@ static void hyperv_flush_

Re: [PATCH v2] tools: hv: update lsvmbus to be compatible with python3

2018-05-23 Thread Vitaly Kuznetsov
Olaf Hering  writes:

> Python3 changed the way how 'print' works.
> Adjust the code to a syntax that is understood by python2 and python3.
>
> Signed-off-by: Olaf Hering 

What are the odds that we decide to send the same patch on the same day?
:-)

http://driverdev.linuxdriverproject.org/pipermail/driverdev-devel/2018-May/121144.html

-- 
  Vitaly
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


[PATCH] Tools: hv: vss: fix loop device detection

2018-05-22 Thread Vitaly Kuznetsov
Commit ea81fdf0981d ("Tools: hv: vss: Skip freezing filesystems backed by
loop") added skip for filesystems backed by loop device. However, it seems
the detection of such cases is incomplete.

It was found that with 'devicemapper' storage driver docker creates the
following chain:

NAMEMAJ:MIN
loop0   7:0
└─docker-8:4-8473394-pool   253:0
  └─docker-8:4-8473394-eac...   253:1

so when we're looking at the mounted device we see major '253' and not '7'.

Solve the issue by walking /sys/dev/block/*/slaves chain and checking if
there's a loop device somewhere.

Other than that, don't skip mountpoints silently when stat() fails. In case
e.g. SELinux is failing stat we don't want to skip freezing everything
without letting user know about the failure.

Fixes: ea81fdf0981d ("Tools: hv: vss: Skip freezing filesystems backed by loop")
Signed-off-by: Vitaly Kuznetsov <vkuzn...@redhat.com>
---
 tools/hv/hv_vss_daemon.c | 65 +---
 1 file changed, 61 insertions(+), 4 deletions(-)

diff --git a/tools/hv/hv_vss_daemon.c b/tools/hv/hv_vss_daemon.c
index 34031a297f02..b13300172762 100644
--- a/tools/hv/hv_vss_daemon.c
+++ b/tools/hv/hv_vss_daemon.c
@@ -36,6 +36,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 
 /* Don't use syslog() in the function since that can cause write to disk */
 static int vss_do_freeze(char *dir, unsigned int cmd)
@@ -68,6 +70,55 @@ static int vss_do_freeze(char *dir, unsigned int cmd)
return !!ret;
 }
 
+static bool is_dev_loop(const char *blkname)
+{
+   char *buffer;
+   DIR *dir;
+   struct dirent *entry;
+   bool ret = false;
+
+   buffer = malloc(PATH_MAX);
+   if (!buffer) {
+   syslog(LOG_ERR, "Can't allocate memory!");
+   exit(1);
+   }
+
+   snprintf(buffer, PATH_MAX, "%s/loop", blkname);
+   if (!access(buffer, R_OK | X_OK)) {
+   ret = true;
+   goto free_buffer;
+   } else if (errno != ENOENT) {
+   syslog(LOG_ERR, "Can't access: %s; error:%d %s!",
+  buffer, errno, strerror(errno));
+   }
+
+   snprintf(buffer, PATH_MAX, "%s/slaves", blkname);
+   dir = opendir(buffer);
+   if (!dir) {
+   if (errno != ENOENT)
+   syslog(LOG_ERR, "Can't opendir: %s; error:%d %s!",
+  buffer, errno, strerror(errno));
+   goto free_buffer;
+   }
+
+   while ((entry = readdir(dir)) != NULL) {
+   if (strcmp(entry->d_name, ".") == 0 ||
+   strcmp(entry->d_name, "..") == 0)
+   continue;
+
+   snprintf(buffer, PATH_MAX, "%s/slaves/%s", blkname,
+entry->d_name);
+   if (is_dev_loop(buffer)) {
+   ret = true;
+   break;
+   }
+   }
+   closedir(dir);
+free_buffer:
+   free(buffer);
+   return ret;
+}
+
 static int vss_operate(int operation)
 {
char match[] = "/dev/";
@@ -75,6 +126,7 @@ static int vss_operate(int operation)
struct mntent *ent;
struct stat sb;
char errdir[1024] = {0};
+   char blkdir[23]; /* /sys/dev/block/XXX:XXX */
unsigned int cmd;
int error = 0, root_seen = 0, save_errno = 0;
 
@@ -96,10 +148,15 @@ static int vss_operate(int operation)
while ((ent = getmntent(mounts))) {
if (strncmp(ent->mnt_fsname, match, strlen(match)))
continue;
-   if (stat(ent->mnt_fsname, ) == -1)
-   continue;
-   if (S_ISBLK(sb.st_mode) && major(sb.st_rdev) == LOOP_MAJOR)
-   continue;
+   if (stat(ent->mnt_fsname, )) {
+   syslog(LOG_ERR, "Can't stat: %s; error:%d %s!",
+  ent->mnt_fsname, errno, strerror(errno));
+   } else {
+   sprintf(blkdir, "/sys/dev/block/%d:%d",
+   major(sb.st_rdev), minor(sb.st_rdev));
+   if (is_dev_loop(blkdir))
+   continue;
+   }
if (hasmntopt(ent, MNTOPT_RO) != NULL)
continue;
if (strcmp(ent->mnt_type, "vfat") == 0)
-- 
2.14.3

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


[PATCH] tools: hv: lsvmbus: convert to Python3

2018-05-22 Thread Vitaly Kuznetsov
Use '2to3' tool to make lsvmbus work with both Python2 and Python3.

Signed-off-by: Vitaly Kuznetsov <vkuzn...@redhat.com>
---
 tools/hv/lsvmbus | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/tools/hv/lsvmbus b/tools/hv/lsvmbus
index 353e56768df8..c184aac33d5c 100644
--- a/tools/hv/lsvmbus
+++ b/tools/hv/lsvmbus
@@ -17,7 +17,7 @@ if options.verbose is not None:
 
 vmbus_sys_path = '/sys/bus/vmbus/devices'
 if not os.path.isdir(vmbus_sys_path):
-   print "%s doesn't exist: exiting..." % vmbus_sys_path
+   print("%s doesn't exist: exiting..." % vmbus_sys_path)
exit(-1)
 
 vmbus_dev_dict = {
@@ -93,11 +93,11 @@ format2 = '%2s: Class_ID = %s - %s\n\tDevice_ID = 
%s\n\tSysfs path: %s\n%s'
 
 for d in vmbus_dev_list:
if verbose == 0:
-   print ('VMBUS ID ' + format0) % (d.vmbus_id, d.dev_desc)
+   print(('VMBUS ID ' + format0) % (d.vmbus_id, d.dev_desc))
elif verbose == 1:
-   print ('VMBUS ID ' + format1) % \
-   (d.vmbus_id, d.class_id, d.dev_desc, d.chn_vp_mapping)
+   print(('VMBUS ID ' + format1) % \
+   (d.vmbus_id, d.class_id, d.dev_desc, d.chn_vp_mapping))
else:
-   print ('VMBUS ID ' + format2) % \
+   print(('VMBUS ID ' + format2) % \
(d.vmbus_id, d.class_id, d.dev_desc, \
-   d.device_id, d.sysfs_path, d.chn_vp_mapping)
+   d.device_id, d.sysfs_path, d.chn_vp_mapping))
-- 
2.14.3

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH net-next, 2/2] hv_netvsc: Add range checking for rx packet offset and length

2018-03-23 Thread Vitaly Kuznetsov
Haiyang Zhang  writes:

> From: Haiyang Zhang 
>
> This patch adds range checking for rx packet offset and length.
> It may only happen if there is a host side bug.
>
> Signed-off-by: Haiyang Zhang 
> ---
>  drivers/net/hyperv/hyperv_net.h |  1 +
>  drivers/net/hyperv/netvsc.c | 17 +++--
>  2 files changed, 16 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/net/hyperv/hyperv_net.h b/drivers/net/hyperv/hyperv_net.h
> index 0db3bd1ea06f..49c05ac894e5 100644
> --- a/drivers/net/hyperv/hyperv_net.h
> +++ b/drivers/net/hyperv/hyperv_net.h
> @@ -793,6 +793,7 @@ struct netvsc_device {
>
>   /* Receive buffer allocated by us but manages by NetVSP */
>   void *recv_buf;
> + u32 recv_buf_size; /* allocated bytes */
>   u32 recv_buf_gpadl_handle;
>   u32 recv_section_cnt;
>   u32 recv_section_size;
> diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c
> index 1ddb2c39b6e4..a6700d65f206 100644
> --- a/drivers/net/hyperv/netvsc.c
> +++ b/drivers/net/hyperv/netvsc.c
> @@ -289,6 +289,8 @@ static int netvsc_init_buf(struct hv_device *device,
>   goto cleanup;
>   }
>
> + net_device->recv_buf_size = buf_size;
> +
>   /*
>* Establish the gpadl handle for this buffer on this
>* channel.  Note: This call uses the vmbus connection rather
> @@ -1095,11 +1097,22 @@ static int netvsc_receive(struct net_device *ndev,
>
>   /* Each range represents 1 RNDIS pkt that contains 1 ethernet frame */
>   for (i = 0; i < count; i++) {
> - void *data = recv_buf
> - + vmxferpage_packet->ranges[i].byte_offset;
> + u32 offset = vmxferpage_packet->ranges[i].byte_offset;
>   u32 buflen = vmxferpage_packet->ranges[i].byte_count;
> + void *data;
>   int ret;
>
> + if (unlikely(offset + buflen > net_device->recv_buf_size)) {
> + status = NVSP_STAT_FAIL;
> + netif_err(net_device_ctx, rx_err, ndev,
> +   "Packet offset:%u + len:%u too big\n",
> +   offset, buflen);

This shouldn't happen, of course, but I'd rather ratelimit this error or
even used something like netdev_WARN_ONCE().

> +
> + continue;
> + }
> +
> + data = recv_buf + offset;
> +
>   trace_rndis_recv(ndev, q_idx, data);
>
>   /* Pass it to the upper layer */

-- 
  Vitaly
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


[PATCH RESEND 0/4] hv_balloon: fixes for num_pages_onlined accounting and misc improvements

2018-02-07 Thread Vitaly Kuznetsov
I just noticed that this series got lost; the initial submission was on
Nov, 8 but nothing happened after. Resending.

Original description:

While doing routing code review I noticed that commit 6df8d9aaf3af
("Drivers: hv: balloon: Correctly update onlined page count") introduced
an issue with num_pages_onlined accounting on memory offlining. Deeper look
showed that the accounting was always buggy. This is fixed in PATCH3.
PATCHes 1 and 2 are preparatory cleanups, PATCH4 adds a tracepoint to
post_status so it's now possible to see what's being sent to the host and
where the data comes from.

Vitaly Kuznetsov (4):
  hv_balloon: fix printk loglevel
  hv_balloon: simplify hv_online_page()/hv_page_online_one()
  hv_balloon: fix bugs in num_pages_onlined accounting
  hv_balloon: trace post_status

 drivers/hv/Makefile   |   1 +
 drivers/hv/hv_balloon.c   | 121 +-
 drivers/hv/hv_trace_balloon.h |  48 +
 3 files changed, 132 insertions(+), 38 deletions(-)
 create mode 100644 drivers/hv/hv_trace_balloon.h

-- 
2.14.3

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


[PATCH RESEND 3/4] hv_balloon: fix bugs in num_pages_onlined accounting

2018-02-07 Thread Vitaly Kuznetsov
Our num_pages_onlined accounting is buggy:
1) In case we're offlining a memory block which was present at boot (e.g.
   when there was no hotplug at all) we subtract 32k from 0 and as
   num_pages_onlined is unsigned get a very big positive number.

2) Commit 6df8d9aaf3af ("Drivers: hv: balloon: Correctly update onlined
   page count") made num_pages_onlined counter accurate on onlining but
   totally incorrect on offlining for partly populated regions: no matter
   how many pages were onlined and what was actually added to
   num_pages_onlined counter we always subtract the full region (32k) so
   again, num_pages_onlined can wrap around zero. By onlining/offlining
   the same partly populated region multiple times we can make the
   situation worse.

Solve these issues by doing accurate accounting on offlining: walk HAS
list, check for covered range and gaps.

Fixes: 6df8d9aaf3af ("Drivers: hv: balloon: Correctly update onlined page 
count")
Signed-off-by: Vitaly Kuznetsov <vkuzn...@redhat.com>
---
 drivers/hv/hv_balloon.c | 82 +
 1 file changed, 69 insertions(+), 13 deletions(-)

diff --git a/drivers/hv/hv_balloon.c b/drivers/hv/hv_balloon.c
index 5b8e1ad1bcfe..7514a32d0de4 100644
--- a/drivers/hv/hv_balloon.c
+++ b/drivers/hv/hv_balloon.c
@@ -576,11 +576,65 @@ static struct hv_dynmem_device dm_device;
 static void post_status(struct hv_dynmem_device *dm);
 
 #ifdef CONFIG_MEMORY_HOTPLUG
+static inline bool has_pfn_is_backed(struct hv_hotadd_state *has,
+unsigned long pfn)
+{
+   struct hv_hotadd_gap *gap;
+
+   /* The page is not backed. */
+   if ((pfn < has->covered_start_pfn) || (pfn >= has->covered_end_pfn))
+   return false;
+
+   /* Check for gaps. */
+   list_for_each_entry(gap, >gap_list, list) {
+   if ((pfn >= gap->start_pfn) && (pfn < gap->end_pfn))
+   return false;
+   }
+
+   return true;
+}
+
+static unsigned long hv_page_offline_check(unsigned long start_pfn,
+  unsigned long nr_pages)
+{
+   unsigned long pfn = start_pfn, count = 0;
+   struct hv_hotadd_state *has;
+   bool found;
+
+   while (pfn < start_pfn + nr_pages) {
+   /*
+* Search for HAS which covers the pfn and when we find one
+* count how many consequitive PFNs are covered.
+*/
+   found = false;
+   list_for_each_entry(has, _device.ha_region_list, list) {
+   while ((pfn >= has->start_pfn) &&
+  (pfn < has->end_pfn) &&
+  (pfn < start_pfn + nr_pages)) {
+   found = true;
+   if (has_pfn_is_backed(has, pfn))
+   count++;
+   pfn++;
+   }
+   }
+
+   /*
+* This PFN is not in any HAS (e.g. we're offlining a region
+* which was present at boot), no need to account for it. Go
+* to the next one.
+*/
+   if (!found)
+   pfn++;
+   }
+
+   return count;
+}
+
 static int hv_memory_notifier(struct notifier_block *nb, unsigned long val,
  void *v)
 {
struct memory_notify *mem = (struct memory_notify *)v;
-   unsigned long flags;
+   unsigned long flags, pfn_count;
 
switch (val) {
case MEM_ONLINE:
@@ -593,7 +647,19 @@ static int hv_memory_notifier(struct notifier_block *nb, 
unsigned long val,
 
case MEM_OFFLINE:
spin_lock_irqsave(_device.ha_lock, flags);
-   dm_device.num_pages_onlined -= mem->nr_pages;
+   pfn_count = hv_page_offline_check(mem->start_pfn,
+ mem->nr_pages);
+   if (pfn_count <= dm_device.num_pages_onlined) {
+   dm_device.num_pages_onlined -= pfn_count;
+   } else {
+   /*
+* We're offlining more pages than we managed to online.
+* This is unexpected. In any case don't let
+* num_pages_onlined wrap around zero.
+*/
+   WARN_ON_ONCE(1);
+   dm_device.num_pages_onlined = 0;
+   }
spin_unlock_irqrestore(_device.ha_lock, flags);
break;
case MEM_GOING_ONLINE:
@@ -612,19 +678,9 @@ static struct notifier_block hv_memory_nb = {
 /* Check if the particular page is backed and can be onlined and online it. */
 static void hv_page_online_one(struct hv_hotadd_state *has, struct page *p

[PATCH RESEND 1/4] hv_balloon: fix printk loglevel

2018-02-07 Thread Vitaly Kuznetsov
We have a mix of different ideas of which loglevel should be used. Unify
on the following:
- pr_info() for normal operation
- pr_warn() for 'strange' host behavior
- pr_err() for all errors.

Signed-off-by: Vitaly Kuznetsov <vkuzn...@redhat.com>
---
 drivers/hv/hv_balloon.c | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/drivers/hv/hv_balloon.c b/drivers/hv/hv_balloon.c
index db0e6652d7ef..1aece72da9ba 100644
--- a/drivers/hv/hv_balloon.c
+++ b/drivers/hv/hv_balloon.c
@@ -691,7 +691,7 @@ static void hv_mem_hot_add(unsigned long start, unsigned 
long size,
(HA_CHUNK << PAGE_SHIFT));
 
if (ret) {
-   pr_warn("hot_add memory failed error is %d\n", ret);
+   pr_err("hot_add memory failed error is %d\n", ret);
if (ret == -EEXIST) {
/*
 * This error indicates that the error
@@ -1014,7 +1014,7 @@ static void hot_add_req(struct work_struct *dummy)
resp.result = 0;
 
if (!do_hot_add || (resp.page_count == 0))
-   pr_info("Memory hot add failed\n");
+   pr_err("Memory hot add failed\n");
 
dm->state = DM_INITIALIZED;
resp.hdr.trans_id = atomic_inc_return(_id);
@@ -1041,7 +1041,7 @@ static void process_info(struct hv_dynmem_device *dm, 
struct dm_info_msg *msg)
 
break;
default:
-   pr_info("Received Unknown type: %d\n", info_hdr->type);
+   pr_warn("Received Unknown type: %d\n", info_hdr->type);
}
 }
 
@@ -1290,7 +1290,7 @@ static void balloon_up(struct work_struct *dummy)
/*
 * Free up the memory we allocatted.
 */
-   pr_info("Balloon response failed\n");
+   pr_err("Balloon response failed\n");
 
for (i = 0; i < bl_resp->range_count; i++)
free_balloon_pages(_device,
@@ -1421,7 +1421,7 @@ static void cap_resp(struct hv_dynmem_device *dm,
struct dm_capabilities_resp_msg *cap_resp)
 {
if (!cap_resp->is_accepted) {
-   pr_info("Capabilities not accepted by host\n");
+   pr_err("Capabilities not accepted by host\n");
dm->state = DM_INIT_ERROR;
}
complete(>host_event);
@@ -1508,7 +1508,7 @@ static void balloon_onchannelcallback(void *context)
break;
 
default:
-   pr_err("Unhandled message: type: %d\n", dm_hdr->type);
+   pr_warn("Unhandled message: type: %d\n", dm_hdr->type);
 
}
}
-- 
2.14.3

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


[PATCH RESEND 2/4] hv_balloon: simplify hv_online_page()/hv_page_online_one()

2018-02-07 Thread Vitaly Kuznetsov
Instead of doing pfn_to_page() and continuosly casting page to unsigned
long just cache the pfn of the page with page_to_pfn().

Signed-off-by: Vitaly Kuznetsov <vkuzn...@redhat.com>
---
 drivers/hv/hv_balloon.c | 27 +--
 1 file changed, 5 insertions(+), 22 deletions(-)

diff --git a/drivers/hv/hv_balloon.c b/drivers/hv/hv_balloon.c
index 1aece72da9ba..5b8e1ad1bcfe 100644
--- a/drivers/hv/hv_balloon.c
+++ b/drivers/hv/hv_balloon.c
@@ -612,28 +612,17 @@ static struct notifier_block hv_memory_nb = {
 /* Check if the particular page is backed and can be onlined and online it. */
 static void hv_page_online_one(struct hv_hotadd_state *has, struct page *pg)
 {
-   unsigned long cur_start_pgp;
-   unsigned long cur_end_pgp;
struct hv_hotadd_gap *gap;
-
-   cur_start_pgp = (unsigned long)pfn_to_page(has->covered_start_pfn);
-   cur_end_pgp = (unsigned long)pfn_to_page(has->covered_end_pfn);
+   unsigned long pfn = page_to_pfn(pg);
 
/* The page is not backed. */
-   if (((unsigned long)pg < cur_start_pgp) ||
-   ((unsigned long)pg >= cur_end_pgp))
+   if ((pfn < has->covered_start_pfn) || (pfn >= has->covered_end_pfn))
return;
 
/* Check for gaps. */
list_for_each_entry(gap, >gap_list, list) {
-   cur_start_pgp = (unsigned long)
-   pfn_to_page(gap->start_pfn);
-   cur_end_pgp = (unsigned long)
-   pfn_to_page(gap->end_pfn);
-   if (((unsigned long)pg >= cur_start_pgp) &&
-   ((unsigned long)pg < cur_end_pgp)) {
+   if ((pfn >= gap->start_pfn) && (pfn < gap->end_pfn))
return;
-   }
}
 
/* This frame is currently backed; online the page. */
@@ -726,19 +715,13 @@ static void hv_mem_hot_add(unsigned long start, unsigned 
long size,
 static void hv_online_page(struct page *pg)
 {
struct hv_hotadd_state *has;
-   unsigned long cur_start_pgp;
-   unsigned long cur_end_pgp;
unsigned long flags;
+   unsigned long pfn = page_to_pfn(pg);
 
spin_lock_irqsave(_device.ha_lock, flags);
list_for_each_entry(has, _device.ha_region_list, list) {
-   cur_start_pgp = (unsigned long)
-   pfn_to_page(has->start_pfn);
-   cur_end_pgp = (unsigned long)pfn_to_page(has->end_pfn);
-
/* The page belongs to a different HAS. */
-   if (((unsigned long)pg < cur_start_pgp) ||
-   ((unsigned long)pg >= cur_end_pgp))
+   if ((pfn < has->start_pfn) || (pfn >= has->end_pfn))
continue;
 
hv_page_online_one(has, pg);
-- 
2.14.3

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


[PATCH RESEND 4/4] hv_balloon: trace post_status

2018-02-07 Thread Vitaly Kuznetsov
Hyper-V balloon driver makes non-trivial calculations to convert Linux's
representation of free/used memory to what Hyper-V host expects to see. Add
a tracepoint to see what's being sent and where the data comes from.

Signed-off-by: Vitaly Kuznetsov <vkuzn...@redhat.com>
---
 drivers/hv/Makefile   |  1 +
 drivers/hv/hv_balloon.c   |  6 ++
 drivers/hv/hv_trace_balloon.h | 48 +++
 3 files changed, 55 insertions(+)
 create mode 100644 drivers/hv/hv_trace_balloon.h

diff --git a/drivers/hv/Makefile b/drivers/hv/Makefile
index 14c22786b519..a1eec7177c2d 100644
--- a/drivers/hv/Makefile
+++ b/drivers/hv/Makefile
@@ -4,6 +4,7 @@ obj-$(CONFIG_HYPERV_UTILS)  += hv_utils.o
 obj-$(CONFIG_HYPERV_BALLOON)   += hv_balloon.o
 
 CFLAGS_hv_trace.o = -I$(src)
+CFLAGS_hv_balloon.o = -I$(src)
 
 hv_vmbus-y := vmbus_drv.o \
 hv.o connection.o channel.o \
diff --git a/drivers/hv/hv_balloon.c b/drivers/hv/hv_balloon.c
index 7514a32d0de4..b3e9f13f8bc3 100644
--- a/drivers/hv/hv_balloon.c
+++ b/drivers/hv/hv_balloon.c
@@ -34,6 +34,9 @@
 
 #include 
 
+#define CREATE_TRACE_POINTS
+#include "hv_trace_balloon.h"
+
 /*
  * We begin with definitions supporting the Dynamic Memory protocol
  * with the host.
@@ -1159,6 +1162,9 @@ static void post_status(struct hv_dynmem_device *dm)
 dm->num_pages_added - dm->num_pages_onlined : 0) +
compute_balloon_floor();
 
+   trace_balloon_status(status.num_avail, status.num_committed,
+vm_memory_committed(), dm->num_pages_ballooned,
+dm->num_pages_added, dm->num_pages_onlined);
/*
 * If our transaction ID is no longer current, just don't
 * send the status. This can happen if we were interrupted
diff --git a/drivers/hv/hv_trace_balloon.h b/drivers/hv/hv_trace_balloon.h
new file mode 100644
index ..93082888aec3
--- /dev/null
+++ b/drivers/hv/hv_trace_balloon.h
@@ -0,0 +1,48 @@
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM hyperv
+
+#if !defined(_HV_TRACE_BALLOON_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _HV_TRACE_BALLOON_H
+
+#include 
+
+TRACE_EVENT(balloon_status,
+   TP_PROTO(u64 available, u64 committed,
+unsigned long vm_memory_committed,
+unsigned long pages_ballooned,
+unsigned long pages_added,
+unsigned long pages_onlined),
+   TP_ARGS(available, committed, vm_memory_committed,
+   pages_ballooned, pages_added, pages_onlined),
+   TP_STRUCT__entry(
+   __field(u64, available)
+   __field(u64, committed)
+   __field(unsigned long, vm_memory_committed)
+   __field(unsigned long, pages_ballooned)
+   __field(unsigned long, pages_added)
+   __field(unsigned long, pages_onlined)
+   ),
+   TP_fast_assign(
+   __entry->available = available;
+   __entry->committed = committed;
+   __entry->vm_memory_committed = vm_memory_committed;
+   __entry->pages_ballooned = pages_ballooned;
+   __entry->pages_added = pages_added;
+   __entry->pages_onlined = pages_onlined;
+   ),
+   TP_printk("available %lld, committed %lld; vm_memory_committed %ld;"
+ " pages_ballooned %ld, pages_added %ld, pages_onlined 
%ld",
+ __entry->available, __entry->committed,
+ __entry->vm_memory_committed, __entry->pages_ballooned,
+ __entry->pages_added, __entry->pages_onlined
+   )
+   );
+
+#undef TRACE_INCLUDE_PATH
+#define TRACE_INCLUDE_PATH .
+#undef TRACE_INCLUDE_FILE
+#define TRACE_INCLUDE_FILE hv_trace_balloon
+#endif /* _HV_TRACE_BALLOON_H */
+
+/* This part must be outside protection */
+#include 
-- 
2.14.3

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


[PATCH v4 4/7] x86/hyper-v: redirect reenlightment notifications on CPU offlining

2018-01-24 Thread Vitaly Kuznetsov
It is very unlikely for CPUs to get offlined when we run on Hyper-V as
we have a protection in vmbus module which prevents it when we have any
VMBus devices assigned. This, however,  may change in future if an option
to reassign an already active channel will be added. It is also possible
to run without any Hyper-V devices of have a CPU with no assigned channels.

Reassign reenlightenment notifications to some other active CPU when
the CPU which is assigned to get them goes offline.

Signed-off-by: Vitaly Kuznetsov <vkuzn...@redhat.com>
Reviewed-by: Thomas Gleixner <t...@linutronix.de>
---
 arch/x86/hyperv/hv_init.c | 22 +-
 1 file changed, 21 insertions(+), 1 deletion(-)

diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
index 712ac40081f7..e4377e2f2a10 100644
--- a/arch/x86/hyperv/hv_init.c
+++ b/arch/x86/hyperv/hv_init.c
@@ -191,6 +191,26 @@ void clear_hv_tscchange_cb(void)
 }
 EXPORT_SYMBOL_GPL(clear_hv_tscchange_cb);
 
+static int hv_cpu_die(unsigned int cpu)
+{
+   struct hv_reenlightenment_control re_ctrl;
+   unsigned int new_cpu;
+
+   if (hv_reenlightenment_cb == NULL)
+   return 0;
+
+   rdmsrl(HV_X64_MSR_REENLIGHTENMENT_CONTROL, *((u64 *)_ctrl));
+   if (re_ctrl.target_vp == hv_vp_index[cpu]) {
+   /* Reassign to some other online CPU */
+   new_cpu = cpumask_any_but(cpu_online_mask, cpu);
+
+   re_ctrl.target_vp = hv_vp_index[new_cpu];
+   wrmsrl(HV_X64_MSR_REENLIGHTENMENT_CONTROL, *((u64 *)_ctrl));
+   }
+
+   return 0;
+}
+
 /*
  * This function is to be invoked early in the boot sequence after the
  * hypervisor has been detected.
@@ -220,7 +240,7 @@ void hyperv_init(void)
return;
 
if (cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "x86/hyperv_init:online",
- hv_cpu_init, NULL) < 0)
+ hv_cpu_init, hv_cpu_die) < 0)
goto free_vp_index;
 
/*
-- 
2.14.3

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


[PATCH v4 7/7] x86/kvm: support Hyper-V reenlightenment

2018-01-24 Thread Vitaly Kuznetsov
When we run nested KVM on Hyper-V guests we need to update masterclocks for
all guests when L1 migrates to a host with different TSC frequency.
Implement the procedure in the following way:
- Pause all guests.
- Tell our host (Hyper-V) to stop emulating TSC accesses.
- Update our gtod copy, recompute clocks.
- Unpause all guests.

This is somewhat similar to cpufreq but we have two important differences:
we can only disable TSC emulation globally (on all CPUs) and we don't know
the new TSC frequency until we turn the emulation off so we can't
'prepare' ourselves to the event.

Signed-off-by: Vitaly Kuznetsov <vkuzn...@redhat.com>
Acked-by: Paolo Bonzini <pbonz...@redhat.com>
---
v3 -> v4:
  add static to kvm_hyperv_tsc_notifier() [kbuild robot]
---
 arch/x86/kvm/x86.c | 45 +
 1 file changed, 45 insertions(+)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index b1ce368a07af..879a99987401 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -68,6 +68,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #define CREATE_TRACE_POINTS
 #include "trace.h"
@@ -5932,6 +5933,43 @@ static void tsc_khz_changed(void *data)
__this_cpu_write(cpu_tsc_khz, khz);
 }
 
+static void kvm_hyperv_tsc_notifier(void)
+{
+#ifdef CONFIG_X86_64
+   struct kvm *kvm;
+   struct kvm_vcpu *vcpu;
+   int cpu;
+
+   spin_lock(_lock);
+   list_for_each_entry(kvm, _list, vm_list)
+   kvm_make_mclock_inprogress_request(kvm);
+
+   hyperv_stop_tsc_emulation();
+
+   /* TSC frequency always matches when on Hyper-V */
+   for_each_present_cpu(cpu)
+   per_cpu(cpu_tsc_khz, cpu) = tsc_khz;
+   kvm_max_guest_tsc_khz = tsc_khz;
+
+   list_for_each_entry(kvm, _list, vm_list) {
+   struct kvm_arch *ka = >arch;
+
+   spin_lock(>pvclock_gtod_sync_lock);
+
+   pvclock_update_vm_gtod_copy(kvm);
+
+   kvm_for_each_vcpu(cpu, vcpu, kvm)
+   kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
+
+   kvm_for_each_vcpu(cpu, vcpu, kvm)
+   kvm_clear_request(KVM_REQ_MCLOCK_INPROGRESS, vcpu);
+
+   spin_unlock(>pvclock_gtod_sync_lock);
+   }
+   spin_unlock(_lock);
+#endif
+}
+
 static int kvmclock_cpufreq_notifier(struct notifier_block *nb, unsigned long 
val,
 void *data)
 {
@@ -6217,6 +6255,9 @@ int kvm_arch_init(void *opaque)
kvm_lapic_init();
 #ifdef CONFIG_X86_64
pvclock_gtod_register_notifier(_gtod_notifier);
+
+   if (x86_hyper_type == X86_HYPER_MS_HYPERV)
+   set_hv_tscchange_cb(kvm_hyperv_tsc_notifier);
 #endif
 
return 0;
@@ -6229,6 +6270,10 @@ int kvm_arch_init(void *opaque)
 
 void kvm_arch_exit(void)
 {
+#ifdef CONFIG_X86_64
+   if (x86_hyper_type == X86_HYPER_MS_HYPERV)
+   clear_hv_tscchange_cb();
+#endif
kvm_lapic_exit();
perf_unregister_guest_info_callbacks(_guest_cbs);
 
-- 
2.14.3

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


[PATCH v4 6/7] x86/kvm: pass stable clocksource to guests when running nested on Hyper-V

2018-01-24 Thread Vitaly Kuznetsov
Currently, KVM is able to work in 'masterclock' mode passing
PVCLOCK_TSC_STABLE_BIT to guests when the clocksource we use on the host
is TSC. When running nested on Hyper-V we normally use a different one:
TSC page which is resistant to TSC frequency changes on event like L1
migration. Add support for it in KVM.

The only non-trivial change in the patch is in vgettsc(): when updating
our gtod copy we now need to get both the clockread and tsc value.

Signed-off-by: Vitaly Kuznetsov <vkuzn...@redhat.com>
Acked-by: Paolo Bonzini <pbonz...@redhat.com>
---
 arch/x86/kvm/x86.c | 93 +++---
 1 file changed, 68 insertions(+), 25 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index c53298dfbf50..b1ce368a07af 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -67,6 +67,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #define CREATE_TRACE_POINTS
 #include "trace.h"
@@ -1377,6 +1378,11 @@ static u64 compute_guest_tsc(struct kvm_vcpu *vcpu, s64 
kernel_ns)
return tsc;
 }
 
+static inline int gtod_is_based_on_tsc(int mode)
+{
+   return mode == VCLOCK_TSC || mode == VCLOCK_HVCLOCK;
+}
+
 static void kvm_track_tsc_matching(struct kvm_vcpu *vcpu)
 {
 #ifdef CONFIG_X86_64
@@ -1396,7 +1402,7 @@ static void kvm_track_tsc_matching(struct kvm_vcpu *vcpu)
 * perform request to enable masterclock.
 */
if (ka->use_master_clock ||
-   (gtod->clock.vclock_mode == VCLOCK_TSC && vcpus_matched))
+   (gtod_is_based_on_tsc(gtod->clock.vclock_mode) && vcpus_matched))
kvm_make_request(KVM_REQ_MASTERCLOCK_UPDATE, vcpu);
 
trace_kvm_track_tsc(vcpu->vcpu_id, ka->nr_vcpus_matched_tsc,
@@ -1459,6 +1465,19 @@ static void kvm_vcpu_write_tsc_offset(struct kvm_vcpu 
*vcpu, u64 offset)
vcpu->arch.tsc_offset = offset;
 }
 
+static inline bool kvm_check_tsc_unstable(void)
+{
+#ifdef CONFIG_X86_64
+   /*
+* TSC is marked unstable when we're running on Hyper-V,
+* 'TSC page' clocksource is good.
+*/
+   if (pvclock_gtod_data.clock.vclock_mode == VCLOCK_HVCLOCK)
+   return false;
+#endif
+   return check_tsc_unstable();
+}
+
 void kvm_write_tsc(struct kvm_vcpu *vcpu, struct msr_data *msr)
 {
struct kvm *kvm = vcpu->kvm;
@@ -1504,7 +1523,7 @@ void kvm_write_tsc(struct kvm_vcpu *vcpu, struct msr_data 
*msr)
  */
if (synchronizing &&
vcpu->arch.virtual_tsc_khz == kvm->arch.last_tsc_khz) {
-   if (!check_tsc_unstable()) {
+   if (!kvm_check_tsc_unstable()) {
offset = kvm->arch.cur_tsc_offset;
pr_debug("kvm: matched tsc offset for %llu\n", data);
} else {
@@ -1604,18 +1623,43 @@ static u64 read_tsc(void)
return last;
 }
 
-static inline u64 vgettsc(u64 *cycle_now)
+static inline u64 vgettsc(u64 *tsc_timestamp, int *mode)
 {
long v;
struct pvclock_gtod_data *gtod = _gtod_data;
+   u64 tsc_pg_val;
+
+   switch (gtod->clock.vclock_mode) {
+   case VCLOCK_HVCLOCK:
+   tsc_pg_val = hv_read_tsc_page_tsc(hv_get_tsc_page(),
+ tsc_timestamp);
+   if (tsc_pg_val != U64_MAX) {
+   /* TSC page valid */
+   *mode = VCLOCK_HVCLOCK;
+   v = (tsc_pg_val - gtod->clock.cycle_last) &
+   gtod->clock.mask;
+   } else {
+   /* TSC page invalid */
+   *mode = VCLOCK_NONE;
+   }
+   break;
+   case VCLOCK_TSC:
+   *mode = VCLOCK_TSC;
+   *tsc_timestamp = read_tsc();
+   v = (*tsc_timestamp - gtod->clock.cycle_last) &
+   gtod->clock.mask;
+   break;
+   default:
+   *mode = VCLOCK_NONE;
+   }
 
-   *cycle_now = read_tsc();
+   if (*mode == VCLOCK_NONE)
+   *tsc_timestamp = v = 0;
 
-   v = (*cycle_now - gtod->clock.cycle_last) & gtod->clock.mask;
return v * gtod->clock.mult;
 }
 
-static int do_monotonic_boot(s64 *t, u64 *cycle_now)
+static int do_monotonic_boot(s64 *t, u64 *tsc_timestamp)
 {
struct pvclock_gtod_data *gtod = _gtod_data;
unsigned long seq;
@@ -1624,9 +1668,8 @@ static int do_monotonic_boot(s64 *t, u64 *cycle_now)
 
do {
seq = read_seqcount_begin(>seq);
-   mode = gtod->clock.vclock_mode;
ns = gtod->nsec_base;
-   ns += vgettsc(cycle_now);
+   ns += vgettsc(tsc_timestamp, );
ns >>= gtod->clock.shift;
ns += gtod->boot_ns;
} while (unlikely(read_seqcount_retry(>seq, seq)))

[PATCH v4 3/7] x86/hyper-v: reenlightenment notifications support

2018-01-24 Thread Vitaly Kuznetsov
Hyper-V supports Live Migration notification. This is supposed to be used
in conjunction with TSC emulation: when we are migrated to a host with
different TSC frequency for some short period host emulates our accesses
to TSC and sends us an interrupt to notify about the event. When we're
done updating everything we can disable TSC emulation and everything will
start working fast again.

We didn't need these notifications before as Hyper-V guests are not
supposed to use TSC as a clocksource: in Linux we even mark it as unstable
on boot. Guests normally use 'tsc page' clocksouce and host updates its
values on migrations automatically.

Things change when we want to run nested virtualization: even when we pass
through PV clocksources (kvm-clock or tsc page) to our guests we need to
know TSC frequency and when it changes.

Hyper-V Top Level Functional Specification (as of v5.0b) wrongly specifies
EAX:BIT(12) of CPUID:0x4009 as the feature identification bit. The
right one to check is EAX:BIT(13) of CPUID:0x4003. I was assured that
the fix in on the way.

Signed-off-by: Vitaly Kuznetsov <vkuzn...@redhat.com>
Reviewed-by: Thomas Gleixner <t...@linutronix.de>
---
v3 -> v4:
- drop 'include ' (and __visible/_irq_entry from
  hyperv_reenlightenment_intr declaration) from mshyperv.h
  [kbuild robot]
---
 arch/x86/entry/entry_32.S  |  3 ++
 arch/x86/entry/entry_64.S  |  3 ++
 arch/x86/hyperv/hv_init.c  | 89 ++
 arch/x86/include/asm/irq_vectors.h |  7 ++-
 arch/x86/include/asm/mshyperv.h|  9 
 arch/x86/include/uapi/asm/hyperv.h | 27 
 arch/x86/kernel/cpu/mshyperv.c |  6 +++
 7 files changed, 143 insertions(+), 1 deletion(-)

diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
index 60c4c342316c..2938cf89d473 100644
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -894,6 +894,9 @@ BUILD_INTERRUPT3(xen_hvm_callback_vector, 
HYPERVISOR_CALLBACK_VECTOR,
 BUILD_INTERRUPT3(hyperv_callback_vector, HYPERVISOR_CALLBACK_VECTOR,
 hyperv_vector_handler)
 
+BUILD_INTERRUPT3(hyperv_reenlightenment_vector, HYPERV_REENLIGHTENMENT_VECTOR,
+hyperv_reenlightenment_intr)
+
 #endif /* CONFIG_HYPERV */
 
 ENTRY(page_fault)
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index ff6f8022612c..b04cfd7f6e4c 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -1244,6 +1244,9 @@ apicinterrupt3 HYPERVISOR_CALLBACK_VECTOR \
 #if IS_ENABLED(CONFIG_HYPERV)
 apicinterrupt3 HYPERVISOR_CALLBACK_VECTOR \
hyperv_callback_vector hyperv_vector_handler
+
+apicinterrupt3 HYPERV_REENLIGHTENMENT_VECTOR \
+   hyperv_reenlightenment_vector hyperv_reenlightenment_intr
 #endif /* CONFIG_HYPERV */
 
 idtentry debug do_debughas_error_code=0
paranoid=1 shift_ist=DEBUG_STACK
diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
index 1a6c63f721bc..712ac40081f7 100644
--- a/arch/x86/hyperv/hv_init.c
+++ b/arch/x86/hyperv/hv_init.c
@@ -18,6 +18,8 @@
  */
 
 #include 
+#include 
+#include 
 #include 
 #include 
 #include 
@@ -102,6 +104,93 @@ static int hv_cpu_init(unsigned int cpu)
return 0;
 }
 
+static void (*hv_reenlightenment_cb)(void);
+
+static void hv_reenlightenment_notify(struct work_struct *dummy)
+{
+   struct hv_tsc_emulation_status emu_status;
+
+   rdmsrl(HV_X64_MSR_TSC_EMULATION_STATUS, *(u64 *)_status);
+
+   /* Don't issue the callback if TSC accesses are not emulated */
+   if (hv_reenlightenment_cb && emu_status.inprogress)
+   hv_reenlightenment_cb();
+}
+static DECLARE_DELAYED_WORK(hv_reenlightenment_work, 
hv_reenlightenment_notify);
+
+void hyperv_stop_tsc_emulation(void)
+{
+   u64 freq;
+   struct hv_tsc_emulation_status emu_status;
+
+   rdmsrl(HV_X64_MSR_TSC_EMULATION_STATUS, *(u64 *)_status);
+   emu_status.inprogress = 0;
+   wrmsrl(HV_X64_MSR_TSC_EMULATION_STATUS, *(u64 *)_status);
+
+   rdmsrl(HV_X64_MSR_TSC_FREQUENCY, freq);
+   tsc_khz = div64_u64(freq, 1000);
+}
+EXPORT_SYMBOL_GPL(hyperv_stop_tsc_emulation);
+
+static inline bool hv_reenlightenment_available(void)
+{
+   /*
+* Check for required features and priviliges to make TSC frequency
+* change notifications work.
+*/
+   return ms_hyperv.features & HV_X64_ACCESS_FREQUENCY_MSRS &&
+   ms_hyperv.misc_features & HV_FEATURE_FREQUENCY_MSRS_AVAILABLE &&
+   ms_hyperv.features & HV_X64_ACCESS_REENLIGHTENMENT;
+}
+
+__visible void __irq_entry hyperv_reenlightenment_intr(struct pt_regs *regs)
+{
+   entering_ack_irq();
+
+   schedule_delayed_work(_reenlightenment_work, HZ/10);
+
+   exiting_irq();
+}
+
+void set_hv_tscchange_cb(void (*cb)(void))
+{
+   struct hv_reenlightenment_control re_ctrl = {
+   .vector = HYPERV_R

[PATCH v4 2/7] x86/hyper-v: add a function to read both TSC and TSC page value simulateneously

2018-01-24 Thread Vitaly Kuznetsov
This is going to be used from KVM code where we need to get both
TSC and TSC page value.

Nobody is supposed to use the function when Hyper-V code is compiled out,
just BUG().

Signed-off-by: Vitaly Kuznetsov <vkuzn...@redhat.com>
---
 arch/x86/hyperv/hv_init.c   |  1 +
 arch/x86/include/asm/mshyperv.h | 23 +++
 2 files changed, 20 insertions(+), 4 deletions(-)

diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
index 21f9d53d9f00..1a6c63f721bc 100644
--- a/arch/x86/hyperv/hv_init.c
+++ b/arch/x86/hyperv/hv_init.c
@@ -37,6 +37,7 @@ struct ms_hyperv_tsc_page *hv_get_tsc_page(void)
 {
return tsc_pg;
 }
+EXPORT_SYMBOL_GPL(hv_get_tsc_page);
 
 static u64 read_hv_clock_tsc(struct clocksource *arg)
 {
diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
index 8bf450b13d9f..6b1d4ea78270 100644
--- a/arch/x86/include/asm/mshyperv.h
+++ b/arch/x86/include/asm/mshyperv.h
@@ -325,9 +325,10 @@ static inline void hyperv_setup_mmu_ops(void) {}
 
 #ifdef CONFIG_HYPERV_TSCPAGE
 struct ms_hyperv_tsc_page *hv_get_tsc_page(void);
-static inline u64 hv_read_tsc_page(const struct ms_hyperv_tsc_page *tsc_pg)
+static inline u64 hv_read_tsc_page_tsc(const struct ms_hyperv_tsc_page *tsc_pg,
+  u64 *cur_tsc)
 {
-   u64 scale, offset, cur_tsc;
+   u64 scale, offset;
u32 sequence;
 
/*
@@ -358,7 +359,7 @@ static inline u64 hv_read_tsc_page(const struct 
ms_hyperv_tsc_page *tsc_pg)
 
scale = READ_ONCE(tsc_pg->tsc_scale);
offset = READ_ONCE(tsc_pg->tsc_offset);
-   cur_tsc = rdtsc_ordered();
+   *cur_tsc = rdtsc_ordered();
 
/*
 * Make sure we read sequence after we read all other values
@@ -368,7 +369,14 @@ static inline u64 hv_read_tsc_page(const struct 
ms_hyperv_tsc_page *tsc_pg)
 
} while (READ_ONCE(tsc_pg->tsc_sequence) != sequence);
 
-   return mul_u64_u64_shr(cur_tsc, scale, 64) + offset;
+   return mul_u64_u64_shr(*cur_tsc, scale, 64) + offset;
+}
+
+static inline u64 hv_read_tsc_page(const struct ms_hyperv_tsc_page *tsc_pg)
+{
+   u64 cur_tsc;
+
+   return hv_read_tsc_page_tsc(tsc_pg, _tsc);
 }
 
 #else
@@ -376,5 +384,12 @@ static inline struct ms_hyperv_tsc_page 
*hv_get_tsc_page(void)
 {
return NULL;
 }
+
+static inline u64 hv_read_tsc_page_tsc(const struct ms_hyperv_tsc_page *tsc_pg,
+  u64 *cur_tsc)
+{
+   BUG();
+   return U64_MAX;
+}
 #endif
 #endif
-- 
2.14.3

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


[PATCH v4 5/7] x86/irq: Count Hyper-V reenlightenment interrupts

2018-01-24 Thread Vitaly Kuznetsov
Hyper-V reenlightenment interrupts arrive when the VM is migrated, we're
not supposed to see many of them. However, it may be important to know
that the event has happened in case we have L2 nested guests.

Signed-off-by: Vitaly Kuznetsov <vkuzn...@redhat.com>
Reviewed-by: Thomas Gleixner <t...@linutronix.de>
---
 arch/x86/hyperv/hv_init.c  | 2 ++
 arch/x86/include/asm/hardirq.h | 3 +++
 arch/x86/kernel/irq.c  | 9 +
 3 files changed, 14 insertions(+)

diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
index e4377e2f2a10..a3adece392f1 100644
--- a/arch/x86/hyperv/hv_init.c
+++ b/arch/x86/hyperv/hv_init.c
@@ -147,6 +147,8 @@ __visible void __irq_entry 
hyperv_reenlightenment_intr(struct pt_regs *regs)
 {
entering_ack_irq();
 
+   inc_irq_stat(irq_hv_reenlightenment_count);
+
schedule_delayed_work(_reenlightenment_work, HZ/10);
 
exiting_irq();
diff --git a/arch/x86/include/asm/hardirq.h b/arch/x86/include/asm/hardirq.h
index 51cc979dd364..7c341a74ec8c 100644
--- a/arch/x86/include/asm/hardirq.h
+++ b/arch/x86/include/asm/hardirq.h
@@ -38,6 +38,9 @@ typedef struct {
 #if IS_ENABLED(CONFIG_HYPERV) || defined(CONFIG_XEN)
unsigned int irq_hv_callback_count;
 #endif
+#if IS_ENABLED(CONFIG_HYPERV)
+   unsigned int irq_hv_reenlightenment_count;
+#endif
 } cacheline_aligned irq_cpustat_t;
 
 DECLARE_PER_CPU_SHARED_ALIGNED(irq_cpustat_t, irq_stat);
diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c
index 68e1867cca80..45fb4d2565f8 100644
--- a/arch/x86/kernel/irq.c
+++ b/arch/x86/kernel/irq.c
@@ -141,6 +141,15 @@ int arch_show_interrupts(struct seq_file *p, int prec)
   irq_stats(j)->irq_hv_callback_count);
seq_puts(p, "  Hypervisor callback interrupts\n");
}
+#endif
+#if IS_ENABLED(CONFIG_HYPERV)
+   if (test_bit(HYPERV_REENLIGHTENMENT_VECTOR, system_vectors)) {
+   seq_printf(p, "%*s: ", prec, "HRE");
+   for_each_online_cpu(j)
+   seq_printf(p, "%10u ",
+  irq_stats(j)->irq_hv_reenlightenment_count);
+   seq_puts(p, "  Hyper-V reenlightenment interrupts\n");
+   }
 #endif
seq_printf(p, "%*s: %10u\n", prec, "ERR", atomic_read(_err_count));
 #if defined(CONFIG_X86_IO_APIC)
-- 
2.14.3

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


[PATCH v4 0/7] x86/kvm/hyperv: stable clocksource for L2 guests when running nested KVM on Hyper-V

2018-01-24 Thread Vitaly Kuznetsov
Changes since v3:
- PATCH3:
  - drop 'include ' (and __visible/_irq_entry from
hyperv_reenlightenment_intr declaration -- but not from implementation)
from mshyperv.h [kbuild robot] I'm retaining Thomas' Reviewed-by tag as
the change is small, hope that's OK.
- PATCH7:
  - add static to kvm_hyperv_tsc_notifier() [kbuild robot]. I'm also
keeping Paolo's Acked-by as the change is trivial.

So currently only PATCH2 of the series doesn't have a Reviewed-by/
Acked-by tag.

Original description:

Currently, KVM passes PVCLOCK_TSC_STABLE_BIT to its guests when running in
so called 'masterclock' mode and this is only possible when the clocksource
on the host is TSC. When running nested on Hyper-V we're using a different
clocksource in L1 (Hyper-V TSC Page) which can actually be used for
masterclock. This series brings the required support.

Making KVM work with TSC page clocksource is relatively easy, it is done in
PATCH 6 of the series. All the rest is required to support L1 migration
when TSC frequency changes, we use a special feature from Hyper-V to do
the job.

Vitaly Kuznetsov (7):
  x86/hyper-v: check for required priviliges in hyperv_init()
  x86/hyper-v: add a function to read both TSC and TSC page value
simulateneously
  x86/hyper-v: reenlightenment notifications support
  x86/hyper-v: redirect reenlightment notifications on CPU offlining
  x86/irq: Count Hyper-V reenlightenment interrupts
  x86/kvm: pass stable clocksource to guests when running nested on
Hyper-V
  x86/kvm: support Hyper-V reenlightenment

 arch/x86/entry/entry_32.S  |   3 +
 arch/x86/entry/entry_64.S  |   3 +
 arch/x86/hyperv/hv_init.c  | 123 -
 arch/x86/include/asm/hardirq.h |   3 +
 arch/x86/include/asm/irq_vectors.h |   7 +-
 arch/x86/include/asm/mshyperv.h|  32 +++--
 arch/x86/include/uapi/asm/hyperv.h |  27 
 arch/x86/kernel/cpu/mshyperv.c |   6 ++
 arch/x86/kernel/irq.c  |   9 +++
 arch/x86/kvm/x86.c | 138 ++---
 10 files changed, 319 insertions(+), 32 deletions(-)

-- 
2.14.3

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


[PATCH v4 1/7] x86/hyper-v: check for required priviliges in hyperv_init()

2018-01-24 Thread Vitaly Kuznetsov
In hyperv_init() we presume we always have access to VP index and hypercall
MSRs while according to the specification we should check if we're allowed
to access the corresponding MSRs before accessing them.

Signed-off-by: Vitaly Kuznetsov <vkuzn...@redhat.com>
Reviewed-by: Thomas Gleixner <t...@linutronix.de>
---
 arch/x86/hyperv/hv_init.c | 9 -
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
index 189a398290db..21f9d53d9f00 100644
--- a/arch/x86/hyperv/hv_init.c
+++ b/arch/x86/hyperv/hv_init.c
@@ -110,12 +110,19 @@ static int hv_cpu_init(unsigned int cpu)
  */
 void hyperv_init(void)
 {
-   u64 guest_id;
+   u64 guest_id, required_msrs;
union hv_x64_msr_hypercall_contents hypercall_msr;
 
if (x86_hyper_type != X86_HYPER_MS_HYPERV)
return;
 
+   /* Absolutely required MSRs */
+   required_msrs = HV_X64_MSR_HYPERCALL_AVAILABLE |
+   HV_X64_MSR_VP_INDEX_AVAILABLE;
+
+   if ((ms_hyperv.features & required_msrs) != required_msrs)
+   return;
+
/* Allocate percpu VP index */
hv_vp_index = kmalloc_array(num_possible_cpus(), sizeof(*hv_vp_index),
GFP_KERNEL);
-- 
2.14.3

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


[PATCH] x86/hyper-v: stop suppressing X86_FEATURE_PCID

2018-01-24 Thread Vitaly Kuznetsov
When hypercall-based TLB flush was enabled for Hyper-V guests PCID feature
was deliberately suppressed as a precaution: back then PCID was never
exposed to Hyper-V guests and it wasn't clear what will happen if some day
it becomes available. The day came and PCID/INVPCID features are already
exposed on certain Hyper-V hosts.

I was asked if we can stop suppressing PCID. From TLFS (as of 5.0b) is is
unclear how TLB flush hypercalls combine with PCID. In particular, I was
worried about our usage of PCID where it is per-cpu based: the same mm gets
different CR3 values on different CPUs and if hypercall does exact matching
we're screwed. However, this is not the case. David Zhang writes:

"In practice, the AddressSpace argument is ignored on any VM that supports
 PCIDs.

Architecturally, the AddressSpace argument must match the CR3 with PCID
bits stripped out (i.e., the low 12 bits of AddressSpace should be 0 in
long mode). The flush hypercalls flush all PCIDs for the specified
AddressSpace."

With this, PCID can be enabled.

Signed-off-by: Vitaly Kuznetsov <vkuzn...@redhat.com>
---
 arch/x86/hyperv/mmu.c | 16 
 1 file changed, 12 insertions(+), 4 deletions(-)

diff --git a/arch/x86/hyperv/mmu.c b/arch/x86/hyperv/mmu.c
index 9cc9e1c1e2db..993a5dff1e40 100644
--- a/arch/x86/hyperv/mmu.c
+++ b/arch/x86/hyperv/mmu.c
@@ -137,7 +137,12 @@ static void hyperv_flush_tlb_others(const struct cpumask 
*cpus,
}
 
if (info->mm) {
-   flush->address_space = virt_to_phys(info->mm->pgd);
+   /*
+* AddressSpace argument must match the CR3 with PCID bits
+* stripped out.
+*/
+   flush->address_space =
+   virt_to_phys(info->mm->pgd) & CR3_ADDR_MASK;
flush->flags = 0;
} else {
flush->address_space = 0;
@@ -219,7 +224,12 @@ static void hyperv_flush_tlb_others_ex(const struct 
cpumask *cpus,
}
 
if (info->mm) {
-   flush->address_space = virt_to_phys(info->mm->pgd);
+   /*
+* AddressSpace argument must match the CR3 with PCID bits
+* stripped out.
+*/
+   flush->address_space =
+   virt_to_phys(info->mm->pgd) & CR3_ADDR_MASK;
flush->flags = 0;
} else {
flush->address_space = 0;
@@ -278,8 +288,6 @@ void hyperv_setup_mmu_ops(void)
if (!(ms_hyperv.hints & HV_X64_REMOTE_TLB_FLUSH_RECOMMENDED))
return;
 
-   setup_clear_cpu_cap(X86_FEATURE_PCID);
-
if (!(ms_hyperv.hints & HV_X64_EX_PROCESSOR_MASKS_RECOMMENDED)) {
pr_info("Using hypercall for remote TLB flush\n");
pv_mmu_ops.flush_tlb_others = hyperv_flush_tlb_others;
-- 
2.14.3

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH v3 3/7] x86/hyper-v: reenlightenment notifications support

2018-01-23 Thread Vitaly Kuznetsov
"Michael Kelley (EOSG)" <michael.h.kel...@microsoft.com> writes:

> Vitaly Kuznetsov <vkuzn...@redhat.com> writes:
>> 
>> > "Michael Kelley (EOSG)" <michael.h.kel...@microsoft.com> writes:
>> >
>> >> On Fri, 19 Jan 2018, Thomas Gleixner wrote:
>> >>
>> >>>
>> >>> You added '#include ' to mshyperv.h which is included 
>> >>> in
>> >>> vclock_gettime.c and pulls in other stuff which fails to expand
>> >>
>> >> Is the declaration of hyperv_reenlightenment_intr() even needed in
>> >> mshyperv.h?  The '#include ' is there for the 
>> >> __irq_entry
>> >> annotation on that declaration.   There's a declaration of the parallel 
>> >> (and
>> >> unannotated) hyperv_vector_handler(), but that looks like a fossil that
>> >> isn't needed either.
>> >>
>> >
>> > True,
>> >
>> > the only need for the declaration in mshyperv.h is to silence "warning:
>> > no previous prototype for ‘hyperv_reenlightenment_intr’"; I'm not sure
>> > if this actually needs fixing.
>> 
>> It seems we can get away with dropping '__visible' and '__irq_entry'
>> qualifiers from 'hyperv_reenlightenment_intr' declaration in mshyperv.h
>> while keeping them in hv_init.c for the implementation. This way we can
>> avoid the nasty 'no previous prototype' warning and not have to add
>> '#include ' to mshyperv.h breaking vdso.
>
> Where exactly is the 'no previous prototype' warning being generated?

It is generated while compiling the implementation in hv_init.c, every
non-static function needs a previously defined prototype.

-- 
  Vitaly
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH v3 3/7] x86/hyper-v: reenlightenment notifications support

2018-01-23 Thread Vitaly Kuznetsov
Vitaly Kuznetsov <vkuzn...@redhat.com> writes:

> "Michael Kelley (EOSG)" <michael.h.kel...@microsoft.com> writes:
>
>> On Fri, 19 Jan 2018, Thomas Gleixner wrote:
>>
>>> 
>>> You added '#include ' to mshyperv.h which is included in
>>> vclock_gettime.c and pulls in other stuff which fails to expand
>>
>> Is the declaration of hyperv_reenlightenment_intr() even needed in
>> mshyperv.h?  The '#include ' is there for the __irq_entry
>> annotation on that declaration.   There's a declaration of the parallel (and
>> unannotated) hyperv_vector_handler(), but that looks like a fossil that
>> isn't needed either.
>>
>
> True,
>
> the only need for the declaration in mshyperv.h is to silence "warning:
> no previous prototype for ‘hyperv_reenlightenment_intr’"; I'm not sure
> if this actually needs fixing.

It seems we can get away with dropping '__visible' and '__irq_entry'
qualifiers from 'hyperv_reenlightenment_intr' declaration in mshyperv.h
while keeping them in hv_init.c for the implementation. This way we can
avoid the nasty 'no previous prototype' warning and not have to add
'#include ' to mshyperv.h breaking vdso.

I'm inclined to go ahead with this in v4 if nobody objects.

-- 
  Vitaly
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH v3 3/7] x86/hyper-v: reenlightenment notifications support

2018-01-22 Thread Vitaly Kuznetsov
"Michael Kelley (EOSG)" <michael.h.kel...@microsoft.com> writes:

> On Fri, 19 Jan 2018, Thomas Gleixner wrote:
>
>> -Original Message-
>> From: Thomas Gleixner [mailto:t...@linutronix.de]
>> Sent: Friday, January 19, 2018 11:48 PM
>> To: Vitaly Kuznetsov <vkuzn...@redhat.com>
>> Cc: kbuild test robot <l...@intel.com>; kbuild-...@01.org; 
>> k...@vger.kernel.org;
>> x...@kernel.org; Stephen Hemminger <sthem...@microsoft.com>; Radim Krčmář
>> <rkrc...@redhat.com>; Haiyang Zhang <haiya...@microsoft.com>; linux-
>> ker...@vger.kernel.org; de...@linuxdriverproject.org; Michael Kelley (EOSG)
>> <michael.h.kel...@microsoft.com>; Ingo Molnar <mi...@redhat.com>; Roman Kagan
>> <rka...@virtuozzo.com>; Andy Lutomirski <l...@kernel.org>; H. Peter Anvin
>> <h...@zytor.com>; Paolo Bonzini <pbonz...@redhat.com>; Mohammed Gamal
>> <mmo...@redhat.com>
>> Subject: Re: [PATCH v3 3/7] x86/hyper-v: reenlightenment notifications 
>> support
>> 
>> On Fri, 19 Jan 2018, Vitaly Kuznetsov wrote:
>> > kbuild test robot <l...@intel.com> writes:
>> >
>> > > Hi Vitaly,
>> > >
>> > > Thank you for the patch! Perhaps something to improve:
>> > >
>> > > [auto build test WARNING on tip/auto-latest]
>> > > [also build test WARNING on v4.15-rc8 next-20180118]
>> > > [cannot apply to tip/x86/core kvm/linux-next]
>> > > [if your patch is applied to the wrong git tree, please drop us a note 
>> > > to help improve the
>> system]
>> > >
>> > > url:
>> https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2F0day-
>> ci%2Flinux%2Fcommits%2FVitaly-Kuznetsov%2Fx86-kvm-hyperv-stable-clocksorce-for-L2-
>> guests-when-running-nested-KVM-on-Hyper-V%2F20180119-
>> 160814=02%7C01%7CMichael.H.Kelley%40microsoft.com%7Ce95c66107da6446826830
>> 8d55fda2c2b%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636520313062777623
>> data=kAXl3mLVUdJi%2BsB4Ub0fmUHQfl6NuUDjW%2FAY9%2BFLZE4%3D=0
>> > > config: x86_64-allmodconfig (attached as .config)
>> > > compiler: gcc-7 (Debian 7.2.0-12) 7.2.1 20171025
>> > > reproduce:
>> > > # save the attached .config to linux build tree
>> > > make ARCH=x86_64
>> > >
>> > > All warnings (new ones prefixed by >>):
>> > >
>> > >In file included from include/linux/kasan.h:17:0,
>> > > from include/linux/slab.h:129,
>> > > from include/linux/irq.h:26,
>> > > from arch/x86/include/asm/hardirq.h:6,
>> > > from include/linux/hardirq.h:9,
>> > > from include/linux/interrupt.h:13,
>> > > from arch/x86/include/asm/mshyperv.h:8,
>> > > from 
>> > > arch/x86//entry/vdso/vdso32/../vclock_gettime.c:20,
>> > > from arch/x86//entry/vdso/vdso32/vclock_gettime.c:33:
>> > >arch/x86/include/asm/pgtable.h: In function 'clone_pgd_range':
>> > >arch/x86/include/asm/pgtable.h:1129:9: error: implicit declaration of 
>> > > function
>> 'kernel_to_user_pgdp'; did you mean 'u64_to_user_ptr'? 
>> [-Werror=implicit-function-
>> declaration]
>> > >  memcpy(kernel_to_user_pgdp(dst), kernel_to_user_pgdp(src),
>> > > ^~~
>> >
>> > Sorry but I'm failing to see how this (and all the rest) is related to
>> > my patch ...
>> 
>> You added '#include ' to mshyperv.h which is included in
>> vclock_gettime.c and pulls in other stuff which fails to expand
>
> Is the declaration of hyperv_reenlightenment_intr() even needed in
> mshyperv.h?  The '#include ' is there for the __irq_entry
> annotation on that declaration.   There's a declaration of the parallel (and
> unannotated) hyperv_vector_handler(), but that looks like a fossil that
> isn't needed either.
>

True,

the only need for the declaration in mshyperv.h is to silence "warning:
no previous prototype for ‘hyperv_reenlightenment_intr’"; I'm not sure
if this actually needs fixing.

-- 
  Vitaly
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH v3 3/7] x86/hyper-v: reenlightenment notifications support

2018-01-22 Thread Vitaly Kuznetsov
Thomas Gleixner <t...@linutronix.de> writes:

> On Fri, 19 Jan 2018, Vitaly Kuznetsov wrote:
>> kbuild test robot <l...@intel.com> writes:
>> 
>> > Hi Vitaly,
>> >
>> > Thank you for the patch! Perhaps something to improve:
>> >
>> > [auto build test WARNING on tip/auto-latest]
>> > [also build test WARNING on v4.15-rc8 next-20180118]
>> > [cannot apply to tip/x86/core kvm/linux-next]
>> > [if your patch is applied to the wrong git tree, please drop us a note to 
>> > help improve the system]
>> >
>> > url:
>> > https://github.com/0day-ci/linux/commits/Vitaly-Kuznetsov/x86-kvm-hyperv-stable-clocksorce-for-L2-guests-when-running-nested-KVM-on-Hyper-V/20180119-160814
>> > config: x86_64-allmodconfig (attached as .config)
>> > compiler: gcc-7 (Debian 7.2.0-12) 7.2.1 20171025
>> > reproduce:
>> > # save the attached .config to linux build tree
>> > make ARCH=x86_64 
>> >
>> > All warnings (new ones prefixed by >>):
>> >
>> >In file included from include/linux/kasan.h:17:0,
>> > from include/linux/slab.h:129,
>> > from include/linux/irq.h:26,
>> > from arch/x86/include/asm/hardirq.h:6,
>> > from include/linux/hardirq.h:9,
>> > from include/linux/interrupt.h:13,
>> > from arch/x86/include/asm/mshyperv.h:8,
>> > from 
>> > arch/x86//entry/vdso/vdso32/../vclock_gettime.c:20,
>> > from arch/x86//entry/vdso/vdso32/vclock_gettime.c:33:
>> >arch/x86/include/asm/pgtable.h: In function 'clone_pgd_range':
>> >arch/x86/include/asm/pgtable.h:1129:9: error: implicit declaration of 
>> > function 'kernel_to_user_pgdp'; did you mean 'u64_to_user_ptr'? 
>> > [-Werror=implicit-function-declaration]
>> >  memcpy(kernel_to_user_pgdp(dst), kernel_to_user_pgdp(src),
>> > ^~~
>> 
>> Sorry but I'm failing to see how this (and all the rest) is related to
>> my patch ...
>
> You added '#include ' to mshyperv.h which is included in
> vclock_gettime.c and pulls in other stuff which fails to expand
>

Oh, right, thanks! I'll see what can be done.

-- 
  Vitaly
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH v3 3/7] x86/hyper-v: reenlightenment notifications support

2018-01-19 Thread Vitaly Kuznetsov
kbuild test robot <l...@intel.com> writes:

> Hi Vitaly,
>
> Thank you for the patch! Perhaps something to improve:
>
> [auto build test WARNING on tip/auto-latest]
> [also build test WARNING on v4.15-rc8 next-20180118]
> [cannot apply to tip/x86/core kvm/linux-next]
> [if your patch is applied to the wrong git tree, please drop us a note to 
> help improve the system]
>
> url:
> https://github.com/0day-ci/linux/commits/Vitaly-Kuznetsov/x86-kvm-hyperv-stable-clocksorce-for-L2-guests-when-running-nested-KVM-on-Hyper-V/20180119-160814
> config: x86_64-allmodconfig (attached as .config)
> compiler: gcc-7 (Debian 7.2.0-12) 7.2.1 20171025
> reproduce:
> # save the attached .config to linux build tree
> make ARCH=x86_64 
>
> All warnings (new ones prefixed by >>):
>
>In file included from include/linux/kasan.h:17:0,
> from include/linux/slab.h:129,
> from include/linux/irq.h:26,
> from arch/x86/include/asm/hardirq.h:6,
> from include/linux/hardirq.h:9,
> from include/linux/interrupt.h:13,
> from arch/x86/include/asm/mshyperv.h:8,
> from arch/x86//entry/vdso/vdso32/../vclock_gettime.c:20,
> from arch/x86//entry/vdso/vdso32/vclock_gettime.c:33:
>arch/x86/include/asm/pgtable.h: In function 'clone_pgd_range':
>arch/x86/include/asm/pgtable.h:1129:9: error: implicit declaration of 
> function 'kernel_to_user_pgdp'; did you mean 'u64_to_user_ptr'? 
> [-Werror=implicit-function-declaration]
>  memcpy(kernel_to_user_pgdp(dst), kernel_to_user_pgdp(src),
> ^~~

Sorry but I'm failing to see how this (and all the rest) is related to
my patch ...

[snip]

-- 
  Vitaly
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


[PATCH v3 6/7] x86/kvm: pass stable clocksource to guests when running nested on Hyper-V

2018-01-16 Thread Vitaly Kuznetsov
Currently, KVM is able to work in 'masterclock' mode passing
PVCLOCK_TSC_STABLE_BIT to guests when the clocksource we use on the host
is TSC. When running nested on Hyper-V we normally use a different one:
TSC page which is resistant to TSC frequency changes on event like L1
migration. Add support for it in KVM.

The only non-trivial change in the patch is in vgettsc(): when updating
our gtod copy we now need to get both the clockread and tsc value.

Signed-off-by: Vitaly Kuznetsov <vkuzn...@redhat.com>
Acked-by: Paolo Bonzini <pbonz...@redhat.com>
---
 arch/x86/kvm/x86.c | 93 +++---
 1 file changed, 68 insertions(+), 25 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 1cec2c62a0b0..f14e0129c8f5 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -67,6 +67,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #define CREATE_TRACE_POINTS
 #include "trace.h"
@@ -1377,6 +1378,11 @@ static u64 compute_guest_tsc(struct kvm_vcpu *vcpu, s64 
kernel_ns)
return tsc;
 }
 
+static inline int gtod_is_based_on_tsc(int mode)
+{
+   return mode == VCLOCK_TSC || mode == VCLOCK_HVCLOCK;
+}
+
 static void kvm_track_tsc_matching(struct kvm_vcpu *vcpu)
 {
 #ifdef CONFIG_X86_64
@@ -1396,7 +1402,7 @@ static void kvm_track_tsc_matching(struct kvm_vcpu *vcpu)
 * perform request to enable masterclock.
 */
if (ka->use_master_clock ||
-   (gtod->clock.vclock_mode == VCLOCK_TSC && vcpus_matched))
+   (gtod_is_based_on_tsc(gtod->clock.vclock_mode) && vcpus_matched))
kvm_make_request(KVM_REQ_MASTERCLOCK_UPDATE, vcpu);
 
trace_kvm_track_tsc(vcpu->vcpu_id, ka->nr_vcpus_matched_tsc,
@@ -1459,6 +1465,19 @@ static void kvm_vcpu_write_tsc_offset(struct kvm_vcpu 
*vcpu, u64 offset)
vcpu->arch.tsc_offset = offset;
 }
 
+static inline bool kvm_check_tsc_unstable(void)
+{
+#ifdef CONFIG_X86_64
+   /*
+* TSC is marked unstable when we're running on Hyper-V,
+* 'TSC page' clocksource is good.
+*/
+   if (pvclock_gtod_data.clock.vclock_mode == VCLOCK_HVCLOCK)
+   return false;
+#endif
+   return check_tsc_unstable();
+}
+
 void kvm_write_tsc(struct kvm_vcpu *vcpu, struct msr_data *msr)
 {
struct kvm *kvm = vcpu->kvm;
@@ -1504,7 +1523,7 @@ void kvm_write_tsc(struct kvm_vcpu *vcpu, struct msr_data 
*msr)
  */
if (synchronizing &&
vcpu->arch.virtual_tsc_khz == kvm->arch.last_tsc_khz) {
-   if (!check_tsc_unstable()) {
+   if (!kvm_check_tsc_unstable()) {
offset = kvm->arch.cur_tsc_offset;
pr_debug("kvm: matched tsc offset for %llu\n", data);
} else {
@@ -1604,18 +1623,43 @@ static u64 read_tsc(void)
return last;
 }
 
-static inline u64 vgettsc(u64 *cycle_now)
+static inline u64 vgettsc(u64 *tsc_timestamp, int *mode)
 {
long v;
struct pvclock_gtod_data *gtod = _gtod_data;
+   u64 tsc_pg_val;
+
+   switch (gtod->clock.vclock_mode) {
+   case VCLOCK_HVCLOCK:
+   tsc_pg_val = hv_read_tsc_page_tsc(hv_get_tsc_page(),
+ tsc_timestamp);
+   if (tsc_pg_val != U64_MAX) {
+   /* TSC page valid */
+   *mode = VCLOCK_HVCLOCK;
+   v = (tsc_pg_val - gtod->clock.cycle_last) &
+   gtod->clock.mask;
+   } else {
+   /* TSC page invalid */
+   *mode = VCLOCK_NONE;
+   }
+   break;
+   case VCLOCK_TSC:
+   *mode = VCLOCK_TSC;
+   *tsc_timestamp = read_tsc();
+   v = (*tsc_timestamp - gtod->clock.cycle_last) &
+   gtod->clock.mask;
+   break;
+   default:
+   *mode = VCLOCK_NONE;
+   }
 
-   *cycle_now = read_tsc();
+   if (*mode == VCLOCK_NONE)
+   *tsc_timestamp = v = 0;
 
-   v = (*cycle_now - gtod->clock.cycle_last) & gtod->clock.mask;
return v * gtod->clock.mult;
 }
 
-static int do_monotonic_boot(s64 *t, u64 *cycle_now)
+static int do_monotonic_boot(s64 *t, u64 *tsc_timestamp)
 {
struct pvclock_gtod_data *gtod = _gtod_data;
unsigned long seq;
@@ -1624,9 +1668,8 @@ static int do_monotonic_boot(s64 *t, u64 *cycle_now)
 
do {
seq = read_seqcount_begin(>seq);
-   mode = gtod->clock.vclock_mode;
ns = gtod->nsec_base;
-   ns += vgettsc(cycle_now);
+   ns += vgettsc(tsc_timestamp, );
ns >>= gtod->clock.shift;
ns += gtod->boot_ns;
} while (unlikely(read_seqcount_retry(>seq, seq)))

[PATCH v3 7/7] x86/kvm: support Hyper-V reenlightenment

2018-01-16 Thread Vitaly Kuznetsov
When we run nested KVM on Hyper-V guests we need to update masterclocks for
all guests when L1 migrates to a host with different TSC frequency.
Implement the procedure in the following way:
- Pause all guests.
- Tell our host (Hyper-V) to stop emulating TSC accesses.
- Update our gtod copy, recompute clocks.
- Unpause all guests.

This is somewhat similar to cpufreq but we have two important differences:
we can only disable TSC emulation globally (on all CPUs) and we don't know
the new TSC frequency until we turn the emulation off so we can't
'prepare' ourselves to the event.

Signed-off-by: Vitaly Kuznetsov <vkuzn...@redhat.com>
Acked-by: Paolo Bonzini <pbonz...@redhat.com>
---
 arch/x86/kvm/x86.c | 45 +
 1 file changed, 45 insertions(+)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index f14e0129c8f5..94f28e6002b2 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -68,6 +68,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #define CREATE_TRACE_POINTS
 #include "trace.h"
@@ -5932,6 +5933,43 @@ static void tsc_khz_changed(void *data)
__this_cpu_write(cpu_tsc_khz, khz);
 }
 
+void kvm_hyperv_tsc_notifier(void)
+{
+#ifdef CONFIG_X86_64
+   struct kvm *kvm;
+   struct kvm_vcpu *vcpu;
+   int cpu;
+
+   spin_lock(_lock);
+   list_for_each_entry(kvm, _list, vm_list)
+   kvm_make_mclock_inprogress_request(kvm);
+
+   hyperv_stop_tsc_emulation();
+
+   /* TSC frequency always matches when on Hyper-V */
+   for_each_present_cpu(cpu)
+   per_cpu(cpu_tsc_khz, cpu) = tsc_khz;
+   kvm_max_guest_tsc_khz = tsc_khz;
+
+   list_for_each_entry(kvm, _list, vm_list) {
+   struct kvm_arch *ka = >arch;
+
+   spin_lock(>pvclock_gtod_sync_lock);
+
+   pvclock_update_vm_gtod_copy(kvm);
+
+   kvm_for_each_vcpu(cpu, vcpu, kvm)
+   kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
+
+   kvm_for_each_vcpu(cpu, vcpu, kvm)
+   kvm_clear_request(KVM_REQ_MCLOCK_INPROGRESS, vcpu);
+
+   spin_unlock(>pvclock_gtod_sync_lock);
+   }
+   spin_unlock(_lock);
+#endif
+}
+
 static int kvmclock_cpufreq_notifier(struct notifier_block *nb, unsigned long 
val,
 void *data)
 {
@@ -6217,6 +6255,9 @@ int kvm_arch_init(void *opaque)
kvm_lapic_init();
 #ifdef CONFIG_X86_64
pvclock_gtod_register_notifier(_gtod_notifier);
+
+   if (x86_hyper_type == X86_HYPER_MS_HYPERV)
+   set_hv_tscchange_cb(kvm_hyperv_tsc_notifier);
 #endif
 
return 0;
@@ -6229,6 +6270,10 @@ int kvm_arch_init(void *opaque)
 
 void kvm_arch_exit(void)
 {
+#ifdef CONFIG_X86_64
+   if (x86_hyper_type == X86_HYPER_MS_HYPERV)
+   clear_hv_tscchange_cb();
+#endif
kvm_lapic_exit();
perf_unregister_guest_info_callbacks(_guest_cbs);
 
-- 
2.14.3

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


[PATCH v3 2/7] x86/hyper-v: add a function to read both TSC and TSC page value simulateneously

2018-01-16 Thread Vitaly Kuznetsov
This is going to be used from KVM code where we need to get both
TSC and TSC page value.

Nobody is supposed to use the function when Hyper-V code is compiled out,
just BUG().

Signed-off-by: Vitaly Kuznetsov <vkuzn...@redhat.com>
---
 arch/x86/hyperv/hv_init.c   |  1 +
 arch/x86/include/asm/mshyperv.h | 23 +++
 2 files changed, 20 insertions(+), 4 deletions(-)

diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
index 21f9d53d9f00..1a6c63f721bc 100644
--- a/arch/x86/hyperv/hv_init.c
+++ b/arch/x86/hyperv/hv_init.c
@@ -37,6 +37,7 @@ struct ms_hyperv_tsc_page *hv_get_tsc_page(void)
 {
return tsc_pg;
 }
+EXPORT_SYMBOL_GPL(hv_get_tsc_page);
 
 static u64 read_hv_clock_tsc(struct clocksource *arg)
 {
diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
index 8bf450b13d9f..6b1d4ea78270 100644
--- a/arch/x86/include/asm/mshyperv.h
+++ b/arch/x86/include/asm/mshyperv.h
@@ -325,9 +325,10 @@ static inline void hyperv_setup_mmu_ops(void) {}
 
 #ifdef CONFIG_HYPERV_TSCPAGE
 struct ms_hyperv_tsc_page *hv_get_tsc_page(void);
-static inline u64 hv_read_tsc_page(const struct ms_hyperv_tsc_page *tsc_pg)
+static inline u64 hv_read_tsc_page_tsc(const struct ms_hyperv_tsc_page *tsc_pg,
+  u64 *cur_tsc)
 {
-   u64 scale, offset, cur_tsc;
+   u64 scale, offset;
u32 sequence;
 
/*
@@ -358,7 +359,7 @@ static inline u64 hv_read_tsc_page(const struct 
ms_hyperv_tsc_page *tsc_pg)
 
scale = READ_ONCE(tsc_pg->tsc_scale);
offset = READ_ONCE(tsc_pg->tsc_offset);
-   cur_tsc = rdtsc_ordered();
+   *cur_tsc = rdtsc_ordered();
 
/*
 * Make sure we read sequence after we read all other values
@@ -368,7 +369,14 @@ static inline u64 hv_read_tsc_page(const struct 
ms_hyperv_tsc_page *tsc_pg)
 
} while (READ_ONCE(tsc_pg->tsc_sequence) != sequence);
 
-   return mul_u64_u64_shr(cur_tsc, scale, 64) + offset;
+   return mul_u64_u64_shr(*cur_tsc, scale, 64) + offset;
+}
+
+static inline u64 hv_read_tsc_page(const struct ms_hyperv_tsc_page *tsc_pg)
+{
+   u64 cur_tsc;
+
+   return hv_read_tsc_page_tsc(tsc_pg, _tsc);
 }
 
 #else
@@ -376,5 +384,12 @@ static inline struct ms_hyperv_tsc_page 
*hv_get_tsc_page(void)
 {
return NULL;
 }
+
+static inline u64 hv_read_tsc_page_tsc(const struct ms_hyperv_tsc_page *tsc_pg,
+  u64 *cur_tsc)
+{
+   BUG();
+   return U64_MAX;
+}
 #endif
 #endif
-- 
2.14.3

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


[PATCH v3 4/7] x86/hyper-v: redirect reenlightment notifications on CPU offlining

2018-01-16 Thread Vitaly Kuznetsov
It is very unlikely for CPUs to get offlined when we run on Hyper-V as
we have a protection in vmbus module which prevents it when we have any
VMBus devices assigned. This, however,  may change in future if an option
to reassign an already active channel will be added. It is also possible
to run without any Hyper-V devices of have a CPU with no assigned channels.

Reassign reenlightenment notifications to some other active CPU when
the CPU which is assigned to get them goes offline.

Signed-off-by: Vitaly Kuznetsov <vkuzn...@redhat.com>
---
v2-> v3:
- Drop spinlock protection from hv_cpu_die() as cpu hotplug is already
  serialized [Thomas Gleixner]
- Use cpumask_any_but() in hv_cpu_die() [Thomas Gleixner]
---
 arch/x86/hyperv/hv_init.c | 22 +-
 1 file changed, 21 insertions(+), 1 deletion(-)

diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
index 712ac40081f7..e4377e2f2a10 100644
--- a/arch/x86/hyperv/hv_init.c
+++ b/arch/x86/hyperv/hv_init.c
@@ -191,6 +191,26 @@ void clear_hv_tscchange_cb(void)
 }
 EXPORT_SYMBOL_GPL(clear_hv_tscchange_cb);
 
+static int hv_cpu_die(unsigned int cpu)
+{
+   struct hv_reenlightenment_control re_ctrl;
+   unsigned int new_cpu;
+
+   if (hv_reenlightenment_cb == NULL)
+   return 0;
+
+   rdmsrl(HV_X64_MSR_REENLIGHTENMENT_CONTROL, *((u64 *)_ctrl));
+   if (re_ctrl.target_vp == hv_vp_index[cpu]) {
+   /* Reassign to some other online CPU */
+   new_cpu = cpumask_any_but(cpu_online_mask, cpu);
+
+   re_ctrl.target_vp = hv_vp_index[new_cpu];
+   wrmsrl(HV_X64_MSR_REENLIGHTENMENT_CONTROL, *((u64 *)_ctrl));
+   }
+
+   return 0;
+}
+
 /*
  * This function is to be invoked early in the boot sequence after the
  * hypervisor has been detected.
@@ -220,7 +240,7 @@ void hyperv_init(void)
return;
 
if (cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "x86/hyperv_init:online",
- hv_cpu_init, NULL) < 0)
+ hv_cpu_init, hv_cpu_die) < 0)
goto free_vp_index;
 
/*
-- 
2.14.3

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


[PATCH v3 5/7] x86/irq: Count Hyper-V reenlightenment interrupts

2018-01-16 Thread Vitaly Kuznetsov
Hyper-V reenlightenment interrupts arrive when the VM is migrated, we're
not supposed to see many of them. However, it may be important to know
that the event has happened in case we have L2 nested guests.

Signed-off-by: Vitaly Kuznetsov <vkuzn...@redhat.com>
---
 arch/x86/hyperv/hv_init.c  | 2 ++
 arch/x86/include/asm/hardirq.h | 3 +++
 arch/x86/kernel/irq.c  | 9 +
 3 files changed, 14 insertions(+)

diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
index e4377e2f2a10..a3adece392f1 100644
--- a/arch/x86/hyperv/hv_init.c
+++ b/arch/x86/hyperv/hv_init.c
@@ -147,6 +147,8 @@ __visible void __irq_entry 
hyperv_reenlightenment_intr(struct pt_regs *regs)
 {
entering_ack_irq();
 
+   inc_irq_stat(irq_hv_reenlightenment_count);
+
schedule_delayed_work(_reenlightenment_work, HZ/10);
 
exiting_irq();
diff --git a/arch/x86/include/asm/hardirq.h b/arch/x86/include/asm/hardirq.h
index 51cc979dd364..7c341a74ec8c 100644
--- a/arch/x86/include/asm/hardirq.h
+++ b/arch/x86/include/asm/hardirq.h
@@ -38,6 +38,9 @@ typedef struct {
 #if IS_ENABLED(CONFIG_HYPERV) || defined(CONFIG_XEN)
unsigned int irq_hv_callback_count;
 #endif
+#if IS_ENABLED(CONFIG_HYPERV)
+   unsigned int irq_hv_reenlightenment_count;
+#endif
 } cacheline_aligned irq_cpustat_t;
 
 DECLARE_PER_CPU_SHARED_ALIGNED(irq_cpustat_t, irq_stat);
diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c
index 68e1867cca80..45fb4d2565f8 100644
--- a/arch/x86/kernel/irq.c
+++ b/arch/x86/kernel/irq.c
@@ -141,6 +141,15 @@ int arch_show_interrupts(struct seq_file *p, int prec)
   irq_stats(j)->irq_hv_callback_count);
seq_puts(p, "  Hypervisor callback interrupts\n");
}
+#endif
+#if IS_ENABLED(CONFIG_HYPERV)
+   if (test_bit(HYPERV_REENLIGHTENMENT_VECTOR, system_vectors)) {
+   seq_printf(p, "%*s: ", prec, "HRE");
+   for_each_online_cpu(j)
+   seq_printf(p, "%10u ",
+  irq_stats(j)->irq_hv_reenlightenment_count);
+   seq_puts(p, "  Hyper-V reenlightenment interrupts\n");
+   }
 #endif
seq_printf(p, "%*s: %10u\n", prec, "ERR", atomic_read(_err_count));
 #if defined(CONFIG_X86_IO_APIC)
-- 
2.14.3

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


[PATCH v3 3/7] x86/hyper-v: reenlightenment notifications support

2018-01-16 Thread Vitaly Kuznetsov
Hyper-V supports Live Migration notification. This is supposed to be used
in conjunction with TSC emulation: when we are migrated to a host with
different TSC frequency for some short period host emulates our accesses
to TSC and sends us an interrupt to notify about the event. When we're
done updating everything we can disable TSC emulation and everything will
start working fast again.

We didn't need these notifications before as Hyper-V guests are not
supposed to use TSC as a clocksource: in Linux we even mark it as unstable
on boot. Guests normally use 'tsc page' clocksouce and host updates its
values on migrations automatically.

Things change when we want to run nested virtualization: even when we pass
through PV clocksources (kvm-clock or tsc page) to our guests we need to
know TSC frequency and when it changes.

Hyper-V Top Level Functional Specification (as of v5.0b) wrongly specifies
EAX:BIT(12) of CPUID:0x4009 as the feature identification bit. The
right one to check is EAX:BIT(13) of CPUID:0x4003. I was assured that
the fix in on the way.

Signed-off-by: Vitaly Kuznetsov <vkuzn...@redhat.com>
---
v2 -> v3:
- add __visible and __irq_entry [Thomas Gleixner]
---
 arch/x86/entry/entry_32.S  |  3 ++
 arch/x86/entry/entry_64.S  |  3 ++
 arch/x86/hyperv/hv_init.c  | 89 ++
 arch/x86/include/asm/irq_vectors.h |  7 ++-
 arch/x86/include/asm/mshyperv.h| 10 +
 arch/x86/include/uapi/asm/hyperv.h | 27 
 arch/x86/kernel/cpu/mshyperv.c |  6 +++
 7 files changed, 144 insertions(+), 1 deletion(-)

diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
index a1f28a54f23a..4041608cd208 100644
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -883,6 +883,9 @@ BUILD_INTERRUPT3(xen_hvm_callback_vector, 
HYPERVISOR_CALLBACK_VECTOR,
 BUILD_INTERRUPT3(hyperv_callback_vector, HYPERVISOR_CALLBACK_VECTOR,
 hyperv_vector_handler)
 
+BUILD_INTERRUPT3(hyperv_reenlightenment_vector, HYPERV_REENLIGHTENMENT_VECTOR,
+hyperv_reenlightenment_intr)
+
 #endif /* CONFIG_HYPERV */
 
 ENTRY(page_fault)
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 4f8e1d35a97c..7da977b8d1dd 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -1233,6 +1233,9 @@ apicinterrupt3 HYPERVISOR_CALLBACK_VECTOR \
 #if IS_ENABLED(CONFIG_HYPERV)
 apicinterrupt3 HYPERVISOR_CALLBACK_VECTOR \
hyperv_callback_vector hyperv_vector_handler
+
+apicinterrupt3 HYPERV_REENLIGHTENMENT_VECTOR \
+   hyperv_reenlightenment_vector hyperv_reenlightenment_intr
 #endif /* CONFIG_HYPERV */
 
 idtentry debug do_debughas_error_code=0
paranoid=1 shift_ist=DEBUG_STACK
diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
index 1a6c63f721bc..712ac40081f7 100644
--- a/arch/x86/hyperv/hv_init.c
+++ b/arch/x86/hyperv/hv_init.c
@@ -18,6 +18,8 @@
  */
 
 #include 
+#include 
+#include 
 #include 
 #include 
 #include 
@@ -102,6 +104,93 @@ static int hv_cpu_init(unsigned int cpu)
return 0;
 }
 
+static void (*hv_reenlightenment_cb)(void);
+
+static void hv_reenlightenment_notify(struct work_struct *dummy)
+{
+   struct hv_tsc_emulation_status emu_status;
+
+   rdmsrl(HV_X64_MSR_TSC_EMULATION_STATUS, *(u64 *)_status);
+
+   /* Don't issue the callback if TSC accesses are not emulated */
+   if (hv_reenlightenment_cb && emu_status.inprogress)
+   hv_reenlightenment_cb();
+}
+static DECLARE_DELAYED_WORK(hv_reenlightenment_work, 
hv_reenlightenment_notify);
+
+void hyperv_stop_tsc_emulation(void)
+{
+   u64 freq;
+   struct hv_tsc_emulation_status emu_status;
+
+   rdmsrl(HV_X64_MSR_TSC_EMULATION_STATUS, *(u64 *)_status);
+   emu_status.inprogress = 0;
+   wrmsrl(HV_X64_MSR_TSC_EMULATION_STATUS, *(u64 *)_status);
+
+   rdmsrl(HV_X64_MSR_TSC_FREQUENCY, freq);
+   tsc_khz = div64_u64(freq, 1000);
+}
+EXPORT_SYMBOL_GPL(hyperv_stop_tsc_emulation);
+
+static inline bool hv_reenlightenment_available(void)
+{
+   /*
+* Check for required features and priviliges to make TSC frequency
+* change notifications work.
+*/
+   return ms_hyperv.features & HV_X64_ACCESS_FREQUENCY_MSRS &&
+   ms_hyperv.misc_features & HV_FEATURE_FREQUENCY_MSRS_AVAILABLE &&
+   ms_hyperv.features & HV_X64_ACCESS_REENLIGHTENMENT;
+}
+
+__visible void __irq_entry hyperv_reenlightenment_intr(struct pt_regs *regs)
+{
+   entering_ack_irq();
+
+   schedule_delayed_work(_reenlightenment_work, HZ/10);
+
+   exiting_irq();
+}
+
+void set_hv_tscchange_cb(void (*cb)(void))
+{
+   struct hv_reenlightenment_control re_ctrl = {
+   .vector = HYPERV_REENLIGHTENMENT_VECTOR,
+   .enabled = 1,
+   .target_vp = hv_vp_index[smp_processor_id()]
+   };
+   struct hv_t

[PATCH v3 1/7] x86/hyper-v: check for required priviliges in hyperv_init()

2018-01-16 Thread Vitaly Kuznetsov
In hyperv_init() we presume we always have access to VP index and hypercall
MSRs while according to the specification we should check if we're allowed
to access the corresponding MSRs before accessing them.

Signed-off-by: Vitaly Kuznetsov <vkuzn...@redhat.com>
Reviewed-by: Thomas Gleixner <t...@linutronix.de>
---
 arch/x86/hyperv/hv_init.c | 9 -
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
index 189a398290db..21f9d53d9f00 100644
--- a/arch/x86/hyperv/hv_init.c
+++ b/arch/x86/hyperv/hv_init.c
@@ -110,12 +110,19 @@ static int hv_cpu_init(unsigned int cpu)
  */
 void hyperv_init(void)
 {
-   u64 guest_id;
+   u64 guest_id, required_msrs;
union hv_x64_msr_hypercall_contents hypercall_msr;
 
if (x86_hyper_type != X86_HYPER_MS_HYPERV)
return;
 
+   /* Absolutely required MSRs */
+   required_msrs = HV_X64_MSR_HYPERCALL_AVAILABLE |
+   HV_X64_MSR_VP_INDEX_AVAILABLE;
+
+   if ((ms_hyperv.features & required_msrs) != required_msrs)
+   return;
+
/* Allocate percpu VP index */
hv_vp_index = kmalloc_array(num_possible_cpus(), sizeof(*hv_vp_index),
GFP_KERNEL);
-- 
2.14.3

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


[PATCH v3 0/7] x86/kvm/hyperv: stable clocksorce for L2 guests when running nested KVM on Hyper-V

2018-01-16 Thread Vitaly Kuznetsov
Changes since v2:
- Add Paolo's Acked-by to PATCH6-7
- Add Thomas' Reviewed-by to PATCH1
- Update the description of PATCH2 to match the reality [Thomas Gleixner]
- Add __visible and __irq_entry annotations to hyperv_reenlightenment_intr()
  [Thomas Gleixner]
- Drop spinlock protection and use cpumask_any_but() in PATCH4
  [Thomas Gleixner]

Original description:

Currently, KVM passes PVCLOCK_TSC_STABLE_BIT to its guests when running in
so called 'masterclock' mode and this is only possible when the clocksource
on the host is TSC. When running nested on Hyper-V we're using a different
clocksource in L1 (Hyper-V TSC Page) which can actually be used for
masterclock. This series brings the required support.

Making KVM work with TSC page clocksource is relatively easy, it is done in
PATCH 6 of the series. All the rest is required to support L1 migration
when TSC frequency changes, we use a special feature from Hyper-V to do
the job.

Vitaly Kuznetsov (7):
  x86/hyper-v: check for required priviliges in hyperv_init()
  x86/hyper-v: add a function to read both TSC and TSC page value
simulateneously
  x86/hyper-v: reenlightenment notifications support
  x86/hyper-v: redirect reenlightment notifications on CPU offlining
  x86/irq: Count Hyper-V reenlightenment interrupts
  x86/kvm: pass stable clocksource to guests when running nested on
Hyper-V
  x86/kvm: support Hyper-V reenlightenment

 arch/x86/entry/entry_32.S  |   3 +
 arch/x86/entry/entry_64.S  |   3 +
 arch/x86/hyperv/hv_init.c  | 123 -
 arch/x86/include/asm/hardirq.h |   3 +
 arch/x86/include/asm/irq_vectors.h |   7 +-
 arch/x86/include/asm/mshyperv.h|  33 +++--
 arch/x86/include/uapi/asm/hyperv.h |  27 
 arch/x86/kernel/cpu/mshyperv.c |   6 ++
 arch/x86/kernel/irq.c  |   9 +++
 arch/x86/kvm/x86.c | 138 ++---
 10 files changed, 320 insertions(+), 32 deletions(-)

-- 
2.14.3

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH v2 2/7] x86/hyper-v: add a function to read both TSC and TSC page value simulateneously

2018-01-15 Thread Vitaly Kuznetsov
Thomas Gleixner <t...@linutronix.de> writes:

> On Wed, 13 Dec 2017, Vitaly Kuznetsov wrote:
>
>> This is going to be used from KVM code where we need to get both
>> TSC and TSC page value.
>> 
>> When Hyper-V code is compiled out just return rdtsc(), this will allow us
>> to avoid ugly ifdefs in non-Hyper-V code.
>
> That's not what the patch implements
>

Not anymore, thanks, will fix the description.

>> +static inline u64 hv_read_tsc_page_tsc(const struct ms_hyperv_tsc_page 
>> *tsc_pg,
>> +   u64 *cur_tsc)
>> +{
>> +BUG();
>> +return U64_MAX;
>> +}

-- 
  Vitaly
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH v2 4/7] x86/hyper-v: redirect reenlightment notifications on CPU offlining

2018-01-15 Thread Vitaly Kuznetsov
Thomas Gleixner <t...@linutronix.de> writes:

> On Wed, 13 Dec 2017, Vitaly Kuznetsov wrote:
>> +static int hv_cpu_die(unsigned int cpu)
>> +{
>> +struct hv_reenlightenment_control re_ctrl;
>> +int i;
>> +static DEFINE_SPINLOCK(lock);
>> +
>> +if (hv_reenlightenment_cb == NULL)
>> +return 0;
>> +
>> +/* Make sure the CPU we migrate to is not going away too */
>> +spin_lock();
>
> What kind of voodoo is this? CPU hotplug is serialized already...
>

Yes, someone around made a comment 'what happens if some day we'll have
parallel cpu hot[un]plug' and I added this. Not really needed, will drop
in v3.

>> +rdmsrl(HV_X64_MSR_REENLIGHTENMENT_CONTROL, *((u64 *)_ctrl));
>> +if (re_ctrl.target_vp == hv_vp_index[cpu]) {
>> +/* Find some other online CPU */
>> +for_each_online_cpu(i) {
>
>   cpu = cpumask_any_but(cpu_online_mask);
>
> Hmm?
>

Cool, thanks)

-- 
  Vitaly
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH v2 0/7] x86/kvm/hyperv: stable clocksorce for L2 guests when running nested KVM on Hyper-V

2018-01-03 Thread Vitaly Kuznetsov
Vitaly Kuznetsov <vkuzn...@redhat.com> writes:

> Paolo Bonzini <pbonz...@redhat.com> writes:
>>
>> Patches 5-7 are
>>
>> Acked-by: Paolo Bonzini <pbonz...@redhat.com>
>>
>> I would appreciate if the Hyper-V folks can provide a topic branch to be
>> merged in both HV and KVM trees.
>>
>
> There's no such thing as Hyper-V tree, patches are usually getting
> merged through 'tip' tree when the majority of changes go to arch/x86 or
> Greg's char-misc tree when changes are drivers/hv heavy (+ net, scsi,
> pci, hid, ... trees for individual drivers).
>
> In this particular case the series is x86-heavy and I believe it should
> go through x86 'tip' tree.
>
> Thomas, Ingo, Peter, could you please take a look? Thanks!

Gentle ping after the holidays season :-)

-- 
  Vitaly
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH RFC 0/7] KVM: nVMX: enlightened VMCS initial implementation

2017-12-21 Thread Vitaly Kuznetsov
Paolo Bonzini <pbonz...@redhat.com> writes:

> On 21/12/2017 13:50, Vitaly Kuznetsov wrote:
>> I'm back with (somewhat frustrating) results (E5-2603):
>
> v4 (that would be Broadwell)?
>

Sorry, v3, actually. Haswell. (the first one supporting vmcs shadowing afaiu).

>> 1) Windows on Hyper-V (no nesting): 1350 cycles
>> 
>> 2) Windows on Hyper-V on Hyper-V: 8600
>> 
>> 3) Windows on KVM (no nesting): 1150  cycles
>> 
>> 4) Windows on Hyper-V on KVM (no enlightened VMCS): 18200
>> 
>> 5) Windows on Hyper-V on KVM (enlightened VMCS): 17100
>
> What version were you using for KVM?  There are quite a few nested virt
> optimizations in kvm/queue (which may make enlightened VMCS both more or
> less efficient).

This is kvm/queue and I rebased enlightened VMCS patches to it.

>
> In particular, with latest kvm/queue you could try tracing vmread and
> vmwrite vmexits, and see if you get any.  If you do, that might be an
> easy few hundred cycles savings.

Will do.

-- 
  Vitaly
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH RFC 2/7] KVM: nVMX: modify vmcs12 fields to match Hyper-V enlightened VMCS

2017-12-21 Thread Vitaly Kuznetsov
Paolo Bonzini <pbonz...@redhat.com> writes:

> On 19/12/2017 13:25, Vitaly Kuznetsov wrote:
>> 
>>> At this point in time, I don't think you can just blithely change the
>>> virtual VMCS layout and revision number. Existing VMs using the old
>>> layout and revision number must continue to work on versions of kvm
>>> past this point. You could tie the layout and revision number changes
>>> to KVM_CAP_HYPERV_ENLIGHTENED_VMCS if you like, but kvm must be able
>>> to continue to service VMs using the previous layout and revision
>>> number in perpetuity.
>>>
>> I see what you mean. In case we need to keep migration of nested
>> workloads working between KVMs of different versions we can't (ever)
>> touch vmcs12.
>
> Actually we can, for two reasons.
>
> First, the active VMCS is stored in host RAM (not in guest RAM).  This
> means there are clear points where to do the translation, namely vmptrld
> and the (not yet upstream) ioctl to set VMX state.
>
> Therefore you only need to keep an (offset, type) map from old to new
> layout map; at those two points if you detect an old VMCS12_REVISION you
> copy the fields one by one instead of doing a memcpy.  The next vmclear
> or vmptrld or get-VMX-state ioctl will automatically update to the new
> VMCS12_REVISION.  Of course, this is a one-way street unless you also
> add support for writing old VMCS12_REVISIONs.
>
> But, second, VMX state migration is not upstream yet, so nested
> hypervisors are currently not migratable: the active VMCS12 state will
> not be migrated at all!  So in upstream KVM we wouldn't even need to
> upgrade the VMCS12_REVISION to make changes to vmcs12.
>
> That said...
>
>> The way to go in this case, I think, is to create a completely separate
>> enlightened_vmcs12 struct and use it when appropriate. We can't possibly
>> support migrating workloads which use enlightened VMCS to an old KVM
>> which doesn't support it.
>
> ... this is probably a good idea as well.
>

One other thing I was thinking about is the shared definition of
enlightened vmcs which we'll use for both KVM-on-Hyper-V and Hyper-V on
KVM and for that purpose I'd like it to be placed outside of struct
vmcs12. We can, of course, embed it at the beginning of vmcs12.

Thinking long term (and having in mind that Microsoft will be updating
enlightened VMCS on its own schedule) -- what would be the preferred way
to go? It seems that personally I'm leaning towards untangling and
keeping it separate from vmcs12 but I can't really find a convincing
argument...

-- 
  Vitaly
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH RFC 0/7] KVM: nVMX: enlightened VMCS initial implementation

2017-12-21 Thread Vitaly Kuznetsov
Vitaly Kuznetsov <vkuzn...@redhat.com> writes:

> Paolo Bonzini <pbonz...@redhat.com> writes:
>
>> On 18/12/2017 18:17, Vitaly Kuznetsov wrote:
>>> The original author of these patches does no longer work at Red Hat, I
>>> agreed to take this over and send upstream. Here is his original
>>> description:
>>> 
>>> "Makes KVM implement the enlightened VMCS feature per Hyper-V TLFS 5.0b.
>>> I've measured about %5 improvement in cost of a nested VM exit (Hyper-V
>>> enabled Windows Server 2016 nested in KVM)."
>>
>> Can you try reproducing this and see how much a simple CPUID loop costs in:
>>
>> * Hyper-V on Hyper-V (with enlightened VMCS, as a proxy for a full
>> implementation including the clean fields mask)
>>
>> * Hyper-V on KVM, with and without enlightened VMCS
>>
>> The latest kvm/queue branch already cut a lot of the cost of a nested VM
>> exit (from ~22000 to ~14000 clock cycles for KVM on KVM), so we could
>> also see if Hyper-V needs shadowing of more fields.
>
> I tested this series before sending out and was able to reproduce said
> 5% improvement with the feature (but didn't keep record of clock
> cycles). I'll try doing tests you mentioned on the same hardware and
> come back with the result. Hopefully I'll manage that before holidays.

I'm back with (somewhat frustrating) results (E5-2603):

1) Windows on Hyper-V (no nesting): 1350 cycles

2) Windows on Hyper-V on Hyper-V: 8600

3) Windows on KVM (no nesting): 1150  cycles

4) Windows on Hyper-V on KVM (no enlightened VMCS): 18200

5) Windows on Hyper-V on KVM (enlightened VMCS): 17100

-- 
  Vitaly
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH RFC 0/7] KVM: nVMX: enlightened VMCS initial implementation

2017-12-19 Thread Vitaly Kuznetsov
Paolo Bonzini <pbonz...@redhat.com> writes:

> On 18/12/2017 18:17, Vitaly Kuznetsov wrote:
>> The original author of these patches does no longer work at Red Hat, I
>> agreed to take this over and send upstream. Here is his original
>> description:
>> 
>> "Makes KVM implement the enlightened VMCS feature per Hyper-V TLFS 5.0b.
>> I've measured about %5 improvement in cost of a nested VM exit (Hyper-V
>> enabled Windows Server 2016 nested in KVM)."
>
> Can you try reproducing this and see how much a simple CPUID loop costs in:
>
> * Hyper-V on Hyper-V (with enlightened VMCS, as a proxy for a full
> implementation including the clean fields mask)
>
> * Hyper-V on KVM, with and without enlightened VMCS
>
> The latest kvm/queue branch already cut a lot of the cost of a nested VM
> exit (from ~22000 to ~14000 clock cycles for KVM on KVM), so we could
> also see if Hyper-V needs shadowing of more fields.

I tested this series before sending out and was able to reproduce said
5% improvement with the feature (but didn't keep record of clock
cycles). I'll try doing tests you mentioned on the same hardware and
come back with the result. Hopefully I'll manage that before holidays.

Thanks,

-- 
  Vitaly
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH RFC 2/7] KVM: nVMX: modify vmcs12 fields to match Hyper-V enlightened VMCS

2017-12-19 Thread Vitaly Kuznetsov
Jim Mattson  writes:

> At this point in time, I don't think you can just blithely change the
> virtual VMCS layout and revision number. Existing VMs using the old
> layout and revision number must continue to work on versions of kvm
> past this point. You could tie the layout and revision number changes
> to KVM_CAP_HYPERV_ENLIGHTENED_VMCS if you like, but kvm must be able
> to continue to service VMs using the previous layout and revision
> number in perpetuity.
>

I see what you mean. In case we need to keep migration of nested
workloads working between KVMs of different versions we can't (ever)
touch vmcs12.

The way to go in this case, I think, is to create a completely separate
enlightened_vmcs12 struct and use it when appropriate. We can't possibly
support migrating workloads which use enlightened VMCS to an old KVM
which doesn't support it.

P.S. "If there are changes in this struct, VMCS12_REVISION must be
changed." comment needs to be replaced with "Don't even think about
changing this" :-)

-- 
  Vitaly
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


[PATCH RFC 5/7] KVM: nVMX: add KVM_CAP_HYPERV_ENLIGHTENED_VMCS capability

2017-12-18 Thread Vitaly Kuznetsov
From: Ladi Prosek <lpro...@redhat.com>

Enlightened VMCS is opt-in. The current version does not contain all
fields supported by nested VMX so we must not advertise the
corresponding VMX features if enlightened VMCS is enabled.

Userspace is given the enlightened VMCS version supported by KVM as
part of enabling KVM_CAP_HYPERV_ENLIGHTENED_VMCS. The version is to
be advertised to the nested hypervisor, currently done via a cpuid
leaf for Hyper-V.

Signed-off-by: Ladi Prosek <lpro...@redhat.com>
Signed-off-by: Vitaly Kuznetsov <vkuzn...@redhat.com>
---
 arch/x86/include/asm/kvm_host.h |  3 +++
 arch/x86/kvm/svm.c  |  9 
 arch/x86/kvm/vmx.c  | 51 +
 arch/x86/kvm/x86.c  | 15 
 include/uapi/linux/kvm.h|  1 +
 5 files changed, 79 insertions(+)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 516798431328..79c188ae7837 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1079,6 +1079,9 @@ struct kvm_x86_ops {
int (*pre_enter_smm)(struct kvm_vcpu *vcpu, char *smstate);
int (*pre_leave_smm)(struct kvm_vcpu *vcpu, u64 smbase);
int (*enable_smi_window)(struct kvm_vcpu *vcpu);
+
+   int (*enable_enlightened_vmcs)(struct kvm_vcpu *vcpu,
+  uint16_t *vmcs_version);
 };
 
 struct kvm_arch_async_pf {
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index eb714f1cdf7e..6dc28d53bb89 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -5505,6 +5505,13 @@ static int enable_smi_window(struct kvm_vcpu *vcpu)
return 0;
 }
 
+static int enable_enlightened_vmcs(struct kvm_vcpu *vcpu,
+  uint16_t *vmcs_version)
+{
+   /* Intel-only feature */
+   return -ENODEV;
+}
+
 static struct kvm_x86_ops svm_x86_ops __ro_after_init = {
.cpu_has_kvm_support = has_svm,
.disabled_by_bios = is_disabled,
@@ -5620,6 +5627,8 @@ static struct kvm_x86_ops svm_x86_ops __ro_after_init = {
.pre_enter_smm = svm_pre_enter_smm,
.pre_leave_smm = svm_pre_leave_smm,
.enable_smi_window = enable_smi_window,
+
+   .enable_enlightened_vmcs = enable_enlightened_vmcs,
 };
 
 static int __init svm_init(void)
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index f3215b6a0531..320bb6670413 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -464,6 +464,8 @@ struct __packed vmcs12 {
  */
 #define VMCS12_SIZE 0x1000
 
+#define ENLIGHTENED_VMCS_VERSION (1 | (1u << 8))
+
 /* Used to remember the last vmcs02 used for some recently used vmcs12s */
 struct vmcs02_list {
struct list_head list;
@@ -495,6 +497,13 @@ struct nested_vmx {
 */
bool sync_shadow_vmcs;
 
+   /*
+* Enlightened VMCS has been enabled. It does not mean that L1 has to
+* use it. However, VMX features available to L1 will be limited based
+* on what the enlightened VMCS supports.
+*/
+   bool enlightened_vmcs_enabled;
+
/* vmcs02_list cache of VMCSs recently used to run L2 guests */
struct list_head vmcs02_pool;
int vmcs02_num;
@@ -12129,6 +12138,46 @@ static int enable_smi_window(struct kvm_vcpu *vcpu)
return 0;
 }
 
+static int enable_enlightened_vmcs(struct kvm_vcpu *vcpu,
+  uint16_t *vmcs_version)
+{
+   struct vcpu_vmx *vmx = to_vmx(vcpu);
+
+   /* We don't support disabling the feature for simplicity. */
+   if (vmx->nested.enlightened_vmcs_enabled)
+   return 0;
+   vmx->nested.enlightened_vmcs_enabled = true;
+   *vmcs_version = ENLIGHTENED_VMCS_VERSION;
+
+   /*
+* Enlightened VMCS doesn't have the POSTED_INTR_DESC_ADDR,
+* POSTED_INTR_NV, VMX_PREEMPTION_TIMER_VALUE,
+* GUEST_IA32_PERF_GLOBAL_CTRL, and HOST_IA32_PERF_GLOBAL_CTRL
+* fields.
+*/
+   vmx->nested.nested_vmx_pinbased_ctls_high &=
+   ~(PIN_BASED_POSTED_INTR |
+ PIN_BASED_VMX_PREEMPTION_TIMER);
+   vmx->nested.nested_vmx_entry_ctls_high &=
+   ~VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL;
+   vmx->nested.nested_vmx_exit_ctls_high &=
+   ~VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL;
+
+   /*
+* Enlightened VMCS doesn't have the APIC_ACCESS_ADDR,
+* EOI_EXIT_BITMAP*, GUEST_INTR_STATUS, VM_FUNCTION_CONTROL,
+* EPTP_LIST_ADDRESS, PML_ADDRESS, and GUEST_PML_INDEX fields.
+*/
+   vmx->nested.nested_vmx_secondary_ctls_high &=
+   ~(SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES |
+ SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY |
+ SECONDARY_EXEC_ENABLE_VMFUNC |
+ SECONDARY_EXEC_ENABLE_PML);
+   vmx->nested.nested_vmx_vmfunc_controls &=
+   ~VMX_VMFUNC_EPTP_SWITCHING;
+   retu

[PATCH RFC 6/7] KVM: nVMX: add enlightened VMCS state

2017-12-18 Thread Vitaly Kuznetsov
From: Ladi Prosek <lpro...@redhat.com>

Adds two bool fields and implements copy_enlightened_to_vmcs12() and
copy_enlightened_to_vmcs12().

Unlike shadow VMCS, enlightened VMCS is para-virtual and active only if
the nested guest explicitly enables it. The pattern repeating itself a
few times throughout this patch:

  if (vmx->nested.enlightened_vmcs_active) {
  /* enlightened! */
  } else if (enable_shadow_vmcs) {
  /* fall-back */
  }

reflects this. If the nested guest elects to not use enlightened VMCS,
the regular HW-assisted shadow VMCS feature is used, if enabled.
enlightened_vmcs_active is never going to be true if
enlightened_vmcs_enabled is not set.

Signed-off-by: Ladi Prosek <lpro...@redhat.com>
Signed-off-by: Vitaly Kuznetsov <vkuzn...@redhat.com>
---
 arch/x86/kvm/vmx.c | 60 ++
 1 file changed, 52 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 320bb6670413..00b4a362351d 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -503,6 +503,16 @@ struct nested_vmx {
 * on what the enlightened VMCS supports.
 */
bool enlightened_vmcs_enabled;
+   /*
+* Indicates that the nested hypervisor performed the last vmentry with
+* a Hyper-V enlightened VMCS.
+*/
+   bool enlightened_vmcs_active;
+
+   /*
+* Indicates that the enlightened VMCS must be synced with vmcs12
+*/
+   bool sync_enlightened_vmcs;
 
/* vmcs02_list cache of VMCSs recently used to run L2 guests */
struct list_head vmcs02_pool;
@@ -991,6 +1001,7 @@ static void vmx_get_segment(struct kvm_vcpu *vcpu,
 static bool guest_state_valid(struct kvm_vcpu *vcpu);
 static u32 vmx_segment_access_rights(struct kvm_segment *var);
 static void copy_shadow_to_vmcs12(struct vcpu_vmx *vmx);
+static void copy_enlightened_to_vmcs12(struct vcpu_vmx *vmx);
 static bool vmx_get_nmi_mask(struct kvm_vcpu *vcpu);
 static void vmx_set_nmi_mask(struct kvm_vcpu *vcpu, bool masked);
 static bool nested_vmx_is_page_fault_vmexit(struct vmcs12 *vmcs12,
@@ -7455,7 +7466,10 @@ static inline void nested_release_vmcs12(struct vcpu_vmx 
*vmx)
if (vmx->nested.current_vmptr == -1ull)
return;
 
-   if (enable_shadow_vmcs) {
+   if (vmx->nested.enlightened_vmcs_active) {
+   copy_enlightened_to_vmcs12(vmx);
+   vmx->nested.sync_enlightened_vmcs = false;
+   } else if (enable_shadow_vmcs) {
/* copy to memory all shadowed fields in case
   they were modified */
copy_shadow_to_vmcs12(vmx);
@@ -7642,6 +7656,20 @@ static inline int vmcs12_write_any(struct kvm_vcpu *vcpu,
 
 }
 
+static void copy_enlightened_to_vmcs12(struct vcpu_vmx *vmx)
+{
+   kvm_vcpu_read_guest_page(>vcpu,
+vmx->nested.current_vmptr >> PAGE_SHIFT,
+vmx->nested.cached_vmcs12, 0, VMCS12_SIZE);
+}
+
+static void copy_vmcs12_to_enlightened(struct vcpu_vmx *vmx)
+{
+   kvm_vcpu_write_guest_page(>vcpu,
+ vmx->nested.current_vmptr >> PAGE_SHIFT,
+ vmx->nested.cached_vmcs12, 0, VMCS12_SIZE);
+}
+
 static void copy_shadow_to_vmcs12(struct vcpu_vmx *vmx)
 {
int i;
@@ -7841,7 +7869,9 @@ static int handle_vmwrite(struct kvm_vcpu *vcpu)
 static void set_current_vmptr(struct vcpu_vmx *vmx, gpa_t vmptr)
 {
vmx->nested.current_vmptr = vmptr;
-   if (enable_shadow_vmcs) {
+   if (vmx->nested.enlightened_vmcs_active) {
+   vmx->nested.sync_enlightened_vmcs = true;
+   } else if (enable_shadow_vmcs) {
vmcs_set_bits(SECONDARY_VM_EXEC_CONTROL,
  SECONDARY_EXEC_SHADOW_VMCS);
vmcs_write64(VMCS_LINK_POINTER,
@@ -9396,7 +9426,10 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu *vcpu)
vmcs_write32(PLE_WINDOW, vmx->ple_window);
}
 
-   if (vmx->nested.sync_shadow_vmcs) {
+   if (vmx->nested.sync_enlightened_vmcs) {
+   copy_vmcs12_to_enlightened(vmx);
+   vmx->nested.sync_enlightened_vmcs = false;
+   } else if (vmx->nested.sync_shadow_vmcs) {
copy_vmcs12_to_shadow(vmx);
vmx->nested.sync_shadow_vmcs = false;
}
@@ -11017,7 +11050,9 @@ static int nested_vmx_run(struct kvm_vcpu *vcpu, bool 
launch)
 
vmcs12 = get_vmcs12(vcpu);
 
-   if (enable_shadow_vmcs)
+   if (vmx->nested.enlightened_vmcs_active)
+   copy_enlightened_to_vmcs12(vmx);
+   else if (enable_shadow_vmcs)
copy_shadow_to_vmcs12(vmx);
 
/*
@@ -11634,8 +11669,12 @@ static void nested_vmx_vmexit(struct kvm_vcpu *vcpu, 
u32 exit_reason,
 */
kvm_make_request(KVM_REQ_APIC_PAGE

[PATCH RFC 7/7] KVM: nVMX: implement enlightened VMPTRLD

2017-12-18 Thread Vitaly Kuznetsov
From: Ladi Prosek <lpro...@redhat.com>

Per Hyper-V TLFS 5.0b:

"The L1 hypervisor may choose to use enlightened VMCSs by writing 1 to
the corresponding field in the VP assist page (see section 7.8.7).
Another field in the VP assist page controls the currently active
enlightened VMCS. Each enlightened VMCS is exactly one page (4 KB) in
size and must be initially zeroed. No VMPTRLD instruction must be
executed to make an enlightened VMCS active or current.

After the L1 hypervisor performs a VM entry with an enlightened VMCS,
the VMCS is considered active on the processor. An enlightened VMCS
can only be active on a single processor at the same time. The L1
hypervisor can execute a VMCLEAR instruction to transition an
enlightened VMCS from the active to the non-active state. Any VMREAD
or VMWRITE instructions while an enlightened VMCS is active is
unsupported and can result in unexpected behavior."

Note that we choose to not modify our VMREAD, VMWRITE, and VMPTRLD
handlers. They will not cause any explicit failure but may not have
the intended effect.

Signed-off-by: Ladi Prosek <lpro...@redhat.com>
Signed-off-by: Vitaly Kuznetsov <vkuzn...@redhat.com>
---
 arch/x86/kvm/vmx.c | 28 
 1 file changed, 28 insertions(+)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 00b4a362351d..f7f6f7d18ade 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -20,6 +20,7 @@
 #include "mmu.h"
 #include "cpuid.h"
 #include "lapic.h"
+#include "hyperv.h"
 
 #include 
 #include 
@@ -7935,6 +7936,30 @@ static int handle_vmptrld(struct kvm_vcpu *vcpu)
return kvm_skip_emulated_instruction(vcpu);
 }
 
+static int nested_vmx_handle_enlightened_vmptrld(struct kvm_vcpu *vcpu)
+{
+   struct vcpu_vmx *vmx = to_vmx(vcpu);
+   struct hv_vp_assist_page assist_page;
+
+   if (!vmx->nested.enlightened_vmcs_enabled)
+   return 1;
+
+   vmx->nested.enlightened_vmcs_active =
+   kvm_hv_get_assist_page(vcpu, _page) &&
+   assist_page.enlighten_vmentry;
+
+   if (vmx->nested.enlightened_vmcs_active &&
+   assist_page.current_nested_vmcs != vmx->nested.current_vmptr) {
+   /*
+* This is an equivalent of the nested hypervisor executing
+* the vmptrld instruction.
+*/
+   set_current_vmptr(vmx, assist_page.current_nested_vmcs);
+   copy_enlightened_to_vmcs12(vmx);
+   }
+   return 1;
+}
+
 /* Emulate the VMPTRST instruction */
 static int handle_vmptrst(struct kvm_vcpu *vcpu)
 {
@@ -11045,6 +11070,9 @@ static int nested_vmx_run(struct kvm_vcpu *vcpu, bool 
launch)
if (!nested_vmx_check_permission(vcpu))
return 1;
 
+   if (!nested_vmx_handle_enlightened_vmptrld(vcpu))
+   return 1;
+
if (!nested_vmx_check_vmcs12(vcpu))
goto out;
 
-- 
2.14.3

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


[PATCH RFC 3/7] KVM: nVMX: add I/O exit ECX, ESI, EDI, EIP vmcs12 fields

2017-12-18 Thread Vitaly Kuznetsov
From: Ladi Prosek <lpro...@redhat.com>

These non-synthetic VMCS fields were not supported by KVM thus far. The
layout is according to Hyper-V TLFS 5.0b, the physical encoding according
to the Intel SDM.

Signed-off-by: Ladi Prosek <lpro...@redhat.com>
Signed-off-by: Vitaly Kuznetsov <vkuzn...@redhat.com>
---
 arch/x86/include/asm/vmx.h | 4 
 arch/x86/kvm/vmx.c | 9 -
 2 files changed, 12 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index 8b6780751132..92a10aa839e6 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -298,6 +298,10 @@ enum vmcs_field {
CR3_TARGET_VALUE2   = 0x600c,
CR3_TARGET_VALUE3   = 0x600e,
EXIT_QUALIFICATION  = 0x6400,
+   EXIT_IO_INSTR_ECX   = 0x6402,
+   EXIT_IO_INSTR_ESI   = 0x6404,
+   EXIT_IO_INSTR_EDI   = 0x6406,
+   EXIT_IO_INSTR_EIP   = 0x6408,
GUEST_LINEAR_ADDRESS= 0x640a,
GUEST_CR0   = 0x6800,
GUEST_CR3   = 0x6802,
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index cd5f29a57880..f3215b6a0531 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -391,7 +391,10 @@ struct __packed vmcs12 {
u32 vmx_instruction_info;
 
natural_width exit_qualification;
-   natural_width padding64_3[4];
+   natural_width exit_io_instr_ecx;
+   natural_width exit_io_instr_esi;
+   natural_width exit_io_instr_edi;
+   natural_width exit_io_instr_eip;
 
natural_width guest_linear_address;
natural_width guest_rsp;
@@ -913,6 +916,10 @@ static const unsigned short vmcs_field_to_offset_table[] = 
{
FIELD(CR3_TARGET_VALUE2, cr3_target_value2),
FIELD(CR3_TARGET_VALUE3, cr3_target_value3),
FIELD(EXIT_QUALIFICATION, exit_qualification),
+   FIELD(EXIT_IO_INSTR_ECX, exit_io_instr_ecx),
+   FIELD(EXIT_IO_INSTR_ESI, exit_io_instr_esi),
+   FIELD(EXIT_IO_INSTR_EDI, exit_io_instr_edi),
+   FIELD(EXIT_IO_INSTR_EIP, exit_io_instr_eip),
FIELD(GUEST_LINEAR_ADDRESS, guest_linear_address),
FIELD(GUEST_CR0, guest_cr0),
FIELD(GUEST_CR3, guest_cr3),
-- 
2.14.3

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


[PATCH RFC 2/7] KVM: nVMX: modify vmcs12 fields to match Hyper-V enlightened VMCS

2017-12-18 Thread Vitaly Kuznetsov
From: Ladi Prosek <lpro...@redhat.com>

Reorders existing fields and adds fields specific to Hyper-V. The layout
now matches Hyper-V TLFS 5.0b 16.11.2 Enlightened VMCS.

Fields used by KVM but missing from Hyper-V are placed in the second half
of the VMCS page to minimize the chances they will clash with future
enlightened VMCS versions.

Signed-off-by: Ladi Prosek <lpro...@redhat.com>
Signed-off-by: Vitaly Kuznetsov <vkuzn...@redhat.com>
---
[Vitaly]: Update VMCS12_REVISION to some new arbitrary number.
---
 arch/x86/kvm/vmx.c | 321 +++--
 1 file changed, 187 insertions(+), 134 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 8eba631c4dbd..cd5f29a57880 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -239,159 +239,212 @@ struct __packed vmcs12 {
u32 revision_id;
u32 abort;
 
+   union {
+   u64 hv_vmcs[255];
+   struct {
+   u16 host_es_selector;
+   u16 host_cs_selector;
+   u16 host_ss_selector;
+   u16 host_ds_selector;
+   u16 host_fs_selector;
+   u16 host_gs_selector;
+   u16 host_tr_selector;
+
+   u64 host_ia32_pat;
+   u64 host_ia32_efer;
+
+   /*
+* To allow migration of L1 (complete with its L2
+* guests) between machines of different natural widths
+* (32 or 64 bit), we cannot have unsigned long fields
+* with no explicit size. We use u64 (aliased
+* natural_width) instead. Luckily, x86 is
+* little-endian.
+*/
+   natural_width host_cr0;
+   natural_width host_cr3;
+   natural_width host_cr4;
+
+   natural_width host_ia32_sysenter_esp;
+   natural_width host_ia32_sysenter_eip;
+   natural_width host_rip;
+   u32 host_ia32_sysenter_cs;
+
+   u32 pin_based_vm_exec_control;
+   u32 vm_exit_controls;
+   u32 secondary_vm_exec_control;
+
+   u64 io_bitmap_a;
+   u64 io_bitmap_b;
+   u64 msr_bitmap;
+
+   u16 guest_es_selector;
+   u16 guest_cs_selector;
+   u16 guest_ss_selector;
+   u16 guest_ds_selector;
+   u16 guest_fs_selector;
+   u16 guest_gs_selector;
+   u16 guest_ldtr_selector;
+   u16 guest_tr_selector;
+
+   u32 guest_es_limit;
+   u32 guest_cs_limit;
+   u32 guest_ss_limit;
+   u32 guest_ds_limit;
+   u32 guest_fs_limit;
+   u32 guest_gs_limit;
+   u32 guest_ldtr_limit;
+   u32 guest_tr_limit;
+   u32 guest_gdtr_limit;
+   u32 guest_idtr_limit;
+
+   u32 guest_es_ar_bytes;
+   u32 guest_cs_ar_bytes;
+   u32 guest_ss_ar_bytes;
+   u32 guest_ds_ar_bytes;
+   u32 guest_fs_ar_bytes;
+   u32 guest_gs_ar_bytes;
+   u32 guest_ldtr_ar_bytes;
+   u32 guest_tr_ar_bytes;
+
+   natural_width guest_es_base;
+   natural_width guest_cs_base;
+   natural_width guest_ss_base;
+   natural_width guest_ds_base;
+   natural_width guest_fs_base;
+   natural_width guest_gs_base;
+   natural_width guest_ldtr_base;
+   natural_width guest_tr_base;
+   natural_width guest_gdtr_base;
+   natural_width guest_idtr_base;
+
+   u64 padding64_1[3];
+
+   u64 vm_exit_msr_store_addr;
+   u64 vm_exit_msr_load_addr;
+   u64 vm_entry_msr_load_addr;
+
+   natural_width cr3_target_value0;
+   natural_width cr3_target_value1;
+   natural_width cr3_target_value2;
+   natural_width cr3_target_value3;
+
+   u32 page_fault_error_code_mask;
+   u32 page_fault_error_code_match;
+
+   u32 cr3_target_count;
+   u32 vm_exit_msr_store_count;
+   u32 vm_exit_msr_load_count;
+   u32 vm_en

  1   2   3   4   5   6   7   8   9   10   >