On Mon, Jul 23, 2018 at 7:11 PM, Andy Lutomirski <l...@kernel.org> wrote:
> On 07/23/2018 05:55 AM, Fenghua Yu wrote:
>>
>> User wants to query if user wait instructions (umonitor, umwait, and
>> tpause) are supported and use the instructions. The vDSO functions
>> provides fast interface for user to check the support and use the
>> instructions.
>>
>> waitpkg_supported and its alias __vdso_waitpkg_supported check if
>> user wait instructions (a.k.a. wait package feature) are supported
>>
>> umonitor and its alias __vdso_umonitor provide user APIs for calling
>> umonitor instruction.
>>
>> umwait and its alias __vdso_umwait provide user APIs for calling
>> umwait instruction.
>>
>> tpause and its alias __vdso_tpause provide user APIs for calling
>> tpause instruction.
>>
>> nsec_to_tsc and its alias __vdso_nsec_to_tsc converts nanoseconds
>> to TSC counter if TSC frequency is known. It will fail if TSC frequency
>> is unknown.
>>
>> The instructions can be implemented in intrinsic functions in future
>> GCC. But the vDSO interfaces are available to user without the
>> intrinsic functions support in GCC and the API waitpkg_supported and
>> nsec_to_tsc cannot be implemented as GCC functions.
>>
>> Signed-off-by: Fenghua Yu <fenghua...@intel.com>
>> ---
>>   arch/x86/entry/vdso/Makefile           |   2 +-
>>   arch/x86/entry/vdso/vdso.lds.S         |  10 ++
>>   arch/x86/entry/vdso/vma.c              |   9 ++
>>   arch/x86/entry/vdso/vuserwait.c        | 233
>> +++++++++++++++++++++++++++++++++
>>   arch/x86/include/asm/vdso_funcs_data.h |   3 +
>>   5 files changed, 256 insertions(+), 1 deletion(-)
>>   create mode 100644 arch/x86/entry/vdso/vuserwait.c
>>
>> diff --git a/arch/x86/entry/vdso/Makefile b/arch/x86/entry/vdso/Makefile
>> index af4fcae5de83..fb0062b09b3c 100644
>> --- a/arch/x86/entry/vdso/Makefile
>> +++ b/arch/x86/entry/vdso/Makefile
>> @@ -17,7 +17,7 @@ VDSO32-$(CONFIG_X86_32)               := y
>>   VDSO32-$(CONFIG_IA32_EMULATION)       := y
>>     # files to link into the vdso
>> -vobjs-y := vdso-note.o vclock_gettime.o vgetcpu.o vdirectstore.o
>> +vobjs-y := vdso-note.o vclock_gettime.o vgetcpu.o vdirectstore.o
>> vuserwait.o
>>     # files to link into kernel
>>   obj-y                         += vma.o
>> diff --git a/arch/x86/entry/vdso/vdso.lds.S
>> b/arch/x86/entry/vdso/vdso.lds.S
>> index 097cdcda43a5..0942710608bf 100644
>> --- a/arch/x86/entry/vdso/vdso.lds.S
>> +++ b/arch/x86/entry/vdso/vdso.lds.S
>> @@ -35,6 +35,16 @@ VERSION {
>>                 __vdso_movdir64b_supported;
>>                 movdir64b;
>>                 __vdso_movdir64b;
>> +               waitpkg_supported;
>> +               __vdso_waitpkg_supported;
>> +               umonitor;
>> +               __vdso_umonitor;
>> +               umwait;
>> +               __vdso_umwait;
>> +               tpause;
>> +               __vdso_tpause;
>> +               nsec_to_tsc;
>> +               __vdso_nsec_to_tsc;
>>         local: *;
>>         };
>>   }
>> diff --git a/arch/x86/entry/vdso/vma.c b/arch/x86/entry/vdso/vma.c
>> index edbe5e63e5c2..006dfb5e5003 100644
>> --- a/arch/x86/entry/vdso/vma.c
>> +++ b/arch/x86/entry/vdso/vma.c
>> @@ -372,10 +372,19 @@ static int vgetcpu_online(unsigned int cpu)
>>     static void __init init_vdso_funcs_data(void)
>>   {
>> +       struct system_counterval_t sys_counterval;
>> +
>>         if (static_cpu_has(X86_FEATURE_MOVDIRI))
>>                 vdso_funcs_data.movdiri_supported = true;
>>         if (static_cpu_has(X86_FEATURE_MOVDIR64B))
>>                 vdso_funcs_data.movdir64b_supported = true;
>> +       if (static_cpu_has(X86_FEATURE_WAITPKG))
>> +               vdso_funcs_data.waitpkg_supported = true;
>> +       if (static_cpu_has(X86_FEATURE_TSC_KNOWN_FREQ)) {
>> +               vdso_funcs_data.tsc_known_freq = true;
>> +               sys_counterval = convert_art_ns_to_tsc(1);
>> +               vdso_funcs_data.tsc_per_nsec = sys_counterval.cycles;
>> +       }
>
>
> You're losing a ton of precision here.  You might even be losing *all* of
> the precision and malfunctioning rather badly.
>
> The correct way to do this is:
>
> tsc_counts = ns * mul >> shift;
>
> and the vclock code illustrates it.  convert_art_ns_to_tsc() is a bad
> example because it uses an expensive division operation for no good reason
> except that no one bothered to optimize it.
>
>> +notrace int __vdso_nsec_to_tsc(unsigned long nsec, unsigned long *tsc)
>> +{
>> +       if (!_vdso_funcs_data->tsc_known_freq)
>> +               return -ENODEV;
>> +
>> +       *tsc = _vdso_funcs_data->tsc_per_nsec * nsec;
>> +
>> +       return 0;
>> +}
>
>
> Please don't expose this one at all.  It would be nice for programs that use
> waitpkg to be migratable using CRIU-like tools, and this export actively
> harms any such effort.  If you omit this function, then the kernel could
> learn to abort an in-progress __vdso_umwait if preempted (rseq-style) and
> CRIU would just work.  It would be a bit of a hack, but it solves a real
> problem.
>
>> +notrace int __vdso_umwait(int state, unsigned long nsec)
>
>
> __vdso_umwait_relative(), please.  Because some day (possibly soon) someone
> will want __vdso_umwait_absolute() and its friend __vdso_read_art_ns() so
> they can do:
>
> u64 start = __vdso_read_art_ns();
> __vdso_umonitor(...);
> ... do something potentially slow or that might fault ...
> __vdso_umwait_absolute(start + timeout);
>
> Also, this patch appears to have a subtle but show-stopping race.  Consider:
>
> 1. Task A does UMONITOR on CPU 1
> 2. Task A is preempted.
> 3. Task B does UMONITOR on CPU 1 at a different address
> 4. Task A resumes
> 5. Task A does UMWAIT
>
> Now task A hangs, at least until the next external wakeup happens.
>
> It's not entirely clear to me how you're supposed to fix this without some
> abomination that's so bad that it torpedoes the entire feature. Except that
> there is no chicken bit to turn this thing off.  Sigh.

The UMWAIT mechanism also looks like it will work incorrectly under a
VM.  How do you (or, more generally, Intel) plan to handle that?

Reply via email to