From: Andrey Ryabinin <[email protected]> Make possible to read virtualized container's CLOCK_MONOTONIC time via __vdso_gettimeofday(). Record containers start time in per-ve vdso and substruct it from the host's time on clock read.
https://jira.sw.ru/browse/PSBM-121668 Signed-off-by: Andrey Ryabinin <[email protected]> Reviewed-by: Konstantin Khorenko <[email protected]> khorenko@ notes: 1) effectively we store in vdso area the same ve->start_time value. If a CT has been previously running, say 5 ns, we store in ve->start_time value (now - 5), so later monotonic_abs_to_ve() returns now - (ve->start_time) == now - (now - 5) == 5 2) introduced timespec_sub_ns() function has "inline" attribute - it is fine. The stock timespec_add_ns() has "__always_inline" attribute, but the function is static, so there will be different copies of the function anyway even if the function is used in other files. 3) timespec_sub_ns() is introduced for optimization: if we use timespec_add_ns(ns)+monotonic_time_to_ve(), there will be 2 cycles of __iter_div_u64_rem(). =============================================== The original vz7 commit message (f7188f105626): ve/vdso: virtualized monotonic gettime through vdso We already have infrastructure for virtualized vdso, however we use it only to change LINUX_VERSION_NAME in container. Simply store container's start time - ve->start_timespec in vdso variable - VDSO64_ve_start_timespec, and use it in __vdso_clock_gettime() to calculate container's monotonic time. Make uts_arch_setup_additional_pages()/uts_prep_vdso_pages_locked() to always setup new vdso, since previous policy to setup vdso only if uts_ns->name.releas e wouldn't work for virtualized __vdso_clock_gettime() https://jira.sw.ru/browse/PSBM-66451 Signed-off-by: Andrey Ryabinin <[email protected]> Reviewed-by: Dmitry Safonov <[email protected]> +++ x86_64, vclock_gettime: Use standart division instead of __iter_div_u64_rem() timespec_sub_ns() historically uses __iter_div_u64_rem() for division. Probably it's supposed to be faster /* * Iterative div/mod for use when dividend is not expected to be much * bigger than divisor. */ u32 iter_div_u64_rem(u64 dividend, u32 divisor, u64 *remainder) However in our case ve_start_time may make dividend much bigger than divisor. So let's use standard "/" instead of iterative one. With 0 ve_start_time I wasn't able to see measurable difference, however with big ve_start_time the difference is rather significant: # time ./clock_iter_div real 1m30.224s user 1m30.343s sys 0m0.008s # time taskset ./clock_div real 0m2.757s user 0m1.730s sys 0m0.066s 32-bit vdso doesn't like 64-bit division and doesn't link. I think it needs __udivsi3(). So just fallback to __iter_div_u64_rem() on 32-bit. mFixes: af2c78f571e6 ("ve: add per-ve CLOCK_MONOTONIC time via __vdso_gettimeofday()") https://jira.sw.ru/browse/PSBM-121856 Signed-off-by: Andrey Ryabinin <[email protected]> +++ vdso, vclock_gettime: fix linking with old linkers On some old linkers vdso fails to build because of dynamic reloction of 've_start_time' symbol: VDSO2C arch/x86/entry/vdso/vdso-image-64.c Error: vdso image contains dynamic relocations I was able to figure out why new linkers doesn't generate relocation while old ones does, but I did find out that visibility("hidden") attribute on 've_start_time' cures the problem. mFixes: af2c78f571e6 ("ve: add per-ve CLOCK_MONOTONIC time via __vdso_gettimeofday()") https://jira.sw.ru/browse/PSBM-121668 Signed-off-by: Andrey Ryabinin <[email protected]> (cherry picked from commit 0c7977348b6acaae464f64b8830204e0a0ee043c) This patch incompatible with time namespaces because it completely ignores them existance :) When we decide to step on time namespaces and set CONFIG_TIME_NS=y we have to drop this particular patch and rework userspace code to properly setup time namespace for container. Time namespaces natively handle all vdso time-related stuff. https://jira.sw.ru/browse/PSBM-131977 Signed-off-by: Alexander Mikhalitsyn <[email protected]> --- arch/x86/entry/vdso/vdso2c.c | 1 + arch/x86/include/asm/vdso.h | 1 + kernel/ve/ve.c | 14 +++++++++ lib/vdso/gettimeofday.c | 57 +++++++++++++++++++++++++++++++----- 4 files changed, 66 insertions(+), 7 deletions(-) diff --git a/arch/x86/entry/vdso/vdso2c.c b/arch/x86/entry/vdso/vdso2c.c index c291f7a1fa73..cce63b8e15ec 100644 --- a/arch/x86/entry/vdso/vdso2c.c +++ b/arch/x86/entry/vdso/vdso2c.c @@ -102,6 +102,7 @@ struct vdso_sym required_syms[] = { {"__kernel_rt_sigreturn", true}, {"int80_landing_pad", true}, {"linux_version_code", true}, + {"ve_start_time", true}, }; __attribute__((format(printf, 1, 2))) __attribute__((noreturn)) diff --git a/arch/x86/include/asm/vdso.h b/arch/x86/include/asm/vdso.h index 87c5e0df1d9d..7c1fcfa8bc3a 100644 --- a/arch/x86/include/asm/vdso.h +++ b/arch/x86/include/asm/vdso.h @@ -32,6 +32,7 @@ struct vdso_image { RH_KABI_EXTEND(unsigned long extable_len) RH_KABI_EXTEND(const void *extable) long sym_linux_version_code; + long sym_ve_start_time; }; #ifdef CONFIG_X86_64 diff --git a/kernel/ve/ve.c b/kernel/ve/ve.c index dd0faf95452e..017435152689 100644 --- a/kernel/ve/ve.c +++ b/kernel/ve/ve.c @@ -646,6 +646,17 @@ void ve_rm_from_release_list(struct cgroup *cgrp) rcu_read_unlock(); } +static void ve_set_vdso_time(struct ve_struct *ve, u64 time) +{ + u64 *vdso_start_time; + + vdso_start_time = ve->vdso_64->data + ve->vdso_64->sym_ve_start_time; + *vdso_start_time = time; + + vdso_start_time = ve->vdso_32->data + ve->vdso_32->sym_ve_start_time; + *vdso_start_time = time; +} + /* under ve->op_sem write-lock */ static int ve_start_container(struct ve_struct *ve) { @@ -680,6 +691,8 @@ static int ve_start_container(struct ve_struct *ve) if (ve->start_time == 0) { ve->start_time = tsk->start_time; ve->real_start_time = tsk->real_start_time; + + ve_set_vdso_time(ve, ve->start_time); } /* The value is wrong, but it is never compared to process * start times */ @@ -1409,6 +1422,7 @@ static ssize_t ve_ts_write(struct kernfs_open_file *of, char *buf, case VE_CF_CLOCK_MONOTONIC: now = ktime_get_ns(); target = &ve->start_time; + ve_set_vdso_time(ve, now - delta_ns); break; case VE_CF_CLOCK_BOOTBASED: now = ktime_get_boot_ns(); diff --git a/lib/vdso/gettimeofday.c b/lib/vdso/gettimeofday.c index c8cfdc6bd2f5..7485508c9cc7 100644 --- a/lib/vdso/gettimeofday.c +++ b/lib/vdso/gettimeofday.c @@ -6,6 +6,8 @@ #include <vdso/datapage.h> #include <vdso/helpers.h> +u64 ve_start_time __attribute__((visibility("hidden"))); + #ifndef vdso_calc_delta /* * Default implementation which works for all sane clocksources. That @@ -46,6 +48,35 @@ static inline bool vdso_cycles_ok(u64 cycles) } #endif +static inline u64 divu64(u64 dividend, u32 divisor, u64 *remainder) +{ + /* 32-bit wants __udivsi3() and fails to link, so fallback to iter */ +#ifndef BUILD_VDSO32 + u64 res; + + res = dividend/divisor; + *remainder = dividend % divisor; + return res; +#else + return __iter_div_u64_rem(dividend, divisor, remainder); +#endif +} + +static inline void timespec_sub_ns(struct __kernel_timespec *ts, u64 ns) +{ + if ((s64)ns <= 0) { + ts->tv_sec += divu64(-ns, NSEC_PER_SEC, &ns); + ts->tv_nsec = ns; + } else { + ts->tv_sec -= divu64(ns, NSEC_PER_SEC, &ns); + if (ns) { + ts->tv_sec--; + ns = NSEC_PER_SEC - ns; + } + ts->tv_nsec = ns; + } +} + #ifdef CONFIG_TIME_NS static int do_hres_timens(const struct vdso_data *vdns, clockid_t clk, struct __kernel_timespec *ts) @@ -149,12 +180,17 @@ static __always_inline int do_hres(const struct vdso_data *vd, clockid_t clk, sec = vdso_ts->sec; } while (unlikely(vdso_read_retry(vd, seq))); - /* - * Do this outside the loop: a race inside the loop could result - * in __iter_div_u64_rem() being extremely slow. - */ - ts->tv_sec = sec + __iter_div_u64_rem(ns, NSEC_PER_SEC, &ns); - ts->tv_nsec = ns; + if (clk != CLOCK_MONOTONIC) { + /* + * Do this outside the loop: a race inside the loop could result + * in __iter_div_u64_rem() being extremely slow. + */ + ts->tv_sec = sec + __iter_div_u64_rem(ns, NSEC_PER_SEC, &ns); + ts->tv_nsec = ns; + } else { + ts->tv_sec = sec; + timespec_sub_ns(ts, ve_start_time - ns); + } return 0; } @@ -201,6 +237,7 @@ static __always_inline int do_coarse(const struct vdso_data *vd, clockid_t clk, { const struct vdso_timestamp *vdso_ts = &vd->basetime[clk]; u32 seq; + u64 ns; do { /* @@ -216,9 +253,15 @@ static __always_inline int do_coarse(const struct vdso_data *vd, clockid_t clk, smp_rmb(); ts->tv_sec = vdso_ts->sec; - ts->tv_nsec = vdso_ts->nsec; + if (clk != CLOCK_MONOTONIC_COARSE) + ts->tv_nsec = vdso_ts->nsec; + else + ns = vdso_ts->nsec; } while (unlikely(vdso_read_retry(vd, seq))); + if (clk == CLOCK_MONOTONIC_COARSE) + timespec_sub_ns(ts, ve_start_time - ns); + return 0; } -- 2.28.0 _______________________________________________ Devel mailing list [email protected] https://lists.openvz.org/mailman/listinfo/devel
