Re: [Qemu-devel] Migrating decrementer

David Gibson Sun, 31 Jan 2016 17:38:03 -0800

On Tue, Jan 26, 2016 at 10:31:19PM +0000, Mark Cave-Ayland wrote:
> On 25/01/16 11:10, David Gibson wrote:
> 
> > Um.. so the migration duration is a complete red herring, regardless
> > of the units.
> > 
> > Remember, we only ever compute the guest timebase value at the moment
> > the guest requests it - actually maintaining a current timebase value
> > makes sense in hardware, but would be nuts in software.
> > 
> > The timebase is a function of real, wall-clock time, and the migration
> > destination has a notion of wall-clock time without reference to the
> > source.
> > 
> > So what you need to transmit for migration is enough information to
> > compute the guest timebase from real-time - essentially just an offset
> > between real-time and the timebase.
> > 
> > The guest can potentially observe the migration duration as a jump in
> > timebase values, but qemu doesn't need to do any calculations with it.
> 
> Thanks for more pointers - I think I'm slowly getting there. My current
> thoughts are that the basic migration algorithm is doing the right thing
> in that it works out the number of host ticks different between source
> and destination.


Sorry, I've take a while to reply to this.  I realised the tb
migration didn't work the way I thought it did, so I've had to get my
head around what's actually going on.

I had thought that it transferred only meta-information telling the
destination how to calculate the timebase, without actually working
out the timebase value at any particular moment.

In fact, what it sends is basically the tuple of (timebase, realtime)
at the point of sending the migration stream.  The destination then
uses that to work out how to compute the timebase from realtime there.

I'm not convinced this is a great approach, but it should basically
work.  However, as you've seen there are also some Just Plain Bugs in
the logic for this.

> I have a slight query with this section of code though:
> 
>     migration_duration_tb = muldiv64(migration_duration_ns, freq,
>                                      NANOSECONDS_PER_SECOND);
> 
> This is not technically correct on TCG x86 since the timebase is the x86
> TSC which is running somewhere in the GHz range, compared to freq which
> is hard-coded to 16MHz.

Um.. what?  AFAICT that line doesn't have any reference to the TSC
speed.  Just ns and the (guest) tb).  Also 16MHz is only for the
oldworld Macs - modern ppc cpus have the TB frequency architected as
512MHz.

> However this doesn't seem to matter because the
> timebase adjustment is limited to a maximum of 1s. Why should this be if
> the timebase is supposed to be free running as you mentioned in a
> previous email?

AFAICT, what it's doing here is assuming that if the migration
duration is >1s (or appears to be >1s) then it's because the host
clocks are out of sync and so just capping the elapsed tb time at 1s.

That's just wrong, IMO.  1s is a long downtime for a live migration,
but it's not impossible, and it will happen nearly always in the
scenariou you've discussed of manually loading the migration stream
from a file.

But more to the point, trying to maintain correctness of the timebase
when the hosts are out of sync is basically futile.  There's no other
reference we can use, so all we can achieve is getting a different
wrong value from what we'd get by blindly trusting the host clock.

We do need to constrain the tb from going backwards, because that will
cause chaos on the guest, but otherwise we should just trust the host
clock and ditch that 1s clamp.  If the hosts are out of sync, then
guest time will jump, but that was always going to happen.

> AFAICT the main problem on TCG x86 is that post-migration the timebase
> calculated by cpu_ppc_get_tb() is incorrect:
> 
> uint64_t cpu_ppc_get_tb(ppc_tb_t *tb_env, uint64_t vmclk, int64_t tb_offset)
> {
>     /* TB time in tb periods */
>     return muldiv64(vmclk, tb_env->tb_freq, get_ticks_per_sec()) +
>                     tb_offset;
> }


So the problem here is that get_ticks_per_sec() (which always returns
1,000,000,000) is not talking about the same ticks as
cpu_get_host_ticks().  That may not have been true when this code was
written.

> For a typical savevm/loadvm pair I see something like this:
> 
> savevm:
> 
> tb->guest_timebase = 26281306490558
> qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) = 7040725511
> 
> loadvm:
> 
> cpu_get_host_ticks() = 26289847005259
> tb_off_adj = -8540514701
> qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) = 7040725511
> cpu_ppc_get_tb() = -15785159386
> 
> But as cpu_ppc_get_tb() uses QEMU_CLOCK_VIRTUAL for vmclk we end up with
> a negative number for the timebase since the virtual clock is dwarfed by
> the number of TSC ticks calculated for tb_off_adj. This will work on a
> PPC host though since cpu_host_get_ticks() is also derived from the
> timebase.

Yeah, we shouldn't be using cpu_host_get_ticks() at all - or anything
else which depends on a host frequency.  We should only be using qemu
interfaces which work in real time units (nanoseconds, usually).

> Another question I have is cpu_ppc_load_tbl():
> 
> uint64_t cpu_ppc_load_tbl (CPUPPCState *env)
> {
>     ppc_tb_t *tb_env = env->tb_env;
>     uint64_t tb;
> 
>     if (kvm_enabled()) {
>         return env->spr[SPR_TBL];
>     }
> 
>     tb = cpu_ppc_get_tb(tb_env, qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL),
>                         tb_env->tb_offset);
>     LOG_TB("%s: tb %016" PRIx64 "\n", __func__, tb);
> 
>     return tb;
> }
> 
> Compared with cpu_ppc_load_tbu(), it is returning uint64_t rather than
> uint32_t and doesn't appear to mask the bottom 32-bits of the timebase
> value?

That's because in 64-bit mode loading the "TBL" is actually a load of
the whole 64-bit timebase.

-- 
David Gibson                    | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
                                | _way_ _around_!
http://www.ozlabs.org/~dgibson

signature.asc
Description: PGP signature

Re: [Qemu-devel] Migrating decrementer

Reply via email to