> -----Original Message----- > From: Alexander Graf [mailto:ag...@suse.de] > Sent: Friday, January 04, 2013 2:11 PM > To: Bhushan Bharat-R65777 > Cc: Marcelo Tosatti; jjhe...@linux.vnet.ibm.com; Christian Borntraeger; > Anthony > Liguori; qemu-devel@nongnu.org qemu-devel > Subject: Re: [Qemu-devel] [PATCH 7/7] KVM regsync: Fix > do_kvm_cpu_synchronize_state data integrity issue > > > On 04.01.2013, at 05:22, Bhushan Bharat-R65777 wrote: > > > > > > >> -----Original Message----- > >> From: Marcelo Tosatti [mailto:mtosa...@redhat.com] > >> Sent: Friday, January 04, 2013 7:08 AM > >> To: Alexander Graf > >> Cc: jjhe...@linux.vnet.ibm.com; Christian Borntraeger; Anthony > >> Liguori; qemu- de...@nongnu.org qemu-devel; Bhushan Bharat-R65777 > >> Subject: Re: [Qemu-devel] [PATCH 7/7] KVM regsync: Fix > >> do_kvm_cpu_synchronize_state data integrity issue > >> > >> On Thu, Jan 03, 2013 at 08:09:22PM +0100, Alexander Graf wrote: > >>> > >>> On 03.01.2013, at 19:48, Jason J. Herne wrote: > >>> > >>>> On 01/03/2013 08:56 AM, Alexander Graf wrote: > >>>>>> static void do_kvm_cpu_synchronize_state(void *_args) > >>>>>>> { > >>>>>>> struct kvm_cpu_syncstate_args *args = _args; > >>>>>>> + CPUArchState *env = args->env; > >>>>>>> + int register_level = args->register_level; > >>>>>>> > >>>>> This probably becomes more readable if we explicitly revert back > >>>>> to > >> unsynced state first: > >>>>> > >>>>> /* Write back local modifications at our current level */ if > >>>>> (register_level > env->kvm_vcpu_dirty) { > >>>>> kvm_arch_put_registers(...); > >>>>> env->kvm_vcpu_dirty = 0; > >>>>> } > >>>>> > >>>>> and then do the sync we are requested to do: > >>>>> > >>>>> if (!env->kvm_vcpu_dirty) { > >>>>> ... > >>>>> } > >>>> > >>>> I agree, but only if we add a second conditional to the if 1st > >>>> statement as > >> such: > >>>> > >>>> if (args->env->kvm_vcpu_dirty && register_level > > >>>> env->kvm_vcpu_dirty) > >>>> > >>>> This is to cover the case where the caller is asking for register level > >>>> "1" > >> and we're already dirty at level "2". In this case, nothing should > >> happen and we'll need the "args->env->kvm_vcpu_dirty" to ensure that is the > case. > >>> > >>> As before, I'd prefer to make this explicit: > >>> > >>>> > >>>> static void do_kvm_cpu_synchronize_state(void *_args) { > >>>> struct kvm_cpu_syncstate_args *args = _args; > >>>> CPUArchState *env = args->env; > >>>> int register_level = args->register_level; > >>> > >>> if (register_level > env->kvm_vcpu_dirty) { > >>> /* We are more dirty than we need to - all is well */ > >>> return; > >>> } > >>> > >>>> > >>>> /* Write back local modifications at our current level */ > >>>> if (args->env->kvm_vcpu_dirty && register_level > env->kvm_vcpu_dirty) > >>>> { > >>>> kvm_arch_put_registers(env, env->kvm_vcpu_dirty); > >>>> env->kvm_vcpu_dirty = 0; > >>>> } > >>>> > >>>> if (!args->env->kvm_vcpu_dirty) { > >>>> kvm_arch_get_registers(env, register_level); > >>>> env->kvm_vcpu_dirty = register_level; > >>>> } > >>>> } > >>>> > >>>> Do you agree? Thanks for your time. :) > >>> > >>> Please also check out the discussions I've had with Bharat about his > >>> watchdog > >> patches. There we need a mechanism to synchronize registers only when > >> we actually need to, in order to avoid potential race conditions with > >> a kernel timer. > >>> > >>> That particular case doesn't work well with levels. We can have > >>> multiple > >> different potential race producers in the kernel that we need to > >> avoid individually, so we can't always synchronize all of them when > >> only one of them needs to be synchronized. > >>> > >>> The big question is what we should be doing about this. We basically > >>> have 3 > >> options: > >>> > >>> * implement levels, treat racy registers as "manually > >>> synchronized", as > >> Bharat's latest patch set does > >>> * implement levels, add a bitmap for additional special > >>> synchronization bits > >>> * replace levels by bitmap > >>> > >>> I'm quite frankly not sure which one of the 3 would be the best way > >>> forward. > >>> > >>> > >>> Alex > >> > >> Implement read/write accessors for individual registers, similarly to > >> how its done in the kernel. See kvm_cache_regs.h. > > > > Read/write for individual registers can be heavy for cases where multiple > register needed to be updated. Is not it? > > It depends. We can still have classes of registers to synchronize that we > could > even synchronize using MANY_REG. For writes, things become even faster with > individual accessors since we can easily coalesce all register writes until we > hit the vcpu again into a MANY_REG array and write them in one go. > > > > > For cases where multiple register needed to be synchronized then I would > > like > to elaborate the options as: > > > > Option 1: > > int timer_func(CPUArchState env) > > { > > cpu_synchronize_state(env); > > //update env->timer_registers > > env->updated_regs_type |= TIMER_REGS_TYPE_BIT; > > To scale this one, it's probably also more clever to swap the logic: > > env->kvm_sync_extra |= SYNC_TIMER_REGS; > cpu_synchronize_state(env); /* gets timer regs */ > /* update env->timer_registers */ > /* timer regs will be synchronized later because kvm_sync_extra has > SYNC_TIMER_REGS set */ > > Your case right now is special in that we don't need to get any register > state, > but only write it. However, if we do > > cpu_synchronize_state(env); /* will not get timer regs */ > env->kvm_sync_extra |= SYNC_TIMER_REGS; > > then the next > > cpu_synchronize_state(env); > > would overwrite the timer values again. So we would also need to indicate that > the bits are dirty: > > cpu_synchronize_state(env); /* will not get timer regs */ > env->kvm_sync_extra |= SYNC_TIMER_REGS; > env->kvm_sync_dirty |= SYNC_TIMER_REGS; > > which cpu_synchronize_state() could use to make sure it doesn't read back any > timer regs again.
yes > It would obviously also have to set it in its normal operation > when it executed first ;). We also need get all registers like for debugging and we should have flag like ALL_REGS etc. But I general I agree with the idea. > > > } > > > > Change the kvm_arch_put_registers() to add followings: > > > > int kvm_arch_put_registers(CPUArchState *env, int level) { > > > > If (env->updated_regs_type) { > > struct kvm_sregs sregs; > > If (env->updated_regs_type & TIMER_REGS_TYPE_BIT) { > > // update sregs->timer_registers; > > // may be we can have a bitmap to tell kernel for what > actually updated > > } > > If (env->updated_regs_type & XX_CPU_REGS_TYPE) { > > // update respective registers in sregs > > } > > > > ioctl(env, KVM_GET_SREGS, &sregs); > > } > > } > > > > This mechanism will get all registers state but this is required only once > > per > entry in QEMU. > > I don't understand this bit > > > > > > > Option 2: > > Define read_regs_type() and write_regs_type() for cases where it requires > multiple registers updates: > > > > int read_regs_type((CPUArchState env, void *regs_ptr, int reg_type) { > > -> reg_type are bitmap like: TIMER_REGS, DEBUG_REGS, PERFMON_REGS etc. > > > > Switch(reg_type) { > > Case TIMER_REGS: > > Struct *timer_regs = regs_ptr; > > Ioctl(KVM_GET_SREGS, timer_regs, CPU_reg_type); > > Break; > > > > Default: > > Printf(error); > > } > > > > Similarly will be the write function: > > int write_regs_type((CPUArchState env, int reg_type) { > > -> reg_type are bitmap like: TIMER_REGS, DEBUG_REGS, PERFMON_REGS etc. > > > > Switch(reg_type) { > > Case TIMER_REGS: > > Struct timer_regs; > > Timer_regs.reg1 = env->timer_regs->reg1; > > Timer_regs.reg2 = env->timer_regs->reg2; > > Ioctl(env, KVM_SET_SREGS, &timer_regs, CPU_reg_type); > > Break; > > > > Default: > > Printf(error); > > } > > > > Int timer_func(CPUxxState env) > > { > > Struct timer_regs; > > read_regs_type((env, &timer_regs,TIMER_REGS); > > //update env->timer_registers > > Write_regs_type(env, TIMER_REGS) > > } > > > > This will get one type of register_types and can cause multiple IOCTL per > entry to QEMU. > > This is still too explicit. How about > > static inline void ppc_set_tsr(CPUState *env, target_ulong val) { > env->kvm_sync_extra |= SYNC_TIMER_REGS; > cpu_synchronize_registers(env); > env->spr[SPR_TSR] = val; You also want env->kvm_sync_dirty also, right? > } > > static inline void ppc_set_tcr(CPUState *env, target_ulong val) { > env->kvm_sync_extra |= SYNC_TIMER_REGS; > cpu_synchronize_registers(env); > env->spr[SPR_TCR] = val; Same as above > } > > int timer_func(CPUState *env) { > ppc_set_tsr(env, 0); SYNC_TIMERS_REGS are get only once on first call. > ppc_set_tcr(env, 0); > } > > The main benefit from this method is that we can do a simple search/replace to > get rid of essentially all synchronization pitfalls where we might miss out on > some code path that then doesn't synchronize its register set. > > > ------- > > > > Keep on using GET/SET_ONE_REG when only one registers needed to be updated. > > We can always use ONE_REG for any of the variants. > > The main question I have is whether it makes sense to move from > > /* state subset only touched by the VCPU itself during runtime */ > #define KVM_PUT_RUNTIME_STATE 1 > /* state subset modified during VCPU reset */ > #define KVM_PUT_RESET_STATE 2 > /* full state set, modified during initialization or on vmload */ > #define KVM_PUT_FULL_STATE 3 > > to a bitmap. So we would always only synchronize KVM_PUT_RUNTIME_STATE, but > keep > a bitmap as explained above around for any state that is more elaborate. For > easy conversion, we could even add bits for SYNC_RESET and SYNC_FULL = > (SYNC_RESET | SYNC_FULL_REAL). Yes, that looks better to me . -Bharat > > That should get rid of all the level based logic, giving us a lot more > flexibility on what we want to synchronize. > > > Alex >