On 04.01.2013, at 11:23, Bhushan Bharat-R65777 wrote: > > >> -----Original Message----- >> From: Alexander Graf [mailto:ag...@suse.de] >> Sent: Friday, January 04, 2013 2:11 PM >> To: Bhushan Bharat-R65777 >> Cc: Marcelo Tosatti; jjhe...@linux.vnet.ibm.com; Christian Borntraeger; >> Anthony >> Liguori; qemu-devel@nongnu.org qemu-devel >> Subject: Re: [Qemu-devel] [PATCH 7/7] KVM regsync: Fix >> do_kvm_cpu_synchronize_state data integrity issue >> >> >> On 04.01.2013, at 05:22, Bhushan Bharat-R65777 wrote: >> >>> >>> >>>> -----Original Message----- >>>> From: Marcelo Tosatti [mailto:mtosa...@redhat.com] >>>> Sent: Friday, January 04, 2013 7:08 AM >>>> To: Alexander Graf >>>> Cc: jjhe...@linux.vnet.ibm.com; Christian Borntraeger; Anthony >>>> Liguori; qemu- de...@nongnu.org qemu-devel; Bhushan Bharat-R65777 >>>> Subject: Re: [Qemu-devel] [PATCH 7/7] KVM regsync: Fix >>>> do_kvm_cpu_synchronize_state data integrity issue >>>> >>>> On Thu, Jan 03, 2013 at 08:09:22PM +0100, Alexander Graf wrote: >>>>> >>>>> On 03.01.2013, at 19:48, Jason J. Herne wrote: >>>>> >>>>>> On 01/03/2013 08:56 AM, Alexander Graf wrote: >>>>>>>> static void do_kvm_cpu_synchronize_state(void *_args) >>>>>>>>> { >>>>>>>>> struct kvm_cpu_syncstate_args *args = _args; >>>>>>>>> + CPUArchState *env = args->env; >>>>>>>>> + int register_level = args->register_level; >>>>>>>>> >>>>>>> This probably becomes more readable if we explicitly revert back >>>>>>> to >>>> unsynced state first: >>>>>>> >>>>>>> /* Write back local modifications at our current level */ if >>>>>>> (register_level > env->kvm_vcpu_dirty) { >>>>>>> kvm_arch_put_registers(...); >>>>>>> env->kvm_vcpu_dirty = 0; >>>>>>> } >>>>>>> >>>>>>> and then do the sync we are requested to do: >>>>>>> >>>>>>> if (!env->kvm_vcpu_dirty) { >>>>>>> ... >>>>>>> } >>>>>> >>>>>> I agree, but only if we add a second conditional to the if 1st >>>>>> statement as >>>> such: >>>>>> >>>>>> if (args->env->kvm_vcpu_dirty && register_level > >>>>>> env->kvm_vcpu_dirty) >>>>>> >>>>>> This is to cover the case where the caller is asking for register level >>>>>> "1" >>>> and we're already dirty at level "2". In this case, nothing should >>>> happen and we'll need the "args->env->kvm_vcpu_dirty" to ensure that is the >> case. >>>>> >>>>> As before, I'd prefer to make this explicit: >>>>> >>>>>> >>>>>> static void do_kvm_cpu_synchronize_state(void *_args) { >>>>>> struct kvm_cpu_syncstate_args *args = _args; >>>>>> CPUArchState *env = args->env; >>>>>> int register_level = args->register_level; >>>>> >>>>> if (register_level > env->kvm_vcpu_dirty) { >>>>> /* We are more dirty than we need to - all is well */ >>>>> return; >>>>> } >>>>> >>>>>> >>>>>> /* Write back local modifications at our current level */ >>>>>> if (args->env->kvm_vcpu_dirty && register_level > env->kvm_vcpu_dirty) { >>>>>> kvm_arch_put_registers(env, env->kvm_vcpu_dirty); >>>>>> env->kvm_vcpu_dirty = 0; >>>>>> } >>>>>> >>>>>> if (!args->env->kvm_vcpu_dirty) { >>>>>> kvm_arch_get_registers(env, register_level); >>>>>> env->kvm_vcpu_dirty = register_level; >>>>>> } >>>>>> } >>>>>> >>>>>> Do you agree? Thanks for your time. :) >>>>> >>>>> Please also check out the discussions I've had with Bharat about his >>>>> watchdog >>>> patches. There we need a mechanism to synchronize registers only when >>>> we actually need to, in order to avoid potential race conditions with >>>> a kernel timer. >>>>> >>>>> That particular case doesn't work well with levels. We can have >>>>> multiple >>>> different potential race producers in the kernel that we need to >>>> avoid individually, so we can't always synchronize all of them when >>>> only one of them needs to be synchronized. >>>>> >>>>> The big question is what we should be doing about this. We basically >>>>> have 3 >>>> options: >>>>> >>>>> * implement levels, treat racy registers as "manually >>>>> synchronized", as >>>> Bharat's latest patch set does >>>>> * implement levels, add a bitmap for additional special >>>>> synchronization bits >>>>> * replace levels by bitmap >>>>> >>>>> I'm quite frankly not sure which one of the 3 would be the best way >>>>> forward. >>>>> >>>>> >>>>> Alex >>>> >>>> Implement read/write accessors for individual registers, similarly to >>>> how its done in the kernel. See kvm_cache_regs.h. >>> >>> Read/write for individual registers can be heavy for cases where multiple >> register needed to be updated. Is not it? >> >> It depends. We can still have classes of registers to synchronize that we >> could >> even synchronize using MANY_REG. For writes, things become even faster with >> individual accessors since we can easily coalesce all register writes until >> we >> hit the vcpu again into a MANY_REG array and write them in one go. >> >>> >>> For cases where multiple register needed to be synchronized then I would >>> like >> to elaborate the options as: >>> >>> Option 1: >>> int timer_func(CPUArchState env) >>> { >>> cpu_synchronize_state(env); >>> //update env->timer_registers >>> env->updated_regs_type |= TIMER_REGS_TYPE_BIT; >> >> To scale this one, it's probably also more clever to swap the logic: >> >> env->kvm_sync_extra |= SYNC_TIMER_REGS; >> cpu_synchronize_state(env); /* gets timer regs */ >> /* update env->timer_registers */ >> /* timer regs will be synchronized later because kvm_sync_extra has >> SYNC_TIMER_REGS set */ >> >> Your case right now is special in that we don't need to get any register >> state, >> but only write it. However, if we do >> >> cpu_synchronize_state(env); /* will not get timer regs */ >> env->kvm_sync_extra |= SYNC_TIMER_REGS; >> >> then the next >> >> cpu_synchronize_state(env); >> >> would overwrite the timer values again. So we would also need to indicate >> that >> the bits are dirty: >> >> cpu_synchronize_state(env); /* will not get timer regs */ >> env->kvm_sync_extra |= SYNC_TIMER_REGS; >> env->kvm_sync_dirty |= SYNC_TIMER_REGS; >> >> which cpu_synchronize_state() could use to make sure it doesn't read back any >> timer regs again. > > yes > >> It would obviously also have to set it in its normal operation >> when it executed first ;). > > We also need get all registers like for debugging and we should have flag > like ALL_REGS etc.
Well, we could just take the current level of "general purpose registers" as one flag. > > But I general I agree with the idea. > >> >>> } >>> >>> Change the kvm_arch_put_registers() to add followings: >>> >>> int kvm_arch_put_registers(CPUArchState *env, int level) { >>> >>> If (env->updated_regs_type) { >>> struct kvm_sregs sregs; >>> If (env->updated_regs_type & TIMER_REGS_TYPE_BIT) { >>> // update sregs->timer_registers; >>> // may be we can have a bitmap to tell kernel for what >> actually updated >>> } >>> If (env->updated_regs_type & XX_CPU_REGS_TYPE) { >>> // update respective registers in sregs >>> } >>> >>> ioctl(env, KVM_GET_SREGS, &sregs); >>> } >>> } >>> >>> This mechanism will get all registers state but this is required only once >>> per >> entry in QEMU. >> >> I don't understand this bit >> >>> >>> >>> Option 2: >>> Define read_regs_type() and write_regs_type() for cases where it requires >> multiple registers updates: >>> >>> int read_regs_type((CPUArchState env, void *regs_ptr, int reg_type) { >>> -> reg_type are bitmap like: TIMER_REGS, DEBUG_REGS, PERFMON_REGS etc. >>> >>> Switch(reg_type) { >>> Case TIMER_REGS: >>> Struct *timer_regs = regs_ptr; >>> Ioctl(KVM_GET_SREGS, timer_regs, CPU_reg_type); >>> Break; >>> >>> Default: >>> Printf(error); >>> } >>> >>> Similarly will be the write function: >>> int write_regs_type((CPUArchState env, int reg_type) { >>> -> reg_type are bitmap like: TIMER_REGS, DEBUG_REGS, PERFMON_REGS etc. >>> >>> Switch(reg_type) { >>> Case TIMER_REGS: >>> Struct timer_regs; >>> Timer_regs.reg1 = env->timer_regs->reg1; >>> Timer_regs.reg2 = env->timer_regs->reg2; >>> Ioctl(env, KVM_SET_SREGS, &timer_regs, CPU_reg_type); >>> Break; >>> >>> Default: >>> Printf(error); >>> } >>> >>> Int timer_func(CPUxxState env) >>> { >>> Struct timer_regs; >>> read_regs_type((env, &timer_regs,TIMER_REGS); >>> //update env->timer_registers >>> Write_regs_type(env, TIMER_REGS) >>> } >>> >>> This will get one type of register_types and can cause multiple IOCTL per >> entry to QEMU. >> >> This is still too explicit. How about >> >> static inline void ppc_set_tsr(CPUState *env, target_ulong val) { >> env->kvm_sync_extra |= SYNC_TIMER_REGS; >> cpu_synchronize_registers(env); >> env->spr[SPR_TSR] = val; > > You also want env->kvm_sync_dirty also, right? Not quite, since SYNC_TIMER_REGS spans more than only TSR. So we need to make sure we fetch the non-TSR register values before we can declare TIMER_REGS as dirty. > >> } >> >> static inline void ppc_set_tcr(CPUState *env, target_ulong val) { >> env->kvm_sync_extra |= SYNC_TIMER_REGS; >> cpu_synchronize_registers(env); >> env->spr[SPR_TCR] = val; > > Same as above > >> } >> >> int timer_func(CPUState *env) { >> ppc_set_tsr(env, 0); > > SYNC_TIMERS_REGS are get only once on first call. Sure, but that's implicit. A user of these functions doesn't have to care which one comes first. Alex