On 2018/11/12 23:46, Dave Hansen wrote: > On 11/11/18 9:38 PM, Li, Aubrey wrote: > >>> Do we want this, or do we want something more time-based? >>> >> This counter is introduced here to solve the race of context switch and >> VZEROUPPER. 3 context switches mean the same thread is on-off CPU 3 times. >> Due to scheduling latency, 3 jiffies could only happen AVX task on-off just >> 1 time. So IMHO the context switches number is better here. > > Imagine we have a HZ=1000 system where AVX_STATE_DECAY_COUNT=3. That > means that a task can be marked as a non-AVX-512-user after not using it > for ~3 ms. But, with HZ=250, that's ~12ms.
>From the other side, if we set a 4ms decay, when HZ=1000, context switch count is 4, that means, we have 4 times of chance to maintain the AVX state, that is, we are able to filter 4 times init state reset out. But if HZ = 250, the context switch is 1, we only have 1 time of chance to filter init state reset out. > > Also, don't forget that we have context switches from the timer > interrupt, but also from normal old operations that sleep. > > Let's say our AVX-512 app was doing: > > while (foo) { > do_avx_512(); > read(pipe, buf, len); > read(pipe, buf, len); > read(pipe, buf, len); > } > > And all three pipe reads context-switched the task. That loop could > finish in way under 3HZ, but still end up in do_avx_512() each time with > fpu...avx->state=0. Yeah, we are trying to address a prediction according to the historical pattern, so you always can make a pattern to beat the prediction pattern. But in practice, I measured tensorflow with AVX512 enabled, linpack with AVX512, and a micro benchmark, the current 3 context switches decay works well enough. > > BTW, I don't have a great solution for this. I was just pointing out > one of the pitfalls from using context switch counts so strictly. I really don't think time-based is better than the count in this case. >>>> +/* >>>> * Highest level per task FPU state data structure that >>>> * contains the FPU register state plus various FPU >>>> * state fields: >>>> @@ -303,6 +312,14 @@ struct fpu { >>>> unsigned char initialized; >>>> >>>> /* >>>> + * @avx_state: >>>> + * >>>> + * This data structure indicates whether this context >>>> + * contains AVX states >>>> + */ >>> >>> Yeah, that's precisely what fpu->state.xsave.xfeatures does. :) >>> I see, will refine in the next version > > One other thought about the new 'avx_state': > > fxregs_state (which is a part of the XSAVE state) has some padding and > 'sw_reserved' areas. You *might* be able to steal some space there. > Not that this is a huge space eater, but why waste the space if we don't > have to? > IMHO, I prefer not adding any extra thing into a data structure associated with a hardware table. Let me try to work out a new version to see if it can satisfy you. Thanks, -Aubrey