On 11/11/18 9:38 PM, Li, Aubrey wrote: > If there is a valid state in the AVX registers, we can say the tasks contains > AVX instructions, can't we?
XRSTOR, for instance, can take XSAVE state out of the init state, but it is not necessarily an AVX instruction. In fact, we had a kernel bug along the way that was accidentally doing this on systems without AVX. >>> +#define AVX_STATE_DECAY_COUNT 3 >> >> How was this number chosen? What does this mean? >> >> It appears that this is saying that after 3 non-AVX-512-using context >> switches, the task is not considered to be using AVX512 any more. That >> seems a bit goofy because the context switch rate is highly dependent on >> HZ, and on how often the task yields. >> >> Do we want this, or do we want something more time-based? >> > This counter is introduced here to solve the race of context switch and > VZEROUPPER. 3 context switches mean the same thread is on-off CPU 3 times. > Due to scheduling latency, 3 jiffies could only happen AVX task on-off just > 1 time. So IMHO the context switches number is better here. Imagine we have a HZ=1000 system where AVX_STATE_DECAY_COUNT=3. That means that a task can be marked as a non-AVX-512-user after not using it for ~3 ms. But, with HZ=250, that's ~12ms. Also, don't forget that we have context switches from the timer interrupt, but also from normal old operations that sleep. Let's say our AVX-512 app was doing: while (foo) { do_avx_512(); read(pipe, buf, len); read(pipe, buf, len); read(pipe, buf, len); } And all three pipe reads context-switched the task. That loop could finish in way under 3HZ, but still end up in do_avx_512() each time with fpu...avx->state=0. BTW, I don't have a great solution for this. I was just pointing out one of the pitfalls from using context switch counts so strictly. >> I'd really like if this looked like this: >> >> if (!have_avx_state_detect()) { >> avx->state = 0; >> return; >> } >> >> Then, in have_avx_state_detect(), explain what XGETBV1 does. BTW, I >> don't think we *totally* need to duplicate the SDM definitions in kernel >> code for each instruction. It's fine to just say that it set 1 for >> features not in the init state. >> > > Thomas suggested open this inline since this is only usage of xgetbv1. So I'll > use static_cpu_has() here directly. Does it sound good? Yeah, that's OK. But, I'd still limit the comments to what it is doing instead of defining the instruction. >>> + /* >>> + * XINUSE is dynamic to track component state because VZEROUPPER >>> + * happens on every function end and reset the bitmap to the >>> + * initial configuration. >> >> This is confusing to me because VZEROUPPER is not apparently involved >> here. What is this trying to say? >> > VZERROUPPER instruction in the task resets the bitmap. I think this is too much detail for a kernel comment. Let's just say that the task itself can get back to the 'init state'. >>> + * State decay is introduced to solve the race condition between >>> + * context switch and a function end. State is aggressively set >>> + * once it's detected but need to be cleared by decay 3 context >>> + * switches >>> + */ >> >> I'd probably say: >> >> AVX512-using processes frequently set AVX512 back to the init >> state themselves. Thus, this detection mechanism can miss. >> The decay ensures that false-negatives do not immediately make >> a task be considered as not using AVX512. > > Thanks, will refine it in the next version. > And yes, AVX512-using processoes set AVX512 back to the init state themselves > by VZEROUPPER instructions. ... or XRSTOR, or an in-kernel XRSTOR from a signal entry. The point is, I'd probably not actually mention VZEROUPPER. It's certainly the common way to init the state, but let's not the *only* way we have to be concerned with. >>> + if (xgetbv(XINUSE_STATE_BITMAP_INDEX) & XFEATURE_MASK_Hi16_ZMM) { >> >> This is *just* an AVX512 state, right? That isn't reflected in any >> comments or naming. Also, why just this state, and not any of the other >> AVX512-related states? > > Only this state causes significant frequency drop, other states not. OK. This is an extremely crucial bit of information to capture, including the background. >>> +/* >>> * Highest level per task FPU state data structure that >>> * contains the FPU register state plus various FPU >>> * state fields: >>> @@ -303,6 +312,14 @@ struct fpu { >>> unsigned char initialized; >>> >>> /* >>> + * @avx_state: >>> + * >>> + * This data structure indicates whether this context >>> + * contains AVX states >>> + */ >> >> Yeah, that's precisely what fpu->state.xsave.xfeatures does. :) >> I see, will refine in the next version One other thought about the new 'avx_state': fxregs_state (which is a part of the XSAVE state) has some padding and 'sw_reserved' areas. You *might* be able to steal some space there. Not that this is a huge space eater, but why waste the space if we don't have to?