* Peter Zijlstra <[email protected]> wrote:

> >     mask = x86_pmu.lbr_nr - 1;
> > -   tos = intel_pmu_lbr_tos();
> > +   tos = task_ctx->tos;
> >     for (i = 0; i < tos; i++) {
> >             lbr_idx = (tos - i) & mask;
> >             wrmsrl(x86_pmu.lbr_from + lbr_idx, task_ctx->lbr_from[i]);
> > @@ -247,6 +247,7 @@ static void __intel_pmu_lbr_restore(struct 
> > x86_perf_task_context *task_ctx)
> >             if (x86_pmu.intel_cap.lbr_format == LBR_FORMAT_INFO)
> >                     wrmsrl(MSR_LBR_INFO_0 + lbr_idx, task_ctx->lbr_info[i]);
> >     }
> > +   wrmsrl(x86_pmu.lbr_tos, tos);
> >     task_ctx->lbr_stack_state = LBR_NONE;
> >  }
> 
> Any idea who much more expensive that wrmsr() is compared to the rdmsr() it 
> replaces?
> 
> If its significant we could think about having this behaviour depend on 
> callstacks.

The WRMSR extra cost is probably rather significant - here is a typical Intel 
WRMSR vs. RDMSR (non-hardwired) cache-hot/cache-cold cost difference:

[  170.798574] x86/bench: 
-------------------------------------------------------------------
[  170.807258] x86/bench: |                 RDTSC-cycles:    hot  (±noise) /   
cold  (±noise)
[  170.816115] x86/bench: 
-------------------------------------------------------------------
[  212.146982] x86/bench: rdtsc                         :     16           /    
 60
[  213.725998] x86/bench: rdmsr                         :    100           /    
148
[  215.469958] x86/bench: wrmsr                         :    456           /    
708

That's on a Xeon E7-4890 (22nm IvyBridge-EX).

So it's 350-550 RDTSC cycles ...

Thanks,

        Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Reply via email to