On 02/22/2018 01:52 PM, Linus Torvalds wrote: > Side note - and this may be crazy talk - I wonder if it might make > sense to have a mode where we allow executable read-only kernel pages > to be marked global too (but only in the kernel mapping).
We did that accidentally, somewhere. It causes machine checks on K8's iirc, which is fun (52994c256df fixed it). So, we'd need to make sure we avoid it there, or just make it global in the user mapping too. > Of course, maybe the performance advantage from keeping the ITLB > entries around isn't huge, but this *may* be worth at least asking > some Intel architects about? I kinda doubt it's worth the trouble. Like you said, this probably doesn't even matter when we have PCID support. Also, we'll usually map all of this text with 2M pages, minus whatever hangs over into the last 2M page of text. My laptop looks like this: > 0xffffffff81000000-0xffffffff81c00000 12M ro PSE > x pmd > 0xffffffff81c00000-0xffffffff81c0b000 44K ro > x pte So, even if we've flushed these entries, we can get all of them back with a single cacheline worth of PMD entries. Just for fun, I tried a 4-core Skylake system with KPTI and nopcid and compiled a random kernel 10 times. I did three configs: no global, all kernel text global + cpu_entry_area, and only cpu_entry_area + entry text. The delta percentages are from the Baseline. The deltas are measurable, but the largest bang for our buck is obviously the entry text. User Time Kernel Time Clock Elapsed Baseline (33 GLB PTEs) 907.6 81.6 264.7 Entry (28 GLB PTEs) 910.9 (+0.4%) 84.0 (+2.9%) 265.2 (+0.2%) No global( 0 GLB PTEs) 914.2 (+0.7%) 89.2 (+9.3%) 267.8 (+1.2%) It's a single line of code to go from the "33" to "28" configuration, so it's totally doable. But, it means having and parsing another boot option that confuses people and then I have to go write actual documentation, which I detest. :) My inclination would be to just do the "entry" stuff as global just as this set left things and leave it at that. I also measured frontend stalls with the toplev.py tool. They show roughly the same thing, but a bit magnified since I was only monitoring the kernel and because in some of these cases, even if we stop being iTLB bound we just bottleneck on something else. I ran: python ~/pmu-tools/toplev.py --kernel --level 3 make -j8 And looked for the relevant ITLB misses in the output: Baseline > FE Frontend_Bound: 24.33 % Slots [ 7.68%] > FE ITLB_Misses: 5.16 % Clocks [ 7.73%] Entry: > FE Frontend_Bound: 26.62 % Slots [ 7.75%] > FE ITLB_Misses: 12.50 % Clocks [ 7.74%] No global: > FE Frontend_Bound: 27.58 % Slots [ 7.65%] > FE ITLB_Misses: 14.74 % Clocks [ 7.71%] 1. https://github.com/andikleen/pmu-tools