On Tue, Mar 07, 2006 at 07:50:52PM +0100, Andi Kleen wrote:
>
> My vmlinux has
>
> ffffffff80278382 <pfn_to_page>:
> ffffffff80278382: 8b 0d 78 ea 41 00 mov 4319864(%rip),%ecx
> # ffffffff80696e00 <memnode_shift>
> ffffffff80278388: 48 89 f8 mov %rdi,%rax
> ffffffff8027838b: 48 c1 e0 0c shl $0xc,%rax
> ffffffff8027838f: 48 d3 e8 shr %cl,%rax
> ffffffff80278392: 48 0f b6 80 00 5e 69 movzbq
> 0xffffffff80695e00(%rax),%rax
> ffffffff80278399: 80
> ffffffff8027839a: 48 8b 14 c5 40 93 71 mov
> 0xffffffff80719340(,%rax,8),%rdx
> ffffffff802783a1: 80
> ffffffff802783a2: 48 2b ba 40 36 00 00 sub 0x3640(%rdx),%rdi
> ffffffff802783a9: 48 6b c7 38 imul $0x38,%rdi,%rax
> ffffffff802783ad: 48 03 82 30 36 00 00 add 0x3630(%rdx),%rax
> ffffffff802783b4: c3 retq
That's easily in the 90+ cycles range as you've got 3 data dependant loads
which will hit in the L2, but likely not in the L1 given that the workload is
manipulating lots of data. Assuming the instruction scheduler gets things
right.
> ffffffff802783b5 <page_to_pfn>:
> ffffffff802783b5: 48 8b 07 mov (%rdi),%rax
> ffffffff802783b8: 48 c1 e8 38 shr $0x38,%rax
> ffffffff802783bc: 48 8b 14 c5 80 97 71 mov
> 0xffffffff80719780(,%rax,8),%rdx
> ffffffff802783c3: 80
> ffffffff802783c4: 48 b8 b7 6d db b6 6d mov
> $0x6db6db6db6db6db7,%rax
> ffffffff802783cb: db b6 6d
> ffffffff802783ce: 48 2b ba 20 03 00 00 sub 0x320(%rdx),%rdi
> ffffffff802783d5: 48 c1 ff 03 sar $0x3,%rdi
> ffffffff802783d9: 48 0f af f8 imul %rax,%rdi
> ffffffff802783dd: 48 03 ba 28 03 00 00 add 0x328(%rdx),%rdi
> ffffffff802783e4: 48 89 f8 mov %rdi,%rax
> ffffffff802783e7: c3 retq
>
>
> Both look quite optimized to me. I haven't timed them but it would surprise
> me
> if P4 needed more than 20 cycles to crunch through each of them.
It's more than that because you've got the data dependancies on the load.
Yes, imul is 10 cycles, but shift is 1.
> Where is that idiv exactly? I don't see it.
My memory seems to be failing me, I can't find it. Whoops.
> Only in pathological workloads. Normally the working set is so large
> that the probability of two pages are near each other is very small.
It's hardly that uncommon for pages to cross cachelines or for pages to move
around CPUs with networking. Remember that we're using pages for the data
buffers in networking, so you'll have pages get freed on the wrong CPU quite
often.
Please name some sort of benchmarks that show your concerns for decreased
performance. I've shown you one that gets improved, and I think the pages
not overlapping cachelines is only a good thing.
I know these things look like piddly little worthless optimizations, but
they add up big time. Mea culpa for not having a 10Gbit nic to show more
"real world" applications.
-ben
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at http://vger.kernel.org/majordomo-info.html