On Tue, Mar 07, 2006 at 07:50:52PM +0100, Andi Kleen wrote:
> 
> My vmlinux has
> 
> ffffffff80278382 <pfn_to_page>:
> ffffffff80278382:       8b 0d 78 ea 41 00       mov    4319864(%rip),%ecx     
>    # ffffffff80696e00 <memnode_shift>
> ffffffff80278388:       48 89 f8                mov    %rdi,%rax
> ffffffff8027838b:       48 c1 e0 0c             shl    $0xc,%rax
> ffffffff8027838f:       48 d3 e8                shr    %cl,%rax
> ffffffff80278392:       48 0f b6 80 00 5e 69    movzbq 
> 0xffffffff80695e00(%rax),%rax
> ffffffff80278399:       80 
> ffffffff8027839a:       48 8b 14 c5 40 93 71    mov    
> 0xffffffff80719340(,%rax,8),%rdx
> ffffffff802783a1:       80 
> ffffffff802783a2:       48 2b ba 40 36 00 00    sub    0x3640(%rdx),%rdi
> ffffffff802783a9:       48 6b c7 38             imul   $0x38,%rdi,%rax
> ffffffff802783ad:       48 03 82 30 36 00 00    add    0x3630(%rdx),%rax
> ffffffff802783b4:       c3                      retq   

That's easily in the 90+ cycles range as you've got 3 data dependant loads 
which will hit in the L2, but likely not in the L1 given that the workload is 
manipulating lots of data.  Assuming the instruction scheduler gets things 
right.

> ffffffff802783b5 <page_to_pfn>:
> ffffffff802783b5:       48 8b 07                mov    (%rdi),%rax
> ffffffff802783b8:       48 c1 e8 38             shr    $0x38,%rax
> ffffffff802783bc:       48 8b 14 c5 80 97 71    mov    
> 0xffffffff80719780(,%rax,8),%rdx
> ffffffff802783c3:       80 
> ffffffff802783c4:       48 b8 b7 6d db b6 6d    mov    
> $0x6db6db6db6db6db7,%rax
> ffffffff802783cb:       db b6 6d 
> ffffffff802783ce:       48 2b ba 20 03 00 00    sub    0x320(%rdx),%rdi
> ffffffff802783d5:       48 c1 ff 03             sar    $0x3,%rdi
> ffffffff802783d9:       48 0f af f8             imul   %rax,%rdi
> ffffffff802783dd:       48 03 ba 28 03 00 00    add    0x328(%rdx),%rdi
> ffffffff802783e4:       48 89 f8                mov    %rdi,%rax
> ffffffff802783e7:       c3                      retq   
> 
> 
> Both look quite optimized to me. I haven't timed them but it would surprise 
> me 
> if P4 needed more than 20 cycles to crunch through each of them.

It's more than that because you've got the data dependancies on the load.  
Yes, imul is 10 cycles, but shift is 1.

> Where is that idiv exactly? I don't see it.

My memory seems to be failing me, I can't find it.  Whoops.

> Only in pathological workloads. Normally the working set is so large 
> that the probability of two pages are near each other is very small.

It's hardly that uncommon for pages to cross cachelines or for pages to move 
around CPUs with networking.  Remember that we're using pages for the data 
buffers in networking, so you'll have pages get freed on the wrong CPU quite 
often.

Please name some sort of benchmarks that show your concerns for decreased 
performance.  I've shown you one that gets improved, and I think the pages 
not overlapping cachelines is only a good thing.

I know these things look like piddly little worthless optimizations, but 
they add up big time.  Mea culpa for not having a 10Gbit nic to show more 
"real world" applications.

                -ben
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to