Re: [go-nuts] noinline is 25% faster than inline on apple m1 ?

2022-07-23 Thread 'Keith Randall' via golang-nuts
Yes, I think this is the extra LEAQ that appears in the loop. Ideally it 
would be lifted out of the loop. I think that is 
https://github.com/golang/go/issues/15808

On Friday, July 22, 2022 at 7:33:47 PM UTC-7 Taj Khattra wrote:

> i get similar results with 1.18 (inline slower than noinline)
> but different results with 1.16, 1.17, and 1.19rc2 (inline faster than 
> noinline)
>
> goos: linux
> goarch: amd64
> cpu: AMD Ryzen 5 5600X 6-Core Processor
>
>  1.16.15
> BenchmarkNoInline-121257173629.607 ns/op
> BenchmarkInline-12  1500663948.721 ns/op
>
> BenchmarkNoInline-121254763449.710 ns/op
> BenchmarkInline-12  1337816088.851 ns/op
>
>  1.17.10
> BenchmarkNoInline-121   10.14 ns/op
> BenchmarkInline-12  1358187228.646 ns/op
>
> BenchmarkNoInline-12123817206   10.61 ns/op
> BenchmarkInline-12  1376915728.754 ns/op
>
>  1.18.4
> BenchmarkNoInline-12121646458   10.13 ns/op
> BenchmarkInline-12  8142097314.65 ns/op
>
> BenchmarkNoInline-12123927972   10.05 ns/op
> BenchmarkInline-12  8137103814.64 ns/op
>
>  1.19rc2
> BenchmarkNoInline-121207990629.864 ns/op
> BenchmarkInline-12  1473069908.579 ns/op
>
> BenchmarkNoInline-12120426837   10.17 ns/op
> BenchmarkInline-12  1290290528.621 ns/op
>
> On Friday, 22 July 2022 at 18:56:54 UTC-7 Kevin Chowski wrote:
>
>> Datapoint: same with windows/amd64 on Intel (running 1.19beta1):
>>
>> goos: windows
>> goarch: amd64
>> pkg: common/sandbox
>> cpu: Intel(R) Core(TM) i7-6650U CPU @ 2.20GHz
>> BenchmarkNoInline-4 7742584814.34 ns/op
>> BenchmarkInline-4   5910893220.58 ns/op
>> PASS
>> ok  common/sandbox  2.645s
>>
>> Looking at the disassembly, I noticed that in the Inline case there was a 
>> 7-byte `lea0xXX(%rip),%rbx` in the tight inner loop due to some 
>> really proactive constant propagation (I hypothesize). If you manually 
>> defeat the propagation by storing the string in a global and manually 
>> copying it into the stack, the inlined becomes faster than NoInline again: 
>> https://go.dev/play/p/VRgJP2y7joS
>>
>> goos: windows
>> goarch: amd64
>> pkg: common/sandbox
>> cpu: Intel(R) Core(TM) i7-6650U CPU @ 2.20GHz
>> BenchmarkNoInline-4 8143653914.08 
>> ns/op
>> BenchmarkInline-4   5925516221.32 
>> ns/op
>> BenchmarkInlineDefeatConstProp-49752482812.57 
>> ns/op
>> PASS
>> ok  common/sandbox  5.111s
>>
>> On Friday, July 22, 2022 at 11:01:00 AM UTC-6 mpr...@google.com wrote:
>>
>>> I can reproduce similar behavior on linux-amd64:
>>>
>>> $ perf stat ./example.com.test -test.bench=BenchmarkInline 
>>> -test.benchtime=1x
>>> goos: linux   
>>> goarch: amd64  
>>> pkg: example.com
>>> cpu: Intel(R) Xeon(R) W-2135 CPU @ 3.70GHz 
>>> BenchmarkInline-12  1   16.78 ns/op 
>>>  
>>> PASS
>>>   
>>>  Performance counter stats for './example.com.test 
>>> -test.bench=BenchmarkInline -test.benchtime=1x':
>>>
>>>   1,691.95 msec task-clock:u  #1.004 CPUs 
>>> utilized  
>>>  0  context-switches:u#0.000 /sec   
>>>  
>>>  0  cpu-migrations:u  #0.000 /sec   
>>>  
>>>352  page-faults:u #  208.044 /sec   
>>>  
>>>  6,732,752,072  cycles:u  #3.979 GHz 
>>> 
>>> 22,405,823,428  instructions:u#3.33  insn per 
>>> cycle 
>>>  6,501,294,164  branches:u#3.842 G/sec   
>>> 
>>>149,596  branch-misses:u   #0.00% of all 
>>> branches
>>>
>>>1.684677260 seconds time elapsed
>>>
>>>1.692474000 seconds user
>>>0.00402 seconds sys
>>>
>>>
>>>
>>> $ perf stat ./example.com.test -test.bench=BenchmarkNoInline 
>>> -test.benchtime=1x
>>> goos: linux
>>> goarch: amd64
>>> pkg: example.com
>>> cpu: Intel(R) Xeon(R) W-2135 CPU @ 3.70GHz
>>> BenchmarkNoInline-121   10.79 ns/op
>>> PASS
>>>
>>>  Performance counter stats for './example.com.test 
>>> -test.bench=BenchmarkNoInline -test.benchtime=1x':
>>>
>>>   1,091.71 msec task-clock:u  #1.005 CPUs 
>>> utilized  
>>>  0  context-switches:u

Re: [go-nuts] noinline is 25% faster than inline on apple m1 ?

2022-07-22 Thread Taj Khattra
i get similar results with 1.18 (inline slower than noinline)
but different results with 1.16, 1.17, and 1.19rc2 (inline faster than 
noinline)

goos: linux
goarch: amd64
cpu: AMD Ryzen 5 5600X 6-Core Processor

 1.16.15
BenchmarkNoInline-121257173629.607 ns/op
BenchmarkInline-12  1500663948.721 ns/op

BenchmarkNoInline-121254763449.710 ns/op
BenchmarkInline-12  1337816088.851 ns/op

 1.17.10
BenchmarkNoInline-121   10.14 ns/op
BenchmarkInline-12  1358187228.646 ns/op

BenchmarkNoInline-12123817206   10.61 ns/op
BenchmarkInline-12  1376915728.754 ns/op

 1.18.4
BenchmarkNoInline-12121646458   10.13 ns/op
BenchmarkInline-12  8142097314.65 ns/op

BenchmarkNoInline-12123927972   10.05 ns/op
BenchmarkInline-12  8137103814.64 ns/op

 1.19rc2
BenchmarkNoInline-121207990629.864 ns/op
BenchmarkInline-12  1473069908.579 ns/op

BenchmarkNoInline-12120426837   10.17 ns/op
BenchmarkInline-12  1290290528.621 ns/op

On Friday, 22 July 2022 at 18:56:54 UTC-7 Kevin Chowski wrote:

> Datapoint: same with windows/amd64 on Intel (running 1.19beta1):
>
> goos: windows
> goarch: amd64
> pkg: common/sandbox
> cpu: Intel(R) Core(TM) i7-6650U CPU @ 2.20GHz
> BenchmarkNoInline-4 7742584814.34 ns/op
> BenchmarkInline-4   5910893220.58 ns/op
> PASS
> ok  common/sandbox  2.645s
>
> Looking at the disassembly, I noticed that in the Inline case there was a 
> 7-byte `lea0xXX(%rip),%rbx` in the tight inner loop due to some 
> really proactive constant propagation (I hypothesize). If you manually 
> defeat the propagation by storing the string in a global and manually 
> copying it into the stack, the inlined becomes faster than NoInline again: 
> https://go.dev/play/p/VRgJP2y7joS
>
> goos: windows
> goarch: amd64
> pkg: common/sandbox
> cpu: Intel(R) Core(TM) i7-6650U CPU @ 2.20GHz
> BenchmarkNoInline-4 8143653914.08 ns/op
> BenchmarkInline-4   5925516221.32 ns/op
> BenchmarkInlineDefeatConstProp-49752482812.57 ns/op
> PASS
> ok  common/sandbox  5.111s
>
> On Friday, July 22, 2022 at 11:01:00 AM UTC-6 mpr...@google.com wrote:
>
>> I can reproduce similar behavior on linux-amd64:
>>
>> $ perf stat ./example.com.test -test.bench=BenchmarkInline 
>> -test.benchtime=1x
>> goos: linux   
>> goarch: amd64  
>> pkg: example.com
>> cpu: Intel(R) Xeon(R) W-2135 CPU @ 3.70GHz 
>> BenchmarkInline-12  1   16.78 ns/op   
>>
>> PASS
>>   
>>  Performance counter stats for './example.com.test 
>> -test.bench=BenchmarkInline -test.benchtime=1x':
>>
>>   1,691.95 msec task-clock:u  #1.004 CPUs 
>> utilized  
>>  0  context-switches:u#0.000 /sec 
>>
>>  0  cpu-migrations:u  #0.000 /sec 
>>
>>352  page-faults:u #  208.044 /sec 
>>
>>  6,732,752,072  cycles:u  #3.979 GHz 
>> 
>> 22,405,823,428  instructions:u#3.33  insn per 
>> cycle 
>>  6,501,294,164  branches:u#3.842 G/sec   
>> 
>>149,596  branch-misses:u   #0.00% of all 
>> branches
>>
>>1.684677260 seconds time elapsed
>>
>>1.692474000 seconds user
>>0.00402 seconds sys
>>
>>
>>
>> $ perf stat ./example.com.test -test.bench=BenchmarkNoInline 
>> -test.benchtime=1x
>> goos: linux
>> goarch: amd64
>> pkg: example.com
>> cpu: Intel(R) Xeon(R) W-2135 CPU @ 3.70GHz
>> BenchmarkNoInline-121   10.79 ns/op
>> PASS
>>
>>  Performance counter stats for './example.com.test 
>> -test.bench=BenchmarkNoInline -test.benchtime=1x':
>>
>>   1,091.71 msec task-clock:u  #1.005 CPUs 
>> utilized  
>>  0  context-switches:u#0.000 /sec 
>>
>>  0  cpu-migrations:u  #0.000 /sec 
>>
>>363  page-faults:u #  332.505 /sec 
>>
>>  4,490,159,750  cycles:u  #4.113 GHz 
>> 
>> 20,205,764,499  instructions:u#4.50  insn p

Re: [go-nuts] noinline is 25% faster than inline on apple m1 ?

2022-07-22 Thread Kevin Chowski
Sorry for the double-post, I just realized that the version I posted before 
had my manually-inlined version that I did as a part of testing. For 
completeness, here's the non-manually-inlined version, which seems have the 
same performance qualities (and probably exactly the same machine code, 
though I didn't double-check): https://go.dev/play/p/h1K38Bq7Otv

On Friday, July 22, 2022 at 7:56:54 PM UTC-6 Kevin Chowski wrote:

> Datapoint: same with windows/amd64 on Intel (running 1.19beta1):
>
> goos: windows
> goarch: amd64
> pkg: common/sandbox
> cpu: Intel(R) Core(TM) i7-6650U CPU @ 2.20GHz
> BenchmarkNoInline-4 7742584814.34 ns/op
> BenchmarkInline-4   5910893220.58 ns/op
> PASS
> ok  common/sandbox  2.645s
>
> Looking at the disassembly, I noticed that in the Inline case there was a 
> 7-byte `lea0xXX(%rip),%rbx` in the tight inner loop due to some 
> really proactive constant propagation (I hypothesize). If you manually 
> defeat the propagation by storing the string in a global and manually 
> copying it into the stack, the inlined becomes faster than NoInline again: 
> https://go.dev/play/p/VRgJP2y7joS
>
> goos: windows
> goarch: amd64
> pkg: common/sandbox
> cpu: Intel(R) Core(TM) i7-6650U CPU @ 2.20GHz
> BenchmarkNoInline-4 8143653914.08 ns/op
> BenchmarkInline-4   5925516221.32 ns/op
> BenchmarkInlineDefeatConstProp-49752482812.57 ns/op
> PASS
> ok  common/sandbox  5.111s
>
> On Friday, July 22, 2022 at 11:01:00 AM UTC-6 mpr...@google.com wrote:
>
>> I can reproduce similar behavior on linux-amd64:
>>
>> $ perf stat ./example.com.test -test.bench=BenchmarkInline 
>> -test.benchtime=1x
>> goos: linux   
>> goarch: amd64  
>> pkg: example.com
>> cpu: Intel(R) Xeon(R) W-2135 CPU @ 3.70GHz 
>> BenchmarkInline-12  1   16.78 ns/op   
>>
>> PASS
>>   
>>  Performance counter stats for './example.com.test 
>> -test.bench=BenchmarkInline -test.benchtime=1x':
>>
>>   1,691.95 msec task-clock:u  #1.004 CPUs 
>> utilized  
>>  0  context-switches:u#0.000 /sec 
>>
>>  0  cpu-migrations:u  #0.000 /sec 
>>
>>352  page-faults:u #  208.044 /sec 
>>
>>  6,732,752,072  cycles:u  #3.979 GHz 
>> 
>> 22,405,823,428  instructions:u#3.33  insn per 
>> cycle 
>>  6,501,294,164  branches:u#3.842 G/sec   
>> 
>>149,596  branch-misses:u   #0.00% of all 
>> branches
>>
>>1.684677260 seconds time elapsed
>>
>>1.692474000 seconds user
>>0.00402 seconds sys
>>
>>
>>
>> $ perf stat ./example.com.test -test.bench=BenchmarkNoInline 
>> -test.benchtime=1x
>> goos: linux
>> goarch: amd64
>> pkg: example.com
>> cpu: Intel(R) Xeon(R) W-2135 CPU @ 3.70GHz
>> BenchmarkNoInline-121   10.79 ns/op
>> PASS
>>
>>  Performance counter stats for './example.com.test 
>> -test.bench=BenchmarkNoInline -test.benchtime=1x':
>>
>>   1,091.71 msec task-clock:u  #1.005 CPUs 
>> utilized  
>>  0  context-switches:u#0.000 /sec 
>>
>>  0  cpu-migrations:u  #0.000 /sec 
>>
>>363  page-faults:u #  332.505 /sec 
>>
>>  4,490,159,750  cycles:u  #4.113 GHz 
>> 
>> 20,205,764,499  instructions:u#4.50  insn per 
>> cycle 
>>  6,701,281,015  branches:u#6.138 G/sec   
>> 
>>586,073  branch-misses:u   #0.01% of all 
>> branches
>>
>>1.086302272 seconds time elapsed
>>
>>1.08771 seconds user
>>0.008027000 seconds sys
>>
>> The non-inlined version is actually fewer instructions to run the same 
>> benchmark, which surprises me because naively looking at the disassembly it 
>> seems that the inlined version is much more compact.
>>
>>
>> On Fri, Jul 22, 2022 at 5:52 AM eric...@arm.com  wrote:
>>
>>> For this piece of code, two test functions are the same, but one is 
>>> inlined, the other is not. However the inlined version is about 25% slower 
>>> than the no inlined version on apple m1 chip. Why is it?
>>>
>>> The code is here https://go.dev/play/p/0NkLMtTZtv4
>>>
>>> --

Re: [go-nuts] noinline is 25% faster than inline on apple m1 ?

2022-07-22 Thread Kevin Chowski
Datapoint: same with windows/amd64 on Intel (running 1.19beta1):

goos: windows
goarch: amd64
pkg: common/sandbox
cpu: Intel(R) Core(TM) i7-6650U CPU @ 2.20GHz
BenchmarkNoInline-4 7742584814.34 ns/op
BenchmarkInline-4   5910893220.58 ns/op
PASS
ok  common/sandbox  2.645s

Looking at the disassembly, I noticed that in the Inline case there was a 
7-byte `lea0xXX(%rip),%rbx` in the tight inner loop due to some 
really proactive constant propagation (I hypothesize). If you manually 
defeat the propagation by storing the string in a global and manually 
copying it into the stack, the inlined becomes faster than NoInline 
again: https://go.dev/play/p/VRgJP2y7joS

goos: windows
goarch: amd64
pkg: common/sandbox
cpu: Intel(R) Core(TM) i7-6650U CPU @ 2.20GHz
BenchmarkNoInline-4 8143653914.08 ns/op
BenchmarkInline-4   5925516221.32 ns/op
BenchmarkInlineDefeatConstProp-49752482812.57 ns/op
PASS
ok  common/sandbox  5.111s

On Friday, July 22, 2022 at 11:01:00 AM UTC-6 mpr...@google.com wrote:

> I can reproduce similar behavior on linux-amd64:
>
> $ perf stat ./example.com.test -test.bench=BenchmarkInline 
> -test.benchtime=1x
> goos: linux   
> goarch: amd64  
> pkg: example.com
> cpu: Intel(R) Xeon(R) W-2135 CPU @ 3.70GHz 
> BenchmarkInline-12  1   16.78 ns/op   
>
> PASS
>   
>  Performance counter stats for './example.com.test 
> -test.bench=BenchmarkInline -test.benchtime=1x':
>
>   1,691.95 msec task-clock:u  #1.004 CPUs utilized 
>  
>  0  context-switches:u#0.000 /sec 
>
>  0  cpu-migrations:u  #0.000 /sec 
>
>352  page-faults:u #  208.044 /sec 
>
>  6,732,752,072  cycles:u  #3.979 GHz   
>   
> 22,405,823,428  instructions:u#3.33  insn per 
> cycle 
>  6,501,294,164  branches:u#3.842 G/sec 
>   
>149,596  branch-misses:u   #0.00% of all 
> branches
>
>1.684677260 seconds time elapsed
>
>1.692474000 seconds user
>0.00402 seconds sys
>
>
>
> $ perf stat ./example.com.test -test.bench=BenchmarkNoInline 
> -test.benchtime=1x
> goos: linux
> goarch: amd64
> pkg: example.com
> cpu: Intel(R) Xeon(R) W-2135 CPU @ 3.70GHz
> BenchmarkNoInline-121   10.79 ns/op
> PASS
>
>  Performance counter stats for './example.com.test 
> -test.bench=BenchmarkNoInline -test.benchtime=1x':
>
>   1,091.71 msec task-clock:u  #1.005 CPUs utilized 
>  
>  0  context-switches:u#0.000 /sec 
>
>  0  cpu-migrations:u  #0.000 /sec 
>
>363  page-faults:u #  332.505 /sec 
>
>  4,490,159,750  cycles:u  #4.113 GHz   
>   
> 20,205,764,499  instructions:u#4.50  insn per 
> cycle 
>  6,701,281,015  branches:u#6.138 G/sec 
>   
>586,073  branch-misses:u   #0.01% of all 
> branches
>
>1.086302272 seconds time elapsed
>
>1.08771 seconds user
>0.008027000 seconds sys
>
> The non-inlined version is actually fewer instructions to run the same 
> benchmark, which surprises me because naively looking at the disassembly it 
> seems that the inlined version is much more compact.
>
>
> On Fri, Jul 22, 2022 at 5:52 AM eric...@arm.com  wrote:
>
>> For this piece of code, two test functions are the same, but one is 
>> inlined, the other is not. However the inlined version is about 25% slower 
>> than the no inlined version on apple m1 chip. Why is it?
>>
>> The code is here https://go.dev/play/p/0NkLMtTZtv4
>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "golang-nuts" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to golang-nuts...@googlegroups.com.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/golang-nuts/527264d7-7cc1-4278-9a29-c04eb3ec4e86n%40googlegroups.com
>>  
>> 
>> .
>>
>

-- 
You received this message becau

Re: [go-nuts] noinline is 25% faster than inline on apple m1 ?

2022-07-22 Thread 'Michael Pratt' via golang-nuts
I can reproduce similar behavior on linux-amd64:

$ perf stat ./example.com.test -test.bench=BenchmarkInline
-test.benchtime=1x
goos: linux
goarch: amd64
pkg: example.com
cpu: Intel(R) Xeon(R) W-2135 CPU @ 3.70GHz
BenchmarkInline-12  1   16.78 ns/op

PASS

 Performance counter stats for './example.com.test
-test.bench=BenchmarkInline -test.benchtime=1x':

  1,691.95 msec task-clock:u  #1.004 CPUs utilized

 0  context-switches:u#0.000 /sec

 0  cpu-migrations:u  #0.000 /sec

   352  page-faults:u #  208.044 /sec

 6,732,752,072  cycles:u  #3.979 GHz

22,405,823,428  instructions:u#3.33  insn per cycle

 6,501,294,164  branches:u#3.842 G/sec

   149,596  branch-misses:u   #0.00% of all
branches

   1.684677260 seconds time elapsed

   1.692474000 seconds user
   0.00402 seconds sys



$ perf stat ./example.com.test -test.bench=BenchmarkNoInline
-test.benchtime=1x
goos: linux
goarch: amd64
pkg: example.com
cpu: Intel(R) Xeon(R) W-2135 CPU @ 3.70GHz
BenchmarkNoInline-121   10.79 ns/op
PASS

 Performance counter stats for './example.com.test
-test.bench=BenchmarkNoInline -test.benchtime=1x':

  1,091.71 msec task-clock:u  #1.005 CPUs utilized

 0  context-switches:u#0.000 /sec

 0  cpu-migrations:u  #0.000 /sec

   363  page-faults:u #  332.505 /sec

 4,490,159,750  cycles:u  #4.113 GHz

20,205,764,499  instructions:u#4.50  insn per cycle

 6,701,281,015  branches:u#6.138 G/sec

   586,073  branch-misses:u   #0.01% of all
branches

   1.086302272 seconds time elapsed

   1.08771 seconds user
   0.008027000 seconds sys

The non-inlined version is actually fewer instructions to run the same
benchmark, which surprises me because naively looking at the disassembly it
seems that the inlined version is much more compact.


On Fri, Jul 22, 2022 at 5:52 AM eric...@arm.com  wrote:

> For this piece of code, two test functions are the same, but one is
> inlined, the other is not. However the inlined version is about 25% slower
> than the no inlined version on apple m1 chip. Why is it?
>
> The code is here https://go.dev/play/p/0NkLMtTZtv4
>
> --
> You received this message because you are subscribed to the Google Groups
> "golang-nuts" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to golang-nuts+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/golang-nuts/527264d7-7cc1-4278-9a29-c04eb3ec4e86n%40googlegroups.com
> 
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to golang-nuts+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/golang-nuts/CALoThU8pAFzz_CGEQ1c4J_tEiLdyeu6kLkkYNjGZKkaLeTgYhw%40mail.gmail.com.