On Fri, 4 Sep 2020 at 17:08, Alexander Monakov <amona...@ispras.ru> wrote: > > > I obtained perf stat results for following benchmark runs: > > > > -O2: > > > > 7856832.692380 task-clock (msec) # 1.000 CPUs utilized > > 3758 context-switches # 0.000 K/sec > > 40 cpu-migrations # 0.000 > > K/sec > > 40847 page-faults # 0.005 > > K/sec > > 7856782413676 cycles # 1.000 GHz > > 6034510093417 instructions # 0.77 insn per > > cycle > > 363937274287 branches # 46.321 M/sec > > 48557110132 branch-misses # 13.34% of all > > branches > > (ouch, 2+ hours per run is a lot, collecting a profile over a minute should be > enough for this kind of code) > > > -O2 with orthonl inlined: > > > > 8319643.114380 task-clock (msec) # 1.000 CPUs utilized > > 4285 context-switches # 0.001 K/sec > > 28 cpu-migrations # 0.000 > > K/sec > > 40843 page-faults # 0.005 > > K/sec > > 8319591038295 cycles # 1.000 GHz > > 6276338800377 instructions # 0.75 insn per > > cycle > > 467400726106 branches # 56.180 M/sec > > 45986364011 branch-misses # 9.84% of all > > branches > > So +100e9 branches, but +240e9 instructions and +480e9 cycles, probably > implying > that extra instructions are appearing in this loop nest, but not in the > innermost > loop. As a reminder for others, the innermost loop has only 3 iterations. > > > -O2 with orthonl inlined and PRE disabled (this removes the extra branches): > > > > 8207331.088040 task-clock (msec) # 1.000 CPUs utilized > > 2266 context-switches # 0.000 K/sec > > 32 cpu-migrations # 0.000 K/sec > > 40846 page-faults # 0.005 K/sec > > 8207292032467 cycles # 1.000 GHz > > 6035724436440 instructions # 0.74 insn per cycle > > 364415440156 branches # 44.401 M/sec > > 53138327276 branch-misses # 14.58% of all branches > > This seems to match baseline in terms of instruction count, but without PRE > the loop nest may be carrying some dependencies over memory. I would simply > check the assembly for the entire 6-level loop nest in question, I hope it's > not very complicated (though Fortran array addressing...). > > > -O2 with orthonl inlined and hoisting disabled: > > > > 7797265.206850 task-clock (msec) # 1.000 CPUs utilized > > 3139 context-switches # 0.000 K/sec > > 20 cpu-migrations # 0.000 > > K/sec > > 40846 page-faults # 0.005 > > K/sec > > 7797221351467 cycles # 1.000 GHz > > 6187348757324 instructions # 0.79 insn per > > cycle > > 461840800061 branches # 59.231 M/sec > > 26920311761 branch-misses # 5.83% of all > > branches > > There's a 20e9 reduction in branch misses and a 500e9 reduction in cycle > count. > I don't think the former fully covers the latter (there's also a 90e9 > reduction > in insn count). > > Given that the inner loop iterates only 3 times, my main suggestion is to > consider how the profile for the entire loop nest looks like (it's 6 loops > deep, > each iterating only 3 times). > > > Perf profiles for > > -O2 -fno-code-hoisting and inlined orthonl: > > https://people.linaro.org/~prathamesh.kulkarni/perf_O2_inline.data > > > > 3196866 |1f04: ldur d1, [x1, #-248] > > 216348301800│ add w0, w0, #0x1 > > 985098 | add x2, x2, #0x18 > > 216215999206│ add x1, x1, #0x48 > > 215630376504│ fmul d1, d5, d1 > > 863829148015│ fmul d1, d1, d6 > > 864228353526│ fmul d0, d1, d0 > > 864568163014│ fmadd d2, d0, d16, d2 > > │ cmp w0, #0x4 > > 216125427594│ ↓ b.eq 1f34 > > 15010377│ ldur d0, [x2, #-8] > > 143753737468│ ↑ b 1f04 > > > > -O2 with inlined orthonl: > > https://people.linaro.org/~prathamesh.kulkarni/perf_O2_inline.data > > > > 359871503840│ 1ef8: ldur d15, [x1, #-248] > > 144055883055│ add w0, w0, #0x1 > > 72262104254│ add x2, x2, #0x18 > > 143991169721│ add x1, x1, #0x48 > > 288648917780│ fmul d15, d17, d15 > > 864665644756│ fmul d15, d15, d18 > > 863868426387│ fmul d14, d15, d14 > > 865228159813│ fmadd d16, d14, d31, d16 > > 245967│ cmp w0, #0x4 > > 215396760545│ ↓ b.eq 1f28 > > 704732365│ ldur d14, [x2, #-8] > > 143775979620│ ↑ b 1ef8 > > This indicates that the loop only covers about 46-48% of overall time. > > High count on the initial ldur instruction could be explained if the loop > is not entered by "fallthru" from the preceding block, or if its backedge > is mispredicted. Sampling mispredictions should be possible with perf record, > and you may be able to check if loop entry is fallthrough by inspecting > assembly. > > It may also be possible to check if code alignment matters, by compiling with > -falign-loops=32. Hi, Thanks a lot for the detailed feedback, and I am sorry for late response.
The hoisting region is: if(mattyp.eq.1) then 4 loops elseif(mattyp.eq.2) then { orthonl inlined into basic block; loads w[0] .. w[8] } else 6 loops // load anisox followed by basic block: senergy= & (s11*w(1,1)+s12*(w(1,2)+w(2,1)) & +s13*(w(1,3)+w(3,1))+s22*w(2,2) & +s23*(w(2,3)+w(3,2))+s33*w(3,3))*weight s(ii1,jj1)=s(ii1,jj1)+senergy s(ii1+1,jj1+1)=s(ii1+1,jj1+1)+senergy s(ii1+2,jj1+2)=s(ii1+2,jj1+2)+senergy Hoisting hoists loads w[0] .. w[8] from orthonl and senergy block, right in block 181, which is: if (mattyp.eq.2) goto <bb 182> else goto <bb 193> which is then further hoisted to block 173: if (mattyp.eq.1) goto <bb 392> else goto <bb 181> >From block 181, we have two paths towards senergy block (bb 194): bb 181 -> bb 182 (orthonl block) -> bb 194 (senergy block) AND bb 181 -> bb 392 <6 loops pre-header> ... -> bb 194 which has a path length of around 18 blocks. (bb 194 post-dominates bb 181 and bb 173). Disabling only load hoisting within blocks 173 and 181 (simply avoid inserting pre_expr if pre_expr->kind == REFERENCE), avoid hoisting of 'w' array and brings back most of performance. Which verifies that it is hoisting of the 'w' array (w[0] ... w[8]), which is causing the slowdown ? I obtained perf profiles for full hoisting, and disabled hoisting of 'w' array for the 6 loops, and the most drastic difference was for ldur instruction: With full hoisting: 359871503840│ 1ef8: ldur d15, [x1, #-248] Without full hoisting: 3441224 │1edc: ldur d1, [x1, #-248] (The loop entry seems to be fall thru in both cases. I have attached profiles for both cases). IIUC, the instruction seems to be loading the first element from anisox array, which makes me wonder if the issue was with data-cache miss for slower version. I ran perf script on perf data for L1-dcache-load-misses with period = 1million, and it reported two cache misses on the ldur instruction in full hoisting case, while it reported zero for the disabled load hoisting case. So I wonder if the slowdown happens because hoisting of 'w' array possibly results in eviction of anisox thus causing a cache miss inside the inner loop and making load slower ? Hoisting also seems to improve the number of overall cache misses tho. For disabled hoisting of 'w' array case, there were a total of 463 cache misses, while with full hoisting there were 357 cache misses (with period = 1 million). Does that happen because hoisting probably reduces cache misses along the orthonl path (bb 173 - > bb 181 -> bb 182 -> bb 194) ? Thanks, Prathamesh > > Alexander
884982389 │1e40: ldr x0, [sp, #448] ◆ │ fmov d19, d6 ▒ 871517886 │ ldr x1, [sp, #808] ▒ │ add x16, sp, #0x720 ▒ 904652642 │ ldr x13, [sp, #784] ▒ │ sub x15, x26, #0x1 ▒ 892180199 │ mov x24, x27 ▒ │ add x28, x27, #0xf8 ▒ 881362543 │ add x22, x1, x0, lsl #3 ▒ │ mov x12, #0x9 // #9 ▒ 906876972 │ mov x23, #0x1 // #1 ▒ 5342906864 │1e6c: fmov d17, d1 ▒ 2622786801 │ mov x14, #0x1778 // #6008 │ mov x20, x28 ▒ 2680397945 │ add x19, sp, x14 ▒ │ mov x18, x24 ▒ 2629152729 │ mov x21, x30 ▒ │ ldr d16, [x22] ▒ 4571598336 │ mov x17, #0x1e // #30 ▒ 15904018941 │1e8c: mov x11, x19 ▒ 8106237022 │ mov x10, x20 ▒ │ mov x14, x21 ▒ 7958740225 │ mov x9, x18 ▒ │ mov x8, #0x1b // #27 ▒ 41353477432 │1ea0: ldr d14, [x9] ▒ 1220553185 │ fmov d18, d22 ◆ 22852558475 │ fmov d20, d19 ▒ 1199867833 │ mov x3, x11 ▒ 22706386191 │ mov x7, x16 ▒ 1177543527 │ mov x6, x10 ▒ 22767111709 │ fmul d14, d17, d14 ▒ 1195454897 │ mov x5, #0x1 // #1 ▒ 94868835951 │ fmadd d16, d14, d31, d16 ▒ 48021203056 │1ec4: ldur d15, [x6, #-248] ▒ 30707657072 │ sub x4, x3, #0x140 ▒ 41301831015 │ fmov d14, d19 ▒ 32467499777 │ mov x2, x13 ▒ 39498561992 │ mov x1, x3 ▒ 32503985332 │ mov w0, #0x1 // #1 ▒39636367978 │ fmul d15, d17, d15 ▒56642417403 │ ldr d21, [x4, x12, lsl #3] ▒215900325343│ fmul d21, d17, d21 ▒49939836468 │ fmul d15, d15, d20 ▒238451679574│ fmul d20, d21, d18 ▒49692127013 │ fmadd d15, d15, d31, d16 ▒287649913912│ fmadd d16, d20, d31, d15 ▒359871503840│ 1ef8: ldur d15, [x1, #-248] ▒144055883055│ add w0, w0, #0x1 ▒72262104254 │ add x2, x2, #0x18 ▒143991169721│ add x1, x1, #0x48 ▒288648917780│ fmul d15, d17, d15 ▒864665644756│ fmul d15, d15, d18 ▒863868426387│ fmul d14, d15, d14 ◆ 865228159813│ fmadd d16, d14, d31, d16 ▒ 245967 │ cmp w0, #0x4 ▒ 215396760545│ ↓ b.eq 1f28 ▒ 704732365 │ ldur d14, [x2, #-8] ▒ 143775979620│ ↑ b 1ef8 ▒ 2623253706 │1f28: add x5, x5, #0x1 ▒71700007726 │ add x6, x6, #0x48 ▒ 291326727 │ add x3, x3, #0x8 ▒41539387956 │ cmp x5, #0x4 ▒ 291327452 │ ↓ b.eq 1f4c ▒ 152721910227│ ldr d18, [x7, x15, lsl #3] ▒ 8561615599 │ add x7, x7, #0x18 ▒ 96142935717 │ ldur d20, [x7, #-24] ▒ 8495464096 │ ↑ b 1ec4 ▒ 201164546300│ 1f4c: add x8, x8, #0x1b ▒ 22086088222 │ add x9, x9, #0xd8 ▒ 1882100212 │ add x14, x14, #0x18 ▒ 22119311849 │ add x10, x10, #0xd8 ▒ 1892034271 │ add x11, x11, #0xd8 ▒ 13413581701 │ cmp x8, #0x6c ▒ 1191551884 │ ↓ b.eq 1f70 ▒26310755425 │ ldur d17, [x14, #-8] ▒ 1210506566 │ ↑ b 1ea0 ▒71960439728 │1f70: add x17, x17, #0x3 ▒ │ add x18, x18, #0x18 ◆ 8069920125 │ add x20, x20, #0x18 ▒ │ add x19, x19, #0x18 ▒ 4645045210 │ cmp x17, #0x27 ▒ │ ↓ b.eq 1f90 ▒ 10962695888 │ ldr d17, [x21], #8 ▒ │ ↑ b 1e8c ▒ 23927242012 │1f90: add x23, x23, #0x1 ▒ │ str d16, [x22] ▒ 2672842806 │ add x16, x16, #0x8 ▒ │ add x12, x12, #0x9 ▒ 2653094829 │ sub x15, x15, #0x1 ▒ │ add x24, x24, #0x48 ▒ 2692030697 │ add x22, x22, #0x1e0 ▒ │ cmp x23, #0x4 ▒ 1721216607 │ ↓ b.eq 1fbc ▒ 448331273 │ ldr d19, [x13], #8 ▒ 1778236919 │ ↑ b 1e6c ▒ 7971009272 │1fbc: ldr x0, [sp, #448] ▒ 911313572 │ add x26, x26, #0x1 ▒ │ add x27, x27, #0x8 ▒ 902215785 │ add x0, x0, #0x1 ▒ │ str x0, [sp, #448] ▒ 478032817 │ cmp x26, #0x4 ▒ │ ↓ b.eq 1fe8 ▒ 1475545769 │ add x0, sp, #0x708 ◆ │ add x0, x0, x26, lsl #3 ▒ 1806982272 │ ldur d22, [x0, #-8] ▒ │ ↑ b 1e40
589937229 │1e30: mov x15, #0x1760 // #5984 ◆ 904297989 │ add x0, sp, x15 ▒ 870649879 │ add x22, x0, x22 ▒ │ fmov d7, d24 ▒ 891274869 │ ldr x0, [sp, #448] ▒ │ add x14, sp, #0x710 ▒ 909978719 │ ldr x12, [sp, #728] ▒ │ sub x13, x27, #0x1 ▒ 882715766 │ add x18, x28, #0x8 ▒ │ sub x19, x0, x28 ▒ 885884552 │ mov x9, #0x9 // #9 ▒ │ mov x20, #0x1 // #1 ▒ 6279074827 │1e60: mov x17, x22 │ mov x16, x30 ▒ 2666213304 │ ldr d2, [x19] ▒ │ mov x15, #0x3 // #3 ▒ 18990367400 │1e70: mov x8, x17 ▒ │ mov x11, x16 ▒ 8057495884 │ mov x10, #0x1b // #27 ▒ 14947123246 │1e7c: sub x0, x8, #0x140 ▒ 22985623052 │ ldur d5, [x11, #-8] ▒ 1060364445 │ fmov d6, d8 ▒ 23956420799 │ fmov d3, d7 ▒ │ add x3, x18, x8 ◆ 24065319873 │ mov x7, x14 ▒ │ ldr d0, [x0, x9, lsl #3] ▒24187025828 │ mov x6, x8 ▒ │ mov x5, #0x1 // #1 ▒ 48132474841 │ fmul d0, d5, d0 ▒ 96001335773 │ fmadd d2, d0, d16, d2 ▒ 61067761742 │1ea8: ldur d4, [x6, #-248] ▒ 14089308947 │ sub x4, x3, #0x140 ▒ 58091146403 │ fmov d0, d7 ▒ 14028168886 │ mov x2, x12 ▒ 57897209384 │ mov x1, x3 ▒ 13994185270 │ mov w0, #0x1 // #1 67891460180 │ fmul d4, d5, d4 ▒28006688701 │ ldr d1, [x4, x9, lsl #3] ▒215655048826│ fmul d1, d5, d1 ▒57701202743 │ fmul d3, d4, d3 ▒230116393416│ fmul d1, d1, d6 ▒57977229144 │ fmadd d2, d3, d16, d2 ▒301775181164│ fmadd d2, d1, d16, d2 ▒ 3441224 │1edc: ldur d1, [x1, #-248] ▒216111094536│ add w0, w0, #0x1 ▒ 1473566 │ add x2, x2, #0x18 ▒215873683406│ add x1, x1, #0x48 ▒216166335905│ fmul d1, d5, d1 ▒864007322335│ fmul d1, d1, d6 ▒863815029515│ fmul d0, d1, d0 ▒864900327399│ fmadd d2, d0, d16, d2 ◆ │ cmp w0, #0x4 ▒ 216329679631│ ↓ b.eq 1f0c ▒ 22872044 │ ldur d0, [x2, #-8] ▒ 143941131893│ ↑ b 1edc ▒ 277804663 │1f0c: add x5, x5, #0x1 ▒72179847520 │ add x6, x6, #0x48 ▒ │ add x3, x3, #0x8 ▒65738463940 │ cmp x5, #0x4 ▒ │ ↓ b.eq 1f30 123097375558│ ldr d6, [x7, x13, lsl #3] ▒ │ add x7, x7, #0x18 ▒ 96061189670 │ ldur d3, [x7, #-24] ▒ │ ↑ b 1ea8 ▒ 42647845407 │1f30: add x10, x10, #0x1b ▒ │ add x11, x11, #0x18 ▒ 24141022972 │ add x8, x8, #0xd8 ▒ │ cmp x10, #0x6c ▒ 14573046432 │ ↑ b.ne 1e7c ▒ 72139544087 │ add x15, x15, #0x3 ▒ 8028370830 │ add x16, x16, #0x8 ▒ │ add x17, x17, #0x18 ▒ 4860057143 │ cmp x15, #0xc ▒ │ ↑ b.ne 1e70 ▒23912996709 │ add x20, x20, #0x1 ◆ │ str d2, [x19] ▒ 2670529487 │ add x14, x14, #0x8 ▒ │ add x9, x9, #0x9 ▒ 2659625346 │ sub x13, x13, #0x1 ▒ │ add x19, x19, #0x1e0 ▒ 1606030574 │ cmp x20, #0x4 ▒ │ ↓ b.eq 1f80 ▒ 3096553445 │ ldr d7, [x12], #8 ▒ │ ↑ b 1e60 ▒ 7964390214 │1f80: add x27, x27, #0x1 ▒ │ sub x28, x28, #0x8 529029469 │ cmp x27, #0x4 ◆ │ ↓ b.eq 2028 ▒ 1176126379 │ lsl x22, x27, #3 ▒ │ add x0, sp, #0x6f8 ▒ 593893747 │ add x0, x0, x22 ▒ 1798781807 │ ldur d8, [x0, #-8] ▒ 580872685 │ ↑ b 1e30