https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118328
Diego Russo <Diego.Russo at arm dot com> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |Diego.Russo at arm dot com
--- Comment #12 from Diego Russo <Diego.Russo at arm dot com> ---
Hello,
I was able to test Richard's patch and I'm glad to confirm that it brings the
benefit expected.
I built gcc with the patch and with it I compiled
https://github.com/Fidget-Spinner/cpython/tree/tail-call-gcc-2 branch that
implements the tail-calling interpreter.
I've also compiled a modified version of that branch which doesn't use the
preserve_none attribute.
We noticed improvements in the code generation. This is the version without
preserve_none
000000000060600c <_TAIL_CALL_BINARY_OP_ADD_INT>:
60600c: f85f0026 ldur x6, [x1, #-16]
606010: aa0103e5 mov x5, x1
606014: 900018e1 adrp x1, 922000 <PyList_Type+0x140>
606018: 91114021 add x1, x1, #0x450
60601c: aa0003e9 mov x9, x0
606020: f94004c7 ldr x7, [x6, #8]
606024: f9001c03 str x3, [x0, #56]
606028: eb0100ff cmp x7, x1
60602c: 540000a1 b.ne 606040
<_TAIL_CALL_BINARY_OP_ADD_INT+0x34> // b.any
606030: f85f80a8 ldur x8, [x5, #-8]
606034: f9400500 ldr x0, [x8, #8]
606038: eb07001f cmp x0, x7
60603c: 54000080 b.eq 60604c
<_TAIL_CALL_BINARY_OP_ADD_INT+0x40> // b.none
606040: aa0503e1 mov x1, x5
606044: aa0903e0 mov x0, x9
606048: 17fffd9a b 6056b0 <_TAIL_CALL_BINARY_OP>
60604c: a9bb7bfd stp x29, x30, [sp, #-80]!
606050: aa0803e1 mov x1, x8
606054: aa0603e0 mov x0, x6
606058: 910003fd mov x29, sp
60605c: a90153f3 stp x19, x20, [sp, #16]
606060: 91003073 add x19, x3, #0xc
606064: aa0203f4 mov x20, x2
606068: a90223e6 stp x6, x8, [sp, #32]
60606c: a90327e3 stp x3, x9, [sp, #48]
606070: f90023e5 str x5, [sp, #64]
606074: 97fb2ca4 bl 4d1304 <_PyLong_Add>
606078: a94223e6 ldp x6, x8, [sp, #32]
60607c: aa0003e4 mov x4, x0
606080: f94023e5 ldr x5, [sp, #64]
606084: a94327e3 ldp x3, x9, [sp, #48]
606088: b9400100 ldr w0, [x8]
60608c: 37f80340 tbnz w0, #31, 6060f4
<_TAIL_CALL_BINARY_OP_ADD_INT+0xe8>
606090: 51000400 sub w0, w0, #0x1
606094: b9000100 str w0, [x8]
606098: 350002e0 cbnz w0, 6060f4
<_TAIL_CALL_BINARY_OP_ADD_INT+0xe8>
60609c: 90001a20 adrp x0, 94a000 <stat_methods+0x78>
6060a0: 91176000 add x0, x0, #0x5d8
6060a4: f9544807 ldr x7, [x0, #10384]
6060a8: b4000167 cbz x7, 6060d4
<_TAIL_CALL_BINARY_OP_ADD_INT+0xc8>
6060ac: f9544c02 ldr x2, [x0, #10392]
6060b0: a9021be8 stp x8, x6, [sp, #32]
6060b4: aa0803e0 mov x0, x8
6060b8: 52800021 mov w1, #0x1 // #1
6060bc: f9001be4 str x4, [sp, #48]
6060c0: f90027e3 str x3, [sp, #72]
6060c4: d63f00e0 blr x7
6060c8: a9421be8 ldp x8, x6, [sp, #32]
6060cc: a94327e4 ldp x4, x9, [sp, #48]
6060d0: a9440fe5 ldp x5, x3, [sp, #64]
6060d4: aa0803e0 mov x0, x8
6060d8: a90213e6 stp x6, x4, [sp, #32]
6060dc: a90317e9 stp x9, x5, [sp, #48]
6060e0: f90023e3 str x3, [sp, #64]
6060e4: 97fb2c71 bl 4d12a8 <_PyLong_ExactDealloc>
6060e8: a94213e6 ldp x6, x4, [sp, #32]
6060ec: a94317e9 ldp x9, x5, [sp, #48]
6060f0: f94023e3 ldr x3, [sp, #64]
6060f4: b94000c0 ldr w0, [x6]
6060f8: 37f80300 tbnz w0, #31, 606158
<_TAIL_CALL_BINARY_OP_ADD_INT+0x14c>
6060fc: 51000400 sub w0, w0, #0x1
606100: b90000c0 str w0, [x6]
606104: 350002a0 cbnz w0, 606158
<_TAIL_CALL_BINARY_OP_ADD_INT+0x14c>
606108: 90001a20 adrp x0, 94a000 <stat_methods+0x78>
60610c: 91176000 add x0, x0, #0x5d8
606110: f9544807 ldr x7, [x0, #10384]
606114: b4000167 cbz x7, 606140
<_TAIL_CALL_BINARY_OP_ADD_INT+0x134>
606118: f9544c02 ldr x2, [x0, #10392]
60611c: a90213e6 stp x6, x4, [sp, #32]
606120: aa0603e0 mov x0, x6
606124: a90317e9 stp x9, x5, [sp, #48]
606128: 52800021 mov w1, #0x1 // #1
60612c: f90023e3 str x3, [sp, #64]
606130: d63f00e0 blr x7
606134: a94213e6 ldp x6, x4, [sp, #32]
606138: a94317e9 ldp x9, x5, [sp, #48]
60613c: f94023e3 ldr x3, [sp, #64]
606140: aa0603e0 mov x0, x6
606144: a90227e4 stp x4, x9, [sp, #32]
606148: a9030fe5 stp x5, x3, [sp, #48]
60614c: 97fb2c57 bl 4d12a8 <_PyLong_ExactDealloc>
606150: a94227e4 ldp x4, x9, [sp, #32]
606154: a9430fe5 ldp x5, x3, [sp, #48]
606158: b4000204 cbz x4, 606198
<_TAIL_CALL_BINARY_OP_ADD_INT+0x18c>
60615c: f81f00a4 stur x4, [x5, #-16]
606160: d0000c60 adrp x0, 794000
<builtin___import____doc__+0x80>
606164: 79401864 ldrh w4, [x3, #12]
606168: 910c0000 add x0, x0, #0x300
60616c: aa1403e2 mov x2, x20
606170: aa1303e3 mov x3, x19
606174: 12001c81 and w1, w4, #0xff
606178: a94153f3 ldp x19, x20, [sp, #16]
60617c: 53087c84 lsr w4, w4, #8
606180: f861d806 ldr x6, [x0, w1, sxtw #3]
606184: d10020a1 sub x1, x5, #0x8
606188: a8c57bfd ldp x29, x30, [sp], #80
60618c: aa0903e0 mov x0, x9
606190: aa0603f0 mov x16, x6
606194: d61f0200 br x16
606198: aa1303e3 mov x3, x19
60619c: aa1403e2 mov x2, x20
6061a0: a94153f3 ldp x19, x20, [sp, #16]
6061a4: d10040a1 sub x1, x5, #0x10
6061a8: a8c57bfd ldp x29, x30, [sp], #80
6061ac: aa0903e0 mov x0, x9
6061b0: 17ffe575 b 5ff784 <_TAIL_CALL_error.isra.0>
6061b4: d503201f nop
6061b8: d503201f nop
6061bc: d503201f nop
we can see the callee-save registers x19, x20 spilled to the stack. Similar
thing happens with the caller-save registers (x3, x4, x5, x9)
...
606144: a90227e4 stp x4, x9, [sp, #32]
606148: a9030fe5 stp x5, x3, [sp, #48]
60614c: 97fb2c57 bl 4d12a8 <_PyLong_ExactDealloc>
606150: a94227e4 ldp x4, x9, [sp, #32]
606154: a9430fe5 ldp x5, x3, [sp, #48]
...
This is the preserves_none output:
0000000000601be0 <_TAIL_CALL_BINARY_OP_ADD_INT>:
601be0: f85f0035 ldur x21, [x1, #-16]
601be4: aa0303f4 mov x20, x3
601be8: f9001c03 str x3, [x0, #56]
601bec: aa0103f3 mov x19, x1
601bf0: b0001901 adrp x1, 922000 <PyList_Type+0x140>
601bf4: 91114021 add x1, x1, #0x450
601bf8: f94006a3 ldr x3, [x21, #8]
601bfc: aa0003f7 mov x23, x0
601c00: eb01007f cmp x3, x1
601c04: 540000a1 b.ne 601c18
<_TAIL_CALL_BINARY_OP_ADD_INT+0x38> // b.any
601c08: f85f8276 ldur x22, [x19, #-8]
601c0c: f94006c0 ldr x0, [x22, #8]
601c10: eb03001f cmp x0, x3
601c14: 540000a0 b.eq 601c28
<_TAIL_CALL_BINARY_OP_ADD_INT+0x48> // b.none
601c18: aa1403e3 mov x3, x20
601c1c: aa1303e1 mov x1, x19
601c20: aa1703e0 mov x0, x23
601c24: 17fffe0a b 60144c <_TAIL_CALL_BINARY_OP>
601c28: a9bf7bfd stp x29, x30, [sp, #-16]!
601c2c: 2a0403f9 mov w25, w4
601c30: aa0203f8 mov x24, x2
601c34: 910003fd mov x29, sp
601c38: aa1603e1 mov x1, x22
601c3c: aa1503e0 mov x0, x21
601c40: 97fb3db1 bl 4d1304 <_PyLong_Add>
601c44: aa0003fa mov x26, x0
601c48: b94002c0 ldr w0, [x22]
601c4c: 9100329b add x27, x20, #0xc
601c50: 37f801c0 tbnz w0, #31, 601c88
<_TAIL_CALL_BINARY_OP_ADD_INT+0xa8>
601c54: 51000400 sub w0, w0, #0x1
601c58: b90002c0 str w0, [x22]
601c5c: 35000160 cbnz w0, 601c88
<_TAIL_CALL_BINARY_OP_ADD_INT+0xa8>
601c60: b0001a40 adrp x0, 94a000 <stat_methods+0x78>
601c64: 91176000 add x0, x0, #0x5d8
601c68: f9544803 ldr x3, [x0, #10384]
601c6c: b40000a3 cbz x3, 601c80
<_TAIL_CALL_BINARY_OP_ADD_INT+0xa0>
601c70: f9544c02 ldr x2, [x0, #10392]
601c74: 52800021 mov w1, #0x1 // #1
601c78: aa1603e0 mov x0, x22
601c7c: d63f0060 blr x3
601c80: aa1603e0 mov x0, x22
601c84: 97fb3d89 bl 4d12a8 <_PyLong_ExactDealloc>
601c88: b94002a0 ldr w0, [x21]
601c8c: 37f801c0 tbnz w0, #31, 601cc4
<_TAIL_CALL_BINARY_OP_ADD_INT+0xe4>
601c90: 51000400 sub w0, w0, #0x1
601c94: b90002a0 str w0, [x21]
601c98: 35000160 cbnz w0, 601cc4
<_TAIL_CALL_BINARY_OP_ADD_INT+0xe4>
601c9c: b0001a40 adrp x0, 94a000 <stat_methods+0x78>
601ca0: 91176000 add x0, x0, #0x5d8
601ca4: f9544803 ldr x3, [x0, #10384]
601ca8: b40000a3 cbz x3, 601cbc
<_TAIL_CALL_BINARY_OP_ADD_INT+0xdc>
601cac: f9544c02 ldr x2, [x0, #10392]
601cb0: 52800021 mov w1, #0x1 // #1
601cb4: aa1503e0 mov x0, x21
601cb8: d63f0060 blr x3
601cbc: aa1503e0 mov x0, x21
601cc0: 97fb3d7a bl 4d12a8 <_PyLong_ExactDealloc>
601cc4: b40001fa cbz x26, 601d00
<_TAIL_CALL_BINARY_OP_ADD_INT+0x120>
601cc8: 79401a84 ldrh w4, [x20, #12]
601ccc: b0000c80 adrp x0, 792000 <builtin_all__doc__+0x20>
601cd0: 91110000 add x0, x0, #0x440
601cd4: aa1b03e3 mov x3, x27
601cd8: 12001c81 and w1, w4, #0xff
601cdc: aa1803e2 mov x2, x24
601ce0: a8c17bfd ldp x29, x30, [sp], #16
601ce4: f81f027a stur x26, [x19, #-16]
601ce8: f861d805 ldr x5, [x0, w1, sxtw #3]
601cec: 53087c84 lsr w4, w4, #8
601cf0: d1002261 sub x1, x19, #0x8
601cf4: aa1703e0 mov x0, x23
601cf8: aa0503f0 mov x16, x5
601cfc: d61f0200 br x16
601d00: a8c17bfd ldp x29, x30, [sp], #16
601d04: 2a1903e4 mov w4, w25
601d08: aa1b03e3 mov x3, x27
601d0c: aa1803e2 mov x2, x24
601d10: d1004261 sub x1, x19, #0x10
601d14: aa1703e0 mov x0, x23
601d18: 17fff672 b 5ff6e0 <_TAIL_CALL_error>
601d1c: d503201f nop
In the preserves_none output, it can move things like x3-5 and v9 above into
call-preserved registers (x21+) without penalty. So the stores before the call
become moves (which can be handled using renaming) and the code after the call
can use those registers directly (so the loads disappear entirely).
Thanks Richard for double checking the output and for the explanation.
I've also run pyperformance benchmark suite against the two versions of
CPython. The preserve_none implementation brings a 4% performance improvement
(as geometric mean).
Almost all benchmarks show performance improvements, up to 16%.
Please, schedule this feature to be fully implemented in GCC as the CPython
project will need it (the PR for tail-calling interpreter has been merged
already - just for clang for now).
Thanks