Hi,
It has come to my attention that we create suboptimal code for the
`__call_stub_fp_' variant of the MIPS16 call stubs. These stubs are
used for outgoing calls made from MIPS16 code to standard MIPS code that
return floating-point results and may also pass floating-point
arguments.
This can be illustrated with this small program:
$ cat foo.c
double bar (double d);
int main (int argc, char **argv)
{
return bar (argc);
}
$ mips-linux-gnu-gcc -O2 -mips16 -c foo.c
$ mips-linux-gnu-objdump -dr foo.o
foo.o: file format elf32-tradbigmips
Disassembly of section .mips16.call.fp.bar:
__call_stub_fp_bar:
0: 44856000mtc1a1,$f12
4: 44846800mtc1a0,$f13
8: 03e09021moves2,ra
c: 0c00jal 0 __call_stub_fp_bar
c: R_MIPS_26bar
10: nop
14: 4403mfc1v1,$f0
18: 0248jr s2
1c: 44020800mfc1v0,$f1
Disassembly of section .text.startup:
main:
0: f100 64c4 save32,ra,s2
4: 1800 jal 0 main
4: R_MIPS16_26 __mips16_floatsidf
8: 6500nop
a: 67a3movea1,v1
c: 1800 jal 0 main
c: R_MIPS16_26 bar
10: 6782movea0,v0
12: 67a3movea1,v1
14: 1800 jal 0 main
14: R_MIPS16_26 __mips16_fix_truncdfsi
18: 6782movea0,v0
1a: f100 6444 restore 32,ra,s2
1e: e8a0jrc ra
-- as you can see in `__call_stub_fp_bar' above the jump delay slot
remains empty because the move to $s2 instruction cannot be scheduled
there due to a data dependency on $ra.
However there is no need to save $ra last and since the MIPS IV ISA
there are no coprocessor move delay slots so the last MTC1 instruction
could be scheduled there instead.
So here is a change to avoid this empty delay slot. With this change
in place we get this code instead:
$ mips-linux-gnu-objdump -dr foo.o
foo.o: file format elf32-tradbigmips
Disassembly of section .mips16.call.fp.bar:
__call_stub_fp_bar:
0: 03e09021moves2,ra
4: 44856000mtc1a1,$f12
8: 0c00jal 0 __call_stub_fp_bar
8: R_MIPS_26bar
c: 44846800mtc1a0,$f13
10: 4403mfc1v1,$f0
14: 0248jr s2
18: 44020800mfc1v0,$f1
Disassembly of section .text.startup:
main:
0: f100 64c4 save32,ra,s2
4: 1800 jal 0 main
4: R_MIPS16_26 __mips16_floatsidf
8: 6500nop
a: 67a3movea1,v1
c: 1800 jal 0 main
c: R_MIPS16_26 bar
10: 6782movea0,v0
12: 67a3movea1,v1
14: 1800 jal 0 main
14: R_MIPS16_26 __mips16_fix_truncdfsi
18: 6782movea0,v0
1a: f100 6444 restore 32,ra,s2
1e: e8a0jrc ra
-- as you can see the last MTC1 instruction has now been placed in the
delay slot.
For ISAs that do have a coprocessor move delay slot there is no gain,
but no loss either, both delay slots remain empty due to data
dependencies (coprocessor move delay slots):
$ mips-linux-gnu-gcc -O2 -mips1 -mips16 -c foo.c
$ mips-linux-gnu-objdump -dr foo.o
foo.o: file format elf32-tradbigmips
Disassembly of section .mips16.call.fp.bar:
__call_stub_fp_bar:
0: 03e09021moves2,ra
4: 44856000mtc1a1,$f12
8: 44846800mtc1a0,$f13
c: 0c00jal 0 __call_stub_fp_bar
c: R_MIPS_26bar
10: nop
14: 4403mfc1v1,$f0
18: 44020800mfc1v0,$f1
1c: 0248jr s2
20: nop
Disassembly of section .text.startup:
main:
0: 63fcaddiu sp,-32
2: 6772movev1,s2
4: 6207sw ra,28(sp)
6: 1800 jal 0 main
6: R_MIPS16_26 __mips16_floatsidf
a: d306sw v1,24(sp)
c: 67a3movea1,v1
e: 1800 jal 0 main
e: R_MIPS16_26 bar
12: 6782movea0,v0
14: 67a3movea1,v1
16: 1800 jal 0 main
16: R_MIPS16_26 __mips16_fix_truncdfsi
1a: 6782movea0,v0
1c: 9606lw a2,24(sp)
1e: 9707lw a3,28(sp)
20: 6556moves2,a2
22: ef00jr a3
24: 6304addiu sp,32
26: 6500nop
-- though I think this consideration is actually academic as no