On Sat, Mar 22, 2014 at 03:25:42PM +0100, David Kuehling wrote: > >>>>> "Bernd" == Bernd Paysan <bernd.pay...@gmx.de> writes: > > > Am Samstag, 22. März 2014, 07:24:55 schrieb David Kuehling: > >> I'm using a recent gforth revision from git (6ec9915f6277de) and > >> noticed that running gforth --dynamic produces pretty extreme > >> performance degradation [..] > > > How does this affect other microbenchmarks, e.g. onebench.fs? And: > > SEE-CODE <word> shows the dynamically generated code; could you > > provide that for the microbenchmark above? > > Ahh, SEE-CODE does a nice job. The disassembly for the full > code-sequences of my recursive micro-benchmark for gforth-fast with and > w/o --dynamic is listed below. Looks like there is a problem with the > CALL code sequence generated for calls into colon-definitions: > > gforth-fast --dynamic > : test1 ; > : test2 test1 ; > see-code test2 > > $2BB725B0 call > $2BB725B4 <test1> > ( $2BFC9FA8 ) 3 16 0 addu, > ( $2BFC9FAC ) 16 0 16 lw, > ( $2BFC9FB0 ) 2 18 0 addu, > ( $2BFC9FB4 ) 3 3 4 addiu, > ( $2BFC9FB8 ) 18 18 -4 addiu, > ( $2BFC9FBC ) 16 16 4 addiu, > ( $2BFC9FC0 ) 3 -4 2 sw, > ( $2BFC9FC4 ) 2 -4 16 lw, > ( $2BFC9FC8 ) $7C03E83B , ( illegal inst ) > ( $2BFC9FCC ) 4 -32680 28 lw, > ( $2BFC9FD0 ) 30 3 0 addu, > ( $2BFC9FD4 ) 4 4 30 addu, > ( $2BFC9FD8 ) 3 2 0 addu, > ( $2BFC9FDC ) 4 256 29 sw, > ( $2BFC9FE0 ) 3 jr, > ( $2BFC9FE4 ) 1 1 0 or, > $2BB725B8 ;s ok > > Compare this against the disassembly of CALL: > see call: > > Code call > ( $403C34 ) 3 16 0 addu, > ( $403C38 ) 16 0 16 lw, > ( $403C3C ) 2 18 0 addu, > ( $403C40 ) 3 3 4 addiu, > ( $403C44 ) 18 18 -4 addiu, > ( $403C48 ) 16 16 4 addiu, > ( $403C4C ) 3 -4 2 sw, > ( $403C50 ) 2 -4 16 lw, > ( $403C54 ) 3 2 0 addu, > ( $403C58 ) 3 jr, > ( $403C5C ) 1 1 0 or, > end-code > > Instead of NEXT the code in test2 holds some nonsense, starting with > invalid instruction $7C03E83B .
Looks to me like the dispatch code that is appended to the code for the call does some additional stuff (in this particular case we could actually take the whole CALL instead of cutting the dispatch part off and appending some other dispatch code, but some gcc versions replace the NEXT at the end of the word with a direct jump to dispatch code, and then that does not work). You can look at where the parts are with gforth-fast --debug The output contains something like: Compiled with gcc-4.3.2 goto * 0x804b539 0x804fec9 len=12 ... call 0-0 11 0x804b6f0 0x80501f0 0x804b6f0 len= 26 rest= 3 send=1 This means that the fragment appended every time there is something other than a fall-through is 12 bytes long, whereas the normal code is 3 bytes long (IIRC what the numbers mean). You can find the appended fragment at 0x804b539 (or 0x804fec9), and the code for CALL at 0x804b6f0, 0x80501f0, or 0x804b6f0. And for the machine where I did this, the code generated by the dynamic call is also longer than the CALL code itself. Maybe we need another round of playing around with gcc to find out how to make it produce a short copyable "goto *". Or we might change the copying code to copy the whole thing if the NEXT part is relocatable. > Don't know why that instruction doesn't > SIGILL, but maybe it's a non-standard/undocumented instruction on > Loongson2f. The binutils also don't know anything about that opcode: > > echo -e "\x3b\xe8\x03\x7c" > /tmp/inst > objdump -D -EL -b binary -m mips:loongson_2f /tmp/inst > [..] > 0: 7c03e83b 0x7c03e83b Since this code was originally generated with gas, the binutils do know about this. To use gdb (i.e., binutils) for the disassembly, use ' disasm-gdb is discode before doing the SEE-CODE or SEE. - anton