felix.winkelm...@bevuta.com schrieb am 2021-11-04: > > 7.558s CPU time, 0/225861 GCs (major/minor), maximum live heap: 30.78 MiB > > 8.839s CPU time, 0/256410 GCs (major/minor), maximum live heap: 30.78 MiB > > > >[...] > > > > It would be great to get some help or explanation with this issue.
> Hi! > I have similar timings and the difference in the number of minor GC indicates > that the c99-fma variant allocates more stack space and thus causes more > minor GCs. > Looking at the generated C file ("csc -k"), we see that scm-fma unboxes the > intermediate > result and thus generates relatively decent code: > /* scm-fma in k183 in k180 in k177 in k174 */ > static void C_ccall f_187(C_word c,C_word *av){ > C_word tmp; > C_word t0=av[0]; > C_word t1=av[1]; > C_word t2=av[2]; > C_word t3=av[3]; > C_word t4=av[4]; > C_word t5; > double f0; > C_word *a; > if(C_unlikely(!C_demand(C_calculate_demand(4,c,1)))){ > C_save_and_reclaim((void *)f_187,c,av);} > a=C_alloc(4); > f0=C_ub_i_flonum_times(C_flonum_magnitude(t2),C_flonum_magnitude(t3)); > t5=t1;{ > C_word *av2=av; > av2[0]=t5; > av2[1]=C_flonum(&a,C_ub_i_flonum_plus(C_flonum_magnitude(t4),f0)); > ((C_proc)(void*)(*((C_word*)t5+1)))(2,av2);}} > The other version allocates a bytevector to hold the result: > /* c99-fma in k183 in k180 in k177 in k174 */ > static void C_ccall f_197(C_word c,C_word *av){ > C_word tmp; > C_word t0=av[0]; > C_word t1=av[1]; > C_word t2=av[2]; > C_word t3=av[3]; > C_word t4=av[4]; > C_word t5; > C_word t6; > C_word *a; > if(C_unlikely(!C_demand(C_calculate_demand(6,c,1)))){ > C_save_and_reclaim((void *)f_197,c,av);} > a=C_alloc(6); > t5=C_a_i_bytevector(&a,1,C_fix(4)); > t6=t1;{ > C_word *av2=av; > av2[0]=t6; > av2[1]=stub21(t5,t2,t3,t4); > ((C_proc)(void*)(*((C_word*)t6+1)))(2,av2);}} > I thought that the allocation of 4 words for the bytevector (which is more > than > needed on a 64 bit machine) makes the difference, but it turns out to be > negligible > Changing it to 2 and also adjusting the values for C_calculate_demand and > C_alloc doesn't seem to change a lot, but you may want to try that - > just modify the C code and compile it with the same options as the .scm file. > On my laptop fma is a library call, so currently my guess is simply that > the scm-fma code is tighter and avoids 3 additional function calls (one to > the stub, > one to C_a_i_bytevector and one to fma). The increased number of GCs may > also be caused by the bytevector above, which is used as a placeholder for > the flonum result, which wastes one word. > There is room for improvement for the compiler, though: the C_fix(4) is overly > conservative (4 words are correct on 32-bit, taking care of flonum alignment, > but > unnecessary on 64 bits). Also, the bytevector thing is a bit of a hack - we > could actually just pass "a" to stub21 directly. You may want to try this out: > /* c99-fma in k183 in k180 in k177 in k174 (modified) */ > static void C_ccall f_197(C_word c,C_word *av){ > C_word tmp; > C_word t0=av[0]; > C_word t1=av[1]; > C_word t2=av[2]; > C_word t3=av[3]; > C_word t4=av[4]; > C_word t6; > C_word *a; > if(C_unlikely(!C_demand(C_calculate_demand(4,c,1)))){ > C_save_and_reclaim((void *)f_197,c,av);} > a=C_alloc(4); > t6=t1;{ > C_word *av2=av; > av2[0]=t6; > av2[1]=stub21((C_word)a,t2,t3,t4); > ((C_proc)(void*)(*((C_word*)t6+1)))(2,av2);}} > This reduces minor GCs on my machine to roughly the same. If your > compiler inlines stub21 and fma, then you should see comparable performance. > Also, default optimization-levels for C are -Os (pass -v to csc to see what is > passed to the C compiler), so using -O2 instead should make a difference. > felix Dear Felix, thank you for ypur explanantions. I tested your modified source and indeed the number of GCs is significantly reduced, but the timing difference remains: original code: 7.656s CPU time, 0/225861 GCs (major/minor), maximum live heap: 30.78 MiB 8.849s CPU time, 0/256410 GCs (major/minor), maximum live heap: 30.78 MiB modified code: 7.378s CPU time, 0/225861 GCs (major/minor), maximum live heap: 30.78 MiB 8.498s CPU time, 0/238095 GCs (major/minor), maximum live heap: 30.78 MiB Both were compiled with -O3 optimization level in gcc. I am fine with these results given your layout of the internals in the background. Would it be theoretically thinkable to include such fma functionality directly into chicken.flonum, i.e. as fp+*, or are included modules typically unaltered? Thank you Christian