On Sun, May 15, 2011 at 1:04 AM, Aurelien Jarno <aurel...@aurel32.net> wrote: > On Sun, May 15, 2011 at 12:52:35AM +0300, Blue Swirl wrote: >> On Sun, May 15, 2011 at 12:16 AM, Aurelien Jarno <aurel...@aurel32.net> >> wrote: >> > On Sat, May 14, 2011 at 10:35:20PM +0300, Blue Swirl wrote: >> >> Here's a RFC series for eliminating AREG0. >> >> >> >> Blue Swirl (11): >> >> Move user emulator stuff from cpu-exec.c to user-exec.c >> >> Delete unused tb_invalidate_page_range >> >> >> >> The above should be OK to commit. >> >> >> >> cpu_loop_exit: avoid using AREG0 >> >> Delegate setup of TCG temporaries to targets >> >> >> >> These two are not, unless the overall plan is OK. >> >> >> >> TCG: fix negative frame offset calculations >> >> TCG/x86: use stack for TCG temps >> >> TCG/Sparc64: use stack for TCG temps >> >> >> >> But these three should be OK. I've tested lightly x86_64 and Sparc64 >> >> hosts. >> >> >> >> Add CONFIG_TARGET_NEEDS_AREG0 >> >> Don't compile legacy qemu_ld/st functions if target doesn't need them >> >> >> >> Should be OK, though the latter patch only touches x86. >> >> >> >> Add new qemu_ld and qemu_st functions >> >> sparc: use new qemu_ld and qemu_st functions >> >> >> >> The last two compile but QEMU segfaults. I just made a naive >> >> conversion for getting comments. >> >> >> > >> > What is the goal behing removing TCG_AREG0? If it is speed improvement, >> > can you please provide some benchmarks? >> >> There was some discussion earlier about why this (or parts of the >> conversion) may be a speed improvement: >> >> http://article.gmane.org/gmane.comp.emulators.qemu/101826 >> http://article.gmane.org/gmane.comp.emulators.qemu/102156 > > Ok, looks like I have missed that. > >> There are no benchmarks yet. In fact, it may be difficult to make >> those without performing the removal completely. >> >> For example, patch 3/11 makes cpu_loop_exit take CPUState as a >> parameter instead of using global env, which would be available and >> the register is reserved anyway. This would only decrease performance >> at this stage, unless a complete conversion is done. I suspect the >> same would happen when moving all helpers from op_helper.c to >> helper.c. But after the whole conversion, this would be a neutral (no >> extra registers used) or even beneficial change (the code is free to >> use one more register). >> >> > The env register is used very often (basically for every load/store, but >> > also a lot of helpers), so it makes sense to reserve a register for it. >> > >> > For what I understand from your patch series, you prefer to pass this >> > register explicitly to TCG functions. This basically means this TCG >> > global will be loaded to host register as soon as it is used, but also >> > regularly, as globals are saved back to their canonical location before >> > an helper or a load/store. >> > >> > So it seems that this patch series will just allowing the "env register" >> > to change over time, though it will not spare one more register for the >> > TCG code, and it will emit longer TCG code to regularly reload the env >> > global into a host register. >> >> But there will be one more register available in some cases. In other > > Inside the TCG code, it will basically happens very rarely, given > load/store are really the most used instructions, and they need to load > the env register.
Not exactly, from a sample run with -d op_opt: $ egrep -v -e '^$' -v -e 'OP after' -v -e ' end' -v -e 'Search PC' /tmp/qemu.log | awk '{print $1}' | sort | uniq -c|sort -rn 1673966 movi_i32 653931 ld_i32 607432 mov_i32 428684 st_i32 326878 movi_i64 308626 add_i32 283186 call 256817 exit_tb 207232 nopn 189388 goto_tb 122398 and_i32 117997 shr_i32 89107 qemu_ld32 82926 set_label 82713 brcond_i32 67169 qemu_st32 55109 or_i32 46536 ext32u_i64 44288 xor_i32 38103 sub_i32 26361 shl_i32 23218 shl_i64 23218 qemu_st64 23218 or_i64 20474 shr_i64 20445 qemu_ld64 11161 qemu_ld8u 10409 qemu_st8 5013 qemu_ld16u 3795 qemu_st16 2776 qemu_ld8s 1915 sar_i32 1414 qemu_ld16s 839 not_i32 579 setcond_i32 213 br 42 ext32s_i64 30 mul_i64 But most other ops probably don't need any additional registers. It could still be that with the extra register, some values could be kept there instead of flushing to storage. >> cases, the number of registers used does not change. Moving the >> registers around is what worries me too. >> >> But there are other effects too, the helpers are now compiled so that >> the global env register is not used. Especially on hosts with low >> number of registers this is not optimal. > > Most helpers are very small functions, so I am not sure more registers > will help. $ nm --print-size --defined-only --size-sort --reverse-sort obj-amd64/sparc-softmmu/op_helper.o |head -20 0000000000003a70 0000000000000c75 T helper_st_asi 00000000000046f0 000000000000086d T helper_ld_asi 0000000000002550 000000000000041b T __stw_mmu 00000000000012c0 000000000000040c T __stq_mmu 0000000000001d10 00000000000003c5 T __stl_mmu 0000000000001a90 0000000000000277 T __ldq_mmu 00000000000022f0 000000000000025f T __ldl_mmu 0000000000002b50 000000000000024f T __ldw_mmu 0000000000001870 0000000000000219 t slow_ldq_mmu 00000000000020e0 0000000000000208 t slow_ldl_mmu 00000000000033f0 0000000000000202 T helper_stqf 0000000000002970 00000000000001de t slow_ldw_mmu 0000000000005120 00000000000001dd T helper_fcmpeq 0000000000003720 00000000000001c7 T helper_ldqf 0000000000001120 000000000000019e t slow_stb_mmu 00000000000016d0 000000000000019c T __stb_mmu 0000000000002da0 000000000000018a t slow_ldb_mmu 0000000000005410 0000000000000172 T helper_fcmped 0000000000002f30 000000000000016d T __ldb_mmu 0000000000000a70 0000000000000164 T do_unassigned_access These are not so small, but they also don't look like frequently used ones, judging by the names. It would be more interesting to see the sizes of heavily used helpers. >> > In any case at then end benchmarks is what are need to decided, TCG has >> > always shown that performance improvements doesn't match the improvement >> > analysis. >> >> If this turns out to be a bad idea, it means that the reverse >> conversion will be beneficial and we should convert helper.c code to >> op_helper.c and take advantage of the global env in a register. This >> was actually my standpoint when the discussion started, but I'm >> interested to see if this approach would work better. > > Does it mean that you plan to do the code changes in git without really > benchmarking, and revert all the changes later if it was a bad idea? I don't think that that would be a good idea. There are actually several possible plans that should be considered, based on the performance: - don't do anything - full conversion - avoid AREG0 use outside of generated code: move helpers from op_helper.c to helper.c - avoid AREG0 in cpu-exec.c etc. - maximize AREG0 use: move helpers from helper.c etc. to op_helper.c It's also possible to refactor qemu_ld/st independently to the above so that they still use AREG0. The performance neutral changes (1, 2, 4 to 7) may have merits of their own, so they may be applied later if there are no objections.