Re: [Qemu-devel] [PATCH RFC 00/11] AREG0 elimination

Blue Swirl Sun, 15 May 2011 00:16:38 -0700

On Sun, May 15, 2011 at 1:04 AM, Aurelien Jarno <aurel...@aurel32.net> wrote:
> On Sun, May 15, 2011 at 12:52:35AM +0300, Blue Swirl wrote:
>> On Sun, May 15, 2011 at 12:16 AM, Aurelien Jarno <aurel...@aurel32.net> 
>> wrote:
>> > On Sat, May 14, 2011 at 10:35:20PM +0300, Blue Swirl wrote:
>> >> Here's a RFC series for eliminating AREG0.
>> >>
>> >> Blue Swirl (11):
>> >>   Move user emulator stuff from cpu-exec.c to user-exec.c
>> >>   Delete unused tb_invalidate_page_range
>> >>
>> >> The above should be OK to commit.
>> >>
>> >>   cpu_loop_exit: avoid using AREG0
>> >>   Delegate setup of TCG temporaries to targets
>> >>
>> >> These two are not, unless the overall plan is OK.
>> >>
>> >>   TCG: fix negative frame offset calculations
>> >>   TCG/x86: use stack for TCG temps
>> >>   TCG/Sparc64: use stack for TCG temps
>> >>
>> >> But these three should be OK. I've tested lightly x86_64 and Sparc64 
>> >> hosts.
>> >>
>> >>   Add CONFIG_TARGET_NEEDS_AREG0
>> >>   Don't compile legacy qemu_ld/st functions if target doesn't need them
>> >>
>> >> Should be OK, though the latter patch only touches x86.
>> >>
>> >>   Add new qemu_ld and qemu_st functions
>> >>   sparc: use new qemu_ld and qemu_st functions
>> >>
>> >> The last two compile but QEMU segfaults. I just made a naive
>> >> conversion for getting comments.
>> >>
>> >
>> > What is the goal behing removing TCG_AREG0? If it is speed improvement,
>> > can you please provide some benchmarks?
>>
>> There was some discussion earlier about why this (or parts of the
>> conversion) may be a speed improvement:
>>
>> http://article.gmane.org/gmane.comp.emulators.qemu/101826
>> http://article.gmane.org/gmane.comp.emulators.qemu/102156
>
> Ok, looks like I have missed that.
>
>> There are no benchmarks yet. In fact, it may be difficult to make
>> those without performing the removal completely.
>>
>> For example, patch 3/11 makes cpu_loop_exit take CPUState as a
>> parameter instead of using global env, which would be available and
>> the register is reserved anyway. This would only decrease performance
>> at this stage, unless a complete conversion is done. I suspect the
>> same would happen when moving all helpers from op_helper.c to
>> helper.c. But after the whole conversion, this would be a neutral (no
>> extra registers used) or even beneficial change (the code is free to
>> use one more register).
>>
>> > The env register is used very often (basically for every load/store, but
>> > also a lot of helpers), so it makes sense to reserve a register for it.
>> >
>> > For what I understand from your patch series, you prefer to pass this
>> > register explicitly to TCG functions. This basically means this TCG
>> > global will be loaded to host register as soon as it is used, but also
>> > regularly, as globals are saved back to their canonical location before
>> > an helper or a load/store.
>> >
>> > So it seems that this patch series will just allowing the "env register"
>> > to change over time, though it will not spare one more register for the
>> > TCG code, and it will emit longer TCG code to regularly reload the env
>> > global into a host register.
>>
>> But there will be one more register available in some cases. In other
>
> Inside the TCG code, it will basically happens very rarely, given
> load/store are really the most used instructions, and they need to load
> the env register.


Not exactly, from a sample run with -d op_opt:
$ egrep -v -e '^$' -v -e 'OP after' -v -e ' end' -v -e 'Search PC'
/tmp/qemu.log | awk '{print $1}' | sort | uniq -c|sort -rn
1673966 movi_i32
 653931 ld_i32
 607432 mov_i32
 428684 st_i32
 326878 movi_i64
 308626 add_i32
 283186 call
 256817 exit_tb
 207232 nopn
 189388 goto_tb
 122398 and_i32
 117997 shr_i32
  89107 qemu_ld32
  82926 set_label
  82713 brcond_i32
  67169 qemu_st32
  55109 or_i32
  46536 ext32u_i64
  44288 xor_i32
  38103 sub_i32
  26361 shl_i32
  23218 shl_i64
  23218 qemu_st64
  23218 or_i64
  20474 shr_i64
  20445 qemu_ld64
  11161 qemu_ld8u
  10409 qemu_st8
   5013 qemu_ld16u
   3795 qemu_st16
   2776 qemu_ld8s
   1915 sar_i32
   1414 qemu_ld16s
    839 not_i32
    579 setcond_i32
    213 br
     42 ext32s_i64
     30 mul_i64

But most other ops probably don't need any additional registers. It
could still be that with the extra register, some values could be kept
there instead of flushing to storage.

>> cases, the number of registers used does not change. Moving the
>> registers around is what worries me too.
>>
>> But there are other effects too, the helpers are now compiled so that
>> the global env register is not used. Especially on hosts with low
>> number of registers this is not optimal.
>
> Most helpers are very small functions, so I am not sure more registers
> will help.

$ nm --print-size --defined-only --size-sort --reverse-sort
obj-amd64/sparc-softmmu/op_helper.o |head -20
0000000000003a70 0000000000000c75 T helper_st_asi
00000000000046f0 000000000000086d T helper_ld_asi
0000000000002550 000000000000041b T __stw_mmu
00000000000012c0 000000000000040c T __stq_mmu
0000000000001d10 00000000000003c5 T __stl_mmu
0000000000001a90 0000000000000277 T __ldq_mmu
00000000000022f0 000000000000025f T __ldl_mmu
0000000000002b50 000000000000024f T __ldw_mmu
0000000000001870 0000000000000219 t slow_ldq_mmu
00000000000020e0 0000000000000208 t slow_ldl_mmu
00000000000033f0 0000000000000202 T helper_stqf
0000000000002970 00000000000001de t slow_ldw_mmu
0000000000005120 00000000000001dd T helper_fcmpeq
0000000000003720 00000000000001c7 T helper_ldqf
0000000000001120 000000000000019e t slow_stb_mmu
00000000000016d0 000000000000019c T __stb_mmu
0000000000002da0 000000000000018a t slow_ldb_mmu
0000000000005410 0000000000000172 T helper_fcmped
0000000000002f30 000000000000016d T __ldb_mmu
0000000000000a70 0000000000000164 T do_unassigned_access

These are not so small, but they also don't look like frequently used
ones, judging by the names. It would be more interesting to see the
sizes of heavily used helpers.

>> > In any case at then end benchmarks is what are need to decided, TCG has
>> > always shown that performance improvements doesn't match the improvement
>> > analysis.
>>
>> If this turns out to be a bad idea, it means that the reverse
>> conversion will be beneficial and we should convert helper.c code to
>> op_helper.c and take advantage of the global env in a register. This
>> was actually my standpoint when the discussion started, but I'm
>> interested to see if this approach would work better.
>
> Does it mean that you plan to do the code changes in git without really
> benchmarking, and revert all the changes later if it was a bad idea?

I don't think that that would be a good idea. There are actually
several possible plans that should be considered, based on the
performance:
- don't do anything
- full conversion
- avoid AREG0 use outside of generated code: move helpers from
op_helper.c to helper.c
- avoid AREG0 in cpu-exec.c etc.
- maximize AREG0 use: move helpers from helper.c etc. to op_helper.c

It's also possible to refactor qemu_ld/st independently to the above
so that they still use AREG0.

The performance neutral changes (1, 2, 4 to 7) may have merits of
their own, so they may be applied later if there are no objections.

Re: [Qemu-devel] [PATCH RFC 00/11] AREG0 elimination

Reply via email to