Re: [Qemu-devel] [PATCH RFC 00/11] AREG0 elimination

2011-05-15 Thread Blue Swirl
On Sun, May 15, 2011 at 1:04 AM, Aurelien Jarno aurel...@aurel32.net wrote:
 On Sun, May 15, 2011 at 12:52:35AM +0300, Blue Swirl wrote:
 On Sun, May 15, 2011 at 12:16 AM, Aurelien Jarno aurel...@aurel32.net 
 wrote:
  On Sat, May 14, 2011 at 10:35:20PM +0300, Blue Swirl wrote:
  Here's a RFC series for eliminating AREG0.
 
  Blue Swirl (11):
    Move user emulator stuff from cpu-exec.c to user-exec.c
    Delete unused tb_invalidate_page_range
 
  The above should be OK to commit.
 
    cpu_loop_exit: avoid using AREG0
    Delegate setup of TCG temporaries to targets
 
  These two are not, unless the overall plan is OK.
 
    TCG: fix negative frame offset calculations
    TCG/x86: use stack for TCG temps
    TCG/Sparc64: use stack for TCG temps
 
  But these three should be OK. I've tested lightly x86_64 and Sparc64 
  hosts.
 
    Add CONFIG_TARGET_NEEDS_AREG0
    Don't compile legacy qemu_ld/st functions if target doesn't need them
 
  Should be OK, though the latter patch only touches x86.
 
    Add new qemu_ld and qemu_st functions
    sparc: use new qemu_ld and qemu_st functions
 
  The last two compile but QEMU segfaults. I just made a naive
  conversion for getting comments.
 
 
  What is the goal behing removing TCG_AREG0? If it is speed improvement,
  can you please provide some benchmarks?

 There was some discussion earlier about why this (or parts of the
 conversion) may be a speed improvement:

 http://article.gmane.org/gmane.comp.emulators.qemu/101826
 http://article.gmane.org/gmane.comp.emulators.qemu/102156

 Ok, looks like I have missed that.

 There are no benchmarks yet. In fact, it may be difficult to make
 those without performing the removal completely.

 For example, patch 3/11 makes cpu_loop_exit take CPUState as a
 parameter instead of using global env, which would be available and
 the register is reserved anyway. This would only decrease performance
 at this stage, unless a complete conversion is done. I suspect the
 same would happen when moving all helpers from op_helper.c to
 helper.c. But after the whole conversion, this would be a neutral (no
 extra registers used) or even beneficial change (the code is free to
 use one more register).

  The env register is used very often (basically for every load/store, but
  also a lot of helpers), so it makes sense to reserve a register for it.
 
  For what I understand from your patch series, you prefer to pass this
  register explicitly to TCG functions. This basically means this TCG
  global will be loaded to host register as soon as it is used, but also
  regularly, as globals are saved back to their canonical location before
  an helper or a load/store.
 
  So it seems that this patch series will just allowing the env register
  to change over time, though it will not spare one more register for the
  TCG code, and it will emit longer TCG code to regularly reload the env
  global into a host register.

 But there will be one more register available in some cases. In other

 Inside the TCG code, it will basically happens very rarely, given
 load/store are really the most used instructions, and they need to load
 the env register.

Not exactly, from a sample run with -d op_opt:
$ egrep -v -e '^$' -v -e 'OP after' -v -e ' end' -v -e 'Search PC'
/tmp/qemu.log | awk '{print $1}' | sort | uniq -c|sort -rn
1673966 movi_i32
 653931 ld_i32
 607432 mov_i32
 428684 st_i32
 326878 movi_i64
 308626 add_i32
 283186 call
 256817 exit_tb
 207232 nopn
 189388 goto_tb
 122398 and_i32
 117997 shr_i32
  89107 qemu_ld32
  82926 set_label
  82713 brcond_i32
  67169 qemu_st32
  55109 or_i32
  46536 ext32u_i64
  44288 xor_i32
  38103 sub_i32
  26361 shl_i32
  23218 shl_i64
  23218 qemu_st64
  23218 or_i64
  20474 shr_i64
  20445 qemu_ld64
  11161 qemu_ld8u
  10409 qemu_st8
   5013 qemu_ld16u
   3795 qemu_st16
   2776 qemu_ld8s
   1915 sar_i32
   1414 qemu_ld16s
839 not_i32
579 setcond_i32
213 br
 42 ext32s_i64
 30 mul_i64

But most other ops probably don't need any additional registers. It
could still be that with the extra register, some values could be kept
there instead of flushing to storage.

 cases, the number of registers used does not change. Moving the
 registers around is what worries me too.

 But there are other effects too, the helpers are now compiled so that
 the global env register is not used. Especially on hosts with low
 number of registers this is not optimal.

 Most helpers are very small functions, so I am not sure more registers
 will help.

$ nm --print-size --defined-only --size-sort --reverse-sort
obj-amd64/sparc-softmmu/op_helper.o |head -20
3a70 0c75 T helper_st_asi
46f0 086d T helper_ld_asi
2550 041b T __stw_mmu
12c0 040c T __stq_mmu
1d10 03c5 T __stl_mmu
1a90 0277 T __ldq_mmu
22f0 025f T __ldl_mmu
2b50 024f T 

Re: [Qemu-devel] [PATCH RFC 00/11] AREG0 elimination

2011-05-15 Thread Aurelien Jarno
On Sun, May 15, 2011 at 10:15:52AM +0300, Blue Swirl wrote:
 On Sun, May 15, 2011 at 1:04 AM, Aurelien Jarno aurel...@aurel32.net wrote:
  On Sun, May 15, 2011 at 12:52:35AM +0300, Blue Swirl wrote:
  On Sun, May 15, 2011 at 12:16 AM, Aurelien Jarno aurel...@aurel32.net 
  wrote:
   On Sat, May 14, 2011 at 10:35:20PM +0300, Blue Swirl wrote:
   Here's a RFC series for eliminating AREG0.
  
   Blue Swirl (11):
     Move user emulator stuff from cpu-exec.c to user-exec.c
     Delete unused tb_invalidate_page_range
  
   The above should be OK to commit.
  
     cpu_loop_exit: avoid using AREG0
     Delegate setup of TCG temporaries to targets
  
   These two are not, unless the overall plan is OK.
  
     TCG: fix negative frame offset calculations
     TCG/x86: use stack for TCG temps
     TCG/Sparc64: use stack for TCG temps
  
   But these three should be OK. I've tested lightly x86_64 and Sparc64 
   hosts.
  
     Add CONFIG_TARGET_NEEDS_AREG0
     Don't compile legacy qemu_ld/st functions if target doesn't need them
  
   Should be OK, though the latter patch only touches x86.
  
     Add new qemu_ld and qemu_st functions
     sparc: use new qemu_ld and qemu_st functions
  
   The last two compile but QEMU segfaults. I just made a naive
   conversion for getting comments.
  
  
   What is the goal behing removing TCG_AREG0? If it is speed improvement,
   can you please provide some benchmarks?
 
  There was some discussion earlier about why this (or parts of the
  conversion) may be a speed improvement:
 
  http://article.gmane.org/gmane.comp.emulators.qemu/101826
  http://article.gmane.org/gmane.comp.emulators.qemu/102156
 
  Ok, looks like I have missed that.
 
  There are no benchmarks yet. In fact, it may be difficult to make
  those without performing the removal completely.
 
  For example, patch 3/11 makes cpu_loop_exit take CPUState as a
  parameter instead of using global env, which would be available and
  the register is reserved anyway. This would only decrease performance
  at this stage, unless a complete conversion is done. I suspect the
  same would happen when moving all helpers from op_helper.c to
  helper.c. But after the whole conversion, this would be a neutral (no
  extra registers used) or even beneficial change (the code is free to
  use one more register).
 
   The env register is used very often (basically for every load/store, but
   also a lot of helpers), so it makes sense to reserve a register for it.
  
   For what I understand from your patch series, you prefer to pass this
   register explicitly to TCG functions. This basically means this TCG
   global will be loaded to host register as soon as it is used, but also
   regularly, as globals are saved back to their canonical location before
   an helper or a load/store.
  
   So it seems that this patch series will just allowing the env register
   to change over time, though it will not spare one more register for the
   TCG code, and it will emit longer TCG code to regularly reload the env
   global into a host register.
 
  But there will be one more register available in some cases. In other
 
  Inside the TCG code, it will basically happens very rarely, given
  load/store are really the most used instructions, and they need to load
  the env register.
 
 Not exactly, from a sample run with -d op_opt:
 $ egrep -v -e '^$' -v -e 'OP after' -v -e ' end' -v -e 'Search PC'
 /tmp/qemu.log | awk '{print $1}' | sort | uniq -c|sort -rn
 1673966 movi_i32
  653931 ld_i32

On which target is that? It looks like that some variables could be
mapped directly as a global to avoid so many load/stores.

Anyway, even this simple load instruction, needs a base register, and it
mosts cases this register is TCG_AREG0. I agree that it can be replaced by
another base register, but in anycase, a register has to be used for that.

It means after such an instruction is encountered, one register is
reserved for this global up to the next helper/memory load store, so
there isn't any register spared, but more instructions generated to
explicitly load this register. 

I also just realized all globals registered through tcg_global_mem_new()
needs a base register and currently it is TCG_AREG0, that means even a
mov_i32 might need access to TCG_AREG0. How do you plan to solve that 
part?

  607432 mov_i32
  428684 st_i32
  326878 movi_i64
  308626 add_i32
  283186 call
  256817 exit_tb
  207232 nopn
  189388 goto_tb
  122398 and_i32
  117997 shr_i32
   89107 qemu_ld32
   82926 set_label
   82713 brcond_i32
   67169 qemu_st32
   55109 or_i32
   46536 ext32u_i64
   44288 xor_i32
   38103 sub_i32
   26361 shl_i32
   23218 shl_i64
   23218 qemu_st64
   23218 or_i64
   20474 shr_i64
   20445 qemu_ld64
   11161 qemu_ld8u
   10409 qemu_st8
5013 qemu_ld16u
3795 qemu_st16
2776 qemu_ld8s
1915 sar_i32
1414 qemu_ld16s
 839 not_i32
 579 setcond_i32
 213 br
  42 ext32s_i64
  30 mul_i64
 
 But most other ops probably don't 

Re: [Qemu-devel] [PATCH RFC 00/11] AREG0 elimination

2011-05-15 Thread Laurent Desnogues
On Sun, May 15, 2011 at 9:15 AM, Blue Swirl blauwir...@gmail.com wrote:
 On Sun, May 15, 2011 at 1:04 AM, Aurelien Jarno aurel...@aurel32.net wrote:
 On Sun, May 15, 2011 at 12:52:35AM +0300, Blue Swirl wrote:
 On Sun, May 15, 2011 at 12:16 AM, Aurelien Jarno aurel...@aurel32.net 
 wrote:
  On Sat, May 14, 2011 at 10:35:20PM +0300, Blue Swirl wrote:
[...]
  The env register is used very often (basically for every load/store, but
  also a lot of helpers), so it makes sense to reserve a register for it.
 
  For what I understand from your patch series, you prefer to pass this
  register explicitly to TCG functions. This basically means this TCG
  global will be loaded to host register as soon as it is used, but also
  regularly, as globals are saved back to their canonical location before
  an helper or a load/store.
 
  So it seems that this patch series will just allowing the env register
  to change over time, though it will not spare one more register for the
  TCG code, and it will emit longer TCG code to regularly reload the env
  global into a host register.

 But there will be one more register available in some cases. In other

 Inside the TCG code, it will basically happens very rarely, given
 load/store are really the most used instructions, and they need to load
 the env register.

 Not exactly, from a sample run with -d op_opt:
 $ egrep -v -e '^$' -v -e 'OP after' -v -e ' end' -v -e 'Search PC'
 /tmp/qemu.log | awk '{print $1}' | sort | uniq -c|sort -rn
 1673966 movi_i32
  653931 ld_i32
  607432 mov_i32
  428684 st_i32
  326878 movi_i64
  308626 add_i32
  283186 call
  256817 exit_tb
  207232 nopn
  189388 goto_tb
  122398 and_i32
  117997 shr_i32
  89107 qemu_ld32
  82926 set_label
  82713 brcond_i32
  67169 qemu_st32
  55109 or_i32
  46536 ext32u_i64
  44288 xor_i32
  38103 sub_i32
  26361 shl_i32
  23218 shl_i64
  23218 qemu_st64
  23218 or_i64
  20474 shr_i64
  20445 qemu_ld64
  11161 qemu_ld8u
  10409 qemu_st8
   5013 qemu_ld16u
   3795 qemu_st16
   2776 qemu_ld8s
   1915 sar_i32
   1414 qemu_ld16s
    839 not_i32
    579 setcond_i32
    213 br
     42 ext32s_i64
     30 mul_i64

Unless I missed something, this doesn't show the usage of
ld/st per TB, which is what Aurélien was looking for if I
understood correctly.  All I can say is that you had at
most 256817 TB's and 234507 qemu_ld/st, so about one per
TB.

Anyway I must be thick, because I fail to see how
generated code could access guest CPU registers without a
pointer to the CPU env :-)

IIUC the SPARC translator uses ld_i32/st_i32 mainly for
accessing the guest CPU registers, which due to register
windows is held in a dedicated global temp.  Is that
correct?  If so this is kind of hiding accesses to the
CPU env;  all other targets read/write registers by using
CPU env (through the use global temps in most cases).

So I think most (if not almost all) TB will need a pointer
to CPU env, which is why I think Aurélien's proposal to
keep a dedicated register that'd be loaded in the prologue
is the only way to not degrade performance of the
generated code (I'd add that this dedicated register
should be the one defined by the ABI as holding the first
parameter value, if that's possible;  I'm afraid this is
not necessarily a good idea).


Laurent



Re: [Qemu-devel] [PATCH RFC 00/11] AREG0 elimination

2011-05-15 Thread Aurelien Jarno
On Sun, May 15, 2011 at 11:27:17AM +0200, Laurent Desnogues wrote:
 On Sun, May 15, 2011 at 9:15 AM, Blue Swirl blauwir...@gmail.com wrote:
  On Sun, May 15, 2011 at 1:04 AM, Aurelien Jarno aurel...@aurel32.net 
  wrote:
  On Sun, May 15, 2011 at 12:52:35AM +0300, Blue Swirl wrote:
  On Sun, May 15, 2011 at 12:16 AM, Aurelien Jarno aurel...@aurel32.net 
  wrote:
   On Sat, May 14, 2011 at 10:35:20PM +0300, Blue Swirl wrote:
 [...]
   The env register is used very often (basically for every load/store, but
   also a lot of helpers), so it makes sense to reserve a register for it.
  
   For what I understand from your patch series, you prefer to pass this
   register explicitly to TCG functions. This basically means this TCG
   global will be loaded to host register as soon as it is used, but also
   regularly, as globals are saved back to their canonical location before
   an helper or a load/store.
  
   So it seems that this patch series will just allowing the env register
   to change over time, though it will not spare one more register for the
   TCG code, and it will emit longer TCG code to regularly reload the env
   global into a host register.
 
  But there will be one more register available in some cases. In other
 
  Inside the TCG code, it will basically happens very rarely, given
  load/store are really the most used instructions, and they need to load
  the env register.
 
  Not exactly, from a sample run with -d op_opt:
  $ egrep -v -e '^$' -v -e 'OP after' -v -e ' end' -v -e 'Search PC'
  /tmp/qemu.log | awk '{print $1}' | sort | uniq -c|sort -rn
  1673966 movi_i32
   653931 ld_i32
   607432 mov_i32
   428684 st_i32
   326878 movi_i64
   308626 add_i32
   283186 call
   256817 exit_tb
   207232 nopn
   189388 goto_tb
   122398 and_i32
   117997 shr_i32
   89107 qemu_ld32
   82926 set_label
   82713 brcond_i32
   67169 qemu_st32
   55109 or_i32
   46536 ext32u_i64
   44288 xor_i32
   38103 sub_i32
   26361 shl_i32
   23218 shl_i64
   23218 qemu_st64
   23218 or_i64
   20474 shr_i64
   20445 qemu_ld64
   11161 qemu_ld8u
   10409 qemu_st8
    5013 qemu_ld16u
    3795 qemu_st16
    2776 qemu_ld8s
    1915 sar_i32
    1414 qemu_ld16s
     839 not_i32
     579 setcond_i32
     213 br
      42 ext32s_i64
      30 mul_i64
 
 Unless I missed something, this doesn't show the usage of
 ld/st per TB, which is what Aurélien was looking for if I
 understood correctly.  All I can say is that you had at
 most 256817 TB's and 234507 qemu_ld/st, so about one per
 TB.
 
 Anyway I must be thick, because I fail to see how
 generated code could access guest CPU registers without a
 pointer to the CPU env :-)
 
 IIUC the SPARC translator uses ld_i32/st_i32 mainly for
 accessing the guest CPU registers, which due to register
 windows is held in a dedicated global temp.  Is that
 correct?  If so this is kind of hiding accesses to the
 CPU env;  all other targets read/write registers by using
 CPU env (through the use global temps in most cases).
 
 So I think most (if not almost all) TB will need a pointer
 to CPU env, which is why I think Aurélien's proposal to
 keep a dedicated register that'd be loaded in the prologue
 is the only way to not degrade performance of the
 generated code (I'd add that this dedicated register
 should be the one defined by the ABI as holding the first
 parameter value, if that's possible;  I'm afraid this is
 not necessarily a good idea).
 

I also thought about making it the one defined by the ABI as holding the
first parameter value, to avoid some register moves, but I am also
afraid this is not necessarily a good idea. It is usually a caller
saved register (because the caller is anyway overriding its value), 
which means that it has to be saved before calling every helper, and it
probably means more register move in that case.

-- 
Aurelien Jarno  GPG: 1024D/F1BCDB73
aurel...@aurel32.net http://www.aurel32.net



Re: [Qemu-devel] [PATCH RFC 00/11] AREG0 elimination

2011-05-15 Thread Blue Swirl
On Sun, May 15, 2011 at 12:24 PM, Aurelien Jarno aurel...@aurel32.net wrote:
 On Sun, May 15, 2011 at 10:15:52AM +0300, Blue Swirl wrote:
 On Sun, May 15, 2011 at 1:04 AM, Aurelien Jarno aurel...@aurel32.net wrote:
  On Sun, May 15, 2011 at 12:52:35AM +0300, Blue Swirl wrote:
  On Sun, May 15, 2011 at 12:16 AM, Aurelien Jarno aurel...@aurel32.net 
  wrote:
   On Sat, May 14, 2011 at 10:35:20PM +0300, Blue Swirl wrote:
   Here's a RFC series for eliminating AREG0.
  
   Blue Swirl (11):
     Move user emulator stuff from cpu-exec.c to user-exec.c
     Delete unused tb_invalidate_page_range
  
   The above should be OK to commit.
  
     cpu_loop_exit: avoid using AREG0
     Delegate setup of TCG temporaries to targets
  
   These two are not, unless the overall plan is OK.
  
     TCG: fix negative frame offset calculations
     TCG/x86: use stack for TCG temps
     TCG/Sparc64: use stack for TCG temps
  
   But these three should be OK. I've tested lightly x86_64 and Sparc64 
   hosts.
  
     Add CONFIG_TARGET_NEEDS_AREG0
     Don't compile legacy qemu_ld/st functions if target doesn't need them
  
   Should be OK, though the latter patch only touches x86.
  
     Add new qemu_ld and qemu_st functions
     sparc: use new qemu_ld and qemu_st functions
  
   The last two compile but QEMU segfaults. I just made a naive
   conversion for getting comments.
  
  
   What is the goal behing removing TCG_AREG0? If it is speed improvement,
   can you please provide some benchmarks?
 
  There was some discussion earlier about why this (or parts of the
  conversion) may be a speed improvement:
 
  http://article.gmane.org/gmane.comp.emulators.qemu/101826
  http://article.gmane.org/gmane.comp.emulators.qemu/102156
 
  Ok, looks like I have missed that.
 
  There are no benchmarks yet. In fact, it may be difficult to make
  those without performing the removal completely.
 
  For example, patch 3/11 makes cpu_loop_exit take CPUState as a
  parameter instead of using global env, which would be available and
  the register is reserved anyway. This would only decrease performance
  at this stage, unless a complete conversion is done. I suspect the
  same would happen when moving all helpers from op_helper.c to
  helper.c. But after the whole conversion, this would be a neutral (no
  extra registers used) or even beneficial change (the code is free to
  use one more register).
 
   The env register is used very often (basically for every load/store, but
   also a lot of helpers), so it makes sense to reserve a register for it.
  
   For what I understand from your patch series, you prefer to pass this
   register explicitly to TCG functions. This basically means this TCG
   global will be loaded to host register as soon as it is used, but also
   regularly, as globals are saved back to their canonical location before
   an helper or a load/store.
  
   So it seems that this patch series will just allowing the env register
   to change over time, though it will not spare one more register for the
   TCG code, and it will emit longer TCG code to regularly reload the env
   global into a host register.
 
  But there will be one more register available in some cases. In other
 
  Inside the TCG code, it will basically happens very rarely, given
  load/store are really the most used instructions, and they need to load
  the env register.

 Not exactly, from a sample run with -d op_opt:
 $ egrep -v -e '^$' -v -e 'OP after' -v -e ' end' -v -e 'Search PC'
 /tmp/qemu.log | awk '{print $1}' | sort | uniq -c|sort -rn
 1673966 movi_i32
  653931 ld_i32

 On which target is that? It looks like that some variables could be
 mapped directly as a global to avoid so many load/stores.

Sparc-softmmu, boot of Aurora-1.0 up to first choices screen in X.

 Anyway, even this simple load instruction, needs a base register, and it
 mosts cases this register is TCG_AREG0. I agree that it can be replaced by
 another base register, but in anycase, a register has to be used for that.

 It means after such an instruction is encountered, one register is
 reserved for this global up to the next helper/memory load store, so
 there isn't any register spared, but more instructions generated to
 explicitly load this register.

 I also just realized all globals registered through tcg_global_mem_new()
 needs a base register and currently it is TCG_AREG0, that means even a
 mov_i32 might need access to TCG_AREG0. How do you plan to solve that
 part?

In most cases the backing storage is CPUState, so cpu_env would be a
logical choice.

  607432 mov_i32
  428684 st_i32
  326878 movi_i64
  308626 add_i32
  283186 call
  256817 exit_tb
  207232 nopn
  189388 goto_tb
  122398 and_i32
  117997 shr_i32
   89107 qemu_ld32
   82926 set_label
   82713 brcond_i32
   67169 qemu_st32
   55109 or_i32
   46536 ext32u_i64
   44288 xor_i32
   38103 sub_i32
   26361 shl_i32
   23218 shl_i64
   23218 qemu_st64
   23218 or_i64
   20474 shr_i64
   20445 qemu_ld64
   11161 

Re: [Qemu-devel] [PATCH RFC 00/11] AREG0 elimination

2011-05-15 Thread Aurelien Jarno
On Sun, May 15, 2011 at 12:58:49PM +0300, Blue Swirl wrote:
 On Sun, May 15, 2011 at 12:24 PM, Aurelien Jarno aurel...@aurel32.net wrote:
  On Sun, May 15, 2011 at 10:15:52AM +0300, Blue Swirl wrote:
  On Sun, May 15, 2011 at 1:04 AM, Aurelien Jarno aurel...@aurel32.net 
  wrote:
   On Sun, May 15, 2011 at 12:52:35AM +0300, Blue Swirl wrote:
   On Sun, May 15, 2011 at 12:16 AM, Aurelien Jarno aurel...@aurel32.net 
   wrote:
On Sat, May 14, 2011 at 10:35:20PM +0300, Blue Swirl wrote:
Here's a RFC series for eliminating AREG0.
   
Blue Swirl (11):
  Move user emulator stuff from cpu-exec.c to user-exec.c
  Delete unused tb_invalidate_page_range
   
The above should be OK to commit.
   
  cpu_loop_exit: avoid using AREG0
  Delegate setup of TCG temporaries to targets
   
These two are not, unless the overall plan is OK.
   
  TCG: fix negative frame offset calculations
  TCG/x86: use stack for TCG temps
  TCG/Sparc64: use stack for TCG temps
   
But these three should be OK. I've tested lightly x86_64 and Sparc64 
hosts.
   
  Add CONFIG_TARGET_NEEDS_AREG0
  Don't compile legacy qemu_ld/st functions if target doesn't need 
them
   
Should be OK, though the latter patch only touches x86.
   
  Add new qemu_ld and qemu_st functions
  sparc: use new qemu_ld and qemu_st functions
   
The last two compile but QEMU segfaults. I just made a naive
conversion for getting comments.
   
   
What is the goal behing removing TCG_AREG0? If it is speed 
improvement,
can you please provide some benchmarks?
  
   There was some discussion earlier about why this (or parts of the
   conversion) may be a speed improvement:
  
   http://article.gmane.org/gmane.comp.emulators.qemu/101826
   http://article.gmane.org/gmane.comp.emulators.qemu/102156
  
   Ok, looks like I have missed that.
  
   There are no benchmarks yet. In fact, it may be difficult to make
   those without performing the removal completely.
  
   For example, patch 3/11 makes cpu_loop_exit take CPUState as a
   parameter instead of using global env, which would be available and
   the register is reserved anyway. This would only decrease performance
   at this stage, unless a complete conversion is done. I suspect the
   same would happen when moving all helpers from op_helper.c to
   helper.c. But after the whole conversion, this would be a neutral (no
   extra registers used) or even beneficial change (the code is free to
   use one more register).
  
The env register is used very often (basically for every load/store, 
but
also a lot of helpers), so it makes sense to reserve a register for 
it.
   
For what I understand from your patch series, you prefer to pass this
register explicitly to TCG functions. This basically means this TCG
global will be loaded to host register as soon as it is used, but also
regularly, as globals are saved back to their canonical location 
before
an helper or a load/store.
   
So it seems that this patch series will just allowing the env 
register
to change over time, though it will not spare one more register for 
the
TCG code, and it will emit longer TCG code to regularly reload the env
global into a host register.
  
   But there will be one more register available in some cases. In other
  
   Inside the TCG code, it will basically happens very rarely, given
   load/store are really the most used instructions, and they need to load
   the env register.
 
  Not exactly, from a sample run with -d op_opt:
  $ egrep -v -e '^$' -v -e 'OP after' -v -e ' end' -v -e 'Search PC'
  /tmp/qemu.log | awk '{print $1}' | sort | uniq -c|sort -rn
  1673966 movi_i32
   653931 ld_i32
 
  On which target is that? It looks like that some variables could be
  mapped directly as a global to avoid so many load/stores.
 
 Sparc-softmmu, boot of Aurora-1.0 up to first choices screen in X.
 
  Anyway, even this simple load instruction, needs a base register, and it
  mosts cases this register is TCG_AREG0. I agree that it can be replaced by
  another base register, but in anycase, a register has to be used for that.
 
  It means after such an instruction is encountered, one register is
  reserved for this global up to the next helper/memory load store, so
  there isn't any register spared, but more instructions generated to
  explicitly load this register.
 
  I also just realized all globals registered through tcg_global_mem_new()
  needs a base register and currently it is TCG_AREG0, that means even a
  mov_i32 might need access to TCG_AREG0. How do you plan to solve that
  part?
 
 In most cases the backing storage is CPUState, so cpu_env would be a
 logical choice.
 
   607432 mov_i32
   428684 st_i32
   326878 movi_i64
   308626 add_i32
   283186 call
   256817 exit_tb
   207232 nopn
   189388 goto_tb
   122398 and_i32
   117997 shr_i32
    89107 qemu_ld32
    82926 set_label
    82713 

Re: [Qemu-devel] [PATCH RFC 00/11] AREG0 elimination

2011-05-15 Thread Blue Swirl
On Sun, May 15, 2011 at 12:27 PM, Laurent Desnogues
laurent.desnog...@gmail.com wrote:
 On Sun, May 15, 2011 at 9:15 AM, Blue Swirl blauwir...@gmail.com wrote:
 On Sun, May 15, 2011 at 1:04 AM, Aurelien Jarno aurel...@aurel32.net wrote:
 On Sun, May 15, 2011 at 12:52:35AM +0300, Blue Swirl wrote:
 On Sun, May 15, 2011 at 12:16 AM, Aurelien Jarno aurel...@aurel32.net 
 wrote:
  On Sat, May 14, 2011 at 10:35:20PM +0300, Blue Swirl wrote:
 [...]
  The env register is used very often (basically for every load/store, but
  also a lot of helpers), so it makes sense to reserve a register for it.
 
  For what I understand from your patch series, you prefer to pass this
  register explicitly to TCG functions. This basically means this TCG
  global will be loaded to host register as soon as it is used, but also
  regularly, as globals are saved back to their canonical location before
  an helper or a load/store.
 
  So it seems that this patch series will just allowing the env register
  to change over time, though it will not spare one more register for the
  TCG code, and it will emit longer TCG code to regularly reload the env
  global into a host register.

 But there will be one more register available in some cases. In other

 Inside the TCG code, it will basically happens very rarely, given
 load/store are really the most used instructions, and they need to load
 the env register.

 Not exactly, from a sample run with -d op_opt:
 $ egrep -v -e '^$' -v -e 'OP after' -v -e ' end' -v -e 'Search PC'
 /tmp/qemu.log | awk '{print $1}' | sort | uniq -c|sort -rn
 1673966 movi_i32
  653931 ld_i32
  607432 mov_i32
  428684 st_i32
  326878 movi_i64
  308626 add_i32
  283186 call
  256817 exit_tb
  207232 nopn
  189388 goto_tb
  122398 and_i32
  117997 shr_i32
  89107 qemu_ld32
  82926 set_label
  82713 brcond_i32
  67169 qemu_st32
  55109 or_i32
  46536 ext32u_i64
  44288 xor_i32
  38103 sub_i32
  26361 shl_i32
  23218 shl_i64
  23218 qemu_st64
  23218 or_i64
  20474 shr_i64
  20445 qemu_ld64
  11161 qemu_ld8u
  10409 qemu_st8
   5013 qemu_ld16u
   3795 qemu_st16
   2776 qemu_ld8s
   1915 sar_i32
   1414 qemu_ld16s
    839 not_i32
    579 setcond_i32
    213 br
     42 ext32s_i64
     30 mul_i64

 Unless I missed something, this doesn't show the usage of
 ld/st per TB, which is what Aurélien was looking for if I
 understood correctly.  All I can say is that you had at
 most 256817 TB's and 234507 qemu_ld/st, so about one per
 TB.

The question was ratio of loads/stores to other instructions. The
statistics are not per TB. There were about 174880 TBs.

 Anyway I must be thick, because I fail to see how
 generated code could access guest CPU registers without a
 pointer to the CPU env :-)

 IIUC the SPARC translator uses ld_i32/st_i32 mainly for
 accessing the guest CPU registers, which due to register
 windows is held in a dedicated global temp.  Is that
 correct?  If so this is kind of hiding accesses to the
 CPU env;  all other targets read/write registers by using
 CPU env (through the use global temps in most cases).

 So I think most (if not almost all) TB will need a pointer
 to CPU env, which is why I think Aurélien's proposal to
 keep a dedicated register that'd be loaded in the prologue
 is the only way to not degrade performance of the
 generated code (I'd add that this dedicated register
 should be the one defined by the ABI as holding the first
 parameter value, if that's possible;  I'm afraid this is
 not necessarily a good idea).

CPU env will be used, but the register could be made available for
other uses too.



Re: [Qemu-devel] [PATCH RFC 00/11] AREG0 elimination

2011-05-15 Thread Blue Swirl
On Sun, May 15, 2011 at 1:19 PM, Aurelien Jarno aurel...@aurel32.net wrote:
 On Sun, May 15, 2011 at 12:58:49PM +0300, Blue Swirl wrote:
 On Sun, May 15, 2011 at 12:24 PM, Aurelien Jarno aurel...@aurel32.net 
 wrote:
  On Sun, May 15, 2011 at 10:15:52AM +0300, Blue Swirl wrote:
  On Sun, May 15, 2011 at 1:04 AM, Aurelien Jarno aurel...@aurel32.net 
  wrote:
   On Sun, May 15, 2011 at 12:52:35AM +0300, Blue Swirl wrote:
   On Sun, May 15, 2011 at 12:16 AM, Aurelien Jarno 
   aurel...@aurel32.net wrote:
On Sat, May 14, 2011 at 10:35:20PM +0300, Blue Swirl wrote:
Here's a RFC series for eliminating AREG0.
   
Blue Swirl (11):
  Move user emulator stuff from cpu-exec.c to user-exec.c
  Delete unused tb_invalidate_page_range
   
The above should be OK to commit.
   
  cpu_loop_exit: avoid using AREG0
  Delegate setup of TCG temporaries to targets
   
These two are not, unless the overall plan is OK.
   
  TCG: fix negative frame offset calculations
  TCG/x86: use stack for TCG temps
  TCG/Sparc64: use stack for TCG temps
   
But these three should be OK. I've tested lightly x86_64 and 
Sparc64 hosts.
   
  Add CONFIG_TARGET_NEEDS_AREG0
  Don't compile legacy qemu_ld/st functions if target doesn't need 
them
   
Should be OK, though the latter patch only touches x86.
   
  Add new qemu_ld and qemu_st functions
  sparc: use new qemu_ld and qemu_st functions
   
The last two compile but QEMU segfaults. I just made a naive
conversion for getting comments.
   
   
What is the goal behing removing TCG_AREG0? If it is speed 
improvement,
can you please provide some benchmarks?
  
   There was some discussion earlier about why this (or parts of the
   conversion) may be a speed improvement:
  
   http://article.gmane.org/gmane.comp.emulators.qemu/101826
   http://article.gmane.org/gmane.comp.emulators.qemu/102156
  
   Ok, looks like I have missed that.
  
   There are no benchmarks yet. In fact, it may be difficult to make
   those without performing the removal completely.
  
   For example, patch 3/11 makes cpu_loop_exit take CPUState as a
   parameter instead of using global env, which would be available and
   the register is reserved anyway. This would only decrease performance
   at this stage, unless a complete conversion is done. I suspect the
   same would happen when moving all helpers from op_helper.c to
   helper.c. But after the whole conversion, this would be a neutral (no
   extra registers used) or even beneficial change (the code is free to
   use one more register).
  
The env register is used very often (basically for every load/store, 
but
also a lot of helpers), so it makes sense to reserve a register for 
it.
   
For what I understand from your patch series, you prefer to pass this
register explicitly to TCG functions. This basically means this TCG
global will be loaded to host register as soon as it is used, but 
also
regularly, as globals are saved back to their canonical location 
before
an helper or a load/store.
   
So it seems that this patch series will just allowing the env 
register
to change over time, though it will not spare one more register for 
the
TCG code, and it will emit longer TCG code to regularly reload the 
env
global into a host register.
  
   But there will be one more register available in some cases. In other
  
   Inside the TCG code, it will basically happens very rarely, given
   load/store are really the most used instructions, and they need to load
   the env register.
 
  Not exactly, from a sample run with -d op_opt:
  $ egrep -v -e '^$' -v -e 'OP after' -v -e ' end' -v -e 'Search PC'
  /tmp/qemu.log | awk '{print $1}' | sort | uniq -c|sort -rn
  1673966 movi_i32
   653931 ld_i32
 
  On which target is that? It looks like that some variables could be
  mapped directly as a global to avoid so many load/stores.

 Sparc-softmmu, boot of Aurora-1.0 up to first choices screen in X.

  Anyway, even this simple load instruction, needs a base register, and it
  mosts cases this register is TCG_AREG0. I agree that it can be replaced by
  another base register, but in anycase, a register has to be used for that.
 
  It means after such an instruction is encountered, one register is
  reserved for this global up to the next helper/memory load store, so
  there isn't any register spared, but more instructions generated to
  explicitly load this register.
 
  I also just realized all globals registered through tcg_global_mem_new()
  needs a base register and currently it is TCG_AREG0, that means even a
  mov_i32 might need access to TCG_AREG0. How do you plan to solve that
  part?

 In most cases the backing storage is CPUState, so cpu_env would be a
 logical choice.

   607432 mov_i32
   428684 st_i32
   326878 movi_i64
   308626 add_i32
   283186 call
   256817 exit_tb
   207232 nopn
   189388 goto_tb
  

Re: [Qemu-devel] [PATCH RFC 00/11] AREG0 elimination

2011-05-15 Thread Aurelien Jarno
On Sun, May 15, 2011 at 02:03:40PM +0300, Blue Swirl wrote:
 On Sun, May 15, 2011 at 12:27 PM, Laurent Desnogues
 laurent.desnog...@gmail.com wrote:
  On Sun, May 15, 2011 at 9:15 AM, Blue Swirl blauwir...@gmail.com wrote:
  On Sun, May 15, 2011 at 1:04 AM, Aurelien Jarno aurel...@aurel32.net 
  wrote:
  On Sun, May 15, 2011 at 12:52:35AM +0300, Blue Swirl wrote:
  On Sun, May 15, 2011 at 12:16 AM, Aurelien Jarno aurel...@aurel32.net 
  wrote:
   On Sat, May 14, 2011 at 10:35:20PM +0300, Blue Swirl wrote:
  [...]
   The env register is used very often (basically for every load/store, 
   but
   also a lot of helpers), so it makes sense to reserve a register for it.
  
   For what I understand from your patch series, you prefer to pass this
   register explicitly to TCG functions. This basically means this TCG
   global will be loaded to host register as soon as it is used, but also
   regularly, as globals are saved back to their canonical location before
   an helper or a load/store.
  
   So it seems that this patch series will just allowing the env 
   register
   to change over time, though it will not spare one more register for the
   TCG code, and it will emit longer TCG code to regularly reload the env
   global into a host register.
 
  But there will be one more register available in some cases. In other
 
  Inside the TCG code, it will basically happens very rarely, given
  load/store are really the most used instructions, and they need to load
  the env register.
 
  Not exactly, from a sample run with -d op_opt:
  $ egrep -v -e '^$' -v -e 'OP after' -v -e ' end' -v -e 'Search PC'
  /tmp/qemu.log | awk '{print $1}' | sort | uniq -c|sort -rn
  1673966 movi_i32
   653931 ld_i32
   607432 mov_i32
   428684 st_i32
   326878 movi_i64
   308626 add_i32
   283186 call
   256817 exit_tb
   207232 nopn
   189388 goto_tb
   122398 and_i32
   117997 shr_i32
   89107 qemu_ld32
   82926 set_label
   82713 brcond_i32
   67169 qemu_st32
   55109 or_i32
   46536 ext32u_i64
   44288 xor_i32
   38103 sub_i32
   26361 shl_i32
   23218 shl_i64
   23218 qemu_st64
   23218 or_i64
   20474 shr_i64
   20445 qemu_ld64
   11161 qemu_ld8u
   10409 qemu_st8
    5013 qemu_ld16u
    3795 qemu_st16
    2776 qemu_ld8s
    1915 sar_i32
    1414 qemu_ld16s
     839 not_i32
     579 setcond_i32
     213 br
      42 ext32s_i64
      30 mul_i64
 
  Unless I missed something, this doesn't show the usage of
  ld/st per TB, which is what Aurélien was looking for if I
  understood correctly.  All I can say is that you had at
  most 256817 TB's and 234507 qemu_ld/st, so about one per
  TB.
 
 The question was ratio of loads/stores to other instructions. The
 statistics are not per TB. There were about 174880 TBs.
 
  Anyway I must be thick, because I fail to see how
  generated code could access guest CPU registers without a
  pointer to the CPU env :-)
 
  IIUC the SPARC translator uses ld_i32/st_i32 mainly for
  accessing the guest CPU registers, which due to register
  windows is held in a dedicated global temp.  Is that
  correct?  If so this is kind of hiding accesses to the
  CPU env;  all other targets read/write registers by using
  CPU env (through the use global temps in most cases).
 
  So I think most (if not almost all) TB will need a pointer
  to CPU env, which is why I think Aurélien's proposal to
  keep a dedicated register that'd be loaded in the prologue
  is the only way to not degrade performance of the
  generated code (I'd add that this dedicated register
  should be the one defined by the ABI as holding the first
  parameter value, if that's possible;  I'm afraid this is
  not necessarily a good idea).
 
 CPU env will be used, but the register could be made available for
 other uses too.
 

What other uses? It is needed for almost every TB, so there is not other
possible use. Let's take an example, a simple TB from mips on x86_64:

| IN: start_kernel
| 0x80510958:  jal0x8051d50c
| 0x8051095c:  nop
| 
| OP after liveness analysis:
|  movi_i32 ra,$0x80510960
|  movi_i32 PC,$0x8051d50c
|  exit_tb $0x0
|  end

| OUT: [size=28]
| 0x40f22960:  mov$0x80510960,%ebp
| 0x40f22965:  mov%ebp,0x7c(%r14)
| 0x40f22969:  mov$0x8051d50c,%ebp
| 0x40f2296e:  mov%ebp,0x80(%r14)
| 0x40f22975:  xor%eax,%eax
| 0x40f22977:  jmpq   0x115042e

x86_64 uses r14 as TCG_AREG0. Despite the instructions being quite 
simple (only 2 movi_i32), the resulting code makes 2 access to env to
save the two registers. Having to reload the env pointer each time to a
register would clearly increase the size of this TB.

Another question, imagine env is not in a register anymore. What kind of
code do you plan to use to load the env pointer into a register? This is
something we need to know to evaluate the cost of not having a fixed 
register for env withing the TCG code.

-- 
Aurelien Jarno  GPG: 1024D/F1BCDB73
aurel...@aurel32.net http://www.aurel32.net



Re: [Qemu-devel] [PATCH RFC 00/11] AREG0 elimination

2011-05-15 Thread Blue Swirl
On Sun, May 15, 2011 at 2:14 PM, Aurelien Jarno aurel...@aurel32.net wrote:
 On Sun, May 15, 2011 at 02:03:40PM +0300, Blue Swirl wrote:
 On Sun, May 15, 2011 at 12:27 PM, Laurent Desnogues
 laurent.desnog...@gmail.com wrote:
  On Sun, May 15, 2011 at 9:15 AM, Blue Swirl blauwir...@gmail.com wrote:
  On Sun, May 15, 2011 at 1:04 AM, Aurelien Jarno aurel...@aurel32.net 
  wrote:
  On Sun, May 15, 2011 at 12:52:35AM +0300, Blue Swirl wrote:
  On Sun, May 15, 2011 at 12:16 AM, Aurelien Jarno aurel...@aurel32.net 
  wrote:
   On Sat, May 14, 2011 at 10:35:20PM +0300, Blue Swirl wrote:
  [...]
   The env register is used very often (basically for every load/store, 
   but
   also a lot of helpers), so it makes sense to reserve a register for 
   it.
  
   For what I understand from your patch series, you prefer to pass this
   register explicitly to TCG functions. This basically means this TCG
   global will be loaded to host register as soon as it is used, but also
   regularly, as globals are saved back to their canonical location 
   before
   an helper or a load/store.
  
   So it seems that this patch series will just allowing the env 
   register
   to change over time, though it will not spare one more register for 
   the
   TCG code, and it will emit longer TCG code to regularly reload the env
   global into a host register.
 
  But there will be one more register available in some cases. In other
 
  Inside the TCG code, it will basically happens very rarely, given
  load/store are really the most used instructions, and they need to load
  the env register.
 
  Not exactly, from a sample run with -d op_opt:
  $ egrep -v -e '^$' -v -e 'OP after' -v -e ' end' -v -e 'Search PC'
  /tmp/qemu.log | awk '{print $1}' | sort | uniq -c|sort -rn
  1673966 movi_i32
   653931 ld_i32
   607432 mov_i32
   428684 st_i32
   326878 movi_i64
   308626 add_i32
   283186 call
   256817 exit_tb
   207232 nopn
   189388 goto_tb
   122398 and_i32
   117997 shr_i32
   89107 qemu_ld32
   82926 set_label
   82713 brcond_i32
   67169 qemu_st32
   55109 or_i32
   46536 ext32u_i64
   44288 xor_i32
   38103 sub_i32
   26361 shl_i32
   23218 shl_i64
   23218 qemu_st64
   23218 or_i64
   20474 shr_i64
   20445 qemu_ld64
   11161 qemu_ld8u
   10409 qemu_st8
    5013 qemu_ld16u
    3795 qemu_st16
    2776 qemu_ld8s
    1915 sar_i32
    1414 qemu_ld16s
     839 not_i32
     579 setcond_i32
     213 br
      42 ext32s_i64
      30 mul_i64
 
  Unless I missed something, this doesn't show the usage of
  ld/st per TB, which is what Aurélien was looking for if I
  understood correctly.  All I can say is that you had at
  most 256817 TB's and 234507 qemu_ld/st, so about one per
  TB.

 The question was ratio of loads/stores to other instructions. The
 statistics are not per TB. There were about 174880 TBs.

  Anyway I must be thick, because I fail to see how
  generated code could access guest CPU registers without a
  pointer to the CPU env :-)
 
  IIUC the SPARC translator uses ld_i32/st_i32 mainly for
  accessing the guest CPU registers, which due to register
  windows is held in a dedicated global temp.  Is that
  correct?  If so this is kind of hiding accesses to the
  CPU env;  all other targets read/write registers by using
  CPU env (through the use global temps in most cases).
 
  So I think most (if not almost all) TB will need a pointer
  to CPU env, which is why I think Aurélien's proposal to
  keep a dedicated register that'd be loaded in the prologue
  is the only way to not degrade performance of the
  generated code (I'd add that this dedicated register
  should be the one defined by the ABI as holding the first
  parameter value, if that's possible;  I'm afraid this is
  not necessarily a good idea).

 CPU env will be used, but the register could be made available for
 other uses too.


 What other uses? It is needed for almost every TB, so there is not other
 possible use. Let's take an example, a simple TB from mips on x86_64:

 | IN: start_kernel
 | 0x80510958:  jal        0x8051d50c
 | 0x8051095c:  nop
 |
 | OP after liveness analysis:
 |  movi_i32 ra,$0x80510960
 |  movi_i32 PC,$0x8051d50c
 |  exit_tb $0x0
 |  end

 | OUT: [size=28]
 | 0x40f22960:  mov    $0x80510960,%ebp
 | 0x40f22965:  mov    %ebp,0x7c(%r14)
 | 0x40f22969:  mov    $0x8051d50c,%ebp
 | 0x40f2296e:  mov    %ebp,0x80(%r14)
 | 0x40f22975:  xor    %eax,%eax
 | 0x40f22977:  jmpq   0x115042e

 x86_64 uses r14 as TCG_AREG0. Despite the instructions being quite
 simple (only 2 movi_i32), the resulting code makes 2 access to env to
 save the two registers. Having to reload the env pointer each time to a
 register would clearly increase the size of this TB.

I don't think TCG would be that simple, instead the pointer would be
loaded only once in this case.

 Another question, imagine env is not in a register anymore. What kind of
 code do you plan to use to load the env pointer into a register? This is
 something we need to know to evaluate the cost of not having 

Re: [Qemu-devel] [PATCH RFC 00/11] AREG0 elimination

2011-05-15 Thread Aurelien Jarno
On Sun, May 15, 2011 at 02:12:27PM +0300, Blue Swirl wrote:
 On Sun, May 15, 2011 at 1:19 PM, Aurelien Jarno aurel...@aurel32.net wrote:
  I still don't get where are this list of possible changes? Did I miss
  something in another thread?
 
 I'm referring to the patches I sent.

Ok, patches 1, 2 and 4 to 7 looks ok at a first glance, though I think
patches 6 and 7 should be done for all hosts or none of them.

  On the TCG generated code, the env structure is used almost for every
  op, so it really makes sense to keep it in a register instead of having to
  reload the address of env regularly from memory. Given it only affects
  TCG generated code, I don't see the point of portability here.
 
 For example, maybe the bugs in Sparc glibc could be avoided by using
 one of %i set of registers (not accessible from helpers) for AREG0
 within generated code instead of %g registers which seem to be
 fragile.

First of all, but it's a different subject, I am not sure there are
sparc glibc bugs, I'd rather says QEMU mis-uses some register. For
example the following code is probably wrong:

/* Note: must be synced with dyngen-exec.h */
#ifdef CONFIG_SOLARIS
#define TCG_AREG0 TCG_REG_G2
#elif defined(__sparc_v9__)
#define TCG_AREG0 TCG_REG_G5
#else
#define TCG_AREG0 TCG_REG_G6
#endif

__sparc_v9__ can set on the 32-bit ABI, when the compiler targets V8+, 
so the condition is probably wrong there. Secondly the SPARC ABI [1] on
page 23 that %g5 to %g7 are reserved for system. I don't think QEMU has
the right to use this registers.

Anyway, I don't see why keeping TCG_AREG0 inside the TCG generated code
would prevent you to use a register from the %i set for it.

   On the other hand, I have objections to remove uses of TCG_AREG0 from
   the TCG part. This register is part of the TCG design and thus used very
   often. In any case we have to keep it for accesses to the globals, so we
   can keep it for softmmu load/stores too.
 
  Right, any changes would need to be backed by performance figures.
 
   If we go for the removal of
   TCG_AREG0 from the GCC side, it probably means loading it in the prologue
   instead.
 
  Do you mean cpu-exec.c side with this? I think this is a very similar
  register tradeoff case to helpers.
 
  I mean calling tcg_qemu_tb_exec(tb_ptr, env) instead of
  tcg_qemu_tb_exec(tb_ptr), and modification in the prologue to save the
  env pointer into the TCG_AREG0 register (which is kept internal to TCG
  generated code).
 
 Maybe it would be easy to test and benchmark this change, it doesn't
 affect generated code or helpers.
 

Yes it's something that can be benchmarked. From experience it won't 
make any significant performance change, it's basically two register
moves (one in the caller, one in the prologue), for each TB which is
not chained. In any case, it will introduce a slight overhead that has
to be compensated by allowing the helpers to use an additional 
register.

[1] http://www.sparc.com/standards/psABI3rd.pdf 

-- 
Aurelien Jarno  GPG: 1024D/F1BCDB73
aurel...@aurel32.net http://www.aurel32.net



Re: [Qemu-devel] [PATCH RFC 00/11] AREG0 elimination

2011-05-15 Thread Aurelien Jarno
On Sun, May 15, 2011 at 02:33:55PM +0300, Blue Swirl wrote:
 On Sun, May 15, 2011 at 2:14 PM, Aurelien Jarno aurel...@aurel32.net wrote:
  On Sun, May 15, 2011 at 02:03:40PM +0300, Blue Swirl wrote:
  On Sun, May 15, 2011 at 12:27 PM, Laurent Desnogues
  laurent.desnog...@gmail.com wrote:
   On Sun, May 15, 2011 at 9:15 AM, Blue Swirl blauwir...@gmail.com wrote:
   On Sun, May 15, 2011 at 1:04 AM, Aurelien Jarno aurel...@aurel32.net 
   wrote:
   On Sun, May 15, 2011 at 12:52:35AM +0300, Blue Swirl wrote:
   On Sun, May 15, 2011 at 12:16 AM, Aurelien Jarno 
   aurel...@aurel32.net wrote:
On Sat, May 14, 2011 at 10:35:20PM +0300, Blue Swirl wrote:
   [...]
The env register is used very often (basically for every 
load/store, but
also a lot of helpers), so it makes sense to reserve a register for 
it.
   
For what I understand from your patch series, you prefer to pass 
this
register explicitly to TCG functions. This basically means this TCG
global will be loaded to host register as soon as it is used, but 
also
regularly, as globals are saved back to their canonical location 
before
an helper or a load/store.
   
So it seems that this patch series will just allowing the env 
register
to change over time, though it will not spare one more register for 
the
TCG code, and it will emit longer TCG code to regularly reload the 
env
global into a host register.
  
   But there will be one more register available in some cases. In other
  
   Inside the TCG code, it will basically happens very rarely, given
   load/store are really the most used instructions, and they need to load
   the env register.
  
   Not exactly, from a sample run with -d op_opt:
   $ egrep -v -e '^$' -v -e 'OP after' -v -e ' end' -v -e 'Search PC'
   /tmp/qemu.log | awk '{print $1}' | sort | uniq -c|sort -rn
   1673966 movi_i32
    653931 ld_i32
    607432 mov_i32
    428684 st_i32
    326878 movi_i64
    308626 add_i32
    283186 call
    256817 exit_tb
    207232 nopn
    189388 goto_tb
    122398 and_i32
    117997 shr_i32
    89107 qemu_ld32
    82926 set_label
    82713 brcond_i32
    67169 qemu_st32
    55109 or_i32
    46536 ext32u_i64
    44288 xor_i32
    38103 sub_i32
    26361 shl_i32
    23218 shl_i64
    23218 qemu_st64
    23218 or_i64
    20474 shr_i64
    20445 qemu_ld64
    11161 qemu_ld8u
    10409 qemu_st8
     5013 qemu_ld16u
     3795 qemu_st16
     2776 qemu_ld8s
     1915 sar_i32
     1414 qemu_ld16s
      839 not_i32
      579 setcond_i32
      213 br
       42 ext32s_i64
       30 mul_i64
  
   Unless I missed something, this doesn't show the usage of
   ld/st per TB, which is what Aurélien was looking for if I
   understood correctly.  All I can say is that you had at
   most 256817 TB's and 234507 qemu_ld/st, so about one per
   TB.
 
  The question was ratio of loads/stores to other instructions. The
  statistics are not per TB. There were about 174880 TBs.
 
   Anyway I must be thick, because I fail to see how
   generated code could access guest CPU registers without a
   pointer to the CPU env :-)
  
   IIUC the SPARC translator uses ld_i32/st_i32 mainly for
   accessing the guest CPU registers, which due to register
   windows is held in a dedicated global temp.  Is that
   correct?  If so this is kind of hiding accesses to the
   CPU env;  all other targets read/write registers by using
   CPU env (through the use global temps in most cases).
  
   So I think most (if not almost all) TB will need a pointer
   to CPU env, which is why I think Aurélien's proposal to
   keep a dedicated register that'd be loaded in the prologue
   is the only way to not degrade performance of the
   generated code (I'd add that this dedicated register
   should be the one defined by the ABI as holding the first
   parameter value, if that's possible;  I'm afraid this is
   not necessarily a good idea).
 
  CPU env will be used, but the register could be made available for
  other uses too.
 
 
  What other uses? It is needed for almost every TB, so there is not other
  possible use. Let's take an example, a simple TB from mips on x86_64:
 
  | IN: start_kernel
  | 0x80510958:  jal        0x8051d50c
  | 0x8051095c:  nop
  |
  | OP after liveness analysis:
  |  movi_i32 ra,$0x80510960
  |  movi_i32 PC,$0x8051d50c
  |  exit_tb $0x0
  |  end
 
  | OUT: [size=28]
  | 0x40f22960:  mov    $0x80510960,%ebp
  | 0x40f22965:  mov    %ebp,0x7c(%r14)
  | 0x40f22969:  mov    $0x8051d50c,%ebp
  | 0x40f2296e:  mov    %ebp,0x80(%r14)
  | 0x40f22975:  xor    %eax,%eax
  | 0x40f22977:  jmpq   0x115042e
 
  x86_64 uses r14 as TCG_AREG0. Despite the instructions being quite
  simple (only 2 movi_i32), the resulting code makes 2 access to env to
  save the two registers. Having to reload the env pointer each time to a
  register would clearly increase the size of this TB.
 
 I don't think TCG would be that simple, instead the pointer would be
 loaded only once in this 

Re: [Qemu-devel] [PATCH RFC 00/11] AREG0 elimination

2011-05-15 Thread Laurent Desnogues
On Sun, May 15, 2011 at 1:33 PM, Blue Swirl blauwir...@gmail.com wrote:
[...]
 x86_64 uses r14 as TCG_AREG0. Despite the instructions being quite
 simple (only 2 movi_i32), the resulting code makes 2 access to env to
 save the two registers. Having to reload the env pointer each time to a
 register would clearly increase the size of this TB.

 I don't think TCG would be that simple, instead the pointer would be
 loaded only once in this case.

Assuming TCG was able to allocate a register for that,
it would be live at most for one TB, so you'd have to
load it at least once per TB, and with block chaining
that wouldn't be efficient as you'd keep on reloading it.


Laurent



Re: [Qemu-devel] [PATCH RFC 00/11] AREG0 elimination

2011-05-15 Thread Blue Swirl
On Sun, May 15, 2011 at 2:36 PM, Aurelien Jarno aurel...@aurel32.net wrote:
 On Sun, May 15, 2011 at 02:12:27PM +0300, Blue Swirl wrote:
 On Sun, May 15, 2011 at 1:19 PM, Aurelien Jarno aurel...@aurel32.net wrote:
  I still don't get where are this list of possible changes? Did I miss
  something in another thread?

 I'm referring to the patches I sent.

 Ok, patches 1, 2 and 4 to 7 looks ok at a first glance, though I think
 patches 6 and 7 should be done for all hosts or none of them.

The changes can be done in steps, but of course removing temp_buf from
CPUState would need all targets to be converted first.

  On the TCG generated code, the env structure is used almost for every
  op, so it really makes sense to keep it in a register instead of having to
  reload the address of env regularly from memory. Given it only affects
  TCG generated code, I don't see the point of portability here.

 For example, maybe the bugs in Sparc glibc could be avoided by using
 one of %i set of registers (not accessible from helpers) for AREG0
 within generated code instead of %g registers which seem to be
 fragile.

 First of all, but it's a different subject, I am not sure there are
 sparc glibc bugs, I'd rather says QEMU mis-uses some register. For
 example the following code is probably wrong:

 /* Note: must be synced with dyngen-exec.h */
 #ifdef CONFIG_SOLARIS
 #define TCG_AREG0 TCG_REG_G2
 #elif defined(__sparc_v9__)
 #define TCG_AREG0 TCG_REG_G5
 #else
 #define TCG_AREG0 TCG_REG_G6
 #endif

 __sparc_v9__ can set on the 32-bit ABI, when the compiler targets V8+,
 so the condition is probably wrong there. Secondly the SPARC ABI [1] on
 page 23 that %g5 to %g7 are reserved for system. I don't think QEMU has
 the right to use this registers.

Yes, but the situation is not so nice. Please see this post for status
as of 2010:
http://article.gmane.org/gmane.comp.emulators.qemu/63610

This is from Debian glibc 2.11.2-10:
$ file /lib/libc-2.11.2.so
/lib/libc-2.11.2.so: ELF 32-bit MSB shared object, SPARC32PLUS,
version 1 (SYSV), dynamically linked (uses shared libs), for GNU/Linux
2.6.18, stripped
$ objdump -d /lib/libc.so.6 |grep %g1|wc -l
69648
$ objdump -d /lib/libc.so.6 |grep %g2|wc -l
37299
$ objdump -d /lib/libc.so.6 |grep %g3|wc -l
20635
$ objdump -d /lib/libc.so.6 |grep %g4|wc -l
11603
$ objdump -d /lib/libc.so.6 |grep %g5|wc -l
448
$ objdump -d /lib/libc.so.6 |grep %g6|wc -l
150
$ objdump -d /lib/libc.so.6 |grep %g7|wc -l
3052

Glibc is compiled for Sparc32plus, so it should only use %g6 and %g7,
or %g1 and %g5 for scratch purposes. However, it is the application
registers %g2 to %g4 that are used heaviest. Looking inside the
objdump it's easy to see that the uses are not for example saving or
restoring, but actually using them without saving the previous value
first:
000211e0 __divdi3:
   211e0:   9d e3 bf a0 save  %sp, -96, %sp
   211e4:   90 10 00 18 mov  %i0, %o0
   211e8:   92 10 00 19 mov  %i1, %o1
   211ec:   94 10 00 1a mov  %i2, %o2
   211f0:   96 10 00 1b mov  %i3, %o3
   211f4:   80 a6 20 00 cmp  %i0, 0
   211f8:   06 40 00 10 bl,pn   %icc, 21238 __divdi3+0x58
   211fc:   a0 10 20 00 clr  %l0
   21200:   80 a2 a0 00 cmp  %o2, 0
   21204:   26 40 00 13 bl,a,pn   %icc, 21250 __divdi3+0x70
   21208:   a0 38 00 10 xnor  %g0, %l0, %l0
   2120c:   7f ff fe ed call  20dc0 __ashldi3+0x40
   21210:   98 10 20 00 clr  %o4
   21214:   84 10 00 08 mov  %o0, %g2

...whoops...

   21218:   80 a4 20 00 cmp  %l0, 0
   2121c:   02 40 00 04 be,pn   %icc, 2122c __divdi3+0x4c
   21220:   86 10 00 09 mov  %o1, %g3

...whoops...

   21224:   86 a0 00 09 subcc  %g0, %o1, %g3
   21228:   84 60 00 02 subc  %g0, %g2, %g2
   2122c:   b2 10 00 03 mov  %g3, %i1
   21230:   81 cf e0 08 rett  %i7 + 8
   21234:   90 10 00 02 mov  %g2, %o0
   21238:   92 a0 00 19 subcc  %g0, %i1, %o1
   2123c:   90 60 00 18 subc  %g0, %i0, %o0
   21240:   80 a2 a0 00 cmp  %o2, 0
   21244:   16 4f ff f2 bge  %icc, 2120c __divdi3+0x2c
   21248:   a0 10 3f ff mov  -1, %l0
   2124c:   a0 38 00 10 xnor  %g0, %l0, %l0
   21250:   96 a0 00 0b subcc  %g0, %o3, %o3
   21254:   10 6f ff ee b  %xcc, 2120c __divdi3+0x2c
   21258:   94 60 00 0a subc  %g0, %o2, %o2
   2125c:   01 00 00 00 nop

This is libc from OpenBSD/Sparc64 4.9:
$ objdump -d /usr/lib/libc.so.58.0 |grep %g1|wc -l
   40562
$ objdump -d /usr/lib/libc.so.58.0 |grep %g2|wc -l
   20384
$ objdump -d /usr/lib/libc.so.58.0 |grep %g3|wc -l
   10240
$ objdump -d /usr/lib/libc.so.58.0 |grep %g4|wc -l
6606
$ objdump -d /usr/lib/libc.so.58.0 |grep %g5|wc -l
3811
$ objdump -d /usr/lib/libc.so.58.0 |grep %g6|wc -l
   4
$ objdump -d /usr/lib/libc.so.58.0 |grep %g7|wc -l
  20

Not so great there either.

 Anyway, I don't see why keeping 

Re: [Qemu-devel] [PATCH RFC 00/11] AREG0 elimination

2011-05-15 Thread Blue Swirl
On Sun, May 15, 2011 at 2:37 PM, Aurelien Jarno aurel...@aurel32.net wrote:
 On Sun, May 15, 2011 at 02:33:55PM +0300, Blue Swirl wrote:
 On Sun, May 15, 2011 at 2:14 PM, Aurelien Jarno aurel...@aurel32.net wrote:
  On Sun, May 15, 2011 at 02:03:40PM +0300, Blue Swirl wrote:
  On Sun, May 15, 2011 at 12:27 PM, Laurent Desnogues
  laurent.desnog...@gmail.com wrote:
   On Sun, May 15, 2011 at 9:15 AM, Blue Swirl blauwir...@gmail.com 
   wrote:
   On Sun, May 15, 2011 at 1:04 AM, Aurelien Jarno aurel...@aurel32.net 
   wrote:
   On Sun, May 15, 2011 at 12:52:35AM +0300, Blue Swirl wrote:
   On Sun, May 15, 2011 at 12:16 AM, Aurelien Jarno 
   aurel...@aurel32.net wrote:
On Sat, May 14, 2011 at 10:35:20PM +0300, Blue Swirl wrote:
   [...]
The env register is used very often (basically for every 
load/store, but
also a lot of helpers), so it makes sense to reserve a register 
for it.
   
For what I understand from your patch series, you prefer to pass 
this
register explicitly to TCG functions. This basically means this TCG
global will be loaded to host register as soon as it is used, but 
also
regularly, as globals are saved back to their canonical location 
before
an helper or a load/store.
   
So it seems that this patch series will just allowing the env 
register
to change over time, though it will not spare one more register 
for the
TCG code, and it will emit longer TCG code to regularly reload the 
env
global into a host register.
  
   But there will be one more register available in some cases. In other
  
   Inside the TCG code, it will basically happens very rarely, given
   load/store are really the most used instructions, and they need to 
   load
   the env register.
  
   Not exactly, from a sample run with -d op_opt:
   $ egrep -v -e '^$' -v -e 'OP after' -v -e ' end' -v -e 'Search PC'
   /tmp/qemu.log | awk '{print $1}' | sort | uniq -c|sort -rn
   1673966 movi_i32
    653931 ld_i32
    607432 mov_i32
    428684 st_i32
    326878 movi_i64
    308626 add_i32
    283186 call
    256817 exit_tb
    207232 nopn
    189388 goto_tb
    122398 and_i32
    117997 shr_i32
    89107 qemu_ld32
    82926 set_label
    82713 brcond_i32
    67169 qemu_st32
    55109 or_i32
    46536 ext32u_i64
    44288 xor_i32
    38103 sub_i32
    26361 shl_i32
    23218 shl_i64
    23218 qemu_st64
    23218 or_i64
    20474 shr_i64
    20445 qemu_ld64
    11161 qemu_ld8u
    10409 qemu_st8
     5013 qemu_ld16u
     3795 qemu_st16
     2776 qemu_ld8s
     1915 sar_i32
     1414 qemu_ld16s
      839 not_i32
      579 setcond_i32
      213 br
       42 ext32s_i64
       30 mul_i64
  
   Unless I missed something, this doesn't show the usage of
   ld/st per TB, which is what Aurélien was looking for if I
   understood correctly.  All I can say is that you had at
   most 256817 TB's and 234507 qemu_ld/st, so about one per
   TB.
 
  The question was ratio of loads/stores to other instructions. The
  statistics are not per TB. There were about 174880 TBs.
 
   Anyway I must be thick, because I fail to see how
   generated code could access guest CPU registers without a
   pointer to the CPU env :-)
  
   IIUC the SPARC translator uses ld_i32/st_i32 mainly for
   accessing the guest CPU registers, which due to register
   windows is held in a dedicated global temp.  Is that
   correct?  If so this is kind of hiding accesses to the
   CPU env;  all other targets read/write registers by using
   CPU env (through the use global temps in most cases).
  
   So I think most (if not almost all) TB will need a pointer
   to CPU env, which is why I think Aurélien's proposal to
   keep a dedicated register that'd be loaded in the prologue
   is the only way to not degrade performance of the
   generated code (I'd add that this dedicated register
   should be the one defined by the ABI as holding the first
   parameter value, if that's possible;  I'm afraid this is
   not necessarily a good idea).
 
  CPU env will be used, but the register could be made available for
  other uses too.
 
 
  What other uses? It is needed for almost every TB, so there is not other
  possible use. Let's take an example, a simple TB from mips on x86_64:
 
  | IN: start_kernel
  | 0x80510958:  jal        0x8051d50c
  | 0x8051095c:  nop
  |
  | OP after liveness analysis:
  |  movi_i32 ra,$0x80510960
  |  movi_i32 PC,$0x8051d50c
  |  exit_tb $0x0
  |  end
 
  | OUT: [size=28]
  | 0x40f22960:  mov    $0x80510960,%ebp
  | 0x40f22965:  mov    %ebp,0x7c(%r14)
  | 0x40f22969:  mov    $0x8051d50c,%ebp
  | 0x40f2296e:  mov    %ebp,0x80(%r14)
  | 0x40f22975:  xor    %eax,%eax
  | 0x40f22977:  jmpq   0x115042e
 
  x86_64 uses r14 as TCG_AREG0. Despite the instructions being quite
  simple (only 2 movi_i32), the resulting code makes 2 access to env to
  save the two registers. Having to reload the env pointer each time to a
  register would clearly increase the size of this TB.

 I don't 

Re: [Qemu-devel] [PATCH RFC 00/11] AREG0 elimination

2011-05-15 Thread Blue Swirl
On Sun, May 15, 2011 at 3:14 PM, Laurent Desnogues
laurent.desnog...@gmail.com wrote:
 On Sun, May 15, 2011 at 1:33 PM, Blue Swirl blauwir...@gmail.com wrote:
 [...]
 x86_64 uses r14 as TCG_AREG0. Despite the instructions being quite
 simple (only 2 movi_i32), the resulting code makes 2 access to env to
 save the two registers. Having to reload the env pointer each time to a
 register would clearly increase the size of this TB.

 I don't think TCG would be that simple, instead the pointer would be
 loaded only once in this case.

 Assuming TCG was able to allocate a register for that,
 it would be live at most for one TB, so you'd have to
 load it at least once per TB, and with block chaining
 that wouldn't be efficient as you'd keep on reloading it.

Yes, but if there are better uses, the register can be flushed. Now
this is not possible since the register is always unavailable.



Re: [Qemu-devel] [PATCH RFC 00/11] AREG0 elimination

2011-05-15 Thread Aurelien Jarno
On Sun, May 15, 2011 at 03:30:17PM +0300, Blue Swirl wrote:
 On Sun, May 15, 2011 at 2:36 PM, Aurelien Jarno aurel...@aurel32.net wrote:
  On Sun, May 15, 2011 at 02:12:27PM +0300, Blue Swirl wrote:
  On Sun, May 15, 2011 at 1:19 PM, Aurelien Jarno aurel...@aurel32.net 
  wrote:
   I still don't get where are this list of possible changes? Did I miss
   something in another thread?
 
  I'm referring to the patches I sent.
 
  Ok, patches 1, 2 and 4 to 7 looks ok at a first glance, though I think
  patches 6 and 7 should be done for all hosts or none of them.
 
 The changes can be done in steps, but of course removing temp_buf from
 CPUState would need all targets to be converted first.
 
   On the TCG generated code, the env structure is used almost for every
   op, so it really makes sense to keep it in a register instead of having 
   to
   reload the address of env regularly from memory. Given it only affects
   TCG generated code, I don't see the point of portability here.
 
  For example, maybe the bugs in Sparc glibc could be avoided by using
  one of %i set of registers (not accessible from helpers) for AREG0
  within generated code instead of %g registers which seem to be
  fragile.
 
  First of all, but it's a different subject, I am not sure there are
  sparc glibc bugs, I'd rather says QEMU mis-uses some register. For
  example the following code is probably wrong:
 
  /* Note: must be synced with dyngen-exec.h */
  #ifdef CONFIG_SOLARIS
  #define TCG_AREG0 TCG_REG_G2
  #elif defined(__sparc_v9__)
  #define TCG_AREG0 TCG_REG_G5
  #else
  #define TCG_AREG0 TCG_REG_G6
  #endif
 
  __sparc_v9__ can set on the 32-bit ABI, when the compiler targets V8+,
  so the condition is probably wrong there. Secondly the SPARC ABI [1] on
  page 23 that %g5 to %g7 are reserved for system. I don't think QEMU has
  the right to use this registers.
 
 Yes, but the situation is not so nice. Please see this post for status
 as of 2010:
 http://article.gmane.org/gmane.comp.emulators.qemu/63610
 
 This is from Debian glibc 2.11.2-10:
 $ file /lib/libc-2.11.2.so
 /lib/libc-2.11.2.so: ELF 32-bit MSB shared object, SPARC32PLUS,
 version 1 (SYSV), dynamically linked (uses shared libs), for GNU/Linux
 2.6.18, stripped
 $ objdump -d /lib/libc.so.6 |grep %g1|wc -l
 69648
 $ objdump -d /lib/libc.so.6 |grep %g2|wc -l
 37299
 $ objdump -d /lib/libc.so.6 |grep %g3|wc -l
 20635
 $ objdump -d /lib/libc.so.6 |grep %g4|wc -l
 11603
 $ objdump -d /lib/libc.so.6 |grep %g5|wc -l
 448
 $ objdump -d /lib/libc.so.6 |grep %g6|wc -l
 150
 $ objdump -d /lib/libc.so.6 |grep %g7|wc -l
 3052
 
 Glibc is compiled for Sparc32plus, so it should only use %g6 and %g7,

From the calling convention point of view, sparc32 and sparc32plus are
the same ABI, so %g5 is also reserved for system use.

 or %g1 and %g5 for scratch purposes. However, it is the application
 registers %g2 to %g4 that are used heaviest. Looking inside the
 objdump it's easy to see that the uses are not for example saving or
 restoring, but actually using them without saving the previous value
 first:

Well, we have to define system and application. System is defined as
library in Chapter 6, and I don't see the libc there, and is probably
considered as part of the application.

 000211e0 __divdi3:
211e0:   9d e3 bf a0 save  %sp, -96, %sp
211e4:   90 10 00 18 mov  %i0, %o0
211e8:   92 10 00 19 mov  %i1, %o1
211ec:   94 10 00 1a mov  %i2, %o2
211f0:   96 10 00 1b mov  %i3, %o3
211f4:   80 a6 20 00 cmp  %i0, 0
211f8:   06 40 00 10 bl,pn   %icc, 21238 __divdi3+0x58
211fc:   a0 10 20 00 clr  %l0
21200:   80 a2 a0 00 cmp  %o2, 0
21204:   26 40 00 13 bl,a,pn   %icc, 21250 __divdi3+0x70
21208:   a0 38 00 10 xnor  %g0, %l0, %l0
2120c:   7f ff fe ed call  20dc0 __ashldi3+0x40
21210:   98 10 20 00 clr  %o4
21214:   84 10 00 08 mov  %o0, %g2
 
 ...whoops...
 
21218:   80 a4 20 00 cmp  %l0, 0
2121c:   02 40 00 04 be,pn   %icc, 2122c __divdi3+0x4c
21220:   86 10 00 09 mov  %o1, %g3
 
 ...whoops...
 
21224:   86 a0 00 09 subcc  %g0, %o1, %g3
21228:   84 60 00 02 subc  %g0, %g2, %g2
2122c:   b2 10 00 03 mov  %g3, %i1
21230:   81 cf e0 08 rett  %i7 + 8
21234:   90 10 00 02 mov  %g2, %o0
21238:   92 a0 00 19 subcc  %g0, %i1, %o1
2123c:   90 60 00 18 subc  %g0, %i0, %o0
21240:   80 a2 a0 00 cmp  %o2, 0
21244:   16 4f ff f2 bge  %icc, 2120c __divdi3+0x2c
21248:   a0 10 3f ff mov  -1, %l0
2124c:   a0 38 00 10 xnor  %g0, %l0, %l0
21250:   96 a0 00 0b subcc  %g0, %o3, %o3
21254:   10 6f ff ee b  %xcc, 2120c __divdi3+0x2c
21258:   94 60 00 0a subc  %g0, %o2, %o2
2125c:   01 00 00 00 nop
 
 This is libc from OpenBSD/Sparc64 4.9:
 

Re: [Qemu-devel] [PATCH RFC 00/11] AREG0 elimination

2011-05-15 Thread Aurelien Jarno
On Sun, May 15, 2011 at 03:37:00PM +0300, Blue Swirl wrote:
 On Sun, May 15, 2011 at 3:14 PM, Laurent Desnogues
 laurent.desnog...@gmail.com wrote:
  On Sun, May 15, 2011 at 1:33 PM, Blue Swirl blauwir...@gmail.com wrote:
  [...]
  x86_64 uses r14 as TCG_AREG0. Despite the instructions being quite
  simple (only 2 movi_i32), the resulting code makes 2 access to env to
  save the two registers. Having to reload the env pointer each time to a
  register would clearly increase the size of this TB.
 
  I don't think TCG would be that simple, instead the pointer would be
  loaded only once in this case.
 
  Assuming TCG was able to allocate a register for that,
  it would be live at most for one TB, so you'd have to
  load it at least once per TB, and with block chaining
  that wouldn't be efficient as you'd keep on reloading it.
 
 Yes, but if there are better uses, the register can be flushed. Now
 this is not possible since the register is always unavailable.
 

What are the better uses, that justify to flush a register that is going
to be used three or four host asm later? 

In the current generated code, roughly one every four instruction
reference TCG_AREG0, so this register is really needed very often.

If you think TCG will be faster by having one more register in between
I suggest you to first optimize tcg_reg_alloc(), which simply spill
a random register, even if they are some allocated register that won't
be used until the end of the TB. You should also should check how often
TCG spills a register (in which case it would have benefit from one more
register). It happens less than 2000 times when booting an emulated mips
system on x86_64, while more than 16 TB are generated.

-- 
Aurelien Jarno  GPG: 1024D/F1BCDB73
aurel...@aurel32.net http://www.aurel32.net



Re: [Qemu-devel] [PATCH RFC 00/11] AREG0 elimination

2011-05-15 Thread Blue Swirl
On Sun, May 15, 2011 at 3:49 PM, Aurelien Jarno aurel...@aurel32.net wrote:
 On Sun, May 15, 2011 at 03:30:17PM +0300, Blue Swirl wrote:
 On Sun, May 15, 2011 at 2:36 PM, Aurelien Jarno aurel...@aurel32.net wrote:
  On Sun, May 15, 2011 at 02:12:27PM +0300, Blue Swirl wrote:
  On Sun, May 15, 2011 at 1:19 PM, Aurelien Jarno aurel...@aurel32.net 
  wrote:
   I still don't get where are this list of possible changes? Did I miss
   something in another thread?
 
  I'm referring to the patches I sent.
 
  Ok, patches 1, 2 and 4 to 7 looks ok at a first glance, though I think
  patches 6 and 7 should be done for all hosts or none of them.

 The changes can be done in steps, but of course removing temp_buf from
 CPUState would need all targets to be converted first.

   On the TCG generated code, the env structure is used almost for every
   op, so it really makes sense to keep it in a register instead of having 
   to
   reload the address of env regularly from memory. Given it only affects
   TCG generated code, I don't see the point of portability here.
 
  For example, maybe the bugs in Sparc glibc could be avoided by using
  one of %i set of registers (not accessible from helpers) for AREG0
  within generated code instead of %g registers which seem to be
  fragile.
 
  First of all, but it's a different subject, I am not sure there are
  sparc glibc bugs, I'd rather says QEMU mis-uses some register. For
  example the following code is probably wrong:
 
  /* Note: must be synced with dyngen-exec.h */
  #ifdef CONFIG_SOLARIS
  #define TCG_AREG0 TCG_REG_G2
  #elif defined(__sparc_v9__)
  #define TCG_AREG0 TCG_REG_G5
  #else
  #define TCG_AREG0 TCG_REG_G6
  #endif
 
  __sparc_v9__ can set on the 32-bit ABI, when the compiler targets V8+,
  so the condition is probably wrong there. Secondly the SPARC ABI [1] on
  page 23 that %g5 to %g7 are reserved for system. I don't think QEMU has
  the right to use this registers.

 Yes, but the situation is not so nice. Please see this post for status
 as of 2010:
 http://article.gmane.org/gmane.comp.emulators.qemu/63610

 This is from Debian glibc 2.11.2-10:
 $ file /lib/libc-2.11.2.so
 /lib/libc-2.11.2.so: ELF 32-bit MSB shared object, SPARC32PLUS,
 version 1 (SYSV), dynamically linked (uses shared libs), for GNU/Linux
 2.6.18, stripped
 $ objdump -d /lib/libc.so.6 |grep %g1|wc -l
 69648
 $ objdump -d /lib/libc.so.6 |grep %g2|wc -l
 37299
 $ objdump -d /lib/libc.so.6 |grep %g3|wc -l
 20635
 $ objdump -d /lib/libc.so.6 |grep %g4|wc -l
 11603
 $ objdump -d /lib/libc.so.6 |grep %g5|wc -l
 448
 $ objdump -d /lib/libc.so.6 |grep %g6|wc -l
 150
 $ objdump -d /lib/libc.so.6 |grep %g7|wc -l
 3052

 Glibc is compiled for Sparc32plus, so it should only use %g6 and %g7,

 From the calling convention point of view, sparc32 and sparc32plus are
 the same ABI, so %g5 is also reserved for system use.

 or %g1 and %g5 for scratch purposes. However, it is the application
 registers %g2 to %g4 that are used heaviest. Looking inside the
 objdump it's easy to see that the uses are not for example saving or
 restoring, but actually using them without saving the previous value
 first:

 Well, we have to define system and application. System is defined as
 library in Chapter 6, and I don't see the libc there, and is probably
 considered as part of the application.

No, for example unistd.h is described and even X11. GCC also says that
libraries should be compiled without using the registers:

http://gcc.gnu.org/onlinedocs/gcc-4.6.0/gcc/SPARC-Options.html#SPARC-Options

 000211e0 __divdi3:
    211e0:       9d e3 bf a0     save  %sp, -96, %sp
    211e4:       90 10 00 18     mov  %i0, %o0
    211e8:       92 10 00 19     mov  %i1, %o1
    211ec:       94 10 00 1a     mov  %i2, %o2
    211f0:       96 10 00 1b     mov  %i3, %o3
    211f4:       80 a6 20 00     cmp  %i0, 0
    211f8:       06 40 00 10     bl,pn   %icc, 21238 __divdi3+0x58
    211fc:       a0 10 20 00     clr  %l0
    21200:       80 a2 a0 00     cmp  %o2, 0
    21204:       26 40 00 13     bl,a,pn   %icc, 21250 __divdi3+0x70
    21208:       a0 38 00 10     xnor  %g0, %l0, %l0
    2120c:       7f ff fe ed     call  20dc0 __ashldi3+0x40
    21210:       98 10 20 00     clr  %o4
    21214:       84 10 00 08     mov  %o0, %g2

 ...whoops...

    21218:       80 a4 20 00     cmp  %l0, 0
    2121c:       02 40 00 04     be,pn   %icc, 2122c __divdi3+0x4c
    21220:       86 10 00 09     mov  %o1, %g3

 ...whoops...

    21224:       86 a0 00 09     subcc  %g0, %o1, %g3
    21228:       84 60 00 02     subc  %g0, %g2, %g2
    2122c:       b2 10 00 03     mov  %g3, %i1
    21230:       81 cf e0 08     rett  %i7 + 8
    21234:       90 10 00 02     mov  %g2, %o0
    21238:       92 a0 00 19     subcc  %g0, %i1, %o1
    2123c:       90 60 00 18     subc  %g0, %i0, %o0
    21240:       80 a2 a0 00     cmp  %o2, 0
    21244:       16 4f ff f2     bge  %icc, 2120c __divdi3+0x2c
    21248:       a0 10 3f ff     mov  -1, %l0
    2124c:       a0 

Re: [Qemu-devel] [PATCH RFC 00/11] AREG0 elimination

2011-05-15 Thread Blue Swirl
On Sun, May 15, 2011 at 4:02 PM, Aurelien Jarno aurel...@aurel32.net wrote:
 On Sun, May 15, 2011 at 03:37:00PM +0300, Blue Swirl wrote:
 On Sun, May 15, 2011 at 3:14 PM, Laurent Desnogues
 laurent.desnog...@gmail.com wrote:
  On Sun, May 15, 2011 at 1:33 PM, Blue Swirl blauwir...@gmail.com wrote:
  [...]
  x86_64 uses r14 as TCG_AREG0. Despite the instructions being quite
  simple (only 2 movi_i32), the resulting code makes 2 access to env to
  save the two registers. Having to reload the env pointer each time to a
  register would clearly increase the size of this TB.
 
  I don't think TCG would be that simple, instead the pointer would be
  loaded only once in this case.
 
  Assuming TCG was able to allocate a register for that,
  it would be live at most for one TB, so you'd have to
  load it at least once per TB, and with block chaining
  that wouldn't be efficient as you'd keep on reloading it.

 Yes, but if there are better uses, the register can be flushed. Now
 this is not possible since the register is always unavailable.


 What are the better uses, that justify to flush a register that is going
 to be used three or four host asm later?

It would obviously replace something else determined by TCG.

 In the current generated code, roughly one every four instruction
 reference TCG_AREG0, so this register is really needed very often.

 If you think TCG will be faster by having one more register in between
 I suggest you to first optimize tcg_reg_alloc(), which simply spill
 a random register, even if they are some allocated register that won't
 be used until the end of the TB. You should also should check how often
 TCG spills a register (in which case it would have benefit from one more
 register). It happens less than 2000 times when booting an emulated mips
 system on x86_64, while more than 16 TB are generated.

Right, on a modern CPU with lots of registers, one additional register
won't be helpful, but on i386 the situation should be very different,
there are very few registers.



Re: [Qemu-devel] [PATCH RFC 00/11] AREG0 elimination

2011-05-15 Thread Aurelien Jarno
On Sun, May 15, 2011 at 04:16:20PM +0300, Blue Swirl wrote:
 On Sun, May 15, 2011 at 3:49 PM, Aurelien Jarno aurel...@aurel32.net wrote:
  On Sun, May 15, 2011 at 03:30:17PM +0300, Blue Swirl wrote:
  On Sun, May 15, 2011 at 2:36 PM, Aurelien Jarno aurel...@aurel32.net 
  wrote:
   On Sun, May 15, 2011 at 02:12:27PM +0300, Blue Swirl wrote:
   On Sun, May 15, 2011 at 1:19 PM, Aurelien Jarno aurel...@aurel32.net 
   wrote:
I still don't get where are this list of possible changes? Did I miss
something in another thread?
  
   I'm referring to the patches I sent.
  
   Ok, patches 1, 2 and 4 to 7 looks ok at a first glance, though I think
   patches 6 and 7 should be done for all hosts or none of them.
 
  The changes can be done in steps, but of course removing temp_buf from
  CPUState would need all targets to be converted first.
 
On the TCG generated code, the env structure is used almost for every
op, so it really makes sense to keep it in a register instead of 
having to
reload the address of env regularly from memory. Given it only affects
TCG generated code, I don't see the point of portability here.
  
   For example, maybe the bugs in Sparc glibc could be avoided by using
   one of %i set of registers (not accessible from helpers) for AREG0
   within generated code instead of %g registers which seem to be
   fragile.
  
   First of all, but it's a different subject, I am not sure there are
   sparc glibc bugs, I'd rather says QEMU mis-uses some register. For
   example the following code is probably wrong:
  
   /* Note: must be synced with dyngen-exec.h */
   #ifdef CONFIG_SOLARIS
   #define TCG_AREG0 TCG_REG_G2
   #elif defined(__sparc_v9__)
   #define TCG_AREG0 TCG_REG_G5
   #else
   #define TCG_AREG0 TCG_REG_G6
   #endif
  
   __sparc_v9__ can set on the 32-bit ABI, when the compiler targets V8+,
   so the condition is probably wrong there. Secondly the SPARC ABI [1] on
   page 23 that %g5 to %g7 are reserved for system. I don't think QEMU has
   the right to use this registers.
 
  Yes, but the situation is not so nice. Please see this post for status
  as of 2010:
  http://article.gmane.org/gmane.comp.emulators.qemu/63610
 
  This is from Debian glibc 2.11.2-10:
  $ file /lib/libc-2.11.2.so
  /lib/libc-2.11.2.so: ELF 32-bit MSB shared object, SPARC32PLUS,
  version 1 (SYSV), dynamically linked (uses shared libs), for GNU/Linux
  2.6.18, stripped
  $ objdump -d /lib/libc.so.6 |grep %g1|wc -l
  69648
  $ objdump -d /lib/libc.so.6 |grep %g2|wc -l
  37299
  $ objdump -d /lib/libc.so.6 |grep %g3|wc -l
  20635
  $ objdump -d /lib/libc.so.6 |grep %g4|wc -l
  11603
  $ objdump -d /lib/libc.so.6 |grep %g5|wc -l
  448
  $ objdump -d /lib/libc.so.6 |grep %g6|wc -l
  150
  $ objdump -d /lib/libc.so.6 |grep %g7|wc -l
  3052
 
  Glibc is compiled for Sparc32plus, so it should only use %g6 and %g7,
 
  From the calling convention point of view, sparc32 and sparc32plus are
  the same ABI, so %g5 is also reserved for system use.
 
  or %g1 and %g5 for scratch purposes. However, it is the application
  registers %g2 to %g4 that are used heaviest. Looking inside the
  objdump it's easy to see that the uses are not for example saving or
  restoring, but actually using them without saving the previous value
  first:
 
  Well, we have to define system and application. System is defined as
  library in Chapter 6, and I don't see the libc there, and is probably
  considered as part of the application.
 
 No, for example unistd.h is described and even X11. GCC also says that
 libraries should be compiled without using the registers:
 
 http://gcc.gnu.org/onlinedocs/gcc-4.6.0/gcc/SPARC-Options.html#SPARC-Options
 
  000211e0 __divdi3:
     211e0:       9d e3 bf a0     save  %sp, -96, %sp
     211e4:       90 10 00 18     mov  %i0, %o0
     211e8:       92 10 00 19     mov  %i1, %o1
     211ec:       94 10 00 1a     mov  %i2, %o2
     211f0:       96 10 00 1b     mov  %i3, %o3
     211f4:       80 a6 20 00     cmp  %i0, 0
     211f8:       06 40 00 10     bl,pn   %icc, 21238 __divdi3+0x58
     211fc:       a0 10 20 00     clr  %l0
     21200:       80 a2 a0 00     cmp  %o2, 0
     21204:       26 40 00 13     bl,a,pn   %icc, 21250 __divdi3+0x70
     21208:       a0 38 00 10     xnor  %g0, %l0, %l0
     2120c:       7f ff fe ed     call  20dc0 __ashldi3+0x40
     21210:       98 10 20 00     clr  %o4
     21214:       84 10 00 08     mov  %o0, %g2
 
  ...whoops...
 
     21218:       80 a4 20 00     cmp  %l0, 0
     2121c:       02 40 00 04     be,pn   %icc, 2122c __divdi3+0x4c
     21220:       86 10 00 09     mov  %o1, %g3
 
  ...whoops...
 
     21224:       86 a0 00 09     subcc  %g0, %o1, %g3
     21228:       84 60 00 02     subc  %g0, %g2, %g2
     2122c:       b2 10 00 03     mov  %g3, %i1
     21230:       81 cf e0 08     rett  %i7 + 8
     21234:       90 10 00 02     mov  %g2, %o0
     21238:       92 a0 00 19     subcc  %g0, %i1, %o1
     2123c:       90 60 00 18     subc  %g0, 

Re: [Qemu-devel] [PATCH RFC 00/11] AREG0 elimination

2011-05-15 Thread Blue Swirl
On Sun, May 15, 2011 at 4:43 PM, Aurelien Jarno aurel...@aurel32.net wrote:
 On Sun, May 15, 2011 at 04:16:20PM +0300, Blue Swirl wrote:
 On Sun, May 15, 2011 at 3:49 PM, Aurelien Jarno aurel...@aurel32.net wrote:
  On Sun, May 15, 2011 at 03:30:17PM +0300, Blue Swirl wrote:
  On Sun, May 15, 2011 at 2:36 PM, Aurelien Jarno aurel...@aurel32.net 
  wrote:
   On Sun, May 15, 2011 at 02:12:27PM +0300, Blue Swirl wrote:
   On Sun, May 15, 2011 at 1:19 PM, Aurelien Jarno aurel...@aurel32.net 
   wrote:
I still don't get where are this list of possible changes? Did I miss
something in another thread?
  
   I'm referring to the patches I sent.
  
   Ok, patches 1, 2 and 4 to 7 looks ok at a first glance, though I think
   patches 6 and 7 should be done for all hosts or none of them.
 
  The changes can be done in steps, but of course removing temp_buf from
  CPUState would need all targets to be converted first.
 
On the TCG generated code, the env structure is used almost for every
op, so it really makes sense to keep it in a register instead of 
having to
reload the address of env regularly from memory. Given it only 
affects
TCG generated code, I don't see the point of portability here.
  
   For example, maybe the bugs in Sparc glibc could be avoided by using
   one of %i set of registers (not accessible from helpers) for AREG0
   within generated code instead of %g registers which seem to be
   fragile.
  
   First of all, but it's a different subject, I am not sure there are
   sparc glibc bugs, I'd rather says QEMU mis-uses some register. For
   example the following code is probably wrong:
  
   /* Note: must be synced with dyngen-exec.h */
   #ifdef CONFIG_SOLARIS
   #define TCG_AREG0 TCG_REG_G2
   #elif defined(__sparc_v9__)
   #define TCG_AREG0 TCG_REG_G5
   #else
   #define TCG_AREG0 TCG_REG_G6
   #endif
  
   __sparc_v9__ can set on the 32-bit ABI, when the compiler targets V8+,
   so the condition is probably wrong there. Secondly the SPARC ABI [1] on
   page 23 that %g5 to %g7 are reserved for system. I don't think QEMU has
   the right to use this registers.
 
  Yes, but the situation is not so nice. Please see this post for status
  as of 2010:
  http://article.gmane.org/gmane.comp.emulators.qemu/63610
 
  This is from Debian glibc 2.11.2-10:
  $ file /lib/libc-2.11.2.so
  /lib/libc-2.11.2.so: ELF 32-bit MSB shared object, SPARC32PLUS,
  version 1 (SYSV), dynamically linked (uses shared libs), for GNU/Linux
  2.6.18, stripped
  $ objdump -d /lib/libc.so.6 |grep %g1|wc -l
  69648
  $ objdump -d /lib/libc.so.6 |grep %g2|wc -l
  37299
  $ objdump -d /lib/libc.so.6 |grep %g3|wc -l
  20635
  $ objdump -d /lib/libc.so.6 |grep %g4|wc -l
  11603
  $ objdump -d /lib/libc.so.6 |grep %g5|wc -l
  448
  $ objdump -d /lib/libc.so.6 |grep %g6|wc -l
  150
  $ objdump -d /lib/libc.so.6 |grep %g7|wc -l
  3052
 
  Glibc is compiled for Sparc32plus, so it should only use %g6 and %g7,
 
  From the calling convention point of view, sparc32 and sparc32plus are
  the same ABI, so %g5 is also reserved for system use.
 
  or %g1 and %g5 for scratch purposes. However, it is the application
  registers %g2 to %g4 that are used heaviest. Looking inside the
  objdump it's easy to see that the uses are not for example saving or
  restoring, but actually using them without saving the previous value
  first:
 
  Well, we have to define system and application. System is defined as
  library in Chapter 6, and I don't see the libc there, and is probably
  considered as part of the application.

 No, for example unistd.h is described and even X11. GCC also says that
 libraries should be compiled without using the registers:

 http://gcc.gnu.org/onlinedocs/gcc-4.6.0/gcc/SPARC-Options.html#SPARC-Options

  000211e0 __divdi3:
     211e0:       9d e3 bf a0     save  %sp, -96, %sp
     211e4:       90 10 00 18     mov  %i0, %o0
     211e8:       92 10 00 19     mov  %i1, %o1
     211ec:       94 10 00 1a     mov  %i2, %o2
     211f0:       96 10 00 1b     mov  %i3, %o3
     211f4:       80 a6 20 00     cmp  %i0, 0
     211f8:       06 40 00 10     bl,pn   %icc, 21238 __divdi3+0x58
     211fc:       a0 10 20 00     clr  %l0
     21200:       80 a2 a0 00     cmp  %o2, 0
     21204:       26 40 00 13     bl,a,pn   %icc, 21250 __divdi3+0x70
     21208:       a0 38 00 10     xnor  %g0, %l0, %l0
     2120c:       7f ff fe ed     call  20dc0 __ashldi3+0x40
     21210:       98 10 20 00     clr  %o4
     21214:       84 10 00 08     mov  %o0, %g2
 
  ...whoops...
 
     21218:       80 a4 20 00     cmp  %l0, 0
     2121c:       02 40 00 04     be,pn   %icc, 2122c __divdi3+0x4c
     21220:       86 10 00 09     mov  %o1, %g3
 
  ...whoops...
 
     21224:       86 a0 00 09     subcc  %g0, %o1, %g3
     21228:       84 60 00 02     subc  %g0, %g2, %g2
     2122c:       b2 10 00 03     mov  %g3, %i1
     21230:       81 cf e0 08     rett  %i7 + 8
     21234:       90 10 00 02     mov  %g2, %o0
     21238:       92 

Re: [Qemu-devel] [PATCH RFC 00/11] AREG0 elimination

2011-05-15 Thread Aurelien Jarno
On Sun, May 15, 2011 at 04:42:05PM +0300, Blue Swirl wrote:
 On Sun, May 15, 2011 at 4:02 PM, Aurelien Jarno aurel...@aurel32.net wrote:
  On Sun, May 15, 2011 at 03:37:00PM +0300, Blue Swirl wrote:
  On Sun, May 15, 2011 at 3:14 PM, Laurent Desnogues
  laurent.desnog...@gmail.com wrote:
   On Sun, May 15, 2011 at 1:33 PM, Blue Swirl blauwir...@gmail.com wrote:
   [...]
   x86_64 uses r14 as TCG_AREG0. Despite the instructions being quite
   simple (only 2 movi_i32), the resulting code makes 2 access to env to
   save the two registers. Having to reload the env pointer each time to a
   register would clearly increase the size of this TB.
  
   I don't think TCG would be that simple, instead the pointer would be
   loaded only once in this case.
  
   Assuming TCG was able to allocate a register for that,
   it would be live at most for one TB, so you'd have to
   load it at least once per TB, and with block chaining
   that wouldn't be efficient as you'd keep on reloading it.
 
  Yes, but if there are better uses, the register can be flushed. Now
  this is not possible since the register is always unavailable.
 
 
  What are the better uses, that justify to flush a register that is going
  to be used three or four host asm later?
 
 It would obviously replace something else determined by TCG.

The register will be free only for a few host instructions. Could you 
please give more concrete example about such a usage?

  In the current generated code, roughly one every four instruction
  reference TCG_AREG0, so this register is really needed very often.
 
  If you think TCG will be faster by having one more register in between
  I suggest you to first optimize tcg_reg_alloc(), which simply spill
  a random register, even if they are some allocated register that won't
  be used until the end of the TB. You should also should check how often
  TCG spills a register (in which case it would have benefit from one more
  register). It happens less than 2000 times when booting an emulated mips
  system on x86_64, while more than 16 TB are generated.
 
 Right, on a modern CPU with lots of registers, one additional register
 won't be helpful, but on i386 the situation should be very different,
 there are very few registers.
 

On i386, I indeed get a lot more of spilled registers, that is 34. Still
that number is not that high, it's less than two times per TB. If we
consider that these register spills are pure loss (which is not always
the case, sometime the spilled register is actually never used later, so
it's just an anticipated save), that's 4 load/store per TB.

It means to compensate, the env register should not be loaded more than
4 times in a TB, which looks like quite difficult to achieve given how
often this register is used.

Please also note that spilling globals currently need access to the env
pointer, which might not be loaded, so another register spill is need to
load it. This will make the code a lot more complex than now to avoid a
deadlock (probably by spilling local temps first).

-- 
Aurelien Jarno  GPG: 1024D/F1BCDB73
aurel...@aurel32.net http://www.aurel32.net



Re: [Qemu-devel] [PATCH RFC 00/11] AREG0 elimination

2011-05-15 Thread Aurelien Jarno
On Sun, May 15, 2011 at 05:02:37PM +0300, Blue Swirl wrote:
 On Sun, May 15, 2011 at 4:43 PM, Aurelien Jarno aurel...@aurel32.net wrote:
  On Sun, May 15, 2011 at 04:16:20PM +0300, Blue Swirl wrote:
  On Sun, May 15, 2011 at 3:49 PM, Aurelien Jarno aurel...@aurel32.net 
  wrote:
   On Sun, May 15, 2011 at 03:30:17PM +0300, Blue Swirl wrote:
   On Sun, May 15, 2011 at 2:36 PM, Aurelien Jarno aurel...@aurel32.net 
   wrote:
On Sun, May 15, 2011 at 02:12:27PM +0300, Blue Swirl wrote:
On Sun, May 15, 2011 at 1:19 PM, Aurelien Jarno 
aurel...@aurel32.net wrote:
 I still don't get where are this list of possible changes? Did I 
 miss
 something in another thread?
   
I'm referring to the patches I sent.
   
Ok, patches 1, 2 and 4 to 7 looks ok at a first glance, though I think
patches 6 and 7 should be done for all hosts or none of them.
  
   The changes can be done in steps, but of course removing temp_buf from
   CPUState would need all targets to be converted first.
  
 On the TCG generated code, the env structure is used almost for 
 every
 op, so it really makes sense to keep it in a register instead of 
 having to
 reload the address of env regularly from memory. Given it only 
 affects
 TCG generated code, I don't see the point of portability here.
   
For example, maybe the bugs in Sparc glibc could be avoided by using
one of %i set of registers (not accessible from helpers) for AREG0
within generated code instead of %g registers which seem to be
fragile.
   
First of all, but it's a different subject, I am not sure there are
sparc glibc bugs, I'd rather says QEMU mis-uses some register. For
example the following code is probably wrong:
   
/* Note: must be synced with dyngen-exec.h */
#ifdef CONFIG_SOLARIS
#define TCG_AREG0 TCG_REG_G2
#elif defined(__sparc_v9__)
#define TCG_AREG0 TCG_REG_G5
#else
#define TCG_AREG0 TCG_REG_G6
#endif
   
__sparc_v9__ can set on the 32-bit ABI, when the compiler targets V8+,
so the condition is probably wrong there. Secondly the SPARC ABI [1] 
on
page 23 that %g5 to %g7 are reserved for system. I don't think QEMU 
has
the right to use this registers.
  
   Yes, but the situation is not so nice. Please see this post for status
   as of 2010:
   http://article.gmane.org/gmane.comp.emulators.qemu/63610
  
   This is from Debian glibc 2.11.2-10:
   $ file /lib/libc-2.11.2.so
   /lib/libc-2.11.2.so: ELF 32-bit MSB shared object, SPARC32PLUS,
   version 1 (SYSV), dynamically linked (uses shared libs), for GNU/Linux
   2.6.18, stripped
   $ objdump -d /lib/libc.so.6 |grep %g1|wc -l
   69648
   $ objdump -d /lib/libc.so.6 |grep %g2|wc -l
   37299
   $ objdump -d /lib/libc.so.6 |grep %g3|wc -l
   20635
   $ objdump -d /lib/libc.so.6 |grep %g4|wc -l
   11603
   $ objdump -d /lib/libc.so.6 |grep %g5|wc -l
   448
   $ objdump -d /lib/libc.so.6 |grep %g6|wc -l
   150
   $ objdump -d /lib/libc.so.6 |grep %g7|wc -l
   3052
  
   Glibc is compiled for Sparc32plus, so it should only use %g6 and %g7,
  
   From the calling convention point of view, sparc32 and sparc32plus are
   the same ABI, so %g5 is also reserved for system use.
  
   or %g1 and %g5 for scratch purposes. However, it is the application
   registers %g2 to %g4 that are used heaviest. Looking inside the
   objdump it's easy to see that the uses are not for example saving or
   restoring, but actually using them without saving the previous value
   first:
  
   Well, we have to define system and application. System is defined as
   library in Chapter 6, and I don't see the libc there, and is probably
   considered as part of the application.
 
  No, for example unistd.h is described and even X11. GCC also says that
  libraries should be compiled without using the registers:
 
  http://gcc.gnu.org/onlinedocs/gcc-4.6.0/gcc/SPARC-Options.html#SPARC-Options
 
   000211e0 __divdi3:
      211e0:       9d e3 bf a0     save  %sp, -96, %sp
      211e4:       90 10 00 18     mov  %i0, %o0
      211e8:       92 10 00 19     mov  %i1, %o1
      211ec:       94 10 00 1a     mov  %i2, %o2
      211f0:       96 10 00 1b     mov  %i3, %o3
      211f4:       80 a6 20 00     cmp  %i0, 0
      211f8:       06 40 00 10     bl,pn   %icc, 21238 __divdi3+0x58
      211fc:       a0 10 20 00     clr  %l0
      21200:       80 a2 a0 00     cmp  %o2, 0
      21204:       26 40 00 13     bl,a,pn   %icc, 21250 __divdi3+0x70
      21208:       a0 38 00 10     xnor  %g0, %l0, %l0
      2120c:       7f ff fe ed     call  20dc0 __ashldi3+0x40
      21210:       98 10 20 00     clr  %o4
      21214:       84 10 00 08     mov  %o0, %g2
  
   ...whoops...
  
      21218:       80 a4 20 00     cmp  %l0, 0
      2121c:       02 40 00 04     be,pn   %icc, 2122c __divdi3+0x4c
      21220:       86 10 00 09     mov  %o1, %g3
  
   ...whoops...
  
      21224:       86 a0 00 09     subcc  %g0, %o1, %g3
      21228:     

Re: [Qemu-devel] [PATCH RFC 00/11] AREG0 elimination

2011-05-15 Thread Blue Swirl
On Sun, May 15, 2011 at 5:03 PM, Aurelien Jarno aurel...@aurel32.net wrote:
 On Sun, May 15, 2011 at 04:42:05PM +0300, Blue Swirl wrote:
 On Sun, May 15, 2011 at 4:02 PM, Aurelien Jarno aurel...@aurel32.net wrote:
  On Sun, May 15, 2011 at 03:37:00PM +0300, Blue Swirl wrote:
  On Sun, May 15, 2011 at 3:14 PM, Laurent Desnogues
  laurent.desnog...@gmail.com wrote:
   On Sun, May 15, 2011 at 1:33 PM, Blue Swirl blauwir...@gmail.com 
   wrote:
   [...]
   x86_64 uses r14 as TCG_AREG0. Despite the instructions being quite
   simple (only 2 movi_i32), the resulting code makes 2 access to env to
   save the two registers. Having to reload the env pointer each time to 
   a
   register would clearly increase the size of this TB.
  
   I don't think TCG would be that simple, instead the pointer would be
   loaded only once in this case.
  
   Assuming TCG was able to allocate a register for that,
   it would be live at most for one TB, so you'd have to
   load it at least once per TB, and with block chaining
   that wouldn't be efficient as you'd keep on reloading it.
 
  Yes, but if there are better uses, the register can be flushed. Now
  this is not possible since the register is always unavailable.
 
 
  What are the better uses, that justify to flush a register that is going
  to be used three or four host asm later?

 It would obviously replace something else determined by TCG.

 The register will be free only for a few host instructions. Could you
 please give more concrete example about such a usage?

  In the current generated code, roughly one every four instruction
  reference TCG_AREG0, so this register is really needed very often.
 
  If you think TCG will be faster by having one more register in between
  I suggest you to first optimize tcg_reg_alloc(), which simply spill
  a random register, even if they are some allocated register that won't
  be used until the end of the TB. You should also should check how often
  TCG spills a register (in which case it would have benefit from one more
  register). It happens less than 2000 times when booting an emulated mips
  system on x86_64, while more than 16 TB are generated.

 Right, on a modern CPU with lots of registers, one additional register
 won't be helpful, but on i386 the situation should be very different,
 there are very few registers.


 On i386, I indeed get a lot more of spilled registers, that is 34. Still
 that number is not that high, it's less than two times per TB. If we
 consider that these register spills are pure loss (which is not always
 the case, sometime the spilled register is actually never used later, so
 it's just an anticipated save), that's 4 load/store per TB.

 It means to compensate, the env register should not be loaded more than
 4 times in a TB, which looks like quite difficult to achieve given how
 often this register is used.

 Please also note that spilling globals currently need access to the env
 pointer, which might not be loaded, so another register spill is need to
 load it. This will make the code a lot more complex than now to avoid a
 deadlock (probably by spilling local temps first).

OK, this doesn't look so attractive after all.



Re: [Qemu-devel] [PATCH RFC 00/11] AREG0 elimination

2011-05-15 Thread Blue Swirl
On Sun, May 15, 2011 at 5:06 PM, Aurelien Jarno aurel...@aurel32.net wrote:
 On Sun, May 15, 2011 at 05:02:37PM +0300, Blue Swirl wrote:
 On Sun, May 15, 2011 at 4:43 PM, Aurelien Jarno aurel...@aurel32.net wrote:
  On Sun, May 15, 2011 at 04:16:20PM +0300, Blue Swirl wrote:
  On Sun, May 15, 2011 at 3:49 PM, Aurelien Jarno aurel...@aurel32.net 
  wrote:
   On Sun, May 15, 2011 at 03:30:17PM +0300, Blue Swirl wrote:
   On Sun, May 15, 2011 at 2:36 PM, Aurelien Jarno aurel...@aurel32.net 
   wrote:
On Sun, May 15, 2011 at 02:12:27PM +0300, Blue Swirl wrote:
On Sun, May 15, 2011 at 1:19 PM, Aurelien Jarno 
aurel...@aurel32.net wrote:
 I still don't get where are this list of possible changes? Did I 
 miss
 something in another thread?
   
I'm referring to the patches I sent.
   
Ok, patches 1, 2 and 4 to 7 looks ok at a first glance, though I 
think
patches 6 and 7 should be done for all hosts or none of them.
  
   The changes can be done in steps, but of course removing temp_buf from
   CPUState would need all targets to be converted first.
  
 On the TCG generated code, the env structure is used almost for 
 every
 op, so it really makes sense to keep it in a register instead of 
 having to
 reload the address of env regularly from memory. Given it only 
 affects
 TCG generated code, I don't see the point of portability here.
   
For example, maybe the bugs in Sparc glibc could be avoided by using
one of %i set of registers (not accessible from helpers) for AREG0
within generated code instead of %g registers which seem to be
fragile.
   
First of all, but it's a different subject, I am not sure there are
sparc glibc bugs, I'd rather says QEMU mis-uses some register. For
example the following code is probably wrong:
   
/* Note: must be synced with dyngen-exec.h */
#ifdef CONFIG_SOLARIS
#define TCG_AREG0 TCG_REG_G2
#elif defined(__sparc_v9__)
#define TCG_AREG0 TCG_REG_G5
#else
#define TCG_AREG0 TCG_REG_G6
#endif
   
__sparc_v9__ can set on the 32-bit ABI, when the compiler targets 
V8+,
so the condition is probably wrong there. Secondly the SPARC ABI [1] 
on
page 23 that %g5 to %g7 are reserved for system. I don't think QEMU 
has
the right to use this registers.
  
   Yes, but the situation is not so nice. Please see this post for status
   as of 2010:
   http://article.gmane.org/gmane.comp.emulators.qemu/63610
  
   This is from Debian glibc 2.11.2-10:
   $ file /lib/libc-2.11.2.so
   /lib/libc-2.11.2.so: ELF 32-bit MSB shared object, SPARC32PLUS,
   version 1 (SYSV), dynamically linked (uses shared libs), for GNU/Linux
   2.6.18, stripped
   $ objdump -d /lib/libc.so.6 |grep %g1|wc -l
   69648
   $ objdump -d /lib/libc.so.6 |grep %g2|wc -l
   37299
   $ objdump -d /lib/libc.so.6 |grep %g3|wc -l
   20635
   $ objdump -d /lib/libc.so.6 |grep %g4|wc -l
   11603
   $ objdump -d /lib/libc.so.6 |grep %g5|wc -l
   448
   $ objdump -d /lib/libc.so.6 |grep %g6|wc -l
   150
   $ objdump -d /lib/libc.so.6 |grep %g7|wc -l
   3052
  
   Glibc is compiled for Sparc32plus, so it should only use %g6 and %g7,
  
   From the calling convention point of view, sparc32 and sparc32plus are
   the same ABI, so %g5 is also reserved for system use.
  
   or %g1 and %g5 for scratch purposes. However, it is the application
   registers %g2 to %g4 that are used heaviest. Looking inside the
   objdump it's easy to see that the uses are not for example saving or
   restoring, but actually using them without saving the previous value
   first:
  
   Well, we have to define system and application. System is defined as
   library in Chapter 6, and I don't see the libc there, and is probably
   considered as part of the application.
 
  No, for example unistd.h is described and even X11. GCC also says that
  libraries should be compiled without using the registers:
 
  http://gcc.gnu.org/onlinedocs/gcc-4.6.0/gcc/SPARC-Options.html#SPARC-Options
 
   000211e0 __divdi3:
      211e0:       9d e3 bf a0     save  %sp, -96, %sp
      211e4:       90 10 00 18     mov  %i0, %o0
      211e8:       92 10 00 19     mov  %i1, %o1
      211ec:       94 10 00 1a     mov  %i2, %o2
      211f0:       96 10 00 1b     mov  %i3, %o3
      211f4:       80 a6 20 00     cmp  %i0, 0
      211f8:       06 40 00 10     bl,pn   %icc, 21238 __divdi3+0x58
      211fc:       a0 10 20 00     clr  %l0
      21200:       80 a2 a0 00     cmp  %o2, 0
      21204:       26 40 00 13     bl,a,pn   %icc, 21250 __divdi3+0x70
      21208:       a0 38 00 10     xnor  %g0, %l0, %l0
      2120c:       7f ff fe ed     call  20dc0 __ashldi3+0x40
      21210:       98 10 20 00     clr  %o4
      21214:       84 10 00 08     mov  %o0, %g2
  
   ...whoops...
  
      21218:       80 a4 20 00     cmp  %l0, 0
      2121c:       02 40 00 04     be,pn   %icc, 2122c __divdi3+0x4c
      21220:       86 10 00 09     mov  %o1, %g3
  
   

Re: [Qemu-devel] [PATCH RFC 00/11] AREG0 elimination

2011-05-14 Thread Aurelien Jarno
On Sat, May 14, 2011 at 10:35:20PM +0300, Blue Swirl wrote:
 Here's a RFC series for eliminating AREG0.
 
 Blue Swirl (11):
   Move user emulator stuff from cpu-exec.c to user-exec.c
   Delete unused tb_invalidate_page_range
 
 The above should be OK to commit.
 
   cpu_loop_exit: avoid using AREG0
   Delegate setup of TCG temporaries to targets
 
 These two are not, unless the overall plan is OK.
 
   TCG: fix negative frame offset calculations
   TCG/x86: use stack for TCG temps
   TCG/Sparc64: use stack for TCG temps
 
 But these three should be OK. I've tested lightly x86_64 and Sparc64 hosts.
 
   Add CONFIG_TARGET_NEEDS_AREG0
   Don't compile legacy qemu_ld/st functions if target doesn't need them
 
 Should be OK, though the latter patch only touches x86.
 
   Add new qemu_ld and qemu_st functions
   sparc: use new qemu_ld and qemu_st functions
 
 The last two compile but QEMU segfaults. I just made a naive
 conversion for getting comments.
 

What is the goal behing removing TCG_AREG0? If it is speed improvement,
can you please provide some benchmarks?

The env register is used very often (basically for every load/store, but
also a lot of helpers), so it makes sense to reserve a register for it.

For what I understand from your patch series, you prefer to pass this
register explicitly to TCG functions. This basically means this TCG
global will be loaded to host register as soon as it is used, but also
regularly, as globals are saved back to their canonical location before
an helper or a load/store.

So it seems that this patch series will just allowing the env register
to change over time, though it will not spare one more register for the 
TCG code, and it will emit longer TCG code to regularly reload the env
global into a host register.

In any case at then end benchmarks is what are need to decided, TCG has
always shown that performance improvements doesn't match the improvement
analysis.

-- 
Aurelien Jarno  GPG: 1024D/F1BCDB73
aurel...@aurel32.net http://www.aurel32.net



Re: [Qemu-devel] [PATCH RFC 00/11] AREG0 elimination

2011-05-14 Thread Blue Swirl
On Sun, May 15, 2011 at 12:16 AM, Aurelien Jarno aurel...@aurel32.net wrote:
 On Sat, May 14, 2011 at 10:35:20PM +0300, Blue Swirl wrote:
 Here's a RFC series for eliminating AREG0.

 Blue Swirl (11):
   Move user emulator stuff from cpu-exec.c to user-exec.c
   Delete unused tb_invalidate_page_range

 The above should be OK to commit.

   cpu_loop_exit: avoid using AREG0
   Delegate setup of TCG temporaries to targets

 These two are not, unless the overall plan is OK.

   TCG: fix negative frame offset calculations
   TCG/x86: use stack for TCG temps
   TCG/Sparc64: use stack for TCG temps

 But these three should be OK. I've tested lightly x86_64 and Sparc64 hosts.

   Add CONFIG_TARGET_NEEDS_AREG0
   Don't compile legacy qemu_ld/st functions if target doesn't need them

 Should be OK, though the latter patch only touches x86.

   Add new qemu_ld and qemu_st functions
   sparc: use new qemu_ld and qemu_st functions

 The last two compile but QEMU segfaults. I just made a naive
 conversion for getting comments.


 What is the goal behing removing TCG_AREG0? If it is speed improvement,
 can you please provide some benchmarks?

There was some discussion earlier about why this (or parts of the
conversion) may be a speed improvement:

http://article.gmane.org/gmane.comp.emulators.qemu/101826
http://article.gmane.org/gmane.comp.emulators.qemu/102156

There are no benchmarks yet. In fact, it may be difficult to make
those without performing the removal completely.

For example, patch 3/11 makes cpu_loop_exit take CPUState as a
parameter instead of using global env, which would be available and
the register is reserved anyway. This would only decrease performance
at this stage, unless a complete conversion is done. I suspect the
same would happen when moving all helpers from op_helper.c to
helper.c. But after the whole conversion, this would be a neutral (no
extra registers used) or even beneficial change (the code is free to
use one more register).

 The env register is used very often (basically for every load/store, but
 also a lot of helpers), so it makes sense to reserve a register for it.

 For what I understand from your patch series, you prefer to pass this
 register explicitly to TCG functions. This basically means this TCG
 global will be loaded to host register as soon as it is used, but also
 regularly, as globals are saved back to their canonical location before
 an helper or a load/store.

 So it seems that this patch series will just allowing the env register
 to change over time, though it will not spare one more register for the
 TCG code, and it will emit longer TCG code to regularly reload the env
 global into a host register.

But there will be one more register available in some cases. In other
cases, the number of registers used does not change. Moving the
registers around is what worries me too.

But there are other effects too, the helpers are now compiled so that
the global env register is not used. Especially on hosts with low
number of registers this is not optimal.

 In any case at then end benchmarks is what are need to decided, TCG has
 always shown that performance improvements doesn't match the improvement
 analysis.

If this turns out to be a bad idea, it means that the reverse
conversion will be beneficial and we should convert helper.c code to
op_helper.c and take advantage of the global env in a register. This
was actually my standpoint when the discussion started, but I'm
interested to see if this approach would work better.

There are other opportunities as well, for example while refactoring,
qemu_ld/st would be changed. Maybe your constant patches would be
useful there or they could benefit from a rearranged interface.



Re: [Qemu-devel] [PATCH RFC 00/11] AREG0 elimination

2011-05-14 Thread Aurelien Jarno
On Sun, May 15, 2011 at 12:52:35AM +0300, Blue Swirl wrote:
 On Sun, May 15, 2011 at 12:16 AM, Aurelien Jarno aurel...@aurel32.net wrote:
  On Sat, May 14, 2011 at 10:35:20PM +0300, Blue Swirl wrote:
  Here's a RFC series for eliminating AREG0.
 
  Blue Swirl (11):
    Move user emulator stuff from cpu-exec.c to user-exec.c
    Delete unused tb_invalidate_page_range
 
  The above should be OK to commit.
 
    cpu_loop_exit: avoid using AREG0
    Delegate setup of TCG temporaries to targets
 
  These two are not, unless the overall plan is OK.
 
    TCG: fix negative frame offset calculations
    TCG/x86: use stack for TCG temps
    TCG/Sparc64: use stack for TCG temps
 
  But these three should be OK. I've tested lightly x86_64 and Sparc64 hosts.
 
    Add CONFIG_TARGET_NEEDS_AREG0
    Don't compile legacy qemu_ld/st functions if target doesn't need them
 
  Should be OK, though the latter patch only touches x86.
 
    Add new qemu_ld and qemu_st functions
    sparc: use new qemu_ld and qemu_st functions
 
  The last two compile but QEMU segfaults. I just made a naive
  conversion for getting comments.
 
 
  What is the goal behing removing TCG_AREG0? If it is speed improvement,
  can you please provide some benchmarks?
 
 There was some discussion earlier about why this (or parts of the
 conversion) may be a speed improvement:
 
 http://article.gmane.org/gmane.comp.emulators.qemu/101826
 http://article.gmane.org/gmane.comp.emulators.qemu/102156

Ok, looks like I have missed that.

 There are no benchmarks yet. In fact, it may be difficult to make
 those without performing the removal completely.
 
 For example, patch 3/11 makes cpu_loop_exit take CPUState as a
 parameter instead of using global env, which would be available and
 the register is reserved anyway. This would only decrease performance
 at this stage, unless a complete conversion is done. I suspect the
 same would happen when moving all helpers from op_helper.c to
 helper.c. But after the whole conversion, this would be a neutral (no
 extra registers used) or even beneficial change (the code is free to
 use one more register).
 
  The env register is used very often (basically for every load/store, but
  also a lot of helpers), so it makes sense to reserve a register for it.
 
  For what I understand from your patch series, you prefer to pass this
  register explicitly to TCG functions. This basically means this TCG
  global will be loaded to host register as soon as it is used, but also
  regularly, as globals are saved back to their canonical location before
  an helper or a load/store.
 
  So it seems that this patch series will just allowing the env register
  to change over time, though it will not spare one more register for the
  TCG code, and it will emit longer TCG code to regularly reload the env
  global into a host register.
 
 But there will be one more register available in some cases. In other

Inside the TCG code, it will basically happens very rarely, given
load/store are really the most used instructions, and they need to load
the env register.

 cases, the number of registers used does not change. Moving the
 registers around is what worries me too.

 But there are other effects too, the helpers are now compiled so that
 the global env register is not used. Especially on hosts with low
 number of registers this is not optimal.

Most helpers are very small functions, so I am not sure more registers
will help.

  In any case at then end benchmarks is what are need to decided, TCG has
  always shown that performance improvements doesn't match the improvement
  analysis.
 
 If this turns out to be a bad idea, it means that the reverse
 conversion will be beneficial and we should convert helper.c code to
 op_helper.c and take advantage of the global env in a register. This
 was actually my standpoint when the discussion started, but I'm
 interested to see if this approach would work better.

Does it mean that you plan to do the code changes in git without really
benchmarking, and revert all the changes later if it was a bad idea?

-- 
Aurelien Jarno  GPG: 1024D/F1BCDB73
aurel...@aurel32.net http://www.aurel32.net



Re: [Qemu-devel] [PATCH RFC 00/11] AREG0 elimination

2011-05-14 Thread Aurelien Jarno
On Sat, May 14, 2011 at 11:16:16PM +0200, Aurelien Jarno wrote:
 On Sat, May 14, 2011 at 10:35:20PM +0300, Blue Swirl wrote:
  Here's a RFC series for eliminating AREG0.
  
  Blue Swirl (11):
Move user emulator stuff from cpu-exec.c to user-exec.c
Delete unused tb_invalidate_page_range
  
  The above should be OK to commit.
  
cpu_loop_exit: avoid using AREG0
Delegate setup of TCG temporaries to targets
  
  These two are not, unless the overall plan is OK.
  
TCG: fix negative frame offset calculations
TCG/x86: use stack for TCG temps
TCG/Sparc64: use stack for TCG temps
  
  But these three should be OK. I've tested lightly x86_64 and Sparc64 hosts.
  
Add CONFIG_TARGET_NEEDS_AREG0
Don't compile legacy qemu_ld/st functions if target doesn't need them
  
  Should be OK, though the latter patch only touches x86.
  
Add new qemu_ld and qemu_st functions
sparc: use new qemu_ld and qemu_st functions
  
  The last two compile but QEMU segfaults. I just made a naive
  conversion for getting comments.
  
 
 What is the goal behing removing TCG_AREG0? If it is speed improvement,
 can you please provide some benchmarks?
 
 The env register is used very often (basically for every load/store, but
 also a lot of helpers), so it makes sense to reserve a register for it.
 
 For what I understand from your patch series, you prefer to pass this
 register explicitly to TCG functions. This basically means this TCG
 global will be loaded to host register as soon as it is used, but also
 regularly, as globals are saved back to their canonical location before
 an helper or a load/store.
 
 So it seems that this patch series will just allowing the env register
 to change over time, though it will not spare one more register for the 
 TCG code, and it will emit longer TCG code to regularly reload the env
 global into a host register.
 

One way to solve that would be to use a env register only at the TCG
level, but not at the GCC level. That means loading the value of env
into TCG_AREG0 in the prologue.

That way, all the TCG code will have direct access to env, and the GCC
generated code (which includes the helper) don't need to have this
register reserved. I doubt the latter will increase the performance, but
I understand the cleanliness argument.

This also means minimal code changes, especially as most (if not all) 
TCG targets seems to have TCG_AREG0 as a callee saved arguments.
Basically the path to do the changes could be:
- Make sure all targets have TCG_AREG0 as a callee saved register
- Load env into TCG_AREG0 (and rename it as a TCG_ENV?) in the prologue 
  (could be by passing it as argument to the prologue code)
- Change all helpers to not use env directly
- Change softmmu code to not use env directly
- Remove GCC hack to reserve a register for env.

-- 
Aurelien Jarno  GPG: 1024D/F1BCDB73
aurel...@aurel32.net http://www.aurel32.net