Re: [Qemu-devel] [PATCH RFC 00/11] AREG0 elimination
On Sun, May 15, 2011 at 1:04 AM, Aurelien Jarno aurel...@aurel32.net wrote: On Sun, May 15, 2011 at 12:52:35AM +0300, Blue Swirl wrote: On Sun, May 15, 2011 at 12:16 AM, Aurelien Jarno aurel...@aurel32.net wrote: On Sat, May 14, 2011 at 10:35:20PM +0300, Blue Swirl wrote: Here's a RFC series for eliminating AREG0. Blue Swirl (11): Move user emulator stuff from cpu-exec.c to user-exec.c Delete unused tb_invalidate_page_range The above should be OK to commit. cpu_loop_exit: avoid using AREG0 Delegate setup of TCG temporaries to targets These two are not, unless the overall plan is OK. TCG: fix negative frame offset calculations TCG/x86: use stack for TCG temps TCG/Sparc64: use stack for TCG temps But these three should be OK. I've tested lightly x86_64 and Sparc64 hosts. Add CONFIG_TARGET_NEEDS_AREG0 Don't compile legacy qemu_ld/st functions if target doesn't need them Should be OK, though the latter patch only touches x86. Add new qemu_ld and qemu_st functions sparc: use new qemu_ld and qemu_st functions The last two compile but QEMU segfaults. I just made a naive conversion for getting comments. What is the goal behing removing TCG_AREG0? If it is speed improvement, can you please provide some benchmarks? There was some discussion earlier about why this (or parts of the conversion) may be a speed improvement: http://article.gmane.org/gmane.comp.emulators.qemu/101826 http://article.gmane.org/gmane.comp.emulators.qemu/102156 Ok, looks like I have missed that. There are no benchmarks yet. In fact, it may be difficult to make those without performing the removal completely. For example, patch 3/11 makes cpu_loop_exit take CPUState as a parameter instead of using global env, which would be available and the register is reserved anyway. This would only decrease performance at this stage, unless a complete conversion is done. I suspect the same would happen when moving all helpers from op_helper.c to helper.c. But after the whole conversion, this would be a neutral (no extra registers used) or even beneficial change (the code is free to use one more register). The env register is used very often (basically for every load/store, but also a lot of helpers), so it makes sense to reserve a register for it. For what I understand from your patch series, you prefer to pass this register explicitly to TCG functions. This basically means this TCG global will be loaded to host register as soon as it is used, but also regularly, as globals are saved back to their canonical location before an helper or a load/store. So it seems that this patch series will just allowing the env register to change over time, though it will not spare one more register for the TCG code, and it will emit longer TCG code to regularly reload the env global into a host register. But there will be one more register available in some cases. In other Inside the TCG code, it will basically happens very rarely, given load/store are really the most used instructions, and they need to load the env register. Not exactly, from a sample run with -d op_opt: $ egrep -v -e '^$' -v -e 'OP after' -v -e ' end' -v -e 'Search PC' /tmp/qemu.log | awk '{print $1}' | sort | uniq -c|sort -rn 1673966 movi_i32 653931 ld_i32 607432 mov_i32 428684 st_i32 326878 movi_i64 308626 add_i32 283186 call 256817 exit_tb 207232 nopn 189388 goto_tb 122398 and_i32 117997 shr_i32 89107 qemu_ld32 82926 set_label 82713 brcond_i32 67169 qemu_st32 55109 or_i32 46536 ext32u_i64 44288 xor_i32 38103 sub_i32 26361 shl_i32 23218 shl_i64 23218 qemu_st64 23218 or_i64 20474 shr_i64 20445 qemu_ld64 11161 qemu_ld8u 10409 qemu_st8 5013 qemu_ld16u 3795 qemu_st16 2776 qemu_ld8s 1915 sar_i32 1414 qemu_ld16s 839 not_i32 579 setcond_i32 213 br 42 ext32s_i64 30 mul_i64 But most other ops probably don't need any additional registers. It could still be that with the extra register, some values could be kept there instead of flushing to storage. cases, the number of registers used does not change. Moving the registers around is what worries me too. But there are other effects too, the helpers are now compiled so that the global env register is not used. Especially on hosts with low number of registers this is not optimal. Most helpers are very small functions, so I am not sure more registers will help. $ nm --print-size --defined-only --size-sort --reverse-sort obj-amd64/sparc-softmmu/op_helper.o |head -20 3a70 0c75 T helper_st_asi 46f0 086d T helper_ld_asi 2550 041b T __stw_mmu 12c0 040c T __stq_mmu 1d10 03c5 T __stl_mmu 1a90 0277 T __ldq_mmu 22f0 025f T __ldl_mmu 2b50 024f T
Re: [Qemu-devel] [PATCH RFC 00/11] AREG0 elimination
On Sun, May 15, 2011 at 10:15:52AM +0300, Blue Swirl wrote: On Sun, May 15, 2011 at 1:04 AM, Aurelien Jarno aurel...@aurel32.net wrote: On Sun, May 15, 2011 at 12:52:35AM +0300, Blue Swirl wrote: On Sun, May 15, 2011 at 12:16 AM, Aurelien Jarno aurel...@aurel32.net wrote: On Sat, May 14, 2011 at 10:35:20PM +0300, Blue Swirl wrote: Here's a RFC series for eliminating AREG0. Blue Swirl (11): Move user emulator stuff from cpu-exec.c to user-exec.c Delete unused tb_invalidate_page_range The above should be OK to commit. cpu_loop_exit: avoid using AREG0 Delegate setup of TCG temporaries to targets These two are not, unless the overall plan is OK. TCG: fix negative frame offset calculations TCG/x86: use stack for TCG temps TCG/Sparc64: use stack for TCG temps But these three should be OK. I've tested lightly x86_64 and Sparc64 hosts. Add CONFIG_TARGET_NEEDS_AREG0 Don't compile legacy qemu_ld/st functions if target doesn't need them Should be OK, though the latter patch only touches x86. Add new qemu_ld and qemu_st functions sparc: use new qemu_ld and qemu_st functions The last two compile but QEMU segfaults. I just made a naive conversion for getting comments. What is the goal behing removing TCG_AREG0? If it is speed improvement, can you please provide some benchmarks? There was some discussion earlier about why this (or parts of the conversion) may be a speed improvement: http://article.gmane.org/gmane.comp.emulators.qemu/101826 http://article.gmane.org/gmane.comp.emulators.qemu/102156 Ok, looks like I have missed that. There are no benchmarks yet. In fact, it may be difficult to make those without performing the removal completely. For example, patch 3/11 makes cpu_loop_exit take CPUState as a parameter instead of using global env, which would be available and the register is reserved anyway. This would only decrease performance at this stage, unless a complete conversion is done. I suspect the same would happen when moving all helpers from op_helper.c to helper.c. But after the whole conversion, this would be a neutral (no extra registers used) or even beneficial change (the code is free to use one more register). The env register is used very often (basically for every load/store, but also a lot of helpers), so it makes sense to reserve a register for it. For what I understand from your patch series, you prefer to pass this register explicitly to TCG functions. This basically means this TCG global will be loaded to host register as soon as it is used, but also regularly, as globals are saved back to their canonical location before an helper or a load/store. So it seems that this patch series will just allowing the env register to change over time, though it will not spare one more register for the TCG code, and it will emit longer TCG code to regularly reload the env global into a host register. But there will be one more register available in some cases. In other Inside the TCG code, it will basically happens very rarely, given load/store are really the most used instructions, and they need to load the env register. Not exactly, from a sample run with -d op_opt: $ egrep -v -e '^$' -v -e 'OP after' -v -e ' end' -v -e 'Search PC' /tmp/qemu.log | awk '{print $1}' | sort | uniq -c|sort -rn 1673966 movi_i32 653931 ld_i32 On which target is that? It looks like that some variables could be mapped directly as a global to avoid so many load/stores. Anyway, even this simple load instruction, needs a base register, and it mosts cases this register is TCG_AREG0. I agree that it can be replaced by another base register, but in anycase, a register has to be used for that. It means after such an instruction is encountered, one register is reserved for this global up to the next helper/memory load store, so there isn't any register spared, but more instructions generated to explicitly load this register. I also just realized all globals registered through tcg_global_mem_new() needs a base register and currently it is TCG_AREG0, that means even a mov_i32 might need access to TCG_AREG0. How do you plan to solve that part? 607432 mov_i32 428684 st_i32 326878 movi_i64 308626 add_i32 283186 call 256817 exit_tb 207232 nopn 189388 goto_tb 122398 and_i32 117997 shr_i32 89107 qemu_ld32 82926 set_label 82713 brcond_i32 67169 qemu_st32 55109 or_i32 46536 ext32u_i64 44288 xor_i32 38103 sub_i32 26361 shl_i32 23218 shl_i64 23218 qemu_st64 23218 or_i64 20474 shr_i64 20445 qemu_ld64 11161 qemu_ld8u 10409 qemu_st8 5013 qemu_ld16u 3795 qemu_st16 2776 qemu_ld8s 1915 sar_i32 1414 qemu_ld16s 839 not_i32 579 setcond_i32 213 br 42 ext32s_i64 30 mul_i64 But most other ops probably don't
Re: [Qemu-devel] [PATCH RFC 00/11] AREG0 elimination
On Sun, May 15, 2011 at 9:15 AM, Blue Swirl blauwir...@gmail.com wrote: On Sun, May 15, 2011 at 1:04 AM, Aurelien Jarno aurel...@aurel32.net wrote: On Sun, May 15, 2011 at 12:52:35AM +0300, Blue Swirl wrote: On Sun, May 15, 2011 at 12:16 AM, Aurelien Jarno aurel...@aurel32.net wrote: On Sat, May 14, 2011 at 10:35:20PM +0300, Blue Swirl wrote: [...] The env register is used very often (basically for every load/store, but also a lot of helpers), so it makes sense to reserve a register for it. For what I understand from your patch series, you prefer to pass this register explicitly to TCG functions. This basically means this TCG global will be loaded to host register as soon as it is used, but also regularly, as globals are saved back to their canonical location before an helper or a load/store. So it seems that this patch series will just allowing the env register to change over time, though it will not spare one more register for the TCG code, and it will emit longer TCG code to regularly reload the env global into a host register. But there will be one more register available in some cases. In other Inside the TCG code, it will basically happens very rarely, given load/store are really the most used instructions, and they need to load the env register. Not exactly, from a sample run with -d op_opt: $ egrep -v -e '^$' -v -e 'OP after' -v -e ' end' -v -e 'Search PC' /tmp/qemu.log | awk '{print $1}' | sort | uniq -c|sort -rn 1673966 movi_i32 653931 ld_i32 607432 mov_i32 428684 st_i32 326878 movi_i64 308626 add_i32 283186 call 256817 exit_tb 207232 nopn 189388 goto_tb 122398 and_i32 117997 shr_i32 89107 qemu_ld32 82926 set_label 82713 brcond_i32 67169 qemu_st32 55109 or_i32 46536 ext32u_i64 44288 xor_i32 38103 sub_i32 26361 shl_i32 23218 shl_i64 23218 qemu_st64 23218 or_i64 20474 shr_i64 20445 qemu_ld64 11161 qemu_ld8u 10409 qemu_st8 5013 qemu_ld16u 3795 qemu_st16 2776 qemu_ld8s 1915 sar_i32 1414 qemu_ld16s 839 not_i32 579 setcond_i32 213 br 42 ext32s_i64 30 mul_i64 Unless I missed something, this doesn't show the usage of ld/st per TB, which is what Aurélien was looking for if I understood correctly. All I can say is that you had at most 256817 TB's and 234507 qemu_ld/st, so about one per TB. Anyway I must be thick, because I fail to see how generated code could access guest CPU registers without a pointer to the CPU env :-) IIUC the SPARC translator uses ld_i32/st_i32 mainly for accessing the guest CPU registers, which due to register windows is held in a dedicated global temp. Is that correct? If so this is kind of hiding accesses to the CPU env; all other targets read/write registers by using CPU env (through the use global temps in most cases). So I think most (if not almost all) TB will need a pointer to CPU env, which is why I think Aurélien's proposal to keep a dedicated register that'd be loaded in the prologue is the only way to not degrade performance of the generated code (I'd add that this dedicated register should be the one defined by the ABI as holding the first parameter value, if that's possible; I'm afraid this is not necessarily a good idea). Laurent
Re: [Qemu-devel] [PATCH RFC 00/11] AREG0 elimination
On Sun, May 15, 2011 at 11:27:17AM +0200, Laurent Desnogues wrote: On Sun, May 15, 2011 at 9:15 AM, Blue Swirl blauwir...@gmail.com wrote: On Sun, May 15, 2011 at 1:04 AM, Aurelien Jarno aurel...@aurel32.net wrote: On Sun, May 15, 2011 at 12:52:35AM +0300, Blue Swirl wrote: On Sun, May 15, 2011 at 12:16 AM, Aurelien Jarno aurel...@aurel32.net wrote: On Sat, May 14, 2011 at 10:35:20PM +0300, Blue Swirl wrote: [...] The env register is used very often (basically for every load/store, but also a lot of helpers), so it makes sense to reserve a register for it. For what I understand from your patch series, you prefer to pass this register explicitly to TCG functions. This basically means this TCG global will be loaded to host register as soon as it is used, but also regularly, as globals are saved back to their canonical location before an helper or a load/store. So it seems that this patch series will just allowing the env register to change over time, though it will not spare one more register for the TCG code, and it will emit longer TCG code to regularly reload the env global into a host register. But there will be one more register available in some cases. In other Inside the TCG code, it will basically happens very rarely, given load/store are really the most used instructions, and they need to load the env register. Not exactly, from a sample run with -d op_opt: $ egrep -v -e '^$' -v -e 'OP after' -v -e ' end' -v -e 'Search PC' /tmp/qemu.log | awk '{print $1}' | sort | uniq -c|sort -rn 1673966 movi_i32 653931 ld_i32 607432 mov_i32 428684 st_i32 326878 movi_i64 308626 add_i32 283186 call 256817 exit_tb 207232 nopn 189388 goto_tb 122398 and_i32 117997 shr_i32 89107 qemu_ld32 82926 set_label 82713 brcond_i32 67169 qemu_st32 55109 or_i32 46536 ext32u_i64 44288 xor_i32 38103 sub_i32 26361 shl_i32 23218 shl_i64 23218 qemu_st64 23218 or_i64 20474 shr_i64 20445 qemu_ld64 11161 qemu_ld8u 10409 qemu_st8 5013 qemu_ld16u 3795 qemu_st16 2776 qemu_ld8s 1915 sar_i32 1414 qemu_ld16s 839 not_i32 579 setcond_i32 213 br 42 ext32s_i64 30 mul_i64 Unless I missed something, this doesn't show the usage of ld/st per TB, which is what Aurélien was looking for if I understood correctly. All I can say is that you had at most 256817 TB's and 234507 qemu_ld/st, so about one per TB. Anyway I must be thick, because I fail to see how generated code could access guest CPU registers without a pointer to the CPU env :-) IIUC the SPARC translator uses ld_i32/st_i32 mainly for accessing the guest CPU registers, which due to register windows is held in a dedicated global temp. Is that correct? If so this is kind of hiding accesses to the CPU env; all other targets read/write registers by using CPU env (through the use global temps in most cases). So I think most (if not almost all) TB will need a pointer to CPU env, which is why I think Aurélien's proposal to keep a dedicated register that'd be loaded in the prologue is the only way to not degrade performance of the generated code (I'd add that this dedicated register should be the one defined by the ABI as holding the first parameter value, if that's possible; I'm afraid this is not necessarily a good idea). I also thought about making it the one defined by the ABI as holding the first parameter value, to avoid some register moves, but I am also afraid this is not necessarily a good idea. It is usually a caller saved register (because the caller is anyway overriding its value), which means that it has to be saved before calling every helper, and it probably means more register move in that case. -- Aurelien Jarno GPG: 1024D/F1BCDB73 aurel...@aurel32.net http://www.aurel32.net
Re: [Qemu-devel] [PATCH RFC 00/11] AREG0 elimination
On Sun, May 15, 2011 at 12:24 PM, Aurelien Jarno aurel...@aurel32.net wrote: On Sun, May 15, 2011 at 10:15:52AM +0300, Blue Swirl wrote: On Sun, May 15, 2011 at 1:04 AM, Aurelien Jarno aurel...@aurel32.net wrote: On Sun, May 15, 2011 at 12:52:35AM +0300, Blue Swirl wrote: On Sun, May 15, 2011 at 12:16 AM, Aurelien Jarno aurel...@aurel32.net wrote: On Sat, May 14, 2011 at 10:35:20PM +0300, Blue Swirl wrote: Here's a RFC series for eliminating AREG0. Blue Swirl (11): Move user emulator stuff from cpu-exec.c to user-exec.c Delete unused tb_invalidate_page_range The above should be OK to commit. cpu_loop_exit: avoid using AREG0 Delegate setup of TCG temporaries to targets These two are not, unless the overall plan is OK. TCG: fix negative frame offset calculations TCG/x86: use stack for TCG temps TCG/Sparc64: use stack for TCG temps But these three should be OK. I've tested lightly x86_64 and Sparc64 hosts. Add CONFIG_TARGET_NEEDS_AREG0 Don't compile legacy qemu_ld/st functions if target doesn't need them Should be OK, though the latter patch only touches x86. Add new qemu_ld and qemu_st functions sparc: use new qemu_ld and qemu_st functions The last two compile but QEMU segfaults. I just made a naive conversion for getting comments. What is the goal behing removing TCG_AREG0? If it is speed improvement, can you please provide some benchmarks? There was some discussion earlier about why this (or parts of the conversion) may be a speed improvement: http://article.gmane.org/gmane.comp.emulators.qemu/101826 http://article.gmane.org/gmane.comp.emulators.qemu/102156 Ok, looks like I have missed that. There are no benchmarks yet. In fact, it may be difficult to make those without performing the removal completely. For example, patch 3/11 makes cpu_loop_exit take CPUState as a parameter instead of using global env, which would be available and the register is reserved anyway. This would only decrease performance at this stage, unless a complete conversion is done. I suspect the same would happen when moving all helpers from op_helper.c to helper.c. But after the whole conversion, this would be a neutral (no extra registers used) or even beneficial change (the code is free to use one more register). The env register is used very often (basically for every load/store, but also a lot of helpers), so it makes sense to reserve a register for it. For what I understand from your patch series, you prefer to pass this register explicitly to TCG functions. This basically means this TCG global will be loaded to host register as soon as it is used, but also regularly, as globals are saved back to their canonical location before an helper or a load/store. So it seems that this patch series will just allowing the env register to change over time, though it will not spare one more register for the TCG code, and it will emit longer TCG code to regularly reload the env global into a host register. But there will be one more register available in some cases. In other Inside the TCG code, it will basically happens very rarely, given load/store are really the most used instructions, and they need to load the env register. Not exactly, from a sample run with -d op_opt: $ egrep -v -e '^$' -v -e 'OP after' -v -e ' end' -v -e 'Search PC' /tmp/qemu.log | awk '{print $1}' | sort | uniq -c|sort -rn 1673966 movi_i32 653931 ld_i32 On which target is that? It looks like that some variables could be mapped directly as a global to avoid so many load/stores. Sparc-softmmu, boot of Aurora-1.0 up to first choices screen in X. Anyway, even this simple load instruction, needs a base register, and it mosts cases this register is TCG_AREG0. I agree that it can be replaced by another base register, but in anycase, a register has to be used for that. It means after such an instruction is encountered, one register is reserved for this global up to the next helper/memory load store, so there isn't any register spared, but more instructions generated to explicitly load this register. I also just realized all globals registered through tcg_global_mem_new() needs a base register and currently it is TCG_AREG0, that means even a mov_i32 might need access to TCG_AREG0. How do you plan to solve that part? In most cases the backing storage is CPUState, so cpu_env would be a logical choice. 607432 mov_i32 428684 st_i32 326878 movi_i64 308626 add_i32 283186 call 256817 exit_tb 207232 nopn 189388 goto_tb 122398 and_i32 117997 shr_i32 89107 qemu_ld32 82926 set_label 82713 brcond_i32 67169 qemu_st32 55109 or_i32 46536 ext32u_i64 44288 xor_i32 38103 sub_i32 26361 shl_i32 23218 shl_i64 23218 qemu_st64 23218 or_i64 20474 shr_i64 20445 qemu_ld64 11161
Re: [Qemu-devel] [PATCH RFC 00/11] AREG0 elimination
On Sun, May 15, 2011 at 12:58:49PM +0300, Blue Swirl wrote: On Sun, May 15, 2011 at 12:24 PM, Aurelien Jarno aurel...@aurel32.net wrote: On Sun, May 15, 2011 at 10:15:52AM +0300, Blue Swirl wrote: On Sun, May 15, 2011 at 1:04 AM, Aurelien Jarno aurel...@aurel32.net wrote: On Sun, May 15, 2011 at 12:52:35AM +0300, Blue Swirl wrote: On Sun, May 15, 2011 at 12:16 AM, Aurelien Jarno aurel...@aurel32.net wrote: On Sat, May 14, 2011 at 10:35:20PM +0300, Blue Swirl wrote: Here's a RFC series for eliminating AREG0. Blue Swirl (11): Move user emulator stuff from cpu-exec.c to user-exec.c Delete unused tb_invalidate_page_range The above should be OK to commit. cpu_loop_exit: avoid using AREG0 Delegate setup of TCG temporaries to targets These two are not, unless the overall plan is OK. TCG: fix negative frame offset calculations TCG/x86: use stack for TCG temps TCG/Sparc64: use stack for TCG temps But these three should be OK. I've tested lightly x86_64 and Sparc64 hosts. Add CONFIG_TARGET_NEEDS_AREG0 Don't compile legacy qemu_ld/st functions if target doesn't need them Should be OK, though the latter patch only touches x86. Add new qemu_ld and qemu_st functions sparc: use new qemu_ld and qemu_st functions The last two compile but QEMU segfaults. I just made a naive conversion for getting comments. What is the goal behing removing TCG_AREG0? If it is speed improvement, can you please provide some benchmarks? There was some discussion earlier about why this (or parts of the conversion) may be a speed improvement: http://article.gmane.org/gmane.comp.emulators.qemu/101826 http://article.gmane.org/gmane.comp.emulators.qemu/102156 Ok, looks like I have missed that. There are no benchmarks yet. In fact, it may be difficult to make those without performing the removal completely. For example, patch 3/11 makes cpu_loop_exit take CPUState as a parameter instead of using global env, which would be available and the register is reserved anyway. This would only decrease performance at this stage, unless a complete conversion is done. I suspect the same would happen when moving all helpers from op_helper.c to helper.c. But after the whole conversion, this would be a neutral (no extra registers used) or even beneficial change (the code is free to use one more register). The env register is used very often (basically for every load/store, but also a lot of helpers), so it makes sense to reserve a register for it. For what I understand from your patch series, you prefer to pass this register explicitly to TCG functions. This basically means this TCG global will be loaded to host register as soon as it is used, but also regularly, as globals are saved back to their canonical location before an helper or a load/store. So it seems that this patch series will just allowing the env register to change over time, though it will not spare one more register for the TCG code, and it will emit longer TCG code to regularly reload the env global into a host register. But there will be one more register available in some cases. In other Inside the TCG code, it will basically happens very rarely, given load/store are really the most used instructions, and they need to load the env register. Not exactly, from a sample run with -d op_opt: $ egrep -v -e '^$' -v -e 'OP after' -v -e ' end' -v -e 'Search PC' /tmp/qemu.log | awk '{print $1}' | sort | uniq -c|sort -rn 1673966 movi_i32 653931 ld_i32 On which target is that? It looks like that some variables could be mapped directly as a global to avoid so many load/stores. Sparc-softmmu, boot of Aurora-1.0 up to first choices screen in X. Anyway, even this simple load instruction, needs a base register, and it mosts cases this register is TCG_AREG0. I agree that it can be replaced by another base register, but in anycase, a register has to be used for that. It means after such an instruction is encountered, one register is reserved for this global up to the next helper/memory load store, so there isn't any register spared, but more instructions generated to explicitly load this register. I also just realized all globals registered through tcg_global_mem_new() needs a base register and currently it is TCG_AREG0, that means even a mov_i32 might need access to TCG_AREG0. How do you plan to solve that part? In most cases the backing storage is CPUState, so cpu_env would be a logical choice. 607432 mov_i32 428684 st_i32 326878 movi_i64 308626 add_i32 283186 call 256817 exit_tb 207232 nopn 189388 goto_tb 122398 and_i32 117997 shr_i32 89107 qemu_ld32 82926 set_label 82713
Re: [Qemu-devel] [PATCH RFC 00/11] AREG0 elimination
On Sun, May 15, 2011 at 12:27 PM, Laurent Desnogues laurent.desnog...@gmail.com wrote: On Sun, May 15, 2011 at 9:15 AM, Blue Swirl blauwir...@gmail.com wrote: On Sun, May 15, 2011 at 1:04 AM, Aurelien Jarno aurel...@aurel32.net wrote: On Sun, May 15, 2011 at 12:52:35AM +0300, Blue Swirl wrote: On Sun, May 15, 2011 at 12:16 AM, Aurelien Jarno aurel...@aurel32.net wrote: On Sat, May 14, 2011 at 10:35:20PM +0300, Blue Swirl wrote: [...] The env register is used very often (basically for every load/store, but also a lot of helpers), so it makes sense to reserve a register for it. For what I understand from your patch series, you prefer to pass this register explicitly to TCG functions. This basically means this TCG global will be loaded to host register as soon as it is used, but also regularly, as globals are saved back to their canonical location before an helper or a load/store. So it seems that this patch series will just allowing the env register to change over time, though it will not spare one more register for the TCG code, and it will emit longer TCG code to regularly reload the env global into a host register. But there will be one more register available in some cases. In other Inside the TCG code, it will basically happens very rarely, given load/store are really the most used instructions, and they need to load the env register. Not exactly, from a sample run with -d op_opt: $ egrep -v -e '^$' -v -e 'OP after' -v -e ' end' -v -e 'Search PC' /tmp/qemu.log | awk '{print $1}' | sort | uniq -c|sort -rn 1673966 movi_i32 653931 ld_i32 607432 mov_i32 428684 st_i32 326878 movi_i64 308626 add_i32 283186 call 256817 exit_tb 207232 nopn 189388 goto_tb 122398 and_i32 117997 shr_i32 89107 qemu_ld32 82926 set_label 82713 brcond_i32 67169 qemu_st32 55109 or_i32 46536 ext32u_i64 44288 xor_i32 38103 sub_i32 26361 shl_i32 23218 shl_i64 23218 qemu_st64 23218 or_i64 20474 shr_i64 20445 qemu_ld64 11161 qemu_ld8u 10409 qemu_st8 5013 qemu_ld16u 3795 qemu_st16 2776 qemu_ld8s 1915 sar_i32 1414 qemu_ld16s 839 not_i32 579 setcond_i32 213 br 42 ext32s_i64 30 mul_i64 Unless I missed something, this doesn't show the usage of ld/st per TB, which is what Aurélien was looking for if I understood correctly. All I can say is that you had at most 256817 TB's and 234507 qemu_ld/st, so about one per TB. The question was ratio of loads/stores to other instructions. The statistics are not per TB. There were about 174880 TBs. Anyway I must be thick, because I fail to see how generated code could access guest CPU registers without a pointer to the CPU env :-) IIUC the SPARC translator uses ld_i32/st_i32 mainly for accessing the guest CPU registers, which due to register windows is held in a dedicated global temp. Is that correct? If so this is kind of hiding accesses to the CPU env; all other targets read/write registers by using CPU env (through the use global temps in most cases). So I think most (if not almost all) TB will need a pointer to CPU env, which is why I think Aurélien's proposal to keep a dedicated register that'd be loaded in the prologue is the only way to not degrade performance of the generated code (I'd add that this dedicated register should be the one defined by the ABI as holding the first parameter value, if that's possible; I'm afraid this is not necessarily a good idea). CPU env will be used, but the register could be made available for other uses too.
Re: [Qemu-devel] [PATCH RFC 00/11] AREG0 elimination
On Sun, May 15, 2011 at 1:19 PM, Aurelien Jarno aurel...@aurel32.net wrote: On Sun, May 15, 2011 at 12:58:49PM +0300, Blue Swirl wrote: On Sun, May 15, 2011 at 12:24 PM, Aurelien Jarno aurel...@aurel32.net wrote: On Sun, May 15, 2011 at 10:15:52AM +0300, Blue Swirl wrote: On Sun, May 15, 2011 at 1:04 AM, Aurelien Jarno aurel...@aurel32.net wrote: On Sun, May 15, 2011 at 12:52:35AM +0300, Blue Swirl wrote: On Sun, May 15, 2011 at 12:16 AM, Aurelien Jarno aurel...@aurel32.net wrote: On Sat, May 14, 2011 at 10:35:20PM +0300, Blue Swirl wrote: Here's a RFC series for eliminating AREG0. Blue Swirl (11): Move user emulator stuff from cpu-exec.c to user-exec.c Delete unused tb_invalidate_page_range The above should be OK to commit. cpu_loop_exit: avoid using AREG0 Delegate setup of TCG temporaries to targets These two are not, unless the overall plan is OK. TCG: fix negative frame offset calculations TCG/x86: use stack for TCG temps TCG/Sparc64: use stack for TCG temps But these three should be OK. I've tested lightly x86_64 and Sparc64 hosts. Add CONFIG_TARGET_NEEDS_AREG0 Don't compile legacy qemu_ld/st functions if target doesn't need them Should be OK, though the latter patch only touches x86. Add new qemu_ld and qemu_st functions sparc: use new qemu_ld and qemu_st functions The last two compile but QEMU segfaults. I just made a naive conversion for getting comments. What is the goal behing removing TCG_AREG0? If it is speed improvement, can you please provide some benchmarks? There was some discussion earlier about why this (or parts of the conversion) may be a speed improvement: http://article.gmane.org/gmane.comp.emulators.qemu/101826 http://article.gmane.org/gmane.comp.emulators.qemu/102156 Ok, looks like I have missed that. There are no benchmarks yet. In fact, it may be difficult to make those without performing the removal completely. For example, patch 3/11 makes cpu_loop_exit take CPUState as a parameter instead of using global env, which would be available and the register is reserved anyway. This would only decrease performance at this stage, unless a complete conversion is done. I suspect the same would happen when moving all helpers from op_helper.c to helper.c. But after the whole conversion, this would be a neutral (no extra registers used) or even beneficial change (the code is free to use one more register). The env register is used very often (basically for every load/store, but also a lot of helpers), so it makes sense to reserve a register for it. For what I understand from your patch series, you prefer to pass this register explicitly to TCG functions. This basically means this TCG global will be loaded to host register as soon as it is used, but also regularly, as globals are saved back to their canonical location before an helper or a load/store. So it seems that this patch series will just allowing the env register to change over time, though it will not spare one more register for the TCG code, and it will emit longer TCG code to regularly reload the env global into a host register. But there will be one more register available in some cases. In other Inside the TCG code, it will basically happens very rarely, given load/store are really the most used instructions, and they need to load the env register. Not exactly, from a sample run with -d op_opt: $ egrep -v -e '^$' -v -e 'OP after' -v -e ' end' -v -e 'Search PC' /tmp/qemu.log | awk '{print $1}' | sort | uniq -c|sort -rn 1673966 movi_i32 653931 ld_i32 On which target is that? It looks like that some variables could be mapped directly as a global to avoid so many load/stores. Sparc-softmmu, boot of Aurora-1.0 up to first choices screen in X. Anyway, even this simple load instruction, needs a base register, and it mosts cases this register is TCG_AREG0. I agree that it can be replaced by another base register, but in anycase, a register has to be used for that. It means after such an instruction is encountered, one register is reserved for this global up to the next helper/memory load store, so there isn't any register spared, but more instructions generated to explicitly load this register. I also just realized all globals registered through tcg_global_mem_new() needs a base register and currently it is TCG_AREG0, that means even a mov_i32 might need access to TCG_AREG0. How do you plan to solve that part? In most cases the backing storage is CPUState, so cpu_env would be a logical choice. 607432 mov_i32 428684 st_i32 326878 movi_i64 308626 add_i32 283186 call 256817 exit_tb 207232 nopn 189388 goto_tb
Re: [Qemu-devel] [PATCH RFC 00/11] AREG0 elimination
On Sun, May 15, 2011 at 02:03:40PM +0300, Blue Swirl wrote: On Sun, May 15, 2011 at 12:27 PM, Laurent Desnogues laurent.desnog...@gmail.com wrote: On Sun, May 15, 2011 at 9:15 AM, Blue Swirl blauwir...@gmail.com wrote: On Sun, May 15, 2011 at 1:04 AM, Aurelien Jarno aurel...@aurel32.net wrote: On Sun, May 15, 2011 at 12:52:35AM +0300, Blue Swirl wrote: On Sun, May 15, 2011 at 12:16 AM, Aurelien Jarno aurel...@aurel32.net wrote: On Sat, May 14, 2011 at 10:35:20PM +0300, Blue Swirl wrote: [...] The env register is used very often (basically for every load/store, but also a lot of helpers), so it makes sense to reserve a register for it. For what I understand from your patch series, you prefer to pass this register explicitly to TCG functions. This basically means this TCG global will be loaded to host register as soon as it is used, but also regularly, as globals are saved back to their canonical location before an helper or a load/store. So it seems that this patch series will just allowing the env register to change over time, though it will not spare one more register for the TCG code, and it will emit longer TCG code to regularly reload the env global into a host register. But there will be one more register available in some cases. In other Inside the TCG code, it will basically happens very rarely, given load/store are really the most used instructions, and they need to load the env register. Not exactly, from a sample run with -d op_opt: $ egrep -v -e '^$' -v -e 'OP after' -v -e ' end' -v -e 'Search PC' /tmp/qemu.log | awk '{print $1}' | sort | uniq -c|sort -rn 1673966 movi_i32 653931 ld_i32 607432 mov_i32 428684 st_i32 326878 movi_i64 308626 add_i32 283186 call 256817 exit_tb 207232 nopn 189388 goto_tb 122398 and_i32 117997 shr_i32 89107 qemu_ld32 82926 set_label 82713 brcond_i32 67169 qemu_st32 55109 or_i32 46536 ext32u_i64 44288 xor_i32 38103 sub_i32 26361 shl_i32 23218 shl_i64 23218 qemu_st64 23218 or_i64 20474 shr_i64 20445 qemu_ld64 11161 qemu_ld8u 10409 qemu_st8 5013 qemu_ld16u 3795 qemu_st16 2776 qemu_ld8s 1915 sar_i32 1414 qemu_ld16s 839 not_i32 579 setcond_i32 213 br 42 ext32s_i64 30 mul_i64 Unless I missed something, this doesn't show the usage of ld/st per TB, which is what Aurélien was looking for if I understood correctly. All I can say is that you had at most 256817 TB's and 234507 qemu_ld/st, so about one per TB. The question was ratio of loads/stores to other instructions. The statistics are not per TB. There were about 174880 TBs. Anyway I must be thick, because I fail to see how generated code could access guest CPU registers without a pointer to the CPU env :-) IIUC the SPARC translator uses ld_i32/st_i32 mainly for accessing the guest CPU registers, which due to register windows is held in a dedicated global temp. Is that correct? If so this is kind of hiding accesses to the CPU env; all other targets read/write registers by using CPU env (through the use global temps in most cases). So I think most (if not almost all) TB will need a pointer to CPU env, which is why I think Aurélien's proposal to keep a dedicated register that'd be loaded in the prologue is the only way to not degrade performance of the generated code (I'd add that this dedicated register should be the one defined by the ABI as holding the first parameter value, if that's possible; I'm afraid this is not necessarily a good idea). CPU env will be used, but the register could be made available for other uses too. What other uses? It is needed for almost every TB, so there is not other possible use. Let's take an example, a simple TB from mips on x86_64: | IN: start_kernel | 0x80510958: jal0x8051d50c | 0x8051095c: nop | | OP after liveness analysis: | movi_i32 ra,$0x80510960 | movi_i32 PC,$0x8051d50c | exit_tb $0x0 | end | OUT: [size=28] | 0x40f22960: mov$0x80510960,%ebp | 0x40f22965: mov%ebp,0x7c(%r14) | 0x40f22969: mov$0x8051d50c,%ebp | 0x40f2296e: mov%ebp,0x80(%r14) | 0x40f22975: xor%eax,%eax | 0x40f22977: jmpq 0x115042e x86_64 uses r14 as TCG_AREG0. Despite the instructions being quite simple (only 2 movi_i32), the resulting code makes 2 access to env to save the two registers. Having to reload the env pointer each time to a register would clearly increase the size of this TB. Another question, imagine env is not in a register anymore. What kind of code do you plan to use to load the env pointer into a register? This is something we need to know to evaluate the cost of not having a fixed register for env withing the TCG code. -- Aurelien Jarno GPG: 1024D/F1BCDB73 aurel...@aurel32.net http://www.aurel32.net
Re: [Qemu-devel] [PATCH RFC 00/11] AREG0 elimination
On Sun, May 15, 2011 at 2:14 PM, Aurelien Jarno aurel...@aurel32.net wrote: On Sun, May 15, 2011 at 02:03:40PM +0300, Blue Swirl wrote: On Sun, May 15, 2011 at 12:27 PM, Laurent Desnogues laurent.desnog...@gmail.com wrote: On Sun, May 15, 2011 at 9:15 AM, Blue Swirl blauwir...@gmail.com wrote: On Sun, May 15, 2011 at 1:04 AM, Aurelien Jarno aurel...@aurel32.net wrote: On Sun, May 15, 2011 at 12:52:35AM +0300, Blue Swirl wrote: On Sun, May 15, 2011 at 12:16 AM, Aurelien Jarno aurel...@aurel32.net wrote: On Sat, May 14, 2011 at 10:35:20PM +0300, Blue Swirl wrote: [...] The env register is used very often (basically for every load/store, but also a lot of helpers), so it makes sense to reserve a register for it. For what I understand from your patch series, you prefer to pass this register explicitly to TCG functions. This basically means this TCG global will be loaded to host register as soon as it is used, but also regularly, as globals are saved back to their canonical location before an helper or a load/store. So it seems that this patch series will just allowing the env register to change over time, though it will not spare one more register for the TCG code, and it will emit longer TCG code to regularly reload the env global into a host register. But there will be one more register available in some cases. In other Inside the TCG code, it will basically happens very rarely, given load/store are really the most used instructions, and they need to load the env register. Not exactly, from a sample run with -d op_opt: $ egrep -v -e '^$' -v -e 'OP after' -v -e ' end' -v -e 'Search PC' /tmp/qemu.log | awk '{print $1}' | sort | uniq -c|sort -rn 1673966 movi_i32 653931 ld_i32 607432 mov_i32 428684 st_i32 326878 movi_i64 308626 add_i32 283186 call 256817 exit_tb 207232 nopn 189388 goto_tb 122398 and_i32 117997 shr_i32 89107 qemu_ld32 82926 set_label 82713 brcond_i32 67169 qemu_st32 55109 or_i32 46536 ext32u_i64 44288 xor_i32 38103 sub_i32 26361 shl_i32 23218 shl_i64 23218 qemu_st64 23218 or_i64 20474 shr_i64 20445 qemu_ld64 11161 qemu_ld8u 10409 qemu_st8 5013 qemu_ld16u 3795 qemu_st16 2776 qemu_ld8s 1915 sar_i32 1414 qemu_ld16s 839 not_i32 579 setcond_i32 213 br 42 ext32s_i64 30 mul_i64 Unless I missed something, this doesn't show the usage of ld/st per TB, which is what Aurélien was looking for if I understood correctly. All I can say is that you had at most 256817 TB's and 234507 qemu_ld/st, so about one per TB. The question was ratio of loads/stores to other instructions. The statistics are not per TB. There were about 174880 TBs. Anyway I must be thick, because I fail to see how generated code could access guest CPU registers without a pointer to the CPU env :-) IIUC the SPARC translator uses ld_i32/st_i32 mainly for accessing the guest CPU registers, which due to register windows is held in a dedicated global temp. Is that correct? If so this is kind of hiding accesses to the CPU env; all other targets read/write registers by using CPU env (through the use global temps in most cases). So I think most (if not almost all) TB will need a pointer to CPU env, which is why I think Aurélien's proposal to keep a dedicated register that'd be loaded in the prologue is the only way to not degrade performance of the generated code (I'd add that this dedicated register should be the one defined by the ABI as holding the first parameter value, if that's possible; I'm afraid this is not necessarily a good idea). CPU env will be used, but the register could be made available for other uses too. What other uses? It is needed for almost every TB, so there is not other possible use. Let's take an example, a simple TB from mips on x86_64: | IN: start_kernel | 0x80510958: jal 0x8051d50c | 0x8051095c: nop | | OP after liveness analysis: | movi_i32 ra,$0x80510960 | movi_i32 PC,$0x8051d50c | exit_tb $0x0 | end | OUT: [size=28] | 0x40f22960: mov $0x80510960,%ebp | 0x40f22965: mov %ebp,0x7c(%r14) | 0x40f22969: mov $0x8051d50c,%ebp | 0x40f2296e: mov %ebp,0x80(%r14) | 0x40f22975: xor %eax,%eax | 0x40f22977: jmpq 0x115042e x86_64 uses r14 as TCG_AREG0. Despite the instructions being quite simple (only 2 movi_i32), the resulting code makes 2 access to env to save the two registers. Having to reload the env pointer each time to a register would clearly increase the size of this TB. I don't think TCG would be that simple, instead the pointer would be loaded only once in this case. Another question, imagine env is not in a register anymore. What kind of code do you plan to use to load the env pointer into a register? This is something we need to know to evaluate the cost of not having
Re: [Qemu-devel] [PATCH RFC 00/11] AREG0 elimination
On Sun, May 15, 2011 at 02:12:27PM +0300, Blue Swirl wrote: On Sun, May 15, 2011 at 1:19 PM, Aurelien Jarno aurel...@aurel32.net wrote: I still don't get where are this list of possible changes? Did I miss something in another thread? I'm referring to the patches I sent. Ok, patches 1, 2 and 4 to 7 looks ok at a first glance, though I think patches 6 and 7 should be done for all hosts or none of them. On the TCG generated code, the env structure is used almost for every op, so it really makes sense to keep it in a register instead of having to reload the address of env regularly from memory. Given it only affects TCG generated code, I don't see the point of portability here. For example, maybe the bugs in Sparc glibc could be avoided by using one of %i set of registers (not accessible from helpers) for AREG0 within generated code instead of %g registers which seem to be fragile. First of all, but it's a different subject, I am not sure there are sparc glibc bugs, I'd rather says QEMU mis-uses some register. For example the following code is probably wrong: /* Note: must be synced with dyngen-exec.h */ #ifdef CONFIG_SOLARIS #define TCG_AREG0 TCG_REG_G2 #elif defined(__sparc_v9__) #define TCG_AREG0 TCG_REG_G5 #else #define TCG_AREG0 TCG_REG_G6 #endif __sparc_v9__ can set on the 32-bit ABI, when the compiler targets V8+, so the condition is probably wrong there. Secondly the SPARC ABI [1] on page 23 that %g5 to %g7 are reserved for system. I don't think QEMU has the right to use this registers. Anyway, I don't see why keeping TCG_AREG0 inside the TCG generated code would prevent you to use a register from the %i set for it. On the other hand, I have objections to remove uses of TCG_AREG0 from the TCG part. This register is part of the TCG design and thus used very often. In any case we have to keep it for accesses to the globals, so we can keep it for softmmu load/stores too. Right, any changes would need to be backed by performance figures. If we go for the removal of TCG_AREG0 from the GCC side, it probably means loading it in the prologue instead. Do you mean cpu-exec.c side with this? I think this is a very similar register tradeoff case to helpers. I mean calling tcg_qemu_tb_exec(tb_ptr, env) instead of tcg_qemu_tb_exec(tb_ptr), and modification in the prologue to save the env pointer into the TCG_AREG0 register (which is kept internal to TCG generated code). Maybe it would be easy to test and benchmark this change, it doesn't affect generated code or helpers. Yes it's something that can be benchmarked. From experience it won't make any significant performance change, it's basically two register moves (one in the caller, one in the prologue), for each TB which is not chained. In any case, it will introduce a slight overhead that has to be compensated by allowing the helpers to use an additional register. [1] http://www.sparc.com/standards/psABI3rd.pdf -- Aurelien Jarno GPG: 1024D/F1BCDB73 aurel...@aurel32.net http://www.aurel32.net
Re: [Qemu-devel] [PATCH RFC 00/11] AREG0 elimination
On Sun, May 15, 2011 at 02:33:55PM +0300, Blue Swirl wrote: On Sun, May 15, 2011 at 2:14 PM, Aurelien Jarno aurel...@aurel32.net wrote: On Sun, May 15, 2011 at 02:03:40PM +0300, Blue Swirl wrote: On Sun, May 15, 2011 at 12:27 PM, Laurent Desnogues laurent.desnog...@gmail.com wrote: On Sun, May 15, 2011 at 9:15 AM, Blue Swirl blauwir...@gmail.com wrote: On Sun, May 15, 2011 at 1:04 AM, Aurelien Jarno aurel...@aurel32.net wrote: On Sun, May 15, 2011 at 12:52:35AM +0300, Blue Swirl wrote: On Sun, May 15, 2011 at 12:16 AM, Aurelien Jarno aurel...@aurel32.net wrote: On Sat, May 14, 2011 at 10:35:20PM +0300, Blue Swirl wrote: [...] The env register is used very often (basically for every load/store, but also a lot of helpers), so it makes sense to reserve a register for it. For what I understand from your patch series, you prefer to pass this register explicitly to TCG functions. This basically means this TCG global will be loaded to host register as soon as it is used, but also regularly, as globals are saved back to their canonical location before an helper or a load/store. So it seems that this patch series will just allowing the env register to change over time, though it will not spare one more register for the TCG code, and it will emit longer TCG code to regularly reload the env global into a host register. But there will be one more register available in some cases. In other Inside the TCG code, it will basically happens very rarely, given load/store are really the most used instructions, and they need to load the env register. Not exactly, from a sample run with -d op_opt: $ egrep -v -e '^$' -v -e 'OP after' -v -e ' end' -v -e 'Search PC' /tmp/qemu.log | awk '{print $1}' | sort | uniq -c|sort -rn 1673966 movi_i32 653931 ld_i32 607432 mov_i32 428684 st_i32 326878 movi_i64 308626 add_i32 283186 call 256817 exit_tb 207232 nopn 189388 goto_tb 122398 and_i32 117997 shr_i32 89107 qemu_ld32 82926 set_label 82713 brcond_i32 67169 qemu_st32 55109 or_i32 46536 ext32u_i64 44288 xor_i32 38103 sub_i32 26361 shl_i32 23218 shl_i64 23218 qemu_st64 23218 or_i64 20474 shr_i64 20445 qemu_ld64 11161 qemu_ld8u 10409 qemu_st8 5013 qemu_ld16u 3795 qemu_st16 2776 qemu_ld8s 1915 sar_i32 1414 qemu_ld16s 839 not_i32 579 setcond_i32 213 br 42 ext32s_i64 30 mul_i64 Unless I missed something, this doesn't show the usage of ld/st per TB, which is what Aurélien was looking for if I understood correctly. All I can say is that you had at most 256817 TB's and 234507 qemu_ld/st, so about one per TB. The question was ratio of loads/stores to other instructions. The statistics are not per TB. There were about 174880 TBs. Anyway I must be thick, because I fail to see how generated code could access guest CPU registers without a pointer to the CPU env :-) IIUC the SPARC translator uses ld_i32/st_i32 mainly for accessing the guest CPU registers, which due to register windows is held in a dedicated global temp. Is that correct? If so this is kind of hiding accesses to the CPU env; all other targets read/write registers by using CPU env (through the use global temps in most cases). So I think most (if not almost all) TB will need a pointer to CPU env, which is why I think Aurélien's proposal to keep a dedicated register that'd be loaded in the prologue is the only way to not degrade performance of the generated code (I'd add that this dedicated register should be the one defined by the ABI as holding the first parameter value, if that's possible; I'm afraid this is not necessarily a good idea). CPU env will be used, but the register could be made available for other uses too. What other uses? It is needed for almost every TB, so there is not other possible use. Let's take an example, a simple TB from mips on x86_64: | IN: start_kernel | 0x80510958: jal 0x8051d50c | 0x8051095c: nop | | OP after liveness analysis: | movi_i32 ra,$0x80510960 | movi_i32 PC,$0x8051d50c | exit_tb $0x0 | end | OUT: [size=28] | 0x40f22960: mov $0x80510960,%ebp | 0x40f22965: mov %ebp,0x7c(%r14) | 0x40f22969: mov $0x8051d50c,%ebp | 0x40f2296e: mov %ebp,0x80(%r14) | 0x40f22975: xor %eax,%eax | 0x40f22977: jmpq 0x115042e x86_64 uses r14 as TCG_AREG0. Despite the instructions being quite simple (only 2 movi_i32), the resulting code makes 2 access to env to save the two registers. Having to reload the env pointer each time to a register would clearly increase the size of this TB. I don't think TCG would be that simple, instead the pointer would be loaded only once in this
Re: [Qemu-devel] [PATCH RFC 00/11] AREG0 elimination
On Sun, May 15, 2011 at 1:33 PM, Blue Swirl blauwir...@gmail.com wrote: [...] x86_64 uses r14 as TCG_AREG0. Despite the instructions being quite simple (only 2 movi_i32), the resulting code makes 2 access to env to save the two registers. Having to reload the env pointer each time to a register would clearly increase the size of this TB. I don't think TCG would be that simple, instead the pointer would be loaded only once in this case. Assuming TCG was able to allocate a register for that, it would be live at most for one TB, so you'd have to load it at least once per TB, and with block chaining that wouldn't be efficient as you'd keep on reloading it. Laurent
Re: [Qemu-devel] [PATCH RFC 00/11] AREG0 elimination
On Sun, May 15, 2011 at 2:36 PM, Aurelien Jarno aurel...@aurel32.net wrote: On Sun, May 15, 2011 at 02:12:27PM +0300, Blue Swirl wrote: On Sun, May 15, 2011 at 1:19 PM, Aurelien Jarno aurel...@aurel32.net wrote: I still don't get where are this list of possible changes? Did I miss something in another thread? I'm referring to the patches I sent. Ok, patches 1, 2 and 4 to 7 looks ok at a first glance, though I think patches 6 and 7 should be done for all hosts or none of them. The changes can be done in steps, but of course removing temp_buf from CPUState would need all targets to be converted first. On the TCG generated code, the env structure is used almost for every op, so it really makes sense to keep it in a register instead of having to reload the address of env regularly from memory. Given it only affects TCG generated code, I don't see the point of portability here. For example, maybe the bugs in Sparc glibc could be avoided by using one of %i set of registers (not accessible from helpers) for AREG0 within generated code instead of %g registers which seem to be fragile. First of all, but it's a different subject, I am not sure there are sparc glibc bugs, I'd rather says QEMU mis-uses some register. For example the following code is probably wrong: /* Note: must be synced with dyngen-exec.h */ #ifdef CONFIG_SOLARIS #define TCG_AREG0 TCG_REG_G2 #elif defined(__sparc_v9__) #define TCG_AREG0 TCG_REG_G5 #else #define TCG_AREG0 TCG_REG_G6 #endif __sparc_v9__ can set on the 32-bit ABI, when the compiler targets V8+, so the condition is probably wrong there. Secondly the SPARC ABI [1] on page 23 that %g5 to %g7 are reserved for system. I don't think QEMU has the right to use this registers. Yes, but the situation is not so nice. Please see this post for status as of 2010: http://article.gmane.org/gmane.comp.emulators.qemu/63610 This is from Debian glibc 2.11.2-10: $ file /lib/libc-2.11.2.so /lib/libc-2.11.2.so: ELF 32-bit MSB shared object, SPARC32PLUS, version 1 (SYSV), dynamically linked (uses shared libs), for GNU/Linux 2.6.18, stripped $ objdump -d /lib/libc.so.6 |grep %g1|wc -l 69648 $ objdump -d /lib/libc.so.6 |grep %g2|wc -l 37299 $ objdump -d /lib/libc.so.6 |grep %g3|wc -l 20635 $ objdump -d /lib/libc.so.6 |grep %g4|wc -l 11603 $ objdump -d /lib/libc.so.6 |grep %g5|wc -l 448 $ objdump -d /lib/libc.so.6 |grep %g6|wc -l 150 $ objdump -d /lib/libc.so.6 |grep %g7|wc -l 3052 Glibc is compiled for Sparc32plus, so it should only use %g6 and %g7, or %g1 and %g5 for scratch purposes. However, it is the application registers %g2 to %g4 that are used heaviest. Looking inside the objdump it's easy to see that the uses are not for example saving or restoring, but actually using them without saving the previous value first: 000211e0 __divdi3: 211e0: 9d e3 bf a0 save %sp, -96, %sp 211e4: 90 10 00 18 mov %i0, %o0 211e8: 92 10 00 19 mov %i1, %o1 211ec: 94 10 00 1a mov %i2, %o2 211f0: 96 10 00 1b mov %i3, %o3 211f4: 80 a6 20 00 cmp %i0, 0 211f8: 06 40 00 10 bl,pn %icc, 21238 __divdi3+0x58 211fc: a0 10 20 00 clr %l0 21200: 80 a2 a0 00 cmp %o2, 0 21204: 26 40 00 13 bl,a,pn %icc, 21250 __divdi3+0x70 21208: a0 38 00 10 xnor %g0, %l0, %l0 2120c: 7f ff fe ed call 20dc0 __ashldi3+0x40 21210: 98 10 20 00 clr %o4 21214: 84 10 00 08 mov %o0, %g2 ...whoops... 21218: 80 a4 20 00 cmp %l0, 0 2121c: 02 40 00 04 be,pn %icc, 2122c __divdi3+0x4c 21220: 86 10 00 09 mov %o1, %g3 ...whoops... 21224: 86 a0 00 09 subcc %g0, %o1, %g3 21228: 84 60 00 02 subc %g0, %g2, %g2 2122c: b2 10 00 03 mov %g3, %i1 21230: 81 cf e0 08 rett %i7 + 8 21234: 90 10 00 02 mov %g2, %o0 21238: 92 a0 00 19 subcc %g0, %i1, %o1 2123c: 90 60 00 18 subc %g0, %i0, %o0 21240: 80 a2 a0 00 cmp %o2, 0 21244: 16 4f ff f2 bge %icc, 2120c __divdi3+0x2c 21248: a0 10 3f ff mov -1, %l0 2124c: a0 38 00 10 xnor %g0, %l0, %l0 21250: 96 a0 00 0b subcc %g0, %o3, %o3 21254: 10 6f ff ee b %xcc, 2120c __divdi3+0x2c 21258: 94 60 00 0a subc %g0, %o2, %o2 2125c: 01 00 00 00 nop This is libc from OpenBSD/Sparc64 4.9: $ objdump -d /usr/lib/libc.so.58.0 |grep %g1|wc -l 40562 $ objdump -d /usr/lib/libc.so.58.0 |grep %g2|wc -l 20384 $ objdump -d /usr/lib/libc.so.58.0 |grep %g3|wc -l 10240 $ objdump -d /usr/lib/libc.so.58.0 |grep %g4|wc -l 6606 $ objdump -d /usr/lib/libc.so.58.0 |grep %g5|wc -l 3811 $ objdump -d /usr/lib/libc.so.58.0 |grep %g6|wc -l 4 $ objdump -d /usr/lib/libc.so.58.0 |grep %g7|wc -l 20 Not so great there either. Anyway, I don't see why keeping
Re: [Qemu-devel] [PATCH RFC 00/11] AREG0 elimination
On Sun, May 15, 2011 at 2:37 PM, Aurelien Jarno aurel...@aurel32.net wrote: On Sun, May 15, 2011 at 02:33:55PM +0300, Blue Swirl wrote: On Sun, May 15, 2011 at 2:14 PM, Aurelien Jarno aurel...@aurel32.net wrote: On Sun, May 15, 2011 at 02:03:40PM +0300, Blue Swirl wrote: On Sun, May 15, 2011 at 12:27 PM, Laurent Desnogues laurent.desnog...@gmail.com wrote: On Sun, May 15, 2011 at 9:15 AM, Blue Swirl blauwir...@gmail.com wrote: On Sun, May 15, 2011 at 1:04 AM, Aurelien Jarno aurel...@aurel32.net wrote: On Sun, May 15, 2011 at 12:52:35AM +0300, Blue Swirl wrote: On Sun, May 15, 2011 at 12:16 AM, Aurelien Jarno aurel...@aurel32.net wrote: On Sat, May 14, 2011 at 10:35:20PM +0300, Blue Swirl wrote: [...] The env register is used very often (basically for every load/store, but also a lot of helpers), so it makes sense to reserve a register for it. For what I understand from your patch series, you prefer to pass this register explicitly to TCG functions. This basically means this TCG global will be loaded to host register as soon as it is used, but also regularly, as globals are saved back to their canonical location before an helper or a load/store. So it seems that this patch series will just allowing the env register to change over time, though it will not spare one more register for the TCG code, and it will emit longer TCG code to regularly reload the env global into a host register. But there will be one more register available in some cases. In other Inside the TCG code, it will basically happens very rarely, given load/store are really the most used instructions, and they need to load the env register. Not exactly, from a sample run with -d op_opt: $ egrep -v -e '^$' -v -e 'OP after' -v -e ' end' -v -e 'Search PC' /tmp/qemu.log | awk '{print $1}' | sort | uniq -c|sort -rn 1673966 movi_i32 653931 ld_i32 607432 mov_i32 428684 st_i32 326878 movi_i64 308626 add_i32 283186 call 256817 exit_tb 207232 nopn 189388 goto_tb 122398 and_i32 117997 shr_i32 89107 qemu_ld32 82926 set_label 82713 brcond_i32 67169 qemu_st32 55109 or_i32 46536 ext32u_i64 44288 xor_i32 38103 sub_i32 26361 shl_i32 23218 shl_i64 23218 qemu_st64 23218 or_i64 20474 shr_i64 20445 qemu_ld64 11161 qemu_ld8u 10409 qemu_st8 5013 qemu_ld16u 3795 qemu_st16 2776 qemu_ld8s 1915 sar_i32 1414 qemu_ld16s 839 not_i32 579 setcond_i32 213 br 42 ext32s_i64 30 mul_i64 Unless I missed something, this doesn't show the usage of ld/st per TB, which is what Aurélien was looking for if I understood correctly. All I can say is that you had at most 256817 TB's and 234507 qemu_ld/st, so about one per TB. The question was ratio of loads/stores to other instructions. The statistics are not per TB. There were about 174880 TBs. Anyway I must be thick, because I fail to see how generated code could access guest CPU registers without a pointer to the CPU env :-) IIUC the SPARC translator uses ld_i32/st_i32 mainly for accessing the guest CPU registers, which due to register windows is held in a dedicated global temp. Is that correct? If so this is kind of hiding accesses to the CPU env; all other targets read/write registers by using CPU env (through the use global temps in most cases). So I think most (if not almost all) TB will need a pointer to CPU env, which is why I think Aurélien's proposal to keep a dedicated register that'd be loaded in the prologue is the only way to not degrade performance of the generated code (I'd add that this dedicated register should be the one defined by the ABI as holding the first parameter value, if that's possible; I'm afraid this is not necessarily a good idea). CPU env will be used, but the register could be made available for other uses too. What other uses? It is needed for almost every TB, so there is not other possible use. Let's take an example, a simple TB from mips on x86_64: | IN: start_kernel | 0x80510958: jal 0x8051d50c | 0x8051095c: nop | | OP after liveness analysis: | movi_i32 ra,$0x80510960 | movi_i32 PC,$0x8051d50c | exit_tb $0x0 | end | OUT: [size=28] | 0x40f22960: mov $0x80510960,%ebp | 0x40f22965: mov %ebp,0x7c(%r14) | 0x40f22969: mov $0x8051d50c,%ebp | 0x40f2296e: mov %ebp,0x80(%r14) | 0x40f22975: xor %eax,%eax | 0x40f22977: jmpq 0x115042e x86_64 uses r14 as TCG_AREG0. Despite the instructions being quite simple (only 2 movi_i32), the resulting code makes 2 access to env to save the two registers. Having to reload the env pointer each time to a register would clearly increase the size of this TB. I don't
Re: [Qemu-devel] [PATCH RFC 00/11] AREG0 elimination
On Sun, May 15, 2011 at 3:14 PM, Laurent Desnogues laurent.desnog...@gmail.com wrote: On Sun, May 15, 2011 at 1:33 PM, Blue Swirl blauwir...@gmail.com wrote: [...] x86_64 uses r14 as TCG_AREG0. Despite the instructions being quite simple (only 2 movi_i32), the resulting code makes 2 access to env to save the two registers. Having to reload the env pointer each time to a register would clearly increase the size of this TB. I don't think TCG would be that simple, instead the pointer would be loaded only once in this case. Assuming TCG was able to allocate a register for that, it would be live at most for one TB, so you'd have to load it at least once per TB, and with block chaining that wouldn't be efficient as you'd keep on reloading it. Yes, but if there are better uses, the register can be flushed. Now this is not possible since the register is always unavailable.
Re: [Qemu-devel] [PATCH RFC 00/11] AREG0 elimination
On Sun, May 15, 2011 at 03:30:17PM +0300, Blue Swirl wrote: On Sun, May 15, 2011 at 2:36 PM, Aurelien Jarno aurel...@aurel32.net wrote: On Sun, May 15, 2011 at 02:12:27PM +0300, Blue Swirl wrote: On Sun, May 15, 2011 at 1:19 PM, Aurelien Jarno aurel...@aurel32.net wrote: I still don't get where are this list of possible changes? Did I miss something in another thread? I'm referring to the patches I sent. Ok, patches 1, 2 and 4 to 7 looks ok at a first glance, though I think patches 6 and 7 should be done for all hosts or none of them. The changes can be done in steps, but of course removing temp_buf from CPUState would need all targets to be converted first. On the TCG generated code, the env structure is used almost for every op, so it really makes sense to keep it in a register instead of having to reload the address of env regularly from memory. Given it only affects TCG generated code, I don't see the point of portability here. For example, maybe the bugs in Sparc glibc could be avoided by using one of %i set of registers (not accessible from helpers) for AREG0 within generated code instead of %g registers which seem to be fragile. First of all, but it's a different subject, I am not sure there are sparc glibc bugs, I'd rather says QEMU mis-uses some register. For example the following code is probably wrong: /* Note: must be synced with dyngen-exec.h */ #ifdef CONFIG_SOLARIS #define TCG_AREG0 TCG_REG_G2 #elif defined(__sparc_v9__) #define TCG_AREG0 TCG_REG_G5 #else #define TCG_AREG0 TCG_REG_G6 #endif __sparc_v9__ can set on the 32-bit ABI, when the compiler targets V8+, so the condition is probably wrong there. Secondly the SPARC ABI [1] on page 23 that %g5 to %g7 are reserved for system. I don't think QEMU has the right to use this registers. Yes, but the situation is not so nice. Please see this post for status as of 2010: http://article.gmane.org/gmane.comp.emulators.qemu/63610 This is from Debian glibc 2.11.2-10: $ file /lib/libc-2.11.2.so /lib/libc-2.11.2.so: ELF 32-bit MSB shared object, SPARC32PLUS, version 1 (SYSV), dynamically linked (uses shared libs), for GNU/Linux 2.6.18, stripped $ objdump -d /lib/libc.so.6 |grep %g1|wc -l 69648 $ objdump -d /lib/libc.so.6 |grep %g2|wc -l 37299 $ objdump -d /lib/libc.so.6 |grep %g3|wc -l 20635 $ objdump -d /lib/libc.so.6 |grep %g4|wc -l 11603 $ objdump -d /lib/libc.so.6 |grep %g5|wc -l 448 $ objdump -d /lib/libc.so.6 |grep %g6|wc -l 150 $ objdump -d /lib/libc.so.6 |grep %g7|wc -l 3052 Glibc is compiled for Sparc32plus, so it should only use %g6 and %g7, From the calling convention point of view, sparc32 and sparc32plus are the same ABI, so %g5 is also reserved for system use. or %g1 and %g5 for scratch purposes. However, it is the application registers %g2 to %g4 that are used heaviest. Looking inside the objdump it's easy to see that the uses are not for example saving or restoring, but actually using them without saving the previous value first: Well, we have to define system and application. System is defined as library in Chapter 6, and I don't see the libc there, and is probably considered as part of the application. 000211e0 __divdi3: 211e0: 9d e3 bf a0 save %sp, -96, %sp 211e4: 90 10 00 18 mov %i0, %o0 211e8: 92 10 00 19 mov %i1, %o1 211ec: 94 10 00 1a mov %i2, %o2 211f0: 96 10 00 1b mov %i3, %o3 211f4: 80 a6 20 00 cmp %i0, 0 211f8: 06 40 00 10 bl,pn %icc, 21238 __divdi3+0x58 211fc: a0 10 20 00 clr %l0 21200: 80 a2 a0 00 cmp %o2, 0 21204: 26 40 00 13 bl,a,pn %icc, 21250 __divdi3+0x70 21208: a0 38 00 10 xnor %g0, %l0, %l0 2120c: 7f ff fe ed call 20dc0 __ashldi3+0x40 21210: 98 10 20 00 clr %o4 21214: 84 10 00 08 mov %o0, %g2 ...whoops... 21218: 80 a4 20 00 cmp %l0, 0 2121c: 02 40 00 04 be,pn %icc, 2122c __divdi3+0x4c 21220: 86 10 00 09 mov %o1, %g3 ...whoops... 21224: 86 a0 00 09 subcc %g0, %o1, %g3 21228: 84 60 00 02 subc %g0, %g2, %g2 2122c: b2 10 00 03 mov %g3, %i1 21230: 81 cf e0 08 rett %i7 + 8 21234: 90 10 00 02 mov %g2, %o0 21238: 92 a0 00 19 subcc %g0, %i1, %o1 2123c: 90 60 00 18 subc %g0, %i0, %o0 21240: 80 a2 a0 00 cmp %o2, 0 21244: 16 4f ff f2 bge %icc, 2120c __divdi3+0x2c 21248: a0 10 3f ff mov -1, %l0 2124c: a0 38 00 10 xnor %g0, %l0, %l0 21250: 96 a0 00 0b subcc %g0, %o3, %o3 21254: 10 6f ff ee b %xcc, 2120c __divdi3+0x2c 21258: 94 60 00 0a subc %g0, %o2, %o2 2125c: 01 00 00 00 nop This is libc from OpenBSD/Sparc64 4.9:
Re: [Qemu-devel] [PATCH RFC 00/11] AREG0 elimination
On Sun, May 15, 2011 at 03:37:00PM +0300, Blue Swirl wrote: On Sun, May 15, 2011 at 3:14 PM, Laurent Desnogues laurent.desnog...@gmail.com wrote: On Sun, May 15, 2011 at 1:33 PM, Blue Swirl blauwir...@gmail.com wrote: [...] x86_64 uses r14 as TCG_AREG0. Despite the instructions being quite simple (only 2 movi_i32), the resulting code makes 2 access to env to save the two registers. Having to reload the env pointer each time to a register would clearly increase the size of this TB. I don't think TCG would be that simple, instead the pointer would be loaded only once in this case. Assuming TCG was able to allocate a register for that, it would be live at most for one TB, so you'd have to load it at least once per TB, and with block chaining that wouldn't be efficient as you'd keep on reloading it. Yes, but if there are better uses, the register can be flushed. Now this is not possible since the register is always unavailable. What are the better uses, that justify to flush a register that is going to be used three or four host asm later? In the current generated code, roughly one every four instruction reference TCG_AREG0, so this register is really needed very often. If you think TCG will be faster by having one more register in between I suggest you to first optimize tcg_reg_alloc(), which simply spill a random register, even if they are some allocated register that won't be used until the end of the TB. You should also should check how often TCG spills a register (in which case it would have benefit from one more register). It happens less than 2000 times when booting an emulated mips system on x86_64, while more than 16 TB are generated. -- Aurelien Jarno GPG: 1024D/F1BCDB73 aurel...@aurel32.net http://www.aurel32.net
Re: [Qemu-devel] [PATCH RFC 00/11] AREG0 elimination
On Sun, May 15, 2011 at 3:49 PM, Aurelien Jarno aurel...@aurel32.net wrote: On Sun, May 15, 2011 at 03:30:17PM +0300, Blue Swirl wrote: On Sun, May 15, 2011 at 2:36 PM, Aurelien Jarno aurel...@aurel32.net wrote: On Sun, May 15, 2011 at 02:12:27PM +0300, Blue Swirl wrote: On Sun, May 15, 2011 at 1:19 PM, Aurelien Jarno aurel...@aurel32.net wrote: I still don't get where are this list of possible changes? Did I miss something in another thread? I'm referring to the patches I sent. Ok, patches 1, 2 and 4 to 7 looks ok at a first glance, though I think patches 6 and 7 should be done for all hosts or none of them. The changes can be done in steps, but of course removing temp_buf from CPUState would need all targets to be converted first. On the TCG generated code, the env structure is used almost for every op, so it really makes sense to keep it in a register instead of having to reload the address of env regularly from memory. Given it only affects TCG generated code, I don't see the point of portability here. For example, maybe the bugs in Sparc glibc could be avoided by using one of %i set of registers (not accessible from helpers) for AREG0 within generated code instead of %g registers which seem to be fragile. First of all, but it's a different subject, I am not sure there are sparc glibc bugs, I'd rather says QEMU mis-uses some register. For example the following code is probably wrong: /* Note: must be synced with dyngen-exec.h */ #ifdef CONFIG_SOLARIS #define TCG_AREG0 TCG_REG_G2 #elif defined(__sparc_v9__) #define TCG_AREG0 TCG_REG_G5 #else #define TCG_AREG0 TCG_REG_G6 #endif __sparc_v9__ can set on the 32-bit ABI, when the compiler targets V8+, so the condition is probably wrong there. Secondly the SPARC ABI [1] on page 23 that %g5 to %g7 are reserved for system. I don't think QEMU has the right to use this registers. Yes, but the situation is not so nice. Please see this post for status as of 2010: http://article.gmane.org/gmane.comp.emulators.qemu/63610 This is from Debian glibc 2.11.2-10: $ file /lib/libc-2.11.2.so /lib/libc-2.11.2.so: ELF 32-bit MSB shared object, SPARC32PLUS, version 1 (SYSV), dynamically linked (uses shared libs), for GNU/Linux 2.6.18, stripped $ objdump -d /lib/libc.so.6 |grep %g1|wc -l 69648 $ objdump -d /lib/libc.so.6 |grep %g2|wc -l 37299 $ objdump -d /lib/libc.so.6 |grep %g3|wc -l 20635 $ objdump -d /lib/libc.so.6 |grep %g4|wc -l 11603 $ objdump -d /lib/libc.so.6 |grep %g5|wc -l 448 $ objdump -d /lib/libc.so.6 |grep %g6|wc -l 150 $ objdump -d /lib/libc.so.6 |grep %g7|wc -l 3052 Glibc is compiled for Sparc32plus, so it should only use %g6 and %g7, From the calling convention point of view, sparc32 and sparc32plus are the same ABI, so %g5 is also reserved for system use. or %g1 and %g5 for scratch purposes. However, it is the application registers %g2 to %g4 that are used heaviest. Looking inside the objdump it's easy to see that the uses are not for example saving or restoring, but actually using them without saving the previous value first: Well, we have to define system and application. System is defined as library in Chapter 6, and I don't see the libc there, and is probably considered as part of the application. No, for example unistd.h is described and even X11. GCC also says that libraries should be compiled without using the registers: http://gcc.gnu.org/onlinedocs/gcc-4.6.0/gcc/SPARC-Options.html#SPARC-Options 000211e0 __divdi3: 211e0: 9d e3 bf a0 save %sp, -96, %sp 211e4: 90 10 00 18 mov %i0, %o0 211e8: 92 10 00 19 mov %i1, %o1 211ec: 94 10 00 1a mov %i2, %o2 211f0: 96 10 00 1b mov %i3, %o3 211f4: 80 a6 20 00 cmp %i0, 0 211f8: 06 40 00 10 bl,pn %icc, 21238 __divdi3+0x58 211fc: a0 10 20 00 clr %l0 21200: 80 a2 a0 00 cmp %o2, 0 21204: 26 40 00 13 bl,a,pn %icc, 21250 __divdi3+0x70 21208: a0 38 00 10 xnor %g0, %l0, %l0 2120c: 7f ff fe ed call 20dc0 __ashldi3+0x40 21210: 98 10 20 00 clr %o4 21214: 84 10 00 08 mov %o0, %g2 ...whoops... 21218: 80 a4 20 00 cmp %l0, 0 2121c: 02 40 00 04 be,pn %icc, 2122c __divdi3+0x4c 21220: 86 10 00 09 mov %o1, %g3 ...whoops... 21224: 86 a0 00 09 subcc %g0, %o1, %g3 21228: 84 60 00 02 subc %g0, %g2, %g2 2122c: b2 10 00 03 mov %g3, %i1 21230: 81 cf e0 08 rett %i7 + 8 21234: 90 10 00 02 mov %g2, %o0 21238: 92 a0 00 19 subcc %g0, %i1, %o1 2123c: 90 60 00 18 subc %g0, %i0, %o0 21240: 80 a2 a0 00 cmp %o2, 0 21244: 16 4f ff f2 bge %icc, 2120c __divdi3+0x2c 21248: a0 10 3f ff mov -1, %l0 2124c: a0
Re: [Qemu-devel] [PATCH RFC 00/11] AREG0 elimination
On Sun, May 15, 2011 at 4:02 PM, Aurelien Jarno aurel...@aurel32.net wrote: On Sun, May 15, 2011 at 03:37:00PM +0300, Blue Swirl wrote: On Sun, May 15, 2011 at 3:14 PM, Laurent Desnogues laurent.desnog...@gmail.com wrote: On Sun, May 15, 2011 at 1:33 PM, Blue Swirl blauwir...@gmail.com wrote: [...] x86_64 uses r14 as TCG_AREG0. Despite the instructions being quite simple (only 2 movi_i32), the resulting code makes 2 access to env to save the two registers. Having to reload the env pointer each time to a register would clearly increase the size of this TB. I don't think TCG would be that simple, instead the pointer would be loaded only once in this case. Assuming TCG was able to allocate a register for that, it would be live at most for one TB, so you'd have to load it at least once per TB, and with block chaining that wouldn't be efficient as you'd keep on reloading it. Yes, but if there are better uses, the register can be flushed. Now this is not possible since the register is always unavailable. What are the better uses, that justify to flush a register that is going to be used three or four host asm later? It would obviously replace something else determined by TCG. In the current generated code, roughly one every four instruction reference TCG_AREG0, so this register is really needed very often. If you think TCG will be faster by having one more register in between I suggest you to first optimize tcg_reg_alloc(), which simply spill a random register, even if they are some allocated register that won't be used until the end of the TB. You should also should check how often TCG spills a register (in which case it would have benefit from one more register). It happens less than 2000 times when booting an emulated mips system on x86_64, while more than 16 TB are generated. Right, on a modern CPU with lots of registers, one additional register won't be helpful, but on i386 the situation should be very different, there are very few registers.
Re: [Qemu-devel] [PATCH RFC 00/11] AREG0 elimination
On Sun, May 15, 2011 at 04:16:20PM +0300, Blue Swirl wrote: On Sun, May 15, 2011 at 3:49 PM, Aurelien Jarno aurel...@aurel32.net wrote: On Sun, May 15, 2011 at 03:30:17PM +0300, Blue Swirl wrote: On Sun, May 15, 2011 at 2:36 PM, Aurelien Jarno aurel...@aurel32.net wrote: On Sun, May 15, 2011 at 02:12:27PM +0300, Blue Swirl wrote: On Sun, May 15, 2011 at 1:19 PM, Aurelien Jarno aurel...@aurel32.net wrote: I still don't get where are this list of possible changes? Did I miss something in another thread? I'm referring to the patches I sent. Ok, patches 1, 2 and 4 to 7 looks ok at a first glance, though I think patches 6 and 7 should be done for all hosts or none of them. The changes can be done in steps, but of course removing temp_buf from CPUState would need all targets to be converted first. On the TCG generated code, the env structure is used almost for every op, so it really makes sense to keep it in a register instead of having to reload the address of env regularly from memory. Given it only affects TCG generated code, I don't see the point of portability here. For example, maybe the bugs in Sparc glibc could be avoided by using one of %i set of registers (not accessible from helpers) for AREG0 within generated code instead of %g registers which seem to be fragile. First of all, but it's a different subject, I am not sure there are sparc glibc bugs, I'd rather says QEMU mis-uses some register. For example the following code is probably wrong: /* Note: must be synced with dyngen-exec.h */ #ifdef CONFIG_SOLARIS #define TCG_AREG0 TCG_REG_G2 #elif defined(__sparc_v9__) #define TCG_AREG0 TCG_REG_G5 #else #define TCG_AREG0 TCG_REG_G6 #endif __sparc_v9__ can set on the 32-bit ABI, when the compiler targets V8+, so the condition is probably wrong there. Secondly the SPARC ABI [1] on page 23 that %g5 to %g7 are reserved for system. I don't think QEMU has the right to use this registers. Yes, but the situation is not so nice. Please see this post for status as of 2010: http://article.gmane.org/gmane.comp.emulators.qemu/63610 This is from Debian glibc 2.11.2-10: $ file /lib/libc-2.11.2.so /lib/libc-2.11.2.so: ELF 32-bit MSB shared object, SPARC32PLUS, version 1 (SYSV), dynamically linked (uses shared libs), for GNU/Linux 2.6.18, stripped $ objdump -d /lib/libc.so.6 |grep %g1|wc -l 69648 $ objdump -d /lib/libc.so.6 |grep %g2|wc -l 37299 $ objdump -d /lib/libc.so.6 |grep %g3|wc -l 20635 $ objdump -d /lib/libc.so.6 |grep %g4|wc -l 11603 $ objdump -d /lib/libc.so.6 |grep %g5|wc -l 448 $ objdump -d /lib/libc.so.6 |grep %g6|wc -l 150 $ objdump -d /lib/libc.so.6 |grep %g7|wc -l 3052 Glibc is compiled for Sparc32plus, so it should only use %g6 and %g7, From the calling convention point of view, sparc32 and sparc32plus are the same ABI, so %g5 is also reserved for system use. or %g1 and %g5 for scratch purposes. However, it is the application registers %g2 to %g4 that are used heaviest. Looking inside the objdump it's easy to see that the uses are not for example saving or restoring, but actually using them without saving the previous value first: Well, we have to define system and application. System is defined as library in Chapter 6, and I don't see the libc there, and is probably considered as part of the application. No, for example unistd.h is described and even X11. GCC also says that libraries should be compiled without using the registers: http://gcc.gnu.org/onlinedocs/gcc-4.6.0/gcc/SPARC-Options.html#SPARC-Options 000211e0 __divdi3: 211e0: 9d e3 bf a0 save %sp, -96, %sp 211e4: 90 10 00 18 mov %i0, %o0 211e8: 92 10 00 19 mov %i1, %o1 211ec: 94 10 00 1a mov %i2, %o2 211f0: 96 10 00 1b mov %i3, %o3 211f4: 80 a6 20 00 cmp %i0, 0 211f8: 06 40 00 10 bl,pn %icc, 21238 __divdi3+0x58 211fc: a0 10 20 00 clr %l0 21200: 80 a2 a0 00 cmp %o2, 0 21204: 26 40 00 13 bl,a,pn %icc, 21250 __divdi3+0x70 21208: a0 38 00 10 xnor %g0, %l0, %l0 2120c: 7f ff fe ed call 20dc0 __ashldi3+0x40 21210: 98 10 20 00 clr %o4 21214: 84 10 00 08 mov %o0, %g2 ...whoops... 21218: 80 a4 20 00 cmp %l0, 0 2121c: 02 40 00 04 be,pn %icc, 2122c __divdi3+0x4c 21220: 86 10 00 09 mov %o1, %g3 ...whoops... 21224: 86 a0 00 09 subcc %g0, %o1, %g3 21228: 84 60 00 02 subc %g0, %g2, %g2 2122c: b2 10 00 03 mov %g3, %i1 21230: 81 cf e0 08 rett %i7 + 8 21234: 90 10 00 02 mov %g2, %o0 21238: 92 a0 00 19 subcc %g0, %i1, %o1 2123c: 90 60 00 18 subc %g0,
Re: [Qemu-devel] [PATCH RFC 00/11] AREG0 elimination
On Sun, May 15, 2011 at 4:43 PM, Aurelien Jarno aurel...@aurel32.net wrote: On Sun, May 15, 2011 at 04:16:20PM +0300, Blue Swirl wrote: On Sun, May 15, 2011 at 3:49 PM, Aurelien Jarno aurel...@aurel32.net wrote: On Sun, May 15, 2011 at 03:30:17PM +0300, Blue Swirl wrote: On Sun, May 15, 2011 at 2:36 PM, Aurelien Jarno aurel...@aurel32.net wrote: On Sun, May 15, 2011 at 02:12:27PM +0300, Blue Swirl wrote: On Sun, May 15, 2011 at 1:19 PM, Aurelien Jarno aurel...@aurel32.net wrote: I still don't get where are this list of possible changes? Did I miss something in another thread? I'm referring to the patches I sent. Ok, patches 1, 2 and 4 to 7 looks ok at a first glance, though I think patches 6 and 7 should be done for all hosts or none of them. The changes can be done in steps, but of course removing temp_buf from CPUState would need all targets to be converted first. On the TCG generated code, the env structure is used almost for every op, so it really makes sense to keep it in a register instead of having to reload the address of env regularly from memory. Given it only affects TCG generated code, I don't see the point of portability here. For example, maybe the bugs in Sparc glibc could be avoided by using one of %i set of registers (not accessible from helpers) for AREG0 within generated code instead of %g registers which seem to be fragile. First of all, but it's a different subject, I am not sure there are sparc glibc bugs, I'd rather says QEMU mis-uses some register. For example the following code is probably wrong: /* Note: must be synced with dyngen-exec.h */ #ifdef CONFIG_SOLARIS #define TCG_AREG0 TCG_REG_G2 #elif defined(__sparc_v9__) #define TCG_AREG0 TCG_REG_G5 #else #define TCG_AREG0 TCG_REG_G6 #endif __sparc_v9__ can set on the 32-bit ABI, when the compiler targets V8+, so the condition is probably wrong there. Secondly the SPARC ABI [1] on page 23 that %g5 to %g7 are reserved for system. I don't think QEMU has the right to use this registers. Yes, but the situation is not so nice. Please see this post for status as of 2010: http://article.gmane.org/gmane.comp.emulators.qemu/63610 This is from Debian glibc 2.11.2-10: $ file /lib/libc-2.11.2.so /lib/libc-2.11.2.so: ELF 32-bit MSB shared object, SPARC32PLUS, version 1 (SYSV), dynamically linked (uses shared libs), for GNU/Linux 2.6.18, stripped $ objdump -d /lib/libc.so.6 |grep %g1|wc -l 69648 $ objdump -d /lib/libc.so.6 |grep %g2|wc -l 37299 $ objdump -d /lib/libc.so.6 |grep %g3|wc -l 20635 $ objdump -d /lib/libc.so.6 |grep %g4|wc -l 11603 $ objdump -d /lib/libc.so.6 |grep %g5|wc -l 448 $ objdump -d /lib/libc.so.6 |grep %g6|wc -l 150 $ objdump -d /lib/libc.so.6 |grep %g7|wc -l 3052 Glibc is compiled for Sparc32plus, so it should only use %g6 and %g7, From the calling convention point of view, sparc32 and sparc32plus are the same ABI, so %g5 is also reserved for system use. or %g1 and %g5 for scratch purposes. However, it is the application registers %g2 to %g4 that are used heaviest. Looking inside the objdump it's easy to see that the uses are not for example saving or restoring, but actually using them without saving the previous value first: Well, we have to define system and application. System is defined as library in Chapter 6, and I don't see the libc there, and is probably considered as part of the application. No, for example unistd.h is described and even X11. GCC also says that libraries should be compiled without using the registers: http://gcc.gnu.org/onlinedocs/gcc-4.6.0/gcc/SPARC-Options.html#SPARC-Options 000211e0 __divdi3: 211e0: 9d e3 bf a0 save %sp, -96, %sp 211e4: 90 10 00 18 mov %i0, %o0 211e8: 92 10 00 19 mov %i1, %o1 211ec: 94 10 00 1a mov %i2, %o2 211f0: 96 10 00 1b mov %i3, %o3 211f4: 80 a6 20 00 cmp %i0, 0 211f8: 06 40 00 10 bl,pn %icc, 21238 __divdi3+0x58 211fc: a0 10 20 00 clr %l0 21200: 80 a2 a0 00 cmp %o2, 0 21204: 26 40 00 13 bl,a,pn %icc, 21250 __divdi3+0x70 21208: a0 38 00 10 xnor %g0, %l0, %l0 2120c: 7f ff fe ed call 20dc0 __ashldi3+0x40 21210: 98 10 20 00 clr %o4 21214: 84 10 00 08 mov %o0, %g2 ...whoops... 21218: 80 a4 20 00 cmp %l0, 0 2121c: 02 40 00 04 be,pn %icc, 2122c __divdi3+0x4c 21220: 86 10 00 09 mov %o1, %g3 ...whoops... 21224: 86 a0 00 09 subcc %g0, %o1, %g3 21228: 84 60 00 02 subc %g0, %g2, %g2 2122c: b2 10 00 03 mov %g3, %i1 21230: 81 cf e0 08 rett %i7 + 8 21234: 90 10 00 02 mov %g2, %o0 21238: 92
Re: [Qemu-devel] [PATCH RFC 00/11] AREG0 elimination
On Sun, May 15, 2011 at 04:42:05PM +0300, Blue Swirl wrote: On Sun, May 15, 2011 at 4:02 PM, Aurelien Jarno aurel...@aurel32.net wrote: On Sun, May 15, 2011 at 03:37:00PM +0300, Blue Swirl wrote: On Sun, May 15, 2011 at 3:14 PM, Laurent Desnogues laurent.desnog...@gmail.com wrote: On Sun, May 15, 2011 at 1:33 PM, Blue Swirl blauwir...@gmail.com wrote: [...] x86_64 uses r14 as TCG_AREG0. Despite the instructions being quite simple (only 2 movi_i32), the resulting code makes 2 access to env to save the two registers. Having to reload the env pointer each time to a register would clearly increase the size of this TB. I don't think TCG would be that simple, instead the pointer would be loaded only once in this case. Assuming TCG was able to allocate a register for that, it would be live at most for one TB, so you'd have to load it at least once per TB, and with block chaining that wouldn't be efficient as you'd keep on reloading it. Yes, but if there are better uses, the register can be flushed. Now this is not possible since the register is always unavailable. What are the better uses, that justify to flush a register that is going to be used three or four host asm later? It would obviously replace something else determined by TCG. The register will be free only for a few host instructions. Could you please give more concrete example about such a usage? In the current generated code, roughly one every four instruction reference TCG_AREG0, so this register is really needed very often. If you think TCG will be faster by having one more register in between I suggest you to first optimize tcg_reg_alloc(), which simply spill a random register, even if they are some allocated register that won't be used until the end of the TB. You should also should check how often TCG spills a register (in which case it would have benefit from one more register). It happens less than 2000 times when booting an emulated mips system on x86_64, while more than 16 TB are generated. Right, on a modern CPU with lots of registers, one additional register won't be helpful, but on i386 the situation should be very different, there are very few registers. On i386, I indeed get a lot more of spilled registers, that is 34. Still that number is not that high, it's less than two times per TB. If we consider that these register spills are pure loss (which is not always the case, sometime the spilled register is actually never used later, so it's just an anticipated save), that's 4 load/store per TB. It means to compensate, the env register should not be loaded more than 4 times in a TB, which looks like quite difficult to achieve given how often this register is used. Please also note that spilling globals currently need access to the env pointer, which might not be loaded, so another register spill is need to load it. This will make the code a lot more complex than now to avoid a deadlock (probably by spilling local temps first). -- Aurelien Jarno GPG: 1024D/F1BCDB73 aurel...@aurel32.net http://www.aurel32.net
Re: [Qemu-devel] [PATCH RFC 00/11] AREG0 elimination
On Sun, May 15, 2011 at 05:02:37PM +0300, Blue Swirl wrote: On Sun, May 15, 2011 at 4:43 PM, Aurelien Jarno aurel...@aurel32.net wrote: On Sun, May 15, 2011 at 04:16:20PM +0300, Blue Swirl wrote: On Sun, May 15, 2011 at 3:49 PM, Aurelien Jarno aurel...@aurel32.net wrote: On Sun, May 15, 2011 at 03:30:17PM +0300, Blue Swirl wrote: On Sun, May 15, 2011 at 2:36 PM, Aurelien Jarno aurel...@aurel32.net wrote: On Sun, May 15, 2011 at 02:12:27PM +0300, Blue Swirl wrote: On Sun, May 15, 2011 at 1:19 PM, Aurelien Jarno aurel...@aurel32.net wrote: I still don't get where are this list of possible changes? Did I miss something in another thread? I'm referring to the patches I sent. Ok, patches 1, 2 and 4 to 7 looks ok at a first glance, though I think patches 6 and 7 should be done for all hosts or none of them. The changes can be done in steps, but of course removing temp_buf from CPUState would need all targets to be converted first. On the TCG generated code, the env structure is used almost for every op, so it really makes sense to keep it in a register instead of having to reload the address of env regularly from memory. Given it only affects TCG generated code, I don't see the point of portability here. For example, maybe the bugs in Sparc glibc could be avoided by using one of %i set of registers (not accessible from helpers) for AREG0 within generated code instead of %g registers which seem to be fragile. First of all, but it's a different subject, I am not sure there are sparc glibc bugs, I'd rather says QEMU mis-uses some register. For example the following code is probably wrong: /* Note: must be synced with dyngen-exec.h */ #ifdef CONFIG_SOLARIS #define TCG_AREG0 TCG_REG_G2 #elif defined(__sparc_v9__) #define TCG_AREG0 TCG_REG_G5 #else #define TCG_AREG0 TCG_REG_G6 #endif __sparc_v9__ can set on the 32-bit ABI, when the compiler targets V8+, so the condition is probably wrong there. Secondly the SPARC ABI [1] on page 23 that %g5 to %g7 are reserved for system. I don't think QEMU has the right to use this registers. Yes, but the situation is not so nice. Please see this post for status as of 2010: http://article.gmane.org/gmane.comp.emulators.qemu/63610 This is from Debian glibc 2.11.2-10: $ file /lib/libc-2.11.2.so /lib/libc-2.11.2.so: ELF 32-bit MSB shared object, SPARC32PLUS, version 1 (SYSV), dynamically linked (uses shared libs), for GNU/Linux 2.6.18, stripped $ objdump -d /lib/libc.so.6 |grep %g1|wc -l 69648 $ objdump -d /lib/libc.so.6 |grep %g2|wc -l 37299 $ objdump -d /lib/libc.so.6 |grep %g3|wc -l 20635 $ objdump -d /lib/libc.so.6 |grep %g4|wc -l 11603 $ objdump -d /lib/libc.so.6 |grep %g5|wc -l 448 $ objdump -d /lib/libc.so.6 |grep %g6|wc -l 150 $ objdump -d /lib/libc.so.6 |grep %g7|wc -l 3052 Glibc is compiled for Sparc32plus, so it should only use %g6 and %g7, From the calling convention point of view, sparc32 and sparc32plus are the same ABI, so %g5 is also reserved for system use. or %g1 and %g5 for scratch purposes. However, it is the application registers %g2 to %g4 that are used heaviest. Looking inside the objdump it's easy to see that the uses are not for example saving or restoring, but actually using them without saving the previous value first: Well, we have to define system and application. System is defined as library in Chapter 6, and I don't see the libc there, and is probably considered as part of the application. No, for example unistd.h is described and even X11. GCC also says that libraries should be compiled without using the registers: http://gcc.gnu.org/onlinedocs/gcc-4.6.0/gcc/SPARC-Options.html#SPARC-Options 000211e0 __divdi3: 211e0: 9d e3 bf a0 save %sp, -96, %sp 211e4: 90 10 00 18 mov %i0, %o0 211e8: 92 10 00 19 mov %i1, %o1 211ec: 94 10 00 1a mov %i2, %o2 211f0: 96 10 00 1b mov %i3, %o3 211f4: 80 a6 20 00 cmp %i0, 0 211f8: 06 40 00 10 bl,pn %icc, 21238 __divdi3+0x58 211fc: a0 10 20 00 clr %l0 21200: 80 a2 a0 00 cmp %o2, 0 21204: 26 40 00 13 bl,a,pn %icc, 21250 __divdi3+0x70 21208: a0 38 00 10 xnor %g0, %l0, %l0 2120c: 7f ff fe ed call 20dc0 __ashldi3+0x40 21210: 98 10 20 00 clr %o4 21214: 84 10 00 08 mov %o0, %g2 ...whoops... 21218: 80 a4 20 00 cmp %l0, 0 2121c: 02 40 00 04 be,pn %icc, 2122c __divdi3+0x4c 21220: 86 10 00 09 mov %o1, %g3 ...whoops... 21224: 86 a0 00 09 subcc %g0, %o1, %g3 21228:
Re: [Qemu-devel] [PATCH RFC 00/11] AREG0 elimination
On Sun, May 15, 2011 at 5:03 PM, Aurelien Jarno aurel...@aurel32.net wrote: On Sun, May 15, 2011 at 04:42:05PM +0300, Blue Swirl wrote: On Sun, May 15, 2011 at 4:02 PM, Aurelien Jarno aurel...@aurel32.net wrote: On Sun, May 15, 2011 at 03:37:00PM +0300, Blue Swirl wrote: On Sun, May 15, 2011 at 3:14 PM, Laurent Desnogues laurent.desnog...@gmail.com wrote: On Sun, May 15, 2011 at 1:33 PM, Blue Swirl blauwir...@gmail.com wrote: [...] x86_64 uses r14 as TCG_AREG0. Despite the instructions being quite simple (only 2 movi_i32), the resulting code makes 2 access to env to save the two registers. Having to reload the env pointer each time to a register would clearly increase the size of this TB. I don't think TCG would be that simple, instead the pointer would be loaded only once in this case. Assuming TCG was able to allocate a register for that, it would be live at most for one TB, so you'd have to load it at least once per TB, and with block chaining that wouldn't be efficient as you'd keep on reloading it. Yes, but if there are better uses, the register can be flushed. Now this is not possible since the register is always unavailable. What are the better uses, that justify to flush a register that is going to be used three or four host asm later? It would obviously replace something else determined by TCG. The register will be free only for a few host instructions. Could you please give more concrete example about such a usage? In the current generated code, roughly one every four instruction reference TCG_AREG0, so this register is really needed very often. If you think TCG will be faster by having one more register in between I suggest you to first optimize tcg_reg_alloc(), which simply spill a random register, even if they are some allocated register that won't be used until the end of the TB. You should also should check how often TCG spills a register (in which case it would have benefit from one more register). It happens less than 2000 times when booting an emulated mips system on x86_64, while more than 16 TB are generated. Right, on a modern CPU with lots of registers, one additional register won't be helpful, but on i386 the situation should be very different, there are very few registers. On i386, I indeed get a lot more of spilled registers, that is 34. Still that number is not that high, it's less than two times per TB. If we consider that these register spills are pure loss (which is not always the case, sometime the spilled register is actually never used later, so it's just an anticipated save), that's 4 load/store per TB. It means to compensate, the env register should not be loaded more than 4 times in a TB, which looks like quite difficult to achieve given how often this register is used. Please also note that spilling globals currently need access to the env pointer, which might not be loaded, so another register spill is need to load it. This will make the code a lot more complex than now to avoid a deadlock (probably by spilling local temps first). OK, this doesn't look so attractive after all.
Re: [Qemu-devel] [PATCH RFC 00/11] AREG0 elimination
On Sun, May 15, 2011 at 5:06 PM, Aurelien Jarno aurel...@aurel32.net wrote: On Sun, May 15, 2011 at 05:02:37PM +0300, Blue Swirl wrote: On Sun, May 15, 2011 at 4:43 PM, Aurelien Jarno aurel...@aurel32.net wrote: On Sun, May 15, 2011 at 04:16:20PM +0300, Blue Swirl wrote: On Sun, May 15, 2011 at 3:49 PM, Aurelien Jarno aurel...@aurel32.net wrote: On Sun, May 15, 2011 at 03:30:17PM +0300, Blue Swirl wrote: On Sun, May 15, 2011 at 2:36 PM, Aurelien Jarno aurel...@aurel32.net wrote: On Sun, May 15, 2011 at 02:12:27PM +0300, Blue Swirl wrote: On Sun, May 15, 2011 at 1:19 PM, Aurelien Jarno aurel...@aurel32.net wrote: I still don't get where are this list of possible changes? Did I miss something in another thread? I'm referring to the patches I sent. Ok, patches 1, 2 and 4 to 7 looks ok at a first glance, though I think patches 6 and 7 should be done for all hosts or none of them. The changes can be done in steps, but of course removing temp_buf from CPUState would need all targets to be converted first. On the TCG generated code, the env structure is used almost for every op, so it really makes sense to keep it in a register instead of having to reload the address of env regularly from memory. Given it only affects TCG generated code, I don't see the point of portability here. For example, maybe the bugs in Sparc glibc could be avoided by using one of %i set of registers (not accessible from helpers) for AREG0 within generated code instead of %g registers which seem to be fragile. First of all, but it's a different subject, I am not sure there are sparc glibc bugs, I'd rather says QEMU mis-uses some register. For example the following code is probably wrong: /* Note: must be synced with dyngen-exec.h */ #ifdef CONFIG_SOLARIS #define TCG_AREG0 TCG_REG_G2 #elif defined(__sparc_v9__) #define TCG_AREG0 TCG_REG_G5 #else #define TCG_AREG0 TCG_REG_G6 #endif __sparc_v9__ can set on the 32-bit ABI, when the compiler targets V8+, so the condition is probably wrong there. Secondly the SPARC ABI [1] on page 23 that %g5 to %g7 are reserved for system. I don't think QEMU has the right to use this registers. Yes, but the situation is not so nice. Please see this post for status as of 2010: http://article.gmane.org/gmane.comp.emulators.qemu/63610 This is from Debian glibc 2.11.2-10: $ file /lib/libc-2.11.2.so /lib/libc-2.11.2.so: ELF 32-bit MSB shared object, SPARC32PLUS, version 1 (SYSV), dynamically linked (uses shared libs), for GNU/Linux 2.6.18, stripped $ objdump -d /lib/libc.so.6 |grep %g1|wc -l 69648 $ objdump -d /lib/libc.so.6 |grep %g2|wc -l 37299 $ objdump -d /lib/libc.so.6 |grep %g3|wc -l 20635 $ objdump -d /lib/libc.so.6 |grep %g4|wc -l 11603 $ objdump -d /lib/libc.so.6 |grep %g5|wc -l 448 $ objdump -d /lib/libc.so.6 |grep %g6|wc -l 150 $ objdump -d /lib/libc.so.6 |grep %g7|wc -l 3052 Glibc is compiled for Sparc32plus, so it should only use %g6 and %g7, From the calling convention point of view, sparc32 and sparc32plus are the same ABI, so %g5 is also reserved for system use. or %g1 and %g5 for scratch purposes. However, it is the application registers %g2 to %g4 that are used heaviest. Looking inside the objdump it's easy to see that the uses are not for example saving or restoring, but actually using them without saving the previous value first: Well, we have to define system and application. System is defined as library in Chapter 6, and I don't see the libc there, and is probably considered as part of the application. No, for example unistd.h is described and even X11. GCC also says that libraries should be compiled without using the registers: http://gcc.gnu.org/onlinedocs/gcc-4.6.0/gcc/SPARC-Options.html#SPARC-Options 000211e0 __divdi3: 211e0: 9d e3 bf a0 save %sp, -96, %sp 211e4: 90 10 00 18 mov %i0, %o0 211e8: 92 10 00 19 mov %i1, %o1 211ec: 94 10 00 1a mov %i2, %o2 211f0: 96 10 00 1b mov %i3, %o3 211f4: 80 a6 20 00 cmp %i0, 0 211f8: 06 40 00 10 bl,pn %icc, 21238 __divdi3+0x58 211fc: a0 10 20 00 clr %l0 21200: 80 a2 a0 00 cmp %o2, 0 21204: 26 40 00 13 bl,a,pn %icc, 21250 __divdi3+0x70 21208: a0 38 00 10 xnor %g0, %l0, %l0 2120c: 7f ff fe ed call 20dc0 __ashldi3+0x40 21210: 98 10 20 00 clr %o4 21214: 84 10 00 08 mov %o0, %g2 ...whoops... 21218: 80 a4 20 00 cmp %l0, 0 2121c: 02 40 00 04 be,pn %icc, 2122c __divdi3+0x4c 21220: 86 10 00 09 mov %o1, %g3
Re: [Qemu-devel] [PATCH RFC 00/11] AREG0 elimination
On Sat, May 14, 2011 at 10:35:20PM +0300, Blue Swirl wrote: Here's a RFC series for eliminating AREG0. Blue Swirl (11): Move user emulator stuff from cpu-exec.c to user-exec.c Delete unused tb_invalidate_page_range The above should be OK to commit. cpu_loop_exit: avoid using AREG0 Delegate setup of TCG temporaries to targets These two are not, unless the overall plan is OK. TCG: fix negative frame offset calculations TCG/x86: use stack for TCG temps TCG/Sparc64: use stack for TCG temps But these three should be OK. I've tested lightly x86_64 and Sparc64 hosts. Add CONFIG_TARGET_NEEDS_AREG0 Don't compile legacy qemu_ld/st functions if target doesn't need them Should be OK, though the latter patch only touches x86. Add new qemu_ld and qemu_st functions sparc: use new qemu_ld and qemu_st functions The last two compile but QEMU segfaults. I just made a naive conversion for getting comments. What is the goal behing removing TCG_AREG0? If it is speed improvement, can you please provide some benchmarks? The env register is used very often (basically for every load/store, but also a lot of helpers), so it makes sense to reserve a register for it. For what I understand from your patch series, you prefer to pass this register explicitly to TCG functions. This basically means this TCG global will be loaded to host register as soon as it is used, but also regularly, as globals are saved back to their canonical location before an helper or a load/store. So it seems that this patch series will just allowing the env register to change over time, though it will not spare one more register for the TCG code, and it will emit longer TCG code to regularly reload the env global into a host register. In any case at then end benchmarks is what are need to decided, TCG has always shown that performance improvements doesn't match the improvement analysis. -- Aurelien Jarno GPG: 1024D/F1BCDB73 aurel...@aurel32.net http://www.aurel32.net
Re: [Qemu-devel] [PATCH RFC 00/11] AREG0 elimination
On Sun, May 15, 2011 at 12:16 AM, Aurelien Jarno aurel...@aurel32.net wrote: On Sat, May 14, 2011 at 10:35:20PM +0300, Blue Swirl wrote: Here's a RFC series for eliminating AREG0. Blue Swirl (11): Move user emulator stuff from cpu-exec.c to user-exec.c Delete unused tb_invalidate_page_range The above should be OK to commit. cpu_loop_exit: avoid using AREG0 Delegate setup of TCG temporaries to targets These two are not, unless the overall plan is OK. TCG: fix negative frame offset calculations TCG/x86: use stack for TCG temps TCG/Sparc64: use stack for TCG temps But these three should be OK. I've tested lightly x86_64 and Sparc64 hosts. Add CONFIG_TARGET_NEEDS_AREG0 Don't compile legacy qemu_ld/st functions if target doesn't need them Should be OK, though the latter patch only touches x86. Add new qemu_ld and qemu_st functions sparc: use new qemu_ld and qemu_st functions The last two compile but QEMU segfaults. I just made a naive conversion for getting comments. What is the goal behing removing TCG_AREG0? If it is speed improvement, can you please provide some benchmarks? There was some discussion earlier about why this (or parts of the conversion) may be a speed improvement: http://article.gmane.org/gmane.comp.emulators.qemu/101826 http://article.gmane.org/gmane.comp.emulators.qemu/102156 There are no benchmarks yet. In fact, it may be difficult to make those without performing the removal completely. For example, patch 3/11 makes cpu_loop_exit take CPUState as a parameter instead of using global env, which would be available and the register is reserved anyway. This would only decrease performance at this stage, unless a complete conversion is done. I suspect the same would happen when moving all helpers from op_helper.c to helper.c. But after the whole conversion, this would be a neutral (no extra registers used) or even beneficial change (the code is free to use one more register). The env register is used very often (basically for every load/store, but also a lot of helpers), so it makes sense to reserve a register for it. For what I understand from your patch series, you prefer to pass this register explicitly to TCG functions. This basically means this TCG global will be loaded to host register as soon as it is used, but also regularly, as globals are saved back to their canonical location before an helper or a load/store. So it seems that this patch series will just allowing the env register to change over time, though it will not spare one more register for the TCG code, and it will emit longer TCG code to regularly reload the env global into a host register. But there will be one more register available in some cases. In other cases, the number of registers used does not change. Moving the registers around is what worries me too. But there are other effects too, the helpers are now compiled so that the global env register is not used. Especially on hosts with low number of registers this is not optimal. In any case at then end benchmarks is what are need to decided, TCG has always shown that performance improvements doesn't match the improvement analysis. If this turns out to be a bad idea, it means that the reverse conversion will be beneficial and we should convert helper.c code to op_helper.c and take advantage of the global env in a register. This was actually my standpoint when the discussion started, but I'm interested to see if this approach would work better. There are other opportunities as well, for example while refactoring, qemu_ld/st would be changed. Maybe your constant patches would be useful there or they could benefit from a rearranged interface.
Re: [Qemu-devel] [PATCH RFC 00/11] AREG0 elimination
On Sun, May 15, 2011 at 12:52:35AM +0300, Blue Swirl wrote: On Sun, May 15, 2011 at 12:16 AM, Aurelien Jarno aurel...@aurel32.net wrote: On Sat, May 14, 2011 at 10:35:20PM +0300, Blue Swirl wrote: Here's a RFC series for eliminating AREG0. Blue Swirl (11): Move user emulator stuff from cpu-exec.c to user-exec.c Delete unused tb_invalidate_page_range The above should be OK to commit. cpu_loop_exit: avoid using AREG0 Delegate setup of TCG temporaries to targets These two are not, unless the overall plan is OK. TCG: fix negative frame offset calculations TCG/x86: use stack for TCG temps TCG/Sparc64: use stack for TCG temps But these three should be OK. I've tested lightly x86_64 and Sparc64 hosts. Add CONFIG_TARGET_NEEDS_AREG0 Don't compile legacy qemu_ld/st functions if target doesn't need them Should be OK, though the latter patch only touches x86. Add new qemu_ld and qemu_st functions sparc: use new qemu_ld and qemu_st functions The last two compile but QEMU segfaults. I just made a naive conversion for getting comments. What is the goal behing removing TCG_AREG0? If it is speed improvement, can you please provide some benchmarks? There was some discussion earlier about why this (or parts of the conversion) may be a speed improvement: http://article.gmane.org/gmane.comp.emulators.qemu/101826 http://article.gmane.org/gmane.comp.emulators.qemu/102156 Ok, looks like I have missed that. There are no benchmarks yet. In fact, it may be difficult to make those without performing the removal completely. For example, patch 3/11 makes cpu_loop_exit take CPUState as a parameter instead of using global env, which would be available and the register is reserved anyway. This would only decrease performance at this stage, unless a complete conversion is done. I suspect the same would happen when moving all helpers from op_helper.c to helper.c. But after the whole conversion, this would be a neutral (no extra registers used) or even beneficial change (the code is free to use one more register). The env register is used very often (basically for every load/store, but also a lot of helpers), so it makes sense to reserve a register for it. For what I understand from your patch series, you prefer to pass this register explicitly to TCG functions. This basically means this TCG global will be loaded to host register as soon as it is used, but also regularly, as globals are saved back to their canonical location before an helper or a load/store. So it seems that this patch series will just allowing the env register to change over time, though it will not spare one more register for the TCG code, and it will emit longer TCG code to regularly reload the env global into a host register. But there will be one more register available in some cases. In other Inside the TCG code, it will basically happens very rarely, given load/store are really the most used instructions, and they need to load the env register. cases, the number of registers used does not change. Moving the registers around is what worries me too. But there are other effects too, the helpers are now compiled so that the global env register is not used. Especially on hosts with low number of registers this is not optimal. Most helpers are very small functions, so I am not sure more registers will help. In any case at then end benchmarks is what are need to decided, TCG has always shown that performance improvements doesn't match the improvement analysis. If this turns out to be a bad idea, it means that the reverse conversion will be beneficial and we should convert helper.c code to op_helper.c and take advantage of the global env in a register. This was actually my standpoint when the discussion started, but I'm interested to see if this approach would work better. Does it mean that you plan to do the code changes in git without really benchmarking, and revert all the changes later if it was a bad idea? -- Aurelien Jarno GPG: 1024D/F1BCDB73 aurel...@aurel32.net http://www.aurel32.net
Re: [Qemu-devel] [PATCH RFC 00/11] AREG0 elimination
On Sat, May 14, 2011 at 11:16:16PM +0200, Aurelien Jarno wrote: On Sat, May 14, 2011 at 10:35:20PM +0300, Blue Swirl wrote: Here's a RFC series for eliminating AREG0. Blue Swirl (11): Move user emulator stuff from cpu-exec.c to user-exec.c Delete unused tb_invalidate_page_range The above should be OK to commit. cpu_loop_exit: avoid using AREG0 Delegate setup of TCG temporaries to targets These two are not, unless the overall plan is OK. TCG: fix negative frame offset calculations TCG/x86: use stack for TCG temps TCG/Sparc64: use stack for TCG temps But these three should be OK. I've tested lightly x86_64 and Sparc64 hosts. Add CONFIG_TARGET_NEEDS_AREG0 Don't compile legacy qemu_ld/st functions if target doesn't need them Should be OK, though the latter patch only touches x86. Add new qemu_ld and qemu_st functions sparc: use new qemu_ld and qemu_st functions The last two compile but QEMU segfaults. I just made a naive conversion for getting comments. What is the goal behing removing TCG_AREG0? If it is speed improvement, can you please provide some benchmarks? The env register is used very often (basically for every load/store, but also a lot of helpers), so it makes sense to reserve a register for it. For what I understand from your patch series, you prefer to pass this register explicitly to TCG functions. This basically means this TCG global will be loaded to host register as soon as it is used, but also regularly, as globals are saved back to their canonical location before an helper or a load/store. So it seems that this patch series will just allowing the env register to change over time, though it will not spare one more register for the TCG code, and it will emit longer TCG code to regularly reload the env global into a host register. One way to solve that would be to use a env register only at the TCG level, but not at the GCC level. That means loading the value of env into TCG_AREG0 in the prologue. That way, all the TCG code will have direct access to env, and the GCC generated code (which includes the helper) don't need to have this register reserved. I doubt the latter will increase the performance, but I understand the cleanliness argument. This also means minimal code changes, especially as most (if not all) TCG targets seems to have TCG_AREG0 as a callee saved arguments. Basically the path to do the changes could be: - Make sure all targets have TCG_AREG0 as a callee saved register - Load env into TCG_AREG0 (and rename it as a TCG_ENV?) in the prologue (could be by passing it as argument to the prologue code) - Change all helpers to not use env directly - Change softmmu code to not use env directly - Remove GCC hack to reserve a register for env. -- Aurelien Jarno GPG: 1024D/F1BCDB73 aurel...@aurel32.net http://www.aurel32.net