As can be seem in the mailing thread that added hardfloat support in QEMU [1], a requirement for it to work is to have float_flag_inexact set when entering the API in softfloat.c. However, in the same thread, it was explained that PPC target would not work by default with this implementation. The problem is that PPC has a non-sticky inexact bit (there is a discussion about it in [2]), meaning that we can't just set the flag and call the API in softfloat.c, as it would return the same flag set to 1, and we wouldn't know if it is supposed to be updated on FPSCR or not. Over the last couple years, there were attempts to enable hardfpu for Power, like [3]. But nothing got to master. [5] shows a suggestion by Yonggang Luo and commentaries by Richard and Zoltan, about caching the last FP instruction and reexecuting it when necessary.
This patch set is a proposition on the idea to cache the last FP insn, to be reexecuted later when the value of FPSCR is to be read by a program. When executed in hardfloat, the instruction "context" is saved inside `env`, and is expected to be reexecuted later, in softfloat, to calculate the correct value of the inexact flag in FPSCR. The instruction to be cached is the last instruction that changes FI. If the instructions does not change FI, it keeps the cache intact. If it changes FI, it caches itself and tries to execute in hardfpu. It might or might not use hardfloat, but as the inexact flag was artificially set, it will require to be reexecuted later. 'Later' means when FPSCR is to be read, like during a call to MFFS, or when a signal occurs. There are probably other places, e.g. other mffs-like instructions, but this RFC only addresses these two scenarios. This is supposed to be more efficient because programs very seldomly read FPSCR, meaning the amount of reexecutions will be low. For now, this was implemented and tested for linux-user, no softmmu work or analysis was done. I implemented the base code to keep all instructions working with this new behavior (patch 1), and also implemented some instructions as an example on what it would be necessary to do for every instruction to use hardfpu (patches 2, 3 and 4). My tests with risu and other manual tests showed the behavior seems to be correct. I tested mainly if FPSCR is the same after using softfloat or hardfloat. On the v1 of this RFC I reported a performance regression with the implementation. However, the test I crafted [4] was supposed to be a mix of many hardfloats and some softfloat fallbacks (instructions fallback to softfloat in special cases, like e.g. negative argument for sqrt). What actually was happening was that there was a huge amount of fallbacks and not many hardfloats actually happening. The expected 'normal scenario' is to have a lot of valid, 'happy path' instructions that can use hardfloat. So, what I did for v2 is to create two tests, one that would hit 100% hardfloat, and one that would fallback 100% to softfloat. I present the results below. The tests are not comparable, neither the new ones or the previous one from v1. So they are supposed to be analyzed uniquely. 100% hardfloat (1:1 mix of fsqrt and fmadd) [6] | | min [s] | max [s] | avg [s] | | before (master)| 30.731 | 31.420 | 31.186 | | after changes | 20.860 | 21.100 | 20.989 | (approx. 1.5x speedup) 100% softfloat (1:1 mix of fsqrt and fmadd) [7] | | min [s] | max [s] | avg [s] | | before (master)| 22.684 | 23.152 | 22.868 | | after changes | 25.098 | 25.397 | 25.281 | (approx 0.9x of old performance) This is way better than what I previously reported, and is a result that might justify going forward with this idea. The only problem is the performance impact when hardfloat cannot be used. I expect that most real-life use cases will hit hardfloat almost 100% of the time, so this might not be a big issue. Opinions on this? You can see that I actually added a new commit to this RFC, implementing the idea also for add, sub, mul, and div. I tested the old test with this new commit, and the result was not better. So the new patch was not responsible for the performance gain, the test itself was bad. As I did not test the code in softmmu or bsd-user (does bsd-user work for PPC?), I added some build time checks to only enable this RFC for linux-user. I'm pretty confident that making this work for softmmu will need changes in other places in the code. But I'm focusing on linux- user for now. Thank you very much! [1] https://patchwork.kernel.org/project/qemu-devel/patch/20181124235553.17371-8-c...@braap.org/ [2] https://lists.nongnu.org/archive/html/qemu-ppc/2022-05/msg00246.html [3] https://patchwork.kernel.org/project/qemu-devel/patch/20200218171702.979f0746...@zero.eik.bme.hu/ [4] https://gist.github.com/vcoracolombo/6ad884a402f1bba531e2e3da7e196656 [5] https://lists.gnu.org/archive/html/qemu-devel/2020-05/msg00064.html [6] https://gist.github.com/vcoracolombo/f0d8b7c9f1cb63dac6ff0221209ec4ff [7] https://gist.github.com/vcoracolombo/4b592644517c0efb3854872a4b30f6cc Víctor Colombo (5): target/ppc: prepare instructions to work with caching last FP insn target/ppc: Implement instruction caching for fsqrt target/ppc: Implement instruction caching for muladd target/ppc: Implement instruction caching for add/sub/mul/div target/ppc: Enable hardfpu for Power fpu/softfloat.c | 10 +- target/ppc/cpu.h | 37 ++++++ target/ppc/excp_helper.c | 2 + target/ppc/fpu_helper.c | 186 +++++++++++++++++++++++++++++ target/ppc/helper.h | 1 + target/ppc/translate/fp-impl.c.inc | 1 + 6 files changed, 233 insertions(+), 4 deletions(-) -- 2.25.1