On 07/10/2018 05:09 AM, colinb2 . wrote: > Feel free to copy this email and attachment to anyone who might be interested. > I'm very happy to answer any questions anyone has. > The program can be compiled and run like this on Linux with GNU GCC: > gcc -O2 -o expmed2.exe expmed2.c > ./expmed2.exe > > This email deals with making part of the GNU GCC compiler - integer division > by a constant divisor - faster. (The calculation of the parameters for the > machine code will be faster; compiled programs won't run faster.) > Further down I mention inequality (1) which can be used to make the LLVM > compiler somewhat faster, because that currently uses code based on (2). > I don't know what - if anything - the Java JVM uses for this, or how other > compilers do this, but these ideas may be useful there. > > By significantly faster I mean I have benchmarked alternative versions of > choose_multiplier which on my low specification netbook can take maybe less > than > half the time the current version takes. Time saved in compilation is much > less > important than time saved in running compiled programs, but the code for the > faster versions is about the same length as the code for the current version, > and is only a bit more complicated, so is worth considering? Improvements to compile time behavior are always worth considering.
> > Licencing: in Fractint people's words "Don't want money, want recognition!" > The version choose_multiplier_v2 is based - but improves - on what's in > the GCC choose_multiplier function in file expmed.c, so the GCC licence. > The version choose_multiplier_v4 is based - but improves - on magicu2 in > "Hacker's Delight", so the licence is you needn't acknowledge the source > but it would be nice to credit the code as from > magicu2 in "Hacker's Delight" by Henry S Warren http://hackersdelight.org > with improvements by Colin Bartlett <coli...@gmail.com>. > This latter also applies to choose_multiplier_power2_divrem because that > is also an obvious (?) idea from magicu2 in "Hacker's Delight". */ So a key issue with GCC is that for nontrivial code to be included in GCC, the code's author must assign their copyright ownership to the FSF. > > The idea is using "wide int" seems slow compared to "unsigned HOST_WIDE_INT", > so it makes sense to avoid using "wide int" as much as possible. Sure, but the use of wide_int types has certain advantages and if possible we'd like that class to be just as efficient as a HOST_WIDE_INT for common operations on 32 and 64 bit types. In fact, for 32 and 64 bit types, wide_int is supposed to generate the same code as HOST_WIDE_INT. At least that was the state in 2013 when wide_int was introduced. So it may be worth spending some time figuring out why changing the types changes the performance. If you're going to argue to use HOST_WIDE_INT, then you'll have to think about how you're going to support 128bit or wider target data types which is the whole point behind the introduction of wide_int. Anyway, I think these higher level questions/concerns need to be addressed before we can reasonably discuss replacing wide_int with HOST_WIDE_INT. WRT algorithmic changes. ISTM it would be best to address those separately from the wide_int vs HOST_WIDE_INT discussion. An algorithmic change should have the same impact regardless of whether we're using wide_int or HOST_WIDE_INT. We can easily > rewrite choose_multiplier to only use "wide int" to calculate the initial > mlow; > this is choose_multiplier_v2. An alternative for choose_multiplier_v2 > completely > avoids using "wide int" by iterating upwards for the initial mlow, but if that > benchmarks as faster than using "wide int" even once (I suspect it might) then > just iterating upwards may even be a bit faster; this is choose_multiplier_v4. So the most obvious take-away from this is to answer the question of whether or not iterating upwards for the initial mlow is a win or not. If so, that might be able to go forward independent of any other changes. > > The attachment is self-contained, and I have checked that the values produced > agree with a "Hacker's Delight" table of M/2^s for small d and n=precision=32. Rather than attaching new implementations, the generally preferred packaging is diffs against the trunk. Jeff