Oh, sorry, a small mistake. Compacting it to a *single* loop while still being faster than gcc may be harder than it looks. This is because the part where a bit of the result is written (a single "ori" in my code) may explode into a full 32-bit operation (e.g. increment "q" if bit needs to be set, then shift left the whole "q"). But perhaps it could be done faster by involving memory access (keeping a pointer to the current byte in "q").
On Sat, Jun 8, 2013 at 6:15 PM, Ambroz Bizjak <ambr...@gmail.com> wrote: > I've done some further optimization, now it's even faster and uses 4 > registers less (but needs >=r16 for the result due to the use of > immediate operands). Even better, there are now only 4 kinds of > iterations (4 macros). The code is here http://ideone.com/EGWcuI and > pasted below. > > I've calculated that my division takes 384 instructions in the worst > case. I've also disassembled a compiled program where gcc did the > division (-O4, the attached program, just changed the main func), see > here http://ideone.com/jZNxkF . I see that the call to __udivmodsi4 > really hasn't been inlined, but the extra instructions are almost > negligible. In my calculations the gcc division takes 653 instructions > in the worst case (may be a bit off), excluding the call entry and > exit. Put simply, my code takes 58% time compared to gcc's division. > When I said 2/3 previously, I was referring to relative times, not > probabilities, but this was in a benchmark so it appeared slower due > to the testing code. > > The sources of speedup in my algorithm seem to be: > 1. Algorithm on the high level. In each iteration, gcc does four > 32-bit operations (add, compare, subtract, add) in the worst case. I > only do three (shift, compare, subtract). > 2. Optimization of groups of 8 iterations. This allows removing some > instructions where some operands are known to be zero. > 3. Unrolling individual iterations. > > About code size, yes, the algorithm I attached here is indeed much > larger. However, I believe we can compact it into four loops and still > be faster than gcc (points 1 and 2 above would still apply). I think > the result would be about twice the size of gcc algorithm. Even > compacting it to one loop should still be faster than gcc due to point > 1. > > I don't really know when a faster but larger algorithm should be used > by gcc, this is part of the reason why I'm posting here. The -Os flag > is probably one thing to look for, but we need to consider than people > may not want a giant division code even at other optimization levels. > Perhaps more specific flags could be used to control the division > algorithm? On the other hand, if the large algorithm goes into an > inline function in avr-libc, how do we make sure people can find it? > > > #include <stdint.h> > > #define DIVIDE_ITER_0_7(i) \ > " lsl %D[n]\n" \ > " rol __tmp_reg__\n" \ > " cp __tmp_reg__,%A[d]\n" \ > " cpc __zero_reg__,%B[d]\n" \ > " cpc __zero_reg__,%C[d]\n" \ > " cpc __zero_reg__,%D[d]\n" \ > " brcs zero_bit_" #i "__%=\n" \ > " sub __tmp_reg__,%A[d]\n" \ > " ori %D[q],1<<(7-" #i ")\n" \ > "zero_bit_" #i "__%=:\n" > > #define DIVIDE_ITER_8_15(i) \ > " lsl %C[n]\n" \ > " rol __tmp_reg__\n" \ > " rol %D[n]\n" \ > " cp __tmp_reg__,%A[d]\n" \ > " cpc %D[n],%B[d]\n" \ > " cpc __zero_reg__,%C[d]\n" \ > " cpc __zero_reg__,%D[d]\n" \ > " brcs zero_bit_" #i "__%=\n" \ > " sub __tmp_reg__,%A[d]\n" \ > " sbc %D[n],%B[d]\n" \ > " ori %C[q],1<<(15-" #i ")\n" \ > "zero_bit_" #i "__%=:\n" > > #define DIVIDE_ITER_16_23(i) \ > " lsl %B[n]\n" \ > " rol __tmp_reg__\n" \ > " rol %D[n]\n" \ > " rol %C[n]\n" \ > " cp __tmp_reg__,%A[d]\n" \ > " cpc %D[n],%B[d]\n" \ > " cpc %C[n],%C[d]\n" \ > " cpc __zero_reg__,%D[d]\n" \ > " brcs zero_bit_" #i "__%=\n" \ > " sub __tmp_reg__,%A[d]\n" \ > " sbc %D[n],%B[d]\n" \ > " sbc %C[n],%C[d]\n" \ > " ori %B[q],1<<(23-" #i ")\n" \ > "zero_bit_" #i "__%=:\n" > > #define DIVIDE_ITER_24_30(i) \ > " lsl %A[n]\n" \ > " rol __tmp_reg__\n" \ > " rol %D[n]\n" \ > " rol %C[n]\n" \ > " rol %B[n]\n" \ > " cp __tmp_reg__,%A[d]\n" \ > " cpc %D[n],%B[d]\n" \ > " cpc %C[n],%C[d]\n" \ > " cpc %B[n],%D[d]\n" \ > " brcs zero_bit_" #i "__%=\n" \ > " sub __tmp_reg__,%A[d]\n" \ > " sbc %D[n],%B[d]\n" \ > " sbc %C[n],%C[d]\n" \ > " sbc %B[n],%D[d]\n" \ > " ori %A[q],1<<(31-" #i ")\n" \ > "zero_bit_" #i "__%=:\n" > > static inline uint32_t divide (uint32_t n, uint32_t d) > { > uint32_t q = 0; > > asm( > "clr __tmp_reg__\n" > DIVIDE_ITER_0_7(0) > DIVIDE_ITER_0_7(1) > DIVIDE_ITER_0_7(2) > DIVIDE_ITER_0_7(3) > DIVIDE_ITER_0_7(4) > DIVIDE_ITER_0_7(5) > DIVIDE_ITER_0_7(6) > DIVIDE_ITER_0_7(7) > DIVIDE_ITER_8_15(8) > DIVIDE_ITER_8_15(9) > DIVIDE_ITER_8_15(10) > DIVIDE_ITER_8_15(11) > DIVIDE_ITER_8_15(12) > DIVIDE_ITER_8_15(13) > DIVIDE_ITER_8_15(14) > DIVIDE_ITER_8_15(15) > DIVIDE_ITER_16_23(16) > DIVIDE_ITER_16_23(17) > DIVIDE_ITER_16_23(18) > DIVIDE_ITER_16_23(19) > DIVIDE_ITER_16_23(20) > DIVIDE_ITER_16_23(21) > DIVIDE_ITER_16_23(22) > DIVIDE_ITER_16_23(23) > DIVIDE_ITER_24_30(24) > DIVIDE_ITER_24_30(25) > DIVIDE_ITER_24_30(26) > DIVIDE_ITER_24_30(27) > DIVIDE_ITER_24_30(28) > DIVIDE_ITER_24_30(29) > DIVIDE_ITER_24_30(30) > " lsl %A[n]\n" > " rol __tmp_reg__\n" > " rol %D[n]\n" > " rol %C[n]\n" > " rol %B[n]\n" > " cp __tmp_reg__,%A[d]\n" > " cpc %D[n],%B[d]\n" > " cpc %C[n],%C[d]\n" > " cpc %B[n],%D[d]\n" > " sbci %A[q],-1\n" > > : [q] "=&a" (q), > [n] "=&r" (n) > : "[q]" (q), > "[n]" (n), > [d] "r" (d) > ); > > return q; > } > > // instructions: > // 4 (init q) + 1 + 8 * 9 + 8 * 11 + 8 * 13 + 7 * 15 + 10 = 384 > > volatile uint32_t test1; > volatile uint32_t test2; > volatile uint32_t test3; > > int main () > { > test3 = divide(test1, test2); > //test3 = test1 / test2; > } > > On Sat, Jun 8, 2013 at 4:31 PM, Weddington, Eric > <eric.wedding...@atmel.com> wrote: >> >> >>> -----Original Message----- >>> From: avr-libc-dev-bounces+eric.weddington=atmel....@nongnu.org >>> [mailto:avr-libc-dev-bounces+eric.weddington=atmel....@nongnu.org] On >>> Behalf Of Ambroz Bizjak >>> Sent: Friday, June 07, 2013 8:34 PM >>> To: avr-libc-dev@nongnu.org >>> Subject: [avr-libc-dev] Faster integer division >>> >>> Hi! >>> >>> I've found 32bit integer division in gcc too slow, and I managed to >>> implement division in asm that's faster than what gcc 4.8 produces at >>> -O4, about 2/3 the time (but I'm not sure if any of this is due to >>> inlining). The algorithm is restoring division but unrolled and >>> heavily optimized. Could this get in gcc or avr-libc, wherever the >>> right place is? >>> >> >> Hi Ambroz, >> >> What's difficult is to balance different needs of different users. For the >> most part, and there are always exceptions to this, most of the AVR GCC >> users are interested in code size, rather than speed. They would rather see >> the smallest way to do division, even if it is slower. But I also recognize >> that there are definitely times where speed is the most desired. >> >> So given that, where do you think it best for your algorithm to generated? >> Should it be generated at a specific optimization level in gcc? Should it be >> a specialized inline-assembly function call in avr-libc that is specifically >> called by the user? >> >> You said that it's faster than what gcc produces about 2/3 of the time, but >> you're not sure if that is due to inlining. Can you do some further testing >> to see if it's due to inlining? Or whether it's because your algorithm is >> better? This is important to know, to see if it's worth the time and effort >> to get it in gcc or avr-libc. >> >> Thanks, >> Eric _______________________________________________ AVR-libc-dev mailing list AVR-libc-dev@nongnu.org https://lists.nongnu.org/mailman/listinfo/avr-libc-dev