This code is now available in a github repo: https://github.com/ambrop72/avr-asm-ops
I've implemented the partially unrolled variants with four loops. The 32bit/32bit division done this way (div_32_32_small) takes 510 cycles in the worst case (div_32_32_large takes 384, gcc takes 653). I think the _small variants would be appropriate for gcc to use in the absence of -Os. I don't know how to measure code size properly, but instruction-wise, div_32_32_small is 62 versus gcc'c 24 instructions. On Sat, Jun 8, 2013 at 6:25 PM, Ambroz Bizjak <ambr...@gmail.com> wrote: > Oh, sorry, a small mistake. Compacting it to a *single* loop while > still being faster than gcc may be harder than it looks. This is > because the part where a bit of the result is written (a single "ori" > in my code) may explode into a full 32-bit operation (e.g. increment > "q" if bit needs to be set, then shift left the whole "q"). But > perhaps it could be done faster by involving memory access (keeping a > pointer to the current byte in "q"). > > On Sat, Jun 8, 2013 at 6:15 PM, Ambroz Bizjak <ambr...@gmail.com> wrote: >> I've done some further optimization, now it's even faster and uses 4 >> registers less (but needs >=r16 for the result due to the use of >> immediate operands). Even better, there are now only 4 kinds of >> iterations (4 macros). The code is here http://ideone.com/EGWcuI and >> pasted below. >> >> I've calculated that my division takes 384 instructions in the worst >> case. I've also disassembled a compiled program where gcc did the >> division (-O4, the attached program, just changed the main func), see >> here http://ideone.com/jZNxkF . I see that the call to __udivmodsi4 >> really hasn't been inlined, but the extra instructions are almost >> negligible. In my calculations the gcc division takes 653 instructions >> in the worst case (may be a bit off), excluding the call entry and >> exit. Put simply, my code takes 58% time compared to gcc's division. >> When I said 2/3 previously, I was referring to relative times, not >> probabilities, but this was in a benchmark so it appeared slower due >> to the testing code. >> >> The sources of speedup in my algorithm seem to be: >> 1. Algorithm on the high level. In each iteration, gcc does four >> 32-bit operations (add, compare, subtract, add) in the worst case. I >> only do three (shift, compare, subtract). >> 2. Optimization of groups of 8 iterations. This allows removing some >> instructions where some operands are known to be zero. >> 3. Unrolling individual iterations. >> >> About code size, yes, the algorithm I attached here is indeed much >> larger. However, I believe we can compact it into four loops and still >> be faster than gcc (points 1 and 2 above would still apply). I think >> the result would be about twice the size of gcc algorithm. Even >> compacting it to one loop should still be faster than gcc due to point >> 1. >> >> I don't really know when a faster but larger algorithm should be used >> by gcc, this is part of the reason why I'm posting here. The -Os flag >> is probably one thing to look for, but we need to consider than people >> may not want a giant division code even at other optimization levels. >> Perhaps more specific flags could be used to control the division >> algorithm? On the other hand, if the large algorithm goes into an >> inline function in avr-libc, how do we make sure people can find it? >> >> >> #include <stdint.h> >> >> #define DIVIDE_ITER_0_7(i) \ >> " lsl %D[n]\n" \ >> " rol __tmp_reg__\n" \ >> " cp __tmp_reg__,%A[d]\n" \ >> " cpc __zero_reg__,%B[d]\n" \ >> " cpc __zero_reg__,%C[d]\n" \ >> " cpc __zero_reg__,%D[d]\n" \ >> " brcs zero_bit_" #i "__%=\n" \ >> " sub __tmp_reg__,%A[d]\n" \ >> " ori %D[q],1<<(7-" #i ")\n" \ >> "zero_bit_" #i "__%=:\n" >> >> #define DIVIDE_ITER_8_15(i) \ >> " lsl %C[n]\n" \ >> " rol __tmp_reg__\n" \ >> " rol %D[n]\n" \ >> " cp __tmp_reg__,%A[d]\n" \ >> " cpc %D[n],%B[d]\n" \ >> " cpc __zero_reg__,%C[d]\n" \ >> " cpc __zero_reg__,%D[d]\n" \ >> " brcs zero_bit_" #i "__%=\n" \ >> " sub __tmp_reg__,%A[d]\n" \ >> " sbc %D[n],%B[d]\n" \ >> " ori %C[q],1<<(15-" #i ")\n" \ >> "zero_bit_" #i "__%=:\n" >> >> #define DIVIDE_ITER_16_23(i) \ >> " lsl %B[n]\n" \ >> " rol __tmp_reg__\n" \ >> " rol %D[n]\n" \ >> " rol %C[n]\n" \ >> " cp __tmp_reg__,%A[d]\n" \ >> " cpc %D[n],%B[d]\n" \ >> " cpc %C[n],%C[d]\n" \ >> " cpc __zero_reg__,%D[d]\n" \ >> " brcs zero_bit_" #i "__%=\n" \ >> " sub __tmp_reg__,%A[d]\n" \ >> " sbc %D[n],%B[d]\n" \ >> " sbc %C[n],%C[d]\n" \ >> " ori %B[q],1<<(23-" #i ")\n" \ >> "zero_bit_" #i "__%=:\n" >> >> #define DIVIDE_ITER_24_30(i) \ >> " lsl %A[n]\n" \ >> " rol __tmp_reg__\n" \ >> " rol %D[n]\n" \ >> " rol %C[n]\n" \ >> " rol %B[n]\n" \ >> " cp __tmp_reg__,%A[d]\n" \ >> " cpc %D[n],%B[d]\n" \ >> " cpc %C[n],%C[d]\n" \ >> " cpc %B[n],%D[d]\n" \ >> " brcs zero_bit_" #i "__%=\n" \ >> " sub __tmp_reg__,%A[d]\n" \ >> " sbc %D[n],%B[d]\n" \ >> " sbc %C[n],%C[d]\n" \ >> " sbc %B[n],%D[d]\n" \ >> " ori %A[q],1<<(31-" #i ")\n" \ >> "zero_bit_" #i "__%=:\n" >> >> static inline uint32_t divide (uint32_t n, uint32_t d) >> { >> uint32_t q = 0; >> >> asm( >> "clr __tmp_reg__\n" >> DIVIDE_ITER_0_7(0) >> DIVIDE_ITER_0_7(1) >> DIVIDE_ITER_0_7(2) >> DIVIDE_ITER_0_7(3) >> DIVIDE_ITER_0_7(4) >> DIVIDE_ITER_0_7(5) >> DIVIDE_ITER_0_7(6) >> DIVIDE_ITER_0_7(7) >> DIVIDE_ITER_8_15(8) >> DIVIDE_ITER_8_15(9) >> DIVIDE_ITER_8_15(10) >> DIVIDE_ITER_8_15(11) >> DIVIDE_ITER_8_15(12) >> DIVIDE_ITER_8_15(13) >> DIVIDE_ITER_8_15(14) >> DIVIDE_ITER_8_15(15) >> DIVIDE_ITER_16_23(16) >> DIVIDE_ITER_16_23(17) >> DIVIDE_ITER_16_23(18) >> DIVIDE_ITER_16_23(19) >> DIVIDE_ITER_16_23(20) >> DIVIDE_ITER_16_23(21) >> DIVIDE_ITER_16_23(22) >> DIVIDE_ITER_16_23(23) >> DIVIDE_ITER_24_30(24) >> DIVIDE_ITER_24_30(25) >> DIVIDE_ITER_24_30(26) >> DIVIDE_ITER_24_30(27) >> DIVIDE_ITER_24_30(28) >> DIVIDE_ITER_24_30(29) >> DIVIDE_ITER_24_30(30) >> " lsl %A[n]\n" >> " rol __tmp_reg__\n" >> " rol %D[n]\n" >> " rol %C[n]\n" >> " rol %B[n]\n" >> " cp __tmp_reg__,%A[d]\n" >> " cpc %D[n],%B[d]\n" >> " cpc %C[n],%C[d]\n" >> " cpc %B[n],%D[d]\n" >> " sbci %A[q],-1\n" >> >> : [q] "=&a" (q), >> [n] "=&r" (n) >> : "[q]" (q), >> "[n]" (n), >> [d] "r" (d) >> ); >> >> return q; >> } >> >> // instructions: >> // 4 (init q) + 1 + 8 * 9 + 8 * 11 + 8 * 13 + 7 * 15 + 10 = 384 >> >> volatile uint32_t test1; >> volatile uint32_t test2; >> volatile uint32_t test3; >> >> int main () >> { >> test3 = divide(test1, test2); >> //test3 = test1 / test2; >> } >> >> On Sat, Jun 8, 2013 at 4:31 PM, Weddington, Eric >> <eric.wedding...@atmel.com> wrote: >>> >>> >>>> -----Original Message----- >>>> From: avr-libc-dev-bounces+eric.weddington=atmel....@nongnu.org >>>> [mailto:avr-libc-dev-bounces+eric.weddington=atmel....@nongnu.org] On >>>> Behalf Of Ambroz Bizjak >>>> Sent: Friday, June 07, 2013 8:34 PM >>>> To: avr-libc-dev@nongnu.org >>>> Subject: [avr-libc-dev] Faster integer division >>>> >>>> Hi! >>>> >>>> I've found 32bit integer division in gcc too slow, and I managed to >>>> implement division in asm that's faster than what gcc 4.8 produces at >>>> -O4, about 2/3 the time (but I'm not sure if any of this is due to >>>> inlining). The algorithm is restoring division but unrolled and >>>> heavily optimized. Could this get in gcc or avr-libc, wherever the >>>> right place is? >>>> >>> >>> Hi Ambroz, >>> >>> What's difficult is to balance different needs of different users. For the >>> most part, and there are always exceptions to this, most of the AVR GCC >>> users are interested in code size, rather than speed. They would rather see >>> the smallest way to do division, even if it is slower. But I also recognize >>> that there are definitely times where speed is the most desired. >>> >>> So given that, where do you think it best for your algorithm to generated? >>> Should it be generated at a specific optimization level in gcc? Should it >>> be a specialized inline-assembly function call in avr-libc that is >>> specifically called by the user? >>> >>> You said that it's faster than what gcc produces about 2/3 of the time, but >>> you're not sure if that is due to inlining. Can you do some further testing >>> to see if it's due to inlining? Or whether it's because your algorithm is >>> better? This is important to know, to see if it's worth the time and effort >>> to get it in gcc or avr-libc. >>> >>> Thanks, >>> Eric _______________________________________________ AVR-libc-dev mailing list AVR-libc-dev@nongnu.org https://lists.nongnu.org/mailman/listinfo/avr-libc-dev