Georg-Johann Lay wrote: > George Spelvin schrieb: >> It is unfortunately one instruction longer than ultoa_invert: >> >> text data bss dec hex filename >> 188 0 0 188 bc ultoa_invert.o >> 190 0 0 190 be itoa_engine.o >> 208 0 0 208 d0 itoa_engine.o (+OPTIMIZE_SPEED) >> 212 0 0 212 d4 itoa_engine.o (without HAVE_MUL)
> hmmm, that's quite some increase in code size... Um... just how serious are you being? I didn't think two bytes was "quite some increase". But compared to your alternatives... sigh, I may have been wasting my time. :-( > IMO octal conversion is so remote from any practical purposes that we > should use an algorithm that works for all radices. Well, I sort of did that. I did one that works for all power-of-two sizes, and used it for both hex and octal. Hex alone can be done much faster, but it isn't bad with the more generic code. ...but you don't mean limited to powers of 2. Just plain old binary long division. As I said when I started this, it's hard to escape certain optimization habits. > This can be done in less than 70 bytes (variable length / radix > algorithm that operates on memory, no need for MUL, needs strrev. H'm... let me have a poke at that. The actual code changes are at the end, but I found 90 cycles in one place, and 91 cycles + 1 instruction in another. With those changes, here's a timing table, for values of the from 2^k - 1: Decimal Hex mod100 Input mem_toa itoa mem_toa itoa mem_toa 8 bit 293 217 194 98 194 16 bit 700 451 527 187 440 24 bit 1342 838 1008 276 748 32 bit 2119 1167 1637 365 1142 40 bit 3143 1649 2414 454 1697 48 bit 4290 2127 3339 543 2301 56 bit 5585 2733 4412 632 2991 64 bit 7115 3420 5633 721 3743 That's really attractive for decimal. Also, for not much extra code, we could do the long division in base 100 (timing for only this part is above), and then split that into 2 digits. For hex, it's a bit painful. > So all is boiling down to the question what execution time is > acceptable or not: > > Is 8000 ticks too slow? > > Is 3000 ticks acceptable? And for what reason? Are 3000 acceptable just > because we have an algorithm that performs in 3000 ticks? The answer, of course, is the same as to "Is 190 bytes too big?": It Depends On The Application. And as library writers, we don't *know* the application. For many applications, an AVR is too slow and they need an ARM or a DSP. "Best" is somewhere on the convex hull of the space/time plot, but I can't say where. I was designing a helper function for printf, and that's a certain size: text data bss dec hex filename 1648 0 0 1648 670 ./avr/lib/avr4/vfprintf_flt.o 948 0 0 948 3b4 ./avr/lib/avr4/vfprintf_std.o 505 0 0 505 1f9 ./avr/lib/avr4/vfprintf_min.o That, and the initial size of __ultoa_invert, kind of set the size scale of what I was aiming for. I have a faster algorithm that uses a 100-byte binary-to-BCD lookup table to do everything two digits at a time. Some applications might want that. > My strong preference is still to have a one-fits-all algorithm that > might very well be slower than an optimal one. But hey, an ordinary > division of a 64-bit value by 10 already costs 2300 cycles, so why > should we hunt cycles just for printf...? There is that. But an intelligent programmer might only use 64 bits for a final accumulation (of, say, 32x32-bit products), avoiding even 64-bit multiplies. > I played around some more; attached is a version that operates on > memory, variable length and radices, and doesn't need MUL. It's > called mem_toa and needs < 70 bytes / 8000 ticks for decimal. > The signed version adds yet another 40 bytes: I made the following changes: Since you already have pointers to the beginning and end of the string, which is what the first half of strrev finds, a slight rearrangement of that code (which saves space and 1 cycle per byte at the expense of one extra loop iteration for odd-length inputs) would let you jump to the middle of it. Copying that code to the end of this to make my benchmarking simpler, I get 7296 cycles to print 2^64-1. One speedup that leaps to mind is that you can negate the radix and do the trial subtraction using an add. That will produce the quotient with the right polarity. Moves one instruction out of the middle loop to the setup code. That saves 90 cycles, taking it to 7206. Another possibility is some bit-shifting tricks. It's of limited use here becaue it messes up your trick of using Num for both the input number and the output quotient. But consider the following, with "nBits" renamed to "Quot". It takes another cycle out of the second loop, saving another 91 cycles (7115). It starts "Quot" out equal to 1, and detect the completion of 8 bit shifts when that bit ends up in the carry flag. So "rol Quot" replaces "dec nBits", and thereby avoids the need for the second "rol Num" just before the store. DEFUN mem_toa wmov pBuf, 24 wmov pNum, 22 neg Radix .LnextDigit: add pNum, nBytes adc pNum+1, __zero_reg__ mov Count, nBytes ;; nBytes < 0 <==> nBytes is unknown ldi nBytes, -1 clr Rem .LnextByte: ld Num, -X ldi Quot, 1 .LnextBit: ;; Bit-wise construct quotient into Quot. ;; Bit-wise reconstruct Num into Rem. lsl Num rol Rem ;; Reducing Rem mod Radix; C = 1 <==> reduction applied add Rem, Radix brcs 1f sub Rem, Radix 1: rol Quot brcc .LnextBit st X, Quot breq 2f sbrc nBytes, 7 mov nBytes, Count 2: dec Count brne .LnextByte subi Rem, -'0' cpi Rem, '9' + 1 brlt 3f subi Rem, '9' + 1 - 'a' 3: st Z+, Rem sbrs nBytes, 7 rjmp .LnextDigit st Z, __zero_reg__ ;; pBuf survived in R25:R24 # XJMP strrev wmov pNum, 24 2: ld Num, X ld Count, -Z st X+, Count st Z, Num /* Loop while X < Z */ 3: cp XL, ZL cpc XH, ZH brlo 2b ret ENDF mem_toa _______________________________________________ AVR-libc-dev mailing list AVR-libc-dev@nongnu.org https://lists.nongnu.org/mailman/listinfo/avr-libc-dev