tl;dr: Beta code included below; it passes my initial testing. I named it _itoa_engine, following "ftoa_engine", but suggestions are welcome. It is unfortunately one instruction longer than ultoa_invert:
text data bss dec hex filename 188 0 0 188 bc ultoa_invert.o 190 0 0 190 be itoa_engine.o 208 0 0 208 d0 itoa_engine.o (+OPTIMIZE_SPEED) 212 0 0 212 d4 itoa_engine.o (without HAVE_MUL) 230 0 0 230 e6 itoa_engine.o (+OPTIMIZE_SPEED) but I'm reasonably satisfied. Well, it turned out that my great idea from last time had a nasty bug in it. There's no really good way to check for termination after printing each digit. Recall that my bit buffer only has room for 15 bits of the number, so I arrange for it to always hold 8 <= n <= 15 bits. When it's about to fall from 8 to 7, I fetch another byte and decrement it from 16 to 15 instead. The problem is that there are two ways to handle the length, and both of them have problems. One is to count so len=1 after fetching the last byte of input, and len=0 after fetching the first byte of padding. In that case, consider printing one byte of zero in hex. I start with 8 bits in the buffer (msbyte = 0x01), and len = 1. I print a '0', and check if len == 0. It's not, so I keep going and print a second zero before realizing that I should stop. Okay, that doesn't work, so what if I arrange so len counts the bytes remaining? len=0 after fetching the last byte, and it holds there after fetching padding. In that case, there are no extra digits, but there are missing digits. Consider printing 2^15 = 0x8000. I start out with len=1 and the last byte in the bit buffer. I print the final 0, then have to fetch another byte to start shifting bits into the lsbyte. So the bit buffer holds 800 (msbyte = 0x18, lsbyte = 0x00, len = 0) when I print the second 0. At this point, len == 0 and lsbyte == 0, so I should stop printing. But there are more bits in the msbyte. There are various tricks you can try in hex, but they break down in octal. Checking for extra bits in the msbyte makes for messy (= large) code. So I went back to my original idea of checing for termination when fetching padding. It's a couple of instructions longer, but it works. Anyway, here's the current state of the function I intend to rewrite vfprintf() around. Any comments? I know about a few mechanical things: I need to integrate it into avr-libc and #include the appropriate headers rather than #define things myself. Likewise X_movw and so on. /* * bum * * 1. vt. (obs.) To make highly efficient, either in time or space, * often at the expense of clarity. * * The core integer-to-ASCII code. This takes as input arbitrary- * precision byte arrays, and outputs ASCII in bases 10 or 2^k. * (With mostly separate code.) * * This code is small and fast, but difficult to understand. It uses * tricky number theory (dividing by multiplying by the 2-adic inverse) * and low-level bit hacks to implement it. Sorry about that. * * The arguments are: * char *out: * Buffer in which the ASCII digits are placed, in little-endian * order. Must be reversed by the caller for printing. The caller is * also responsible for ensuring that it is big enough. * uint8_t *bin: * Buffer holding the input value. This is NOT const; it is used as * working storage by the algorithm. More specifically, negation * is done in place if requested, after which decimal conversion * leaves the buffer filled with zero, and binary-base conversion * leaves the buffer unmodified. * uint8_t len: * The length of the data at "bin". Must be between 1 and 255; * if you pass in 0, Bad Things will happen. * uint8_t flags: * A variable whise bits determine the form of the output. * flags bit 0: If set, the input is negated before printing. * This is a convenience to the caller for printing signed * numbers. * flags bit 1: If set, digits greater than 10 (a, b, c, ...) * are printed in lower case. If clear, they are printed in * upper case. This has no effect if the base is 10 or less. * flags bits 2..7: A contiguous bit mask (of the form (1 << k) - 1) * selecting the base to print. If zero, decimal is output. * If non-zero, a mask of the binary base to print (1 for * binary, 7 for octal, 15 for hex). No error-checking * is done; if you pass in something else, you'll get * peculiar output. * * The return value is a pointer to just past the end of the output. * * An overview of the code is as follows: * - The conditional negate is fairly straightforward. It's convenient * to do here because it leaves the pointer at the msbyte for the * next step. * - Then we strip off most significant zero bytes until we get to a * non-zero byte or only one byte is left. (The latter ensures we * print "0" for zero, rather than the null string.) * - Then the code splits depending on whether the remaining flags are * zero (decimal) or not. * * - The decimal code alternates two phases, finding the last digit (the * input mod 5), and then dividing the input by 10. Each involves a * pass over the input, although conveniently in opposite directions. * - Actually, the first pass (msbyte-to-lsbyte) divides the input by 2, * and stores the lsbit. It also computes the quotient mod 255, * which is then reduced mod 5. Combining (x % 2) and (x/2 % 5) * gives the digit. * - The second pass (lsbyte-to-msbyte) then divides the input by 5. * This is done by subtracting the remainder mod 5 to produce an exact * multiple of 5, then multiplying the result by the 2-adic expansion of * -1/5 (which has a particularly simple form 0x...33333) and negating. * - If the second pass produces an msbyte of 0, the length is reduced. * When the length reaches zero, conversion is done. * * - The binary code does one lsbit-to-msbit pass over the input. * - It mostly counts loop iterations with right shifts rather than * counters. When the shift produces a zero result (and a set carry, * since it must have been non-zero before), the count has expired. * - The input value is kept in a 2-byte shift register which holds * 8..15 bits of the input, with a 1 bit at the high end delimiting * the valid data. * - The main loop shifts the 16-bit buffer right one bit at a time until * it's time to print a digit, or the buffer is about to underflow to * 7 bits, meaning it's time to load another byte into the high half. * - The "digit" value is a mask of "bits already printed". We shift * it down until it's zero, then print the digit (lsb & mask), then * reload it with the mask. * - When the right shift of the msbyte of the buffer produces zero * (the delimiter bit was just shifted out), we load a new msbyte and * retry the shift. * - When we run out of input bytes, there's some tricky code to add * enough padding msbits to ensure remaining bits in the buffer * are printed, for any remaining bits in the buffer to be printed, * stopping when there are no more. * * The code often sets the carry bit several instructions before using it. * When reading, it's important to know that ld, st, mov, inc, dec, * and logical operations do not modify the carry bit. */ #ifndef __tmp_reg__ # define __tmp_reg__ r0 #endif #ifndef __zero_reg__ # define __zero_reg__ r1 #endif //#undef __AVR_HAVE_MUL__ #define __AVR_HAVE_MUL__ 1 #define OPTIMIZE_SPEED 1 /* Arguments */ #define out X /* Arrives in r24:r25, but we move it immediately */ #define out_lo r26 #define out_hi r27 #define bin Z /* Arrives in r22:r23, but we move it immediately */ #define bin_lo r30 #define bin_hi r31 #define len r20 #define flags r18 /* Mask, after removing two lsbits */ /* Local variables. The first four overlap with arguments. */ #define msb r25 /* 16-bit accumulator */ #define lsb r24 #define rem r23 /* Remainder mod 255, or mod 5 */ #define digit r22 /* Current digit */ #define tlen r21 /* Loop counter, copy of len or flags */ #define k r19 /* Constant value, usually the multiplier 0x33 */ #if __AVR_HAVE_MUL__ #define zero r18 /* Used instead of r1 to free up multiplier */ #else #define zero __zero_reg__ #endif .text .global _itoa_engine .type _itoa_engine, @function _itoa_engine: movw out_lo, r24 movw bin_lo, r22 lsr flags ; Lsbit is negate flag #if OPTIMIZE_SPEED ; 4 more instructions, but avoids a pass over the input (saving ; 9*len - 4 cycles) in the common case when no negation is desired. brcs 1f add bin_lo, len adc bin_hi, __zero_reg__ rjmp 3f #endif ;; Conditional negate, depending on carry. The standard identity ;; -x = ~x + 1 can be extended to do a conditional negate. ;; Given a mask of 0 or -1, (x^mask) - mask returns x or -x. ;; However, we would need the carry bit clear to start this, ;; and forming "mask" from the carry bit in one instruction ;; preserves the carry bit. So instead do the conditional ;; increment using add zero with carry. ;; ;; It turns out there's no faster way to do an unconditional ;; negate. Multi-precision negate is awkward on AVR, which lacks ;; negate-with-carry or any form of dest = src-dest instruction. ;; So it's two instructions minimum (mov+sbc or invert+adc), ;; and either way we need one instruction of setup, to clear ;; the carry or create a constant of 0xff. The latter because ;; the COM instruction overwrites carry, and there's no EORI, ;; so we need to load a constant 0xff into a register. 1: mov tlen, len sbc k, k ; Set to 0 or -1, carry preserved 2: ld __tmp_reg__, bin eor __tmp_reg__, k adc __tmp_reg__, __zero_reg__ st bin+, __tmp_reg__ dec tlen brne 2b ; Strip redundant trailing (most-significant) zeros from bin. 3: ld __tmp_reg__, -bin or __tmp_reg__, __tmp_reg__ brne 4f dec len brne 3b inc len ; Split into two branches based on msbits of flag. 4: lsr flags ; lower/upper case flag to carry brne .L_binary ;; The longest form of the decimal code (OPTIMIZE_SPEED && !HAVE_MUL) ;; is exactly 63 instructions (126 bytes) long, which is the extreme ;; limit of this conditional branch. If it gets any longer, you'll ;; have add an instruction. I suggest swapping the cases and dealing ;; with the fact that the .L_epilogue branch is now out of range by ;; placing it at the end of the binary code and having the decimal ;; code end in a rjmp to it. .L_decimal: adiw bin_lo, 1 #if __AVR_HAVE_MUL__ clr zero ldi k, 0x33 #endif ; The main loop, repeated while len > 0 .L_decimal_loop: #if OPTIMIZE_SPEED ldi digit, '0'+10 ; Adjust for later offset on rem #else ldi digit, '0' ; Sneakily preload the ASCII offset #endif ser rem ; Sum of all bytes, mod 255 mov tlen, len ;; Pass 1, msb-to-lsb: finding the input mod 10. ;; ;; We do two things here: divide by 2 (saving the lsbit), and sum ;; the result mod 255. This is then used to compute the result ;; mod 5, which combined with the digit gives the decimal digit ;; we want. The awkward thing is that we want *two* carry bits ;; in this loop (one for the shift and the other for summing the ;; bytes), so we use the lsbit of digit for one of them. .L_mod10: ld __tmp_reg__, -bin lsr digit ; Lsbit to carry bit ror __tmp_reg__ st bin, __tmp_reg__ rol digit ; Carry bit to lsbit add rem, __tmp_reg__ adc rem, zero ; End-around carry for mod 255 dec tlen brne .L_mod10 #if OPTIMIZE_SPEED ; Reduce rem mod 15 (from 1 <= rem <= 255 to 1 <= rem <= 15) mov __tmp_reg__, rem swap __tmp_reg__ andi rem, 0xf0 add rem, __tmp_reg__ ; Add high halves to get carry bit andi rem, 0xf0 swap rem adc rem, zero ; End-around carry ; Reduce rem mod 5 (average of 2.2 subtracts) 0: subi rem, 5 brcc 0b ; We are left with rem-5, which can be compensated for elsewhere. #else ;; Reduce rem mod 35, using a loop. ;; ;; This is a bit slower (22.82 cycles average to the end of the ;; mod-5 loop, vs. 12.6 cycles using the above), but saves 5 ;; instructions. Any multiple of 5 will work, but 35 is optimal. 0: subi rem, 35 brcc 0b ; Now add back mod 5 until we go positive again. 0: subi rem, -5 brcs 0b #endif ; Form and store ASCII digit (2*rem + digit) add digit, rem add digit, rem st out+, digit ;; Pass 2, lsb-to-msb: dividing by 5. ;; ;; Rather than do a general divide by 5, we can subtract rem ;; to produce a multiple of 5, and then do an exact division by ;; multiplying by the 2-adic expansion of 1/5, 0xCCC...CCD. ;; ;; To get this into an even simpler form, we multiply by -1/5 = ;; 0x333...333 and negate. Each byte is multiplied by 0x33 and ;; added to an accumulator to be used for each higher byte. ;; ;; The accumulator has to be 16 bits wide, but after storing ;; each output byte, we can fold the msbyte into the lsbyte. ;; ;; Negating the output can be "complement and add one", but ;; we do it as "subtract one and complement", initializing the ;; accumulator to 0xff, then complementing before storing. ;; ;; To subtract rem without an separate carry propagation pass, ;; subtract 0x33 times rem from the accumulator to start. ;; (Since 0 <= rem <= 4, this is very easy.) ; msb:lsb = 255 - (rem * 0x33). Guaranteed to fit into 8 bits. #if !OPTIMIZE_SPEED ldi lsb, 255 #elif __AVR_HAVE_MUL__ ldi lsb, 0 ; Adjust for pre-decrement of rem by 5 #else ldi lsb, 45 ; Adjust for pre-decrement of rem by 5 #endif #if __AVR_HAVE_MUL__ mul rem, k sub lsb, r0 #else ; Multiply by 0x11 mov __tmp_reg__, rem swap __tmp_reg__ ; rem < 16, so this is rem << 4 add __tmp_reg__, rem ; Multiply by 3 sub lsb, __tmp_reg__ sub lsb, __tmp_reg__ sub lsb, __tmp_reg__ #endif clr msb ; Here is the actual divide loop mov tlen, len .L_divide: ld __tmp_reg__, bin ; acc += 0x33 * r0. This product could be up to 14 bits long. #if __AVR_HAVE_MUL__ mul __tmp_reg__, k add lsb, r0 adc msb, r1 #else ; let rem:digit = __tmp_reg__ << 4 = __tmp_reg__ * 0x10 mov digit, __tmp_reg__ swap digit mov rem, digit andi rem, 15 ; Mask off high 4 bits eor digit, rem ; Mask off low 4 bits ; Add __tmp_reg__ to get __tmp_reg__ * 0x11 add digit, __tmp_reg__ adc rem, zero ; Now add it to the accumulator 3 times (there is no faster way) add lsb, digit adc msb, rem add lsb, digit adc msb, rem add lsb, digit adc msb, rem #endif ; Store the complemented accumulator (*bin++ = ~lsb) mov __tmp_reg__, lsb com __tmp_reg__ st bin+, __tmp_reg__ ; Fold the accumulator: msb:lsb = msb + lsb add lsb, msb clr msb adc msb, zero dec tlen brne .L_divide ;; End of main loop: check if the new msbyte was zero. ;; If so, drop it (decrement len), and stop when len == 0. cpse r0, zero rjmp .L_decimal_loop sbiw bin_lo, 1 dec len brne .L_decimal_loop #if __AVR_HAVE_MUL__ clr __zero_reg__ #endif ; Finally copy the out pointer to r24 and return it. .L_epilogue: movw r24, out_lo ret .L_binary: movw bin_lo, r22 ; Reset bin to lsbyte ; Done with args in r22-r25; now allowed to use digit, rem, lsb, msb ; Load startup values. ldi rem, 'A'-'0'-10 ; ASCII adjustment for hex digits past 9 brcc 0f ldi rem, 'a'-'0'-10 0: ldi msb, 1 ld lsb, bin+ .L_digit_out: ; Spit out a digit mov digit, flags and digit, lsb cpi digit, 10 brcs 0f add digit, rem ; Hex digit > 9 0: subi digit, -'0' st X+, digit mov digit, flags .L_bitloop: lsr msb brne 6f ; msb is empty; fetch another byte dec len ; Preserves carry breq 7f ; No more input ld msb, Z+ 5: ror msb ; Shift carry=1 into msbit 6: ror lsb lsr digit brne .L_bitloop ; if ((tlen >>= 1)== 0) { rjmp .L_digit_out ;; We've run out of bytes. If all 1 bits in lsb have been printed ;; (i.e. lsb <= digit), stop. If not, pad with enough zeros to print ;; one more digit, and so we'll check again after printing it. 7: cp digit, lsb ; if (lsb <= digit) return X brcc .L_epilogue inc len adc msb, digit ; msb = digit + 1, and clear carry rjmp 5b .size _itoa_engine, .-_itoa_engine _______________________________________________ AVR-libc-dev mailing list AVR-libc-dev@nongnu.org https://lists.nongnu.org/mailman/listinfo/avr-libc-dev