Re: [avr-libc-dev] Printing octal (brain dump)

George Spelvin Sun, 18 Dec 2016 23:04:25 -0800

tl;dr: Beta code included below; it passes my initial testing.  I named
it _itoa_engine, following "ftoa_engine", but suggestions are welcome.
It is unfortunately one instruction longer than ultoa_invert:


   text    data     bss     dec     hex filename
    188       0       0     188      bc ultoa_invert.o
    190       0       0     190      be itoa_engine.o
    208       0       0     208      d0 itoa_engine.o (+OPTIMIZE_SPEED)
    212       0       0     212      d4 itoa_engine.o (without HAVE_MUL)
    230       0       0     230      e6 itoa_engine.o (+OPTIMIZE_SPEED)

but I'm reasonably satisfied.


Well, it turned out that my great idea from last time had a nasty
bug in it.  There's no really good way to check for termination after
printing each digit.

Recall that my bit buffer only has room for 15 bits of the number, so I
arrange for it to always hold 8 <= n <= 15 bits.  When it's about to fall
from 8 to 7, I fetch another byte and decrement it from 16 to 15 instead.

The problem is that there are two ways to handle the length, and
both of them have problems.

One is to count so len=1 after fetching the last byte of input, and
len=0 after fetching the first byte of padding.

In that case, consider printing one byte of zero in hex.

I start with 8 bits in the buffer (msbyte = 0x01), and len = 1.  I print a
'0', and check if len == 0.  It's not, so I keep going and print a second
zero before realizing that I should stop.


Okay, that doesn't work, so what if I arrange so len counts the bytes
remaining?  len=0 after fetching the last byte, and it holds there
after fetching padding.

In that case, there are no extra digits, but there are missing
digits.  Consider printing 2^15 = 0x8000.  I start out with
len=1 and the last byte in the bit buffer.  I print the final 0, then
have to fetch another byte to start shifting bits into the lsbyte.

So the bit buffer holds 800 (msbyte = 0x18, lsbyte = 0x00, len = 0) when
I print the second 0.  At this point, len == 0 and lsbyte == 0, so
I should stop printing.  But there are more bits in the msbyte.

There are various tricks you can try in hex, but they break down in
octal.  Checking for extra bits in the msbyte makes for messy
(= large) code.


So I went back to my original idea of checing for termination when
fetching padding.  It's a couple of instructions longer, but it works.

Anyway, here's the current state of the function I intend to rewrite
vfprintf() around.  Any comments?  I know about a few mechanical things:
I need to integrate it into avr-libc and #include the appropriate headers
rather than #define things myself.  Likewise X_movw and so on.




/*
 *      bum
 *
 *      1. vt. (obs.) To make highly efficient, either in time or space,
 *         often at the expense of clarity.
 *
 * The core integer-to-ASCII code.  This takes as input arbitrary-
 * precision byte arrays, and outputs ASCII in bases 10 or 2^k.
 * (With mostly separate code.)
 *
 * This code is small and fast, but difficult to understand.  It uses
 * tricky number theory (dividing by multiplying by the 2-adic inverse)
 * and low-level bit hacks to implement it.  Sorry about that.
 *
 * The arguments are:
 * char *out:
 *      Buffer in which the ASCII digits are placed, in little-endian
 *      order.  Must be reversed by the caller for printing.  The caller is
 *      also responsible for ensuring that it is big enough.
 * uint8_t *bin:
 *      Buffer holding the input value.  This is NOT const; it is used as
 *      working storage by the algorithm.  More specifically, negation
 *      is done in place if requested, after which decimal conversion
 *      leaves the buffer filled with zero, and binary-base conversion
 *      leaves the buffer unmodified.
 * uint8_t len:
 *      The length of the data at "bin".  Must be between 1 and 255;
 *      if you pass in 0, Bad Things will happen.
 * uint8_t flags:
 *      A variable whise bits determine the form of the output.
 *      flags bit 0: If set, the input is negated before printing.
 *              This is a convenience to the caller for printing signed
 *              numbers.
 *      flags bit 1: If set, digits greater than 10 (a, b, c, ...)
 *              are printed in lower case.  If clear, they are printed in
 *              upper case.  This has no effect if the base is 10 or less.
 *      flags bits 2..7: A contiguous bit mask (of the form (1 << k) - 1)
 *              selecting the base to print.  If zero, decimal is output.
 *              If non-zero, a mask of the binary base to print (1 for
 *              binary, 7 for octal, 15 for hex).  No error-checking
 *              is done; if you pass in something else, you'll get
 *              peculiar output.
 *
 * The return value is a pointer to just past the end of the output.
 *
 * An overview of the code is as follows:
 * - The conditional negate is fairly straightforward.  It's convenient
 *   to do here because it leaves the pointer at the msbyte for the
 *   next step.
 * - Then we strip off most significant zero bytes until we get to a
 *   non-zero byte or only one byte is left.  (The latter ensures we
 *   print "0" for zero, rather than the null string.)
 * - Then the code splits depending on whether the remaining flags are
 *   zero (decimal) or not.
 *
 * - The decimal code alternates two phases, finding the last digit (the
 *   input mod 5), and then dividing the input by 10.  Each involves a
 *   pass over the input, although conveniently in opposite directions.
 * - Actually, the first pass (msbyte-to-lsbyte) divides the input by 2,
 *   and stores the lsbit.  It also computes the quotient mod 255,
 *   which is then reduced mod 5.  Combining (x % 2) and (x/2 % 5)
 *   gives the digit.
 * - The second pass (lsbyte-to-msbyte) then divides the input by 5.
 *   This is done by subtracting the remainder mod 5 to produce an exact
 *   multiple of 5, then multiplying the result by the 2-adic expansion of
 *   -1/5 (which has a particularly simple form 0x...33333) and negating.
 * - If the second pass produces an msbyte of 0, the length is reduced.
 *   When the length reaches zero, conversion is done.
 *
 * - The binary code does one lsbit-to-msbit pass over the input.
 * - It mostly counts loop iterations with right shifts rather than
 *   counters.  When the shift produces a zero result (and a set carry,
 *   since it must have been non-zero before), the count has expired.
 * - The input value is kept in a 2-byte shift register which holds
 *   8..15 bits of the input, with a 1 bit at the high end delimiting
 *   the valid data.
 * - The main loop shifts the 16-bit buffer right one bit at a time until
 *   it's time to print a digit, or the buffer is about to underflow to
 *   7 bits, meaning it's time to load another byte into the high half.
 * - The "digit" value is a mask of "bits already printed".  We shift
 *   it down until it's zero, then print the digit (lsb & mask), then
 *   reload it with the mask.
 * - When the right shift of the msbyte of the buffer produces zero
 *   (the delimiter bit was just shifted out), we load a new msbyte and
 *   retry the shift.
 * - When we run out of input bytes, there's some tricky code to add
 *   enough padding msbits to ensure remaining bits in the buffer
 *   are printed, for any remaining bits in the buffer to be printed,
 *   stopping when there are no more.
 *
 * The code often sets the carry bit several instructions before using it.
 * When reading, it's important to know that ld, st, mov, inc, dec,
 * and logical operations do not modify the carry bit.
 */

#ifndef __tmp_reg__
# define __tmp_reg__ r0
#endif
#ifndef __zero_reg__
# define __zero_reg__ r1
#endif

//#undef __AVR_HAVE_MUL__
#define __AVR_HAVE_MUL__ 1
#define OPTIMIZE_SPEED 1

/* Arguments */
#define out     X       /* Arrives in r24:r25, but we move it immediately */
#define out_lo  r26
#define out_hi  r27
#define bin     Z       /* Arrives in r22:r23, but we move it immediately */
#define bin_lo  r30
#define bin_hi  r31
#define len     r20
#define flags   r18     /* Mask, after removing two lsbits */

/* Local variables.  The first four overlap with arguments. */
#define msb     r25     /* 16-bit accumulator */
#define lsb     r24
#define rem     r23     /* Remainder mod 255, or mod 5 */
#define digit   r22     /* Current digit */
#define tlen    r21     /* Loop counter, copy of len or flags */
#define k       r19     /* Constant value, usually the multiplier 0x33 */

#if __AVR_HAVE_MUL__
#define zero    r18     /* Used instead of r1 to free up multiplier */
#else
#define zero    __zero_reg__
#endif

        .text
        .global _itoa_engine
        .type   _itoa_engine, @function
_itoa_engine:
        movw    out_lo, r24
        movw    bin_lo, r22

        lsr     flags           ; Lsbit is negate flag
#if OPTIMIZE_SPEED
        ; 4 more instructions, but avoids a pass over the input (saving
        ; 9*len - 4 cycles) in the common case when no negation is desired.
        brcs    1f
        add     bin_lo, len
        adc     bin_hi, __zero_reg__
        rjmp    3f
#endif
        ;; Conditional negate, depending on carry.  The standard identity
        ;; -x = ~x + 1 can be extended to do a conditional negate.
        ;; Given a mask of 0 or -1, (x^mask) - mask returns x or -x.
        ;; However, we would need the carry bit clear to start this,
        ;; and forming "mask" from the carry bit in one instruction
        ;; preserves the carry bit.  So instead do the conditional
        ;; increment using add zero with carry.
        ;;
        ;; It turns out there's no faster way to do an unconditional
        ;; negate.  Multi-precision negate is awkward on AVR, which lacks
        ;; negate-with-carry or any form of dest = src-dest instruction.
        ;; So it's two instructions minimum (mov+sbc or invert+adc),
        ;; and either way we need one instruction of setup, to clear
        ;; the carry or create a constant of 0xff.  The latter because
        ;; the COM instruction overwrites carry, and there's no EORI,
        ;; so we need to load a constant 0xff into a register.

1:      mov     tlen, len
        sbc     k, k            ; Set to 0 or -1, carry preserved

2:      ld      __tmp_reg__, bin
        eor     __tmp_reg__, k
        adc     __tmp_reg__, __zero_reg__
        st      bin+, __tmp_reg__
        dec     tlen
        brne    2b

        ; Strip redundant trailing (most-significant) zeros from bin.
3:      ld      __tmp_reg__, -bin
        or      __tmp_reg__, __tmp_reg__
        brne    4f      
        dec     len
        brne    3b
        inc     len

        ; Split into two branches based on msbits of flag.
4:      lsr     flags           ; lower/upper case flag to carry
        brne    .L_binary
        ;; The longest form of the decimal code (OPTIMIZE_SPEED && !HAVE_MUL)
        ;; is exactly 63 instructions (126 bytes) long, which is the extreme
        ;; limit of this conditional branch.  If it gets any longer, you'll
        ;; have add an instruction.  I suggest swapping the cases and dealing
        ;; with the fact that the .L_epilogue branch is now out of range by
        ;; placing it at the end of the binary code and having the decimal
        ;; code end in a rjmp to it.

.L_decimal:
        adiw    bin_lo, 1
#if __AVR_HAVE_MUL__
        clr     zero
        ldi     k, 0x33
#endif

        ; The main loop, repeated while len > 0
.L_decimal_loop:
#if OPTIMIZE_SPEED
        ldi     digit, '0'+10   ; Adjust for later offset on rem
#else
        ldi     digit, '0'      ; Sneakily preload the ASCII offset
#endif
        ser     rem             ; Sum of all bytes, mod 255
        mov     tlen, len

        ;; Pass 1, msb-to-lsb: finding the input mod 10.
        ;;
        ;; We do two things here: divide by 2 (saving the lsbit), and sum
        ;; the result mod 255.  This is then used to compute the result
        ;; mod 5, which combined with the digit gives the decimal digit
        ;; we want.  The awkward thing is that we want *two* carry bits
        ;; in this loop (one for the shift and the other for summing the
        ;; bytes), so we use the lsbit of digit for one of them.
.L_mod10:
        ld      __tmp_reg__, -bin
        lsr     digit           ; Lsbit to carry bit
        ror     __tmp_reg__
        st      bin, __tmp_reg__
        rol     digit           ; Carry bit to lsbit
        add     rem, __tmp_reg__
        adc     rem, zero       ; End-around carry for mod 255

        dec     tlen
        brne    .L_mod10

#if OPTIMIZE_SPEED
        ; Reduce rem mod 15  (from 1 <= rem <= 255 to 1 <= rem <= 15)
        mov     __tmp_reg__, rem
        swap    __tmp_reg__
        andi    rem, 0xf0
        add     rem, __tmp_reg__        ; Add high halves to get carry bit
        andi    rem, 0xf0
        swap    rem
        adc     rem, zero               ; End-around carry
        ; Reduce rem mod 5 (average of 2.2 subtracts)
0:      subi    rem, 5
        brcc    0b
        ; We are left with rem-5, which can be compensated for elsewhere.
#else
        ;; Reduce rem mod 35, using a loop.
        ;;
        ;; This is a bit slower (22.82 cycles average to the end of the
        ;; mod-5 loop, vs. 12.6 cycles using the above), but saves 5
        ;; instructions.  Any multiple of 5 will work, but 35 is optimal.
0:      subi    rem, 35
        brcc    0b
        ; Now add back mod 5 until we go positive again.
0:      subi    rem, -5
        brcs    0b
#endif

        ; Form and store ASCII digit (2*rem + digit)
        add     digit, rem
        add     digit, rem
        st      out+, digit

        ;; Pass 2, lsb-to-msb: dividing by 5.
        ;;
        ;; Rather than do a general divide by 5, we can subtract rem
        ;; to produce a multiple of 5, and then do an exact division by
        ;; multiplying by the 2-adic expansion of 1/5, 0xCCC...CCD.
        ;;
        ;; To get this into an even simpler form, we multiply by -1/5 =
        ;; 0x333...333 and negate.  Each byte is multiplied by 0x33 and
        ;; added to an accumulator to be used for each higher byte.
        ;;
        ;; The accumulator has to be 16 bits wide, but after storing
        ;; each output byte, we can fold the msbyte into the lsbyte.
        ;;
        ;; Negating the output can be "complement and add one", but
        ;; we do it as "subtract one and complement", initializing the
        ;; accumulator to 0xff, then complementing before storing.
        ;;
        ;; To subtract rem without an separate carry propagation pass,
        ;; subtract 0x33 times rem from the accumulator to start.
        ;; (Since 0 <= rem <= 4, this is very easy.)

        ; msb:lsb = 255 - (rem * 0x33).  Guaranteed to fit into 8 bits.
#if !OPTIMIZE_SPEED
        ldi     lsb, 255
#elif __AVR_HAVE_MUL__
        ldi     lsb, 0          ; Adjust for pre-decrement of rem by 5
#else
        ldi     lsb, 45         ; Adjust for pre-decrement of rem by 5
#endif
#if __AVR_HAVE_MUL__
        mul     rem, k
        sub     lsb, r0
#else
        ; Multiply by 0x11
        mov     __tmp_reg__, rem
        swap    __tmp_reg__     ; rem < 16, so this is rem << 4
        add     __tmp_reg__, rem
        ; Multiply by 3
        sub     lsb, __tmp_reg__
        sub     lsb, __tmp_reg__
        sub     lsb, __tmp_reg__
#endif
        clr     msb

        ; Here is the actual divide loop
        mov     tlen, len
.L_divide:
        ld      __tmp_reg__, bin
        ; acc += 0x33 * r0.  This product could be up to 14 bits long.
#if __AVR_HAVE_MUL__
        mul     __tmp_reg__, k
        add     lsb, r0
        adc     msb, r1
#else
        ; let rem:digit = __tmp_reg__ << 4 = __tmp_reg__ * 0x10
        mov     digit, __tmp_reg__
        swap    digit
        mov     rem, digit
        andi    rem, 15         ; Mask off high 4 bits
        eor     digit, rem      ; Mask off low 4 bits

        ; Add __tmp_reg__ to get __tmp_reg__ * 0x11
        add     digit, __tmp_reg__
        adc     rem, zero

        ; Now add it to the accumulator 3 times (there is no faster way)
        add     lsb, digit
        adc     msb, rem
        add     lsb, digit
        adc     msb, rem
        add     lsb, digit
        adc     msb, rem
#endif
        ; Store the complemented accumulator (*bin++ = ~lsb)
        mov     __tmp_reg__, lsb
        com     __tmp_reg__
        st      bin+, __tmp_reg__

        ; Fold the accumulator: msb:lsb = msb + lsb
        add     lsb, msb
        clr     msb
        adc     msb, zero

        dec     tlen
        brne    .L_divide

        ;; End of main loop: check if the new msbyte was zero.
        ;; If so, drop it (decrement len), and stop when len == 0.
        cpse    r0, zero
         rjmp   .L_decimal_loop
        sbiw    bin_lo, 1
        dec     len
        brne    .L_decimal_loop

#if __AVR_HAVE_MUL__
        clr     __zero_reg__
#endif
        ; Finally copy the out pointer to r24 and return it.
.L_epilogue:
        movw    r24, out_lo
        ret


.L_binary:
        movw    bin_lo, r22     ; Reset bin to lsbyte
        ; Done with args in r22-r25; now allowed to use digit, rem, lsb, msb

        ; Load startup values.
        ldi     rem, 'A'-'0'-10 ; ASCII adjustment for hex digits past 9
        brcc    0f
         ldi    rem, 'a'-'0'-10
0:      ldi     msb, 1
        ld      lsb, bin+

.L_digit_out:                   ; Spit out a digit
        mov     digit, flags
        and     digit, lsb
        cpi     digit, 10
        brcs    0f
         add    digit, rem      ; Hex digit > 9
0:      subi    digit, -'0'
        st      X+, digit
        mov     digit, flags
.L_bitloop:
        lsr     msb
        brne    6f
         ; msb is empty; fetch another byte
         dec    len             ; Preserves carry
         breq   7f              ; No more input
         ld     msb, Z+
5:       ror    msb             ; Shift carry=1 into msbit
6:      ror     lsb
        lsr     digit
        brne    .L_bitloop      ; if ((tlen >>= 1)== 0) {
        rjmp    .L_digit_out

        ;; We've run out of bytes.  If all 1 bits in lsb have been printed
        ;; (i.e. lsb <= digit), stop.  If not, pad with enough zeros to print
        ;; one more digit, and so we'll check again after printing it.
7:      cp      digit, lsb      ; if (lsb <= digit) return X
        brcc    .L_epilogue
        inc     len
        adc     msb, digit      ; msb = digit + 1, and clear carry
        rjmp    5b
        .size   _itoa_engine, .-_itoa_engine

_______________________________________________
AVR-libc-dev mailing list
AVR-libc-dev@nongnu.org
https://lists.nongnu.org/mailman/listinfo/avr-libc-dev

Re: [avr-libc-dev] Printing octal (brain dump)

Reply via email to