Hello,

it seems that GNU awk counts the length of multibyte strings in bytes
instead of in characters. So e.g. length(string) and
printf("%-20.20s<- end", string) give unexpected results (at least for
me) for multibyte strings.

POSIX defines the AWK functions index, length, match and substr to deal
with characters, while the definition of AWK's printf is a bit vague
("The printf statement shall produce output based on a notation similar
[sic] to the File Format Notation used to describe file formats [...]").
POSIX defines the field width and the precision for the 's' string
conversion in terms of bytes in both the "File Format Notation" and in
the definition of printf in the C library (stdio). So the definition for
AWK seems to be in bytes as well.

So the behaviour of GNU awk's printf may be correct, while the behaviour
of length(), index() et al seems to be incorrect.

I have taken a quick look at the source to find a way to fix this. But
it seems that the string length is still used as a byte-count
everywhere.

The places where the code needs to be checked and/or changed seem to
include at least the length of regex matches, the use of the stlen
member of tree nodes and also the uses of strlen(). So this will involve
some work to get done.

E.g. in io.c: rs1scan():
---------------------------------------------------------------------------
                /* Check that newline found isn't the sentinel. */
                if (found && (bp - mbclen) < iop->dataend) {
                        /*
                         * set len to what we have so far, in case this
                         * is
                         * all there is
                         */
                        recm->len = bp - recm->start - mbclen;
                        recm->rt_start = bp - mbclen;
                        recm->rt_len = mbclen;
                        *state = NOSTATE;
                        return REC_OK;
                } else {
                        /* also set len */
                        recm->len = bp - recm->start;
                        *state = INDATA;
                        iop->scanoff = bp - iop->off;
                        return NOTERM;
                }
        }
#endif
        while (*bp != rs)
                bp++;

        /* set len to what we have so far, in case this is all there is
 * */
        recm->len = bp - recm->start;
---------------------------------------------------------------------------

Is this a known problem and is someone working on this?

Further questions:

 - Are there plans to change the definition of the field width and
   precision of the printf string conversion to deal with charaters?

 - How will/should the presence of combining characters influence
   character counting?

Regards,

-- 
Olaf Dabrunz (od/odabrunz), SUSE Linux AG, NÃrnberg


--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Reply via email to