bug#17196: UTF-8 printf string formating problem

Pádraig Brady Mon, 07 Apr 2014 17:12:23 -0700

On 04/07/2014 10:57 PM, Eric Blake wrote:
> [adding the Austin Group]
> 
> On 04/07/2014 07:08 AM, Pádraig Brady wrote:
>> On 04/06/2014 07:24 PM, Bob Proulx wrote:
>>> Pádraig Brady wrote:
>>>> Yes printf follows the C standard which only considers bytes.
>>>> ...
>>>> I don't think we'd be able to change the current operation of printf
>>>> due to backwards compat reasons? Though we might be able to somehow 
>>>> leverage
>>>> the existing multibyte character aware alignment/truncation code in:
>>>> http://git.sv.gnu.org/gitweb/?p=coreutils.git;a=blob;f=gl/lib/mbsalign.c;hb=HEAD
>>>
>>> Dan Douglas pointed out in the corresponding discussion in bug-bash
>>> that ksh uses the L modifier.
>>>
>>>   http://lists.gnu.org/archive/html/bug-bash/2014-04/msg00021.html
>>>
>>>   Dan Douglas wrote:
>>>   > ksh93 already has this feature using the "L" modifier:
>>>   > 
>>>   > ksh -c "printf '%.3Ls\n' $'\u2605\u2605\u2605\u2605\u2605'"
>>>   > ★★★
>>>
>>> At least there is prior art for it.
>>
>> So we can count bytes, chars or cells (graphemes).
>>
>> Thinking a bit more about it, I think shell level printf
>> should be dealing in text of the current encoding and counting cells.
>> In the edge case where you want to deal in bytes one can do:
>>   LC_ALL=C printf ...
>>
>> I see that ksh behaves as I would expect and counts cells,
>> though requires the explicit %L enabler:
>>   $ ksh -c "printf '%.3Ls\n' $'a\u0301\u2605\u2605\u2605'"
>>   á★★
>>   $ ksh -c "printf '%.3Ls\n' $'Ａ\u2605\u2605\u2605'"
>>   Ａ★
>>   $ ksh -c "printf '%.3Ls\n' $'ＡＡ\u2605\u2605\u2605'"
>>   Ａ
>>
>> zsh seems to just count characters:
>>   $ zsh -c "printf '%.3Ls\n' $'a\u0301\u2605\u2605\u2605'"
>>   á★
>>   $ zsh -c "printf '%.3s\n' $'a\u0301\u2605\u2605\u2605'"
>>   á★
>>   $ zsh -c "printf '%.3Ls\n' $'Ａ\u2605\u2605\u2605'"
>>   Ａ★★
>>
>> I see that dash gives invalid directive for any of %ls %Ls %S.
>>
>> Pity there is no consensus here.
>> Personally I would go for:
>>   printf '%3s' 'blah'  # count cells
>>   printf '%3Ls' 'blah' # count chars
>>   LANG=C '%3Ls' 'blah' # count bytes
>>   LANG=C '%3s' 'blah'  # count bytes
> 
> Hmm.  POSIX requires support for %ls (aka %S) according to byte counts,
> and currently states that %Ls is undefined.  But I would LOVE to have a
> standardized spelling for counting characters instead of bytes.  The
> extension %Ls looks like a good candidate for standardization, precisely
> because counting characters when printing a multibyte string is more
> useful than counting bytes (you do NOT want to end in the middle of a
> multibyte character), and because ksh offers it as existing practice.


Note ksh seems to count cells with %Ls

> Your idea for counting "cells" (by which I'm assuming you mean one or
> more characters that all display within the same cell of the terminal,
> as if the end user saw only one grapheme), on the other hand, does not
> seem to have any precedence, and I would strongly object to having %s
> count by cells because %s already has a standardized (if unfortunate)
> meaning of counting by bytes.  Maybe yet another extension is warranted
> (perhaps %LLs?) as a new notion for counting by cells instead of
> characters, but it's harder to justify that without existing practice.

At the shell level I expect that the vast majority
of uses would prefer to be specifying cell counts.
I thought there might not be much backwards compat issues
with doing that, especially since zsh and gawk adjust
the meaning of %s according to the locale
(albeit for char rather than cell count).

But it's a fair point that there may be scripts
that don't consider the zsh behavior.

If we had to make it explicit for backwards compat reasons,
then I suppose counting by characters is the least useful,
so we could just standardize the existing ksh behavior and have:

   printf '%3s' 'blah'  # count bytes
   printf '%3Ls' 'blah' # count cells
   LANG=C '%3Ls' 'blah' # count bytes

This has the disadvantage of not degrading gracefully
on dash for example where %Ls is rejected.

thanks,
Pádraig.

bug#17196: UTF-8 printf string formating problem

Reply via email to