On 04/07/2014 10:57 PM, Eric Blake wrote: > [adding the Austin Group] > > On 04/07/2014 07:08 AM, Pádraig Brady wrote: >> On 04/06/2014 07:24 PM, Bob Proulx wrote: >>> Pádraig Brady wrote: >>>> Yes printf follows the C standard which only considers bytes. >>>> ... >>>> I don't think we'd be able to change the current operation of printf >>>> due to backwards compat reasons? Though we might be able to somehow >>>> leverage >>>> the existing multibyte character aware alignment/truncation code in: >>>> http://git.sv.gnu.org/gitweb/?p=coreutils.git;a=blob;f=gl/lib/mbsalign.c;hb=HEAD >>> >>> Dan Douglas pointed out in the corresponding discussion in bug-bash >>> that ksh uses the L modifier. >>> >>> http://lists.gnu.org/archive/html/bug-bash/2014-04/msg00021.html >>> >>> Dan Douglas wrote: >>> > ksh93 already has this feature using the "L" modifier: >>> > >>> > ksh -c "printf '%.3Ls\n' $'\u2605\u2605\u2605\u2605\u2605'" >>> > ★★★ >>> >>> At least there is prior art for it. >> >> So we can count bytes, chars or cells (graphemes). >> >> Thinking a bit more about it, I think shell level printf >> should be dealing in text of the current encoding and counting cells. >> In the edge case where you want to deal in bytes one can do: >> LC_ALL=C printf ... >> >> I see that ksh behaves as I would expect and counts cells, >> though requires the explicit %L enabler: >> $ ksh -c "printf '%.3Ls\n' $'a\u0301\u2605\u2605\u2605'" >> á★★ >> $ ksh -c "printf '%.3Ls\n' $'A\u2605\u2605\u2605'" >> A★ >> $ ksh -c "printf '%.3Ls\n' $'AA\u2605\u2605\u2605'" >> A >> >> zsh seems to just count characters: >> $ zsh -c "printf '%.3Ls\n' $'a\u0301\u2605\u2605\u2605'" >> á★ >> $ zsh -c "printf '%.3s\n' $'a\u0301\u2605\u2605\u2605'" >> á★ >> $ zsh -c "printf '%.3Ls\n' $'A\u2605\u2605\u2605'" >> A★★ >> >> I see that dash gives invalid directive for any of %ls %Ls %S. >> >> Pity there is no consensus here. >> Personally I would go for: >> printf '%3s' 'blah' # count cells >> printf '%3Ls' 'blah' # count chars >> LANG=C '%3Ls' 'blah' # count bytes >> LANG=C '%3s' 'blah' # count bytes > > Hmm. POSIX requires support for %ls (aka %S) according to byte counts, > and currently states that %Ls is undefined. But I would LOVE to have a > standardized spelling for counting characters instead of bytes. The > extension %Ls looks like a good candidate for standardization, precisely > because counting characters when printing a multibyte string is more > useful than counting bytes (you do NOT want to end in the middle of a > multibyte character), and because ksh offers it as existing practice.
Note ksh seems to count cells with %Ls > Your idea for counting "cells" (by which I'm assuming you mean one or > more characters that all display within the same cell of the terminal, > as if the end user saw only one grapheme), on the other hand, does not > seem to have any precedence, and I would strongly object to having %s > count by cells because %s already has a standardized (if unfortunate) > meaning of counting by bytes. Maybe yet another extension is warranted > (perhaps %LLs?) as a new notion for counting by cells instead of > characters, but it's harder to justify that without existing practice. At the shell level I expect that the vast majority of uses would prefer to be specifying cell counts. I thought there might not be much backwards compat issues with doing that, especially since zsh and gawk adjust the meaning of %s according to the locale (albeit for char rather than cell count). But it's a fair point that there may be scripts that don't consider the zsh behavior. If we had to make it explicit for backwards compat reasons, then I suppose counting by characters is the least useful, so we could just standardize the existing ksh behavior and have: printf '%3s' 'blah' # count bytes printf '%3Ls' 'blah' # count cells LANG=C '%3Ls' 'blah' # count bytes This has the disadvantage of not degrading gracefully on dash for example where %Ls is rejected. thanks, Pádraig.
