On 06/02/2021 23:38, Robert Elz via austin-group-l at The Open Group wrote:
Date: Sat, 06 Feb 2021 21:55:19 +0100
From: Steffen Nurpmeso <stef...@sdaoden.eu>
Message-ID: <20210206205519.43rln%stef...@sdaoden.eu>
| Fiddling with bytes is something completely different.
But how is the shell supposed to know?
Consider
U1=$'\u021c'
U2=$'\u0a47'
X1=$'\310\234'
X2=$'\340\251\207'
[...]
Ignoring the bit about converting to other replacement chars, here,
since I'm concerned with valid codepoints only, I don't think the
shell should be converting this kind of thing via iconv() ... utilities
might (including built-ins in sh, like echo or printf) but not the
shell itself. In the above (assuming I did the conversions correctly)
it should always be the case that $U1 = $X1 and $U1 = $X2, regardless
of any locale settings. If I cannot assume that when writing a script
then I have no idea how I would ever do anything with non-ascii chars
reliably.
bash, ksh and zsh, all of which support $'\u....', do convert the
Unicode code point to the current locale, and I support this and
implemented the same in my shell. For \u sequences that ask for a
Unicode code point that is not representable in the current locale, the
\u sequence is left unconverted (bash, ksh, my shell) or causes the
shell to report an error (zsh).
This is useful for scripts that aim to work in a limited selection of
locales and know that certain characters are valid in all the supported
locales, but are not encoded the same way in all of them. If they want
to print a Euro symbol, for instance, they can write
echo $'\u20AC'
and be assured it works everywhere the Euro symbol is supported.
If they instead write
echo '€'
where the script is saved as UTF-8, the script will needlessly break
when it is run in an ISO-8859-15 environment.
Cheers,
Harald van Dijk