Dixi quod…
>Note that XBD 8.2 only specifies the "POSIX" and "C" locales,
>and that support for GNU-style language[_territory][.codeset]
>locales is XSI-optional. For the purpose of POSIX, mksh is an
>operating environment with 32-bit integers and support for only
>the C (= POSIX) locale.
>
>| If the locale value is not recognized by the implementation, the
>| behavior is unspecified.
>accordingly. (Admittedly, this means we should probably track the
>locale-related variables and imply set +U when LC_ALL=C is set…
>I’ll put that on my TODO, though low-priority.)
I have now thought on and decided about implementing locale
tracking in the mksh codebase.
Current behaviour is:
– busybox-style builtin calls get set ±U dependent
on the LC_ALL, LC_CTYPE, LANG environment variables
– script and interactive shells get set -U or set +U
if one of them is set on the command line, otherwise:
– scripts always get set +U ("C" locale)
– if the shell was compiled with -DMKSH_ASSUME_UTF8=0,
interactive sessions get set +U ("C" locale)
– if the shell was compiled with -DMKSH_ASSUME_UTF8,
interactive sessions get set -U ("C.UTF-8" locale)
– if 「setlocale(LC_CTYPE, "")」 is supported and
returns something that matches /utf-?8/i, or if it is
supported but doesn’t yet a subsequent call of
「nl_langinfo(CODESET)」 is supported and returns something
matching /utf-?8/i, an interactive session gets set -U
– if ${LC_ALL:-${LC_CTYPE:-$LANG}} matches /utf-?8/i, same
– otherwise, the interactive session gets set +U
Recapitulating some constraints and users’ wishes:
– the locale is usually set dependent on the same environment
variables mksh uses; UTF-8 locales usually have UTF-8 or
utf8 in their name; legacy encoding locales usually don’t
– POSIX requires locale tracking (so e.g. “LC_ALL=C” inside
a running shell session must turn off “set -U”, i.e. imply
“set +U”
– POSIX only requires support for the "C" locale, though
– mksh only supports UTF-8 and 8-bit modes, not full locales
– users on some systems, e.g. glibc-using GNU/Linux, wish
for the shell to operate according to system locales for
mksh; some may wish it for sh (but see below for lksh);
most may not wish it for mksh-static (MKSH_SMALL) but
some may; some users may not wish it (e.g. libc5/Linux)
– users on some systems (those without any UTF-8 locale,
or other traditional ones, specifically MirBSD), as well
as legacy scripts to be run on other systems (making this
relevant for lksh) specifically require set +U to be the
default for scripts, because those scripts typically do
not, in contrast to many (but by far not all; you see them
fail often enough in contemporary Debian still) scripts on
modern GNU systems, begin with 'export LC_ALL=C'
– the user can always run “mksh -Uc 'string'” to force UTF-8 mode
– the builtin rules allow locales to be recognised at start:
$ ln -s $(whence -p mksh) eval
$ LC_ALL=C ./eval 'a=$(LC_ALL=C.UTF-8 /usr/bin/printf \\u00e9); echo $a ${#a}'
é 2
$ LC_ALL=C.UTF-8 ./eval 'a=$(LC_ALL=C.UTF-8 /usr/bin/printf \\u00e9); echo $a
${#a}'
é 1
– it is easy to set the shell into the mode of the currently
active locale using its own rules (set -u safe):
set -U; [[ ${LC_ALL:-${LC_CTYPE:-${LANG:-}}} = *[Uu][Tt][Ff]?(-)8* ]] || set
+U
My ruling on the issue is therefore:
① POSIX locale tracking requires to switch to set +U mode
if e.g. “LC_ALL=C” is set, but “set -U” can only ever be
enabled in a script if the (nōn-POSIX) command “set -U”
is run, making operation unspecified. Therefore, there’s
no concern wrt. POSIX.
② I have weighed the various user requirements and requests
and come to the conclusion that implementing locale tracking
in a manner that fits all users is impossible, and thus would
require adding compile-time or run-time options, leading to
the same mess (people having to select that option somehow);
I have shown a one-liner to make set ±U match the locale
from the environment, which can already now be used for this
very purpose.
③ Implementing locale tracking if “set -o posix” is declined,
because POSIX only requires support for the "C" locale, and
mksh does not implement anything other than UTF-8 anyway,
which would wake false hopes (“why does LANG=de_DE.UTF-8
work but LANG=de_DE@euro doesn’t?”).
④ Operations like splitting a string along multibyte character
boundaries have no place in /bin/sh scripts, due to portability
concerns, so those scripts need to be tailored for the various
existing shells anyway; the one-liner from above can easily be
added to the “if mksh (or lksh)” case, POSIXly:
case ${KSH_VERSION:-} in
*MIRBSD\ KSH*|*LEGACY\ KSH*)
case ${LC_ALL:-${LC_CTYPE:-${LANG:-}}} in
*[Uu][Tt][Ff]8*|*[Uu][Tt][Ff]-8*)
set -U
;;
*)
set +U
;;
esac
# the next line implies: set +o braceexpand
set -o posix
;;
esac
This code parses fine (mksh part ignored) with Heirloom sh.
⑤ The user is recommended to not rely on the environment locale
settings for shell behaviour anyway, as that can change, e.g.
when using ssh without a shell (batch mode), from cronjobs, etc.
That concludes this issue.
Vincent: on a less “ruling” note, sorry for the bunch of recent
disagreements over shell behaviour we had. There are reasons,
some good, some not so good, some legacy, and sometimes it’s just
because nobody had yet put any thought into it, for details on
mksh’s behaviour. I have thought over this issue for a while and
in deep detail – I hope you can see that from this eMail – and
wish to, never mind the outcome, thank you for re-raising this
issue (locales stuff occasionally pops up).
Feel free to ask upstream (IRC, mailing list or eMail) next time.
I’m sorry for the initial harsh “closed” response, but debbugs
is not a discussion forum. I promise to try and listen to your
issues when you bring them up e.g. via IRC to me.
bye,
//mirabilos
--
Yay for having to rewrite other people's Bash scripts because bash
suddenly stopped supporting the bash extensions they make use of
-- Tonnerre Lombard in #nosec