Dixi quod… >Note that XBD 8.2 only specifies the "POSIX" and "C" locales, >and that support for GNU-style language[_territory][.codeset] >locales is XSI-optional. For the purpose of POSIX, mksh is an >operating environment with 32-bit integers and support for only >the C (= POSIX) locale. > >| If the locale value is not recognized by the implementation, the >| behavior is unspecified.
>accordingly. (Admittedly, this means we should probably track the >locale-related variables and imply set +U when LC_ALL=C is set… >I’ll put that on my TODO, though low-priority.) I have now thought on and decided about implementing locale tracking in the mksh codebase. Current behaviour is: – busybox-style builtin calls get set ±U dependent on the LC_ALL, LC_CTYPE, LANG environment variables – script and interactive shells get set -U or set +U if one of them is set on the command line, otherwise: – scripts always get set +U ("C" locale) – if the shell was compiled with -DMKSH_ASSUME_UTF8=0, interactive sessions get set +U ("C" locale) – if the shell was compiled with -DMKSH_ASSUME_UTF8, interactive sessions get set -U ("C.UTF-8" locale) – if 「setlocale(LC_CTYPE, "")」 is supported and returns something that matches /utf-?8/i, or if it is supported but doesn’t yet a subsequent call of 「nl_langinfo(CODESET)」 is supported and returns something matching /utf-?8/i, an interactive session gets set -U – if ${LC_ALL:-${LC_CTYPE:-$LANG}} matches /utf-?8/i, same – otherwise, the interactive session gets set +U Recapitulating some constraints and users’ wishes: – the locale is usually set dependent on the same environment variables mksh uses; UTF-8 locales usually have UTF-8 or utf8 in their name; legacy encoding locales usually don’t – POSIX requires locale tracking (so e.g. “LC_ALL=C” inside a running shell session must turn off “set -U”, i.e. imply “set +U” – POSIX only requires support for the "C" locale, though – mksh only supports UTF-8 and 8-bit modes, not full locales – users on some systems, e.g. glibc-using GNU/Linux, wish for the shell to operate according to system locales for mksh; some may wish it for sh (but see below for lksh); most may not wish it for mksh-static (MKSH_SMALL) but some may; some users may not wish it (e.g. libc5/Linux) – users on some systems (those without any UTF-8 locale, or other traditional ones, specifically MirBSD), as well as legacy scripts to be run on other systems (making this relevant for lksh) specifically require set +U to be the default for scripts, because those scripts typically do not, in contrast to many (but by far not all; you see them fail often enough in contemporary Debian still) scripts on modern GNU systems, begin with 'export LC_ALL=C' – the user can always run “mksh -Uc 'string'” to force UTF-8 mode – the builtin rules allow locales to be recognised at start: $ ln -s $(whence -p mksh) eval $ LC_ALL=C ./eval 'a=$(LC_ALL=C.UTF-8 /usr/bin/printf \\u00e9); echo $a ${#a}' é 2 $ LC_ALL=C.UTF-8 ./eval 'a=$(LC_ALL=C.UTF-8 /usr/bin/printf \\u00e9); echo $a ${#a}' é 1 – it is easy to set the shell into the mode of the currently active locale using its own rules (set -u safe): set -U; [[ ${LC_ALL:-${LC_CTYPE:-${LANG:-}}} = *[Uu][Tt][Ff]?(-)8* ]] || set +U My ruling on the issue is therefore: ① POSIX locale tracking requires to switch to set +U mode if e.g. “LC_ALL=C” is set, but “set -U” can only ever be enabled in a script if the (nōn-POSIX) command “set -U” is run, making operation unspecified. Therefore, there’s no concern wrt. POSIX. ② I have weighed the various user requirements and requests and come to the conclusion that implementing locale tracking in a manner that fits all users is impossible, and thus would require adding compile-time or run-time options, leading to the same mess (people having to select that option somehow); I have shown a one-liner to make set ±U match the locale from the environment, which can already now be used for this very purpose. ③ Implementing locale tracking if “set -o posix” is declined, because POSIX only requires support for the "C" locale, and mksh does not implement anything other than UTF-8 anyway, which would wake false hopes (“why does LANG=de_DE.UTF-8 work but LANG=de_DE@euro doesn’t?”). ④ Operations like splitting a string along multibyte character boundaries have no place in /bin/sh scripts, due to portability concerns, so those scripts need to be tailored for the various existing shells anyway; the one-liner from above can easily be added to the “if mksh (or lksh)” case, POSIXly: case ${KSH_VERSION:-} in *MIRBSD\ KSH*|*LEGACY\ KSH*) case ${LC_ALL:-${LC_CTYPE:-${LANG:-}}} in *[Uu][Tt][Ff]8*|*[Uu][Tt][Ff]-8*) set -U ;; *) set +U ;; esac # the next line implies: set +o braceexpand set -o posix ;; esac This code parses fine (mksh part ignored) with Heirloom sh. ⑤ The user is recommended to not rely on the environment locale settings for shell behaviour anyway, as that can change, e.g. when using ssh without a shell (batch mode), from cronjobs, etc. That concludes this issue. Vincent: on a less “ruling” note, sorry for the bunch of recent disagreements over shell behaviour we had. There are reasons, some good, some not so good, some legacy, and sometimes it’s just because nobody had yet put any thought into it, for details on mksh’s behaviour. I have thought over this issue for a while and in deep detail – I hope you can see that from this eMail – and wish to, never mind the outcome, thank you for re-raising this issue (locales stuff occasionally pops up). Feel free to ask upstream (IRC, mailing list or eMail) next time. I’m sorry for the initial harsh “closed” response, but debbugs is not a discussion forum. I promise to try and listen to your issues when you bring them up e.g. via IRC to me. bye, //mirabilos -- Yay for having to rewrite other people's Bash scripts because bash suddenly stopped supporting the bash extensions they make use of -- Tonnerre Lombard in #nosec