Hey Geoff.
Thanks and funny you came up with that...
On Tue, 2022-01-25 at 09:25 +0000, Geoff Clare via austin-group-l at
The Open Group wrote:
> You are correct, and a common method of preserving trailing newlines
> is to append a non-newline character and then strip it, e.g.:
>
> output=$(some_command && printf x)
> output=${output%x}
... which is exactly where I was starting from.
But it seems to be more tricky than simply that:
Several sources[0] explain, that this alone may not work, namely when
some_command outputs e.g. bytes that are an invalid character encoding
in the current locale, but together with the 'x', they do in fact
become valid and subsequently, stripping off the 'x' fails.
A common example given is \x78 alone followed by \x88 in
zh_HK.big5hkscs.
The workaround given for that is to set LC_ALL=C after the sentinel was
appended but before it's stripped off again,... and to restore the
previous state of LC_ALL (old value / unset) afterwards.
In [1] I was pointed to the fact that POSIX reqiures:
- The encoded values associated with <period>, <slash>, <newline>, and
<carriage-return> shall be invariant across all locales supported by
the implementation.”
=> which means AFAIU, that these will have the same binary
representation in any locale/encoding.
- Likewise, the byte values used to encode <period>, <slash>,
<newline>, and <carriage-return> shall not occur as part of any
other character in any locale.”
=> which means AFAIU that it cannot happen, that a invalidly
encoded character + the sentinel form together a valid character
and thus the sentinel cannot be stripped of, as no partial byte
sequence could be completed by these bytes/characters to a valid
character in any locale/encoding.
(see 6.1 Portable Character Set [1])
So my first thought was, that with either . or / as sentinel value,
none of the LC_* stuff would be necessary and it would be still
guaranteed that it works as expected (in any conforming shell).
However, that may have been crashed again by [2] respectively [3].
Koichi Murase’s point seem to be that the following could happen:
'<some invalid character encoding>.'
with '.' being the sentinel value.
While that sequence doesn't make up a valid character (because of
POSIX' requirements) it may still cause the decoder of the encoding to
fail stripping of the sentinel, e.g. if it simply stops working after
the invalid encoding is encountered or whatever.
So my questions would be:
1) What's the opinion on that from the POSIX side?
Are locales/encodings required to still handle the above case
gracefully?
Or is the POSIX point of view, that one really has to do the LC_ALL
trick to be 100% sure?
2) Koichi also pointed out earlier[4]:
> In theory, ISO/IEC 2022 encoding allows to change the meaning of
> C0 (\x00-\x1F), GL (\x21-\x7E), C1 (\x80-\x9F), and GR
> (\xA0-\xAF) by locking shift escape sequences. In particular, all
> the bit combinations (i.e. bytes) in GL which contain ASCII "."
> and "x" can be used for trailing bytes of 94^n character sets
> (such as LC_CTYPE=ja_JP.ISO-2022-JP). The only two bit-
> combinations that are unaffected by the ISO/IEC 2022 shifts
> are SP (space \x20) and DEL (^? or \x7F). But actually, the
> encodings that are fully ISO/IEC 2022 have hardly used as user
> locales because most utilities have problems in dealing with such
> context-dependent encoding schemes.
I'd assume that POSIX' provisions simply forbid that shifting then
(at least with respect to NUL . / LF and CR)?
3) Does POSIX define anywhere which values a shell variable is required
to be able to store?
I only found that NUL is excluded, but that alone doesn't mean that
any other byte value is required to work.
Thanks for your help,
Chris.
[0] https://unix.stackexchange.com/questions/383217
[1] https://lists.zytor.com/archives/klibc/2022-January/004659.html
[2] https://lists.gnu.org/archive/html/help-bash/2022-01/msg00052.html
[3] https://lists.gnu.org/archive/html/help-bash/2022-01/msg00059.html
[4] https://lists.gnu.org/archive/html/help-bash/2021-05/msg00180.html